Continuent Blog: Essential MySQL Cluster Monitoring Using Nagios and NRPE

Blog

In a previous post we went into detail about how to implement Tungsten-specific checks. In this post, we will focus on the other standard Nagios checks that would help to monitor the health of your cluster nodes.

Your database cluster contains your most business-critical data. The Replica nodes must be online, healthy and in sync with the Primary in order to be viable failover candidates.

This means keeping a close watch on the health of the database nodes using many perspectives, from ensuring sufficient disk space to testing that replication traffic is flowing.

A robust monitoring setup is essential for cluster health and viability — if your Replica’s replicator goes offline, and you do not know about it, then that Replica node becomes effectively useless because it has stale data.

Nagios Checks

The Power of Persistence

One of the best (and also the worst) things about Nagios is the built-in nagging — it just screams for attention until you pay attention to it.

Nagios server uses services.cfg which defines a service that calls the check_nrpe binary with at least one argument — the name of the check to execute on the remote host.

Once on the remote host, the NRPE daemon processes the request from the Nagios server, comparing the check name sent by the Nagios server request with the list of defined commands in the /etc/nagios/nrpe.cfg file. If a match is found, the command is executed by the nrpe user. If different privileges are needed, then sudo must be employed.

Prerequisites

Before You Can Use These Examples

This is NOT a Nagios tutorial as such, although we present configuration examples for the Nagios framework. You will need to already have the following:

Nagios server installed and fully functional
NRPE installed and fully functional on each cluster node you wish to monitor

Please note that installing and configuring Nagios and NRPE in your environment is not covered in this article.

Teach the Targets

Tell NRPE on the Database Nodes What to Do

The NRPE commands are defined in the /etc/nagios/nrpe.cfg file on each monitored database node. We will discuss three NRPE plugins called by the defined commands: check_disk, check_mysql and check_mysql_query.

First, let's ensure sufficient disk space using the check_disk plugin by defining two custom commands, each calling check_disk to monitor a different disk partition:

command[check_root]=/usr/lib64/nagios/plugins/check_disk -w 20 -c 10 -p /
command[check_disk_data]=/usr/lib64/nagios/plugins/check_disk -w 20 -c 10 -p /volumes/data

Next, let's validate that we are able to login to mysql directly, bypassing the connector by using port 13306, and using the check_mysql plugin by defining a custom command also called check_mysql:

command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql -H localhost -u nagios -p secret -P 13306

If there is a Connector proxy running on that node, you may wish to run the same test to validate that login work through the Connector on port 3306. For that particular example, a dedicated nagios user is in use, so if the Connector is in proxy mode, make sure the user.map file has the nagios user defined properly. The same check_mysql plugin will be used, specifying the Connector port and defining a custom command called check_mysql_connector:

command[check_mysql_connector]=/usr/lib64/nagios/plugins/check_mysql -H localhost -u nagios -p secret -P 3306

Finally, you may run any MySQL query you wish to validate further, normally via the local MySQL port 13306 to ensure that the check is testing the local host:

command[check_mysql_query]=/usr/lib64/nagios/plugins/check_mysql_query -q 'select mydatacolumn from nagios.test_data' -H localhost -u nagios -p secret -P 13306

Here are some other example commands you may define that are not Tungsten-specific:

command[check_total_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200
command[check_users]=/usr/lib64/nagios/plugins/check_users -w 15 -c 25
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 5,4,3 -c 6,5,4
command[check_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200
command[check_zombie_procs]=/usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s Z

Additionally, there is no harm in defining NRPE agent commands that may not be called by the upstream Nagios server. This allows for simple administration — keep the golden copy in one place and then just push updates to all nodes as needed then restart nrpe.

Big Brother Sees You

Tell the Nagios server to begin watching

Here are the service check definitions for the /opt/local/etc/nagios/objects/services.cfg file:

# Service definition
define service{
service_description 	Root partition - Tungsten Clustering
servicegroups       	myclusters
host_name           	db1,db2,db3,db4,db5,db6,db7,db8,db9
check_command       	check_nrpe!check_root
contact_groups      	admin
use                 	generic-service
}
# Service definition
define service{
service_description 	Data partition - Tungsten Clustering
servicegroups       	myclusters
host_name           	db1,db2,db3,db4,db5,db6,db7,db8,db9
check_command       	check_nrpe!check_disk_data
contact_groups      	admin
use                 	generic-service
}
# Service definition
define service{
service_description 	mysql local login - Tungsten Clustering
servicegroups       	myclusters
host_name           	db1,db2,db3,db4,db5,db6,db7,db8,db9
contact_groups      	admin
check_command       	check_nrpe!check_mysql
use                 	generic-service
}
# Service definition
define service{
service_description 	mysql login via connector - Tungsten Clustering
servicegroups       	myclusters
host_name           	db1,db2,db3,db4,db5,db6,db7,db8,db9
contact_groups      	admin
check_command       	check_nrpe!check_mysql_connector
use                 	generic-service
}
# Service definition
define service{
service_description 	mysql local query - Tungsten Clustering
servicegroups       	myclusters
host_name           	db1,db2,db3,db4,db5,db6,db7,db8,db9
contact_groups      	admin
check_command       	check_nrpe!check_mysql_query
use                 	generic-service
}

Note

You must also add all of the hosts into the /opt/local/etc/nagios/objects/hosts.cfg file.

Let's Get Practical

How to Test the Remote NRPE Calls From the Command Line

The best way to ensure things are working well is to divide and conquer. My favorite approach is to use the check_nrpe binary on the command line from the Nagios server to make sure that the call(s) to the remote monitored node(s) succeed long before I configure the Nagios server daemon and start getting those evil text messages and emails.

To test a remote NRPE client command from a nagios server via the command line, use the check_nrpe command:

shell> /opt/local/libexec/nagios/check_nrpe -H db1 -c check_disk_data
DISK OK - free space: /volumes/data 40234 MB (78% inode=99%);| /volumes/data=10955MB;51170;51180;0;51190

The above command calls the NRPE daemon running on host db1 and executes the NRPE command "check_disk_data" as defined in the db1:/etc/nagios/nrpe.cfg file.

The Wrap-Up

Put It All Together and Sleep Better Knowing Your Tungsten Cluster Is Under Constant Surveillance

Once your tests are working and your Nagios server config files have been updated, just restart the Nagios server daemon and you are on your way!

Tuning the values in the nrpe.cfg file may be required for optimal performance, as always, YMMV.

To learn about Continuent solutions in general, check out our Products & Solutions.

For more information about monitoring Tungsten clusters, please visit our documentation.

Tungsten Clustering is the most flexible, performant global database layer available today — use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!

Want to learn more or run a POC? Contact us.

Published In

Categories:

Cluster Management, Database Administration, Monitoring and Observability

Series:

Tungsten University

Tags:

High Availability, Monitoring, MySQL, Nagios, NRPE

Author

Eric M. Stone

COO and VP of Product Management

Eric is a veteran of fast-paced, large-scale enterprise environments with 40 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

View All Eric M.’s Posts

Essential MySQL Cluster Monitoring Using Nagios and NRPE