Blog

Essential MySQL Cluster Monitoring Using Nagios and NRPE

In a previous post we went into detail about how to implement Tungsten-specific checks. In this post we will focus on the other standard Nagios checks that would help keep your cluster nodes healthy.

Your database cluster contains your most business-critical data. The slave nodes must be online, healthy and in sync with the master in order to be viable failover candidates.

This means keeping a close watch on the health of the databases nodes from many perspectives, from ensuring sufficient disk space to testing that replication traffic is flowing.

A robust monitoring setup is essential for cluster health and viability - if your replicator goes offline and you do not know about it, then that slave becomes effectively useless because it has stale data.

Nagios Checks

The Power of Persistence

One of the best (and also the worst) things about Nagios is the built-in nagging - it just screams for attention until you pay attention to it.

Nagios server uses services.cfg which defines a service that calls the check_nrpe binary with at least one argument - the name of the check to execute on the remote host.

Once on the remote host, the NRPE daemon processes the request from the Nagios server, comparing the check name sent by the Nagios server request with the list of defined commands in the /etc/nagios/nrpe.cfg file. If a match is found, the command is executed by the nrpe user. If different privileges are needed, then sudo must be employed.

Prerequisites

Before you can use these examples

This is NOT a Nagios tutorial as such, although we present configuration examples for the Nagios framework. You will need to already have the following:

  • Nagios server installed and fully functional
  • NRPE installed and fully functional on each cluster node you wish to monitor

Please note that installing and configuring Nagios and NRPE in your environment is not covered in this article.

Teach the Targets

Tell NRPE on the Database Nodes What To Do

The NRPE commands are defined in the /etc/nagios/nrpe.cfg file on each monitored database node. We will discuss three NRPE plugins called by the defined commands: check_disk, check_mysql and check_mysql_query.

First, let's ensure that we do not fill up our disk space using the check_disk plugin by defining two custom commands, each calling check_disk to monitor a different disk partition:

command[check_root]=/usr/lib64/nagios/plugins/check_disk -w 20 -c 10 -p /
command[check_disk_data]=/usr/lib64/nagios/plugins/check_disk -w 20 -c 10 -p /volumes/data

Next, let's validate that we are able to login to mysql directly, bypassing the connector by using port 13306, and using the check_mysql plugin by defining a custom command also called check_mysql:

command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql -H localhost -u nagios -p secret -P 13306

If there is a connector running on that node, you may run the same test to validate that we are able to login through the connector by using port 3306 and the check_mysql plugin by defining a custom command called check_mysql_connector:

command[check_mysql_connector]=/usr/lib64/nagios/plugins/check_mysql -H localhost -u nagios -p secret -P 3306

Finally, you may run any MySQL query you wish to validate further, normally via the local MySQL port 13306 to ensure that the check is testing the local host:

command[check_mysql_query]=/usr/lib64/nagios/plugins/check_mysql_query -q 'select mydatacolumn from nagios.test_data' -H localhost -u nagios -p secret -P 13306

Here are some other example commands you may define that are not Tungsten-specific:

command[check_total_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200 
command[check_users]=/usr/lib64/nagios/plugins/check_users -w 15 -c 25
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 5,4,3 -c 6,5,4  
command[check_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200
command[check_zombie_procs]=/usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s Z

Additionally, there is no harm in defining commands that may not be called, which allows for simple administration - keep the master copy in one place and then just push updates to all nodes as needed then restart nrpe.

Big Brother Sees You

Tell the Nagios server to begin watching

Here are the service check definitions for the /opt/local/etc/nagios/objects/services.cfg file:

# Service definition
define service{
    service_description     Root partition - Tungsten Clustering 
    servicegroups           myclusters
    host_name               db1,db2,db3,db4,db5,db6,db7,db8,db9
    check_command           check_nrpe!check_root
    contact_groups          admin
    use                     generic-service
    }    
 
# Service definition
define service{
    service_description     Data partition - Tungsten Clustering 
    servicegroups           myclusters
    host_name               db1,db2,db3,db4,db5,db6,db7,db8,db9
    check_command           check_nrpe!check_disk_data
    contact_groups          admin
    use                     generic-service
    }    
 
# Service definition
define service{
    service_description     mysql local login - Tungsten Clustering
    servicegroups           myclusters
    host_name               db1,db2,db3,db4,db5,db6,db7,db8,db9
    contact_groups          admin
    check_command           check_nrpe!check_mysql
    use                     generic-service
    }    
 
# Service definition
define service{
    service_description     mysql login via connector - Tungsten Clustering
    servicegroups           myclusters
    host_name               db1,db2,db3,db4,db5,db6,db7,db8,db9
    contact_groups          admin
    check_command           check_nrpe!check_mysql_connector
    use                     generic-service
    }    
 
# Service definition
define service{
    service_description     mysql local query - Tungsten Clustering
    servicegroups           myclusters
    host_name               db1,db2,db3,db4,db5,db6,db7,db8,db9
    contact_groups          admin
    check_command           check_nrpe!check_mysql_query
    use                     generic-service
    }    

NOTE: You must also add all of the hosts into the /opt/local/etc/nagios/objects/hosts.cfg file.

Let's Get Practical

How to test the remote NRPE calls from the command line

The best way to ensure things are working well is to divide and conquer. My favorite approach is to use the check_nrpe binary on the command line from the Nagios server to make sure that the call(s) to the remote monitored node(s) succeed long before I configure the Nagios server daemon and start getting those evil text messages and emails.

To test a remote NRPE client command from a nagios server via the command line, use the check_nrpe command:

shell> /opt/local/libexec/nagios/check_nrpe -H db1 -c check_disk_data
DISK OK - free space: /volumes/data 40234 MB (78% inode=99%);| /volumes/data=10955MB;51170;51180;0;51190

The above command calls the NRPE daemon running on host db1 and executes the NRPE command "check_disk_data" as defined in the db1:/etc/nagios/nrpe.cfg file.

The Wrap-Up

Put it all together and sleep better knowing your Tungsten Cluster is under constant surveillance

Once your tests are working and your Nagios server config files have been updated, just restart the Nagios server daemon and you are on your way!

Tuning the values in the nrpe.cfg file may be required for optimal performance, as always, YMMV.

To learn about Continuent solutions in general, check out https://www.continuent.com/solutions

For more information about monitoring Tungsten clusters, please visit https://docs.continuent.com/tungsten-clustering-6.0/ecosystem-nagios.html.

Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!

For more information, please visit https://www.continuent.com/solutions

Want to learn more or run a POC? Contact us.

About the Author

Eric M. Stone
COO

Eric is a veteran of fast-paced, large-scale enterprise environments with 35 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

Add new comment