In a previous post we went into detail about how to implement Tungsten-specific checks. In this post we will focus on the other standard Nagios checks that would help keep your cluster nodes healthy.
Your database cluster contains your most business-critical data. The slave nodes must be online, healthy and in sync with the master in order to be viable failover candidates.
This means keeping a close watch on the health of the databases nodes from many perspectives, from ensuring sufficient disk space to testing that replication traffic is flowing.
A robust monitoring setup is essential for cluster health and viability - if your replicator goes offline and you do not know about it, then that slave becomes effectively useless because it has stale data.
Nagios Checks
The Power of Persistence
One of the best (and also the worst) things about Nagios is the built-in nagging - it just screams for attention until you pay attention to it.
Nagios server uses services.cfg
which defines a service that calls the check_nrpe
binary with at least one argument - the name of the check to execute on the remote host.
Once on the remote host, the NRPE daemon processes the request from the Nagios server, comparing the check name sent by the Nagios server request with the list of defined commands in the /etc/nagios/nrpe.cfg
file. If a match is found, the command is executed by the nrpe
user. If different privileges are needed, then sudo must be employed.
Prerequisites
Before you can use these examples
This is NOT a Nagios tutorial as such, although we present configuration examples for the Nagios framework. You will need to already have the following:
- Nagios server installed and fully functional
- NRPE installed and fully functional on each cluster node you wish to monitor
Please note that installing and configuring Nagios and NRPE in your environment is not covered in this article.
Teach the Targets
Tell NRPE on the Database Nodes What To Do
The NRPE commands are defined in the /etc/nagios/nrpe.cfg
file on each monitored database node. We will discuss three NRPE plugins called by the defined commands: check_disk
, check_mysql
and check_mysql_query
.
First, let's ensure that we do not fill up our disk space using the check_disk
plugin by defining two custom commands, each calling check_disk to monitor a different disk partition:
command[check_root]=/usr/lib64/nagios/plugins/check_disk -w 20 -c 10 -p /
command[check_disk_data]=/usr/lib64/nagios/plugins/check_disk -w 20 -c 10 -p /volumes/data
Next, let's validate that we are able to login to mysql directly, bypassing the connector by using port 13306, and using the check_mysql
plugin by defining a custom command also called check_mysql
:
command[check_mysql]=/usr/lib64/nagios/plugins/check_mysql -H localhost -u nagios -p secret -P 13306
If there is a connector running on that node, you may run the same test to validate that we are able to login through the connector by using port 3306 and the check_mysql
plugin by defining a custom command called check_mysql_connector
:
command[check_mysql_connector]=/usr/lib64/nagios/plugins/check_mysql -H localhost -u nagios -p secret -P 3306
Finally, you may run any MySQL query you wish to validate further, normally via the local MySQL port 13306 to ensure that the check is testing the local host:
command[check_mysql_query]=/usr/lib64/nagios/plugins/check_mysql_query -q 'select mydatacolumn from nagios.test_data' -H localhost -u nagios -p secret -P 13306
Here are some other example commands you may define that are not Tungsten-specific:
command[check_total_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200
command[check_users]=/usr/lib64/nagios/plugins/check_users -w 15 -c 25
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 5,4,3 -c 6,5,4
command[check_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200
command[check_zombie_procs]=/usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s Z
Additionally, there is no harm in defining commands that may not be called, which allows for simple administration - keep the master copy in one place and then just push updates to all nodes as needed then restart nrpe.
Big Brother Sees You
Tell the Nagios server to begin watching
Here are the service check definitions for the /opt/local/etc/nagios/objects/services.cfg
file:
# Service definition
define service{
service_description Root partition - Tungsten Clustering
servicegroups myclusters
host_name db1,db2,db3,db4,db5,db6,db7,db8,db9
check_command check_nrpe!check_root
contact_groups admin
use generic-service
}
# Service definition
define service{
service_description Data partition - Tungsten Clustering
servicegroups myclusters
host_name db1,db2,db3,db4,db5,db6,db7,db8,db9
check_command check_nrpe!check_disk_data
contact_groups admin
use generic-service
}
# Service definition
define service{
service_description mysql local login - Tungsten Clustering
servicegroups myclusters
host_name db1,db2,db3,db4,db5,db6,db7,db8,db9
contact_groups admin
check_command check_nrpe!check_mysql
use generic-service
}
# Service definition
define service{
service_description mysql login via connector - Tungsten Clustering
servicegroups myclusters
host_name db1,db2,db3,db4,db5,db6,db7,db8,db9
contact_groups admin
check_command check_nrpe!check_mysql_connector
use generic-service
}
# Service definition
define service{
service_description mysql local query - Tungsten Clustering
servicegroups myclusters
host_name db1,db2,db3,db4,db5,db6,db7,db8,db9
contact_groups admin
check_command check_nrpe!check_mysql_query
use generic-service
}
NOTE: You must also add all of the hosts into the /opt/local/etc/nagios/objects/hosts.cfg
file.
Let's Get Practical
How to test the remote NRPE calls from the command line
The best way to ensure things are working well is to divide and conquer. My favorite approach is to use the check_nrpe
binary on the command line from the Nagios server to make sure that the call(s) to the remote monitored node(s) succeed long before I configure the Nagios server daemon and start getting those evil text messages and emails.
To test a remote NRPE client command from a nagios server via the command line, use the check_nrpe
command:
shell> /opt/local/libexec/nagios/check_nrpe -H db1 -c check_disk_data
DISK OK - free space: /volumes/data 40234 MB (78% inode=99%);| /volumes/data=10955MB;51170;51180;0;51190
The above command calls the NRPE daemon running on host db1
and executes the NRPE command "check_disk_data" as defined in the db1:/etc/nagios/nrpe.cfg
file.
The Wrap-Up
Put it all together and sleep better knowing your Tungsten Cluster is under constant surveillance
Once your tests are working and your Nagios server config files have been updated, just restart the Nagios server daemon and you are on your way!
Tuning the values in the nrpe.cfg
file may be required for optimal performance, as always, YMMV.
To learn about Continuent solutions in general, check out https://www.continuent.com/solutions
For more information about monitoring Tungsten clusters, please visit https://docs.continuent.com/tungsten-clustering-6.0/ecosystem-nagios.html.
Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!
For more information, please visit https://www.continuent.com/solutions
Want to learn more or run a POC? Contact us.
Comments
Add new comment