Continuent Blog: Global Multi-Primary MySQL Cluster Monitoring Using Nagios and NRPE

Blog

Your MySQL / MariaDB / Percona Server database cluster contains your most business-critical data. The Replica nodes must be online, healthy and in sync with the Primary in order to be viable failover candidates.

This means keeping a close watch on the health of the databases nodes from many perspectives, from ensuring sufficient disk space to testing that replication traffic is flowing.

A robust monitoring setup is essential for cluster health and viability - if your replicator goes offline and you do not know about it, then that Replica becomes effectively useless because it has stale data.

Big Brother is Watching You!

The Power of Nagios

Even while you sleep, your servers are busy, and you simply cannot keep watch all the time. Now, more than ever, with global deployments, it is literally impossible to watch everything all the time.

Enter Nagios, you best big brother ever. As a long-time player in the monitoring market, Nagios has both free and paid versions. Our examples use the free Nagios 3 version.

Nagios and the related NRPE daemon together enable you to monitor almost anything. Tungsten Clustering provides a number of NRPE plugins right out of the box for your monitoring convenience.

The Puzzle Pieces

How does it all fit together?

Nagios server uses services.cfg which defines a service that calls the check_nrpe binary with at least one argument - the name of the check to execute on the remote host.

Once on the remote host, the NRPE daemon processes the request from the Nagios server, comparing the check name sent by the Nagios server request with the list of defined commands in the /etc/nagios/nrpe.cfg file. If a match is found, the command is executed by the nrpe user. If different privileges are needed, then sudo must be employed.

Multi Primary Monitoring

Configuring Tungsten Cluster Monitoring with NRPE

For our example, we will have a Composite Active/Active dataservice called global with three active, writable member clusters (one per site), east, west and north.

Each site has two subservices per node, one per remote site. Each sub-service is responsible for pulling writes from the remote clusters and applying them to each node:

Cluster east has sub-services east_from_west and east_from_north
Cluster west has sub-services west_from_east and west_from_north
Cluster north has sub-services north_from_east and north_from_west

That means that 3 replication services per node must be checked in addition to the datasource state. In this situation, we use a number of plugins together.

Prerequisites

Before you can use these examples

This is NOT a Nagios tutorial as such, although we present configuration examples for the Nagios framework. You will need to already have the following:

Tungsten Clustering v6.0.0 or greater deployed as a Composite Active/Active topology
Nagios server installed and fully functional
NRPE installed and fully functional on each cluster node you wish to monitor

Please note that installing and configuring Nagios and NRPE in your environment is not covered in this article.

Teach the Targets

Tell NRPE on the Database Nodes What To Do

The NRPE commands are defined in the /etc/nagios/nrpe.cfg file on each monitored database node:

command[check_tungsten_online]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_online
command[check_tungsten_latency]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_latency -w 2.5 -c 4.0
 
command[check_tungsten_progress_east]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s east
command[check_tungsten_progress_east_from_north]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s east_from_north
command[check_tungsten_progress_east_from_west]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s east_from_west
 
command[check_tungsten_progress_north]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s north
command[check_tungsten_progress_north_from_east]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s north_from_east
command[check_tungsten_progress_north_from_west]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s north_from_west
 
command[check_tungsten_progress_west]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s west
command[check_tungsten_progress_west_from_east]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s west_from_east
command[check_tungsten_progress_west_from_north]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s west_from_north

Note that sudo is in use to give the nrpe user access as the tungsten user to the tungsten-owned check scripts using the sudo wildcard configuration.

Additionally, there is no harm in defining commands that may not be called, which allows for simple administration - keep the Primary copy in one place and then just push updates to all nodes as needed then restart nrpe.

Big Brother Sees You

Tell the Nagios server to begin watching

Here are the service check definitions for the /opt/local/etc/nagios/objects/services.cfg file:

# Service definition
define service{
    service_description         check_tungsten_latency - Tungsten Clustering
    servicegroups               myclusters
    host_name                   db1,db2,db3,db4,db5,db6,db7,db8,db9
    check_command               check_nrpe!check_tungsten_latency
    contact_groups              admin
    use                         generic-service
    }
 
# Service definition
define service{
    service_description         check_tungsten_online - Tungsten Clustering
    servicegroups               myclusters
    host_name                   db1,db2,db3,db4,db5,db6,db7,db8,db9
    check_command               check_nrpe!check_tungsten_online
    contact_groups              admin
    use                         generic-service
    }
 
 
# Service definition
define service{
    service_description         check_tungsten_progress - east
    host_name                   db1,db2,db3
    check_command               check_nrpe!check_tungsten_progress_east
 
    servicegroups       myclusters
    contact_groups      admin
    use                         generic-service
    }
 
 
# Service definition
define service{
    service_description         check_tungsten_progress - east_from_north
    host_name                   db1,db2,db3
    check_command               check_nrpe!check_tungsten_progress_east_from_north
 
    servicegroups       myclusters
    contact_groups      admin
    use                         generic-service
    }
 
 
# Service definition
define service{
    service_description         check_tungsten_progress - east_from_west
    host_name                   db1,db2,db3
    check_command               check_nrpe!check_tungsten_progress_east_from_west
 
    servicegroups       myclusters
    contact_groups      admin
    use                         generic-service
    }
 
 
# Service definition
define service{
    service_description         check_tungsten_progress - north
    host_name                   db7,db8,db9
    check_command               check_nrpe!check_tungsten_progress_north
 
    servicegroups       myclusters
    contact_groups      admin
    use                         generic-service
    }
 
 
# Service definition
define service{
    service_description         check_tungsten_progress - north_from_east
    host_name                   db7,db8,db9
    check_command               check_nrpe!check_tungsten_progress_north_from_east
 
    servicegroups       myclusters
    contact_groups      admin
    use                         generic-service
    }
 
 
# Service definition
define service{
    service_description         check_tungsten_progress - north_from_west
    host_name                   db7,db8,db9
    check_command               check_nrpe!check_tungsten_progress_north_from_west
 
    servicegroups       myclusters
    contact_groups      admin
    use                         generic-service
    }
 
 
# Service definition
define service{
    service_description         check_tungsten_progress - west
    host_name                   db4,db5,db6
    check_command               check_nrpe!check_tungsten_progress_west
 
    servicegroups       myclusters
    contact_groups      admin
    use                         generic-service
    }
 
 
# Service definition
define service{
    service_description         check_tungsten_progress - west_from_east
    host_name                   db4,db5,db6
    check_command               check_nrpe!check_tungsten_progress_west_from_east
 
    servicegroups       myclusters
    contact_groups      admin
    use                         generic-service
    }
 
 
# Service definition
define service{
    service_description         check_tungsten_progress - west_from_north
    host_name                   db4,db5,db6
    check_command               check_nrpe!check_tungsten_progress_west_from_north
 
    servicegroups       myclusters
    contact_groups      admin
    use                         generic-service
    }

NOTE: You must also add all of the hosts into the /opt/local/etc/nagios/objects/hosts.cfg file.

Security: Sudo Configuration

To enable sudo for the nrpe user, create or edit the file /etc/sudoers.d/20_nagios like so:

sudo vim /etc/sudoers.d/20_nagios

Defaults:nrpe !requiretty
Defaults:nrpe !visiblepw
nrpe          ALL=(tungsten)  NOPASSWD: /opt/continuent/tungsten/cluster-home/bin/check*

The above allows the nrpe OS user to execute any check scripts starting with "/opt/continuent/tungsten/cluster-home/bin/check" as the tungsten OS user.

Let's Get Practical

How to test the remote NRPE calls from the command line

The best way to ensure things are working well is to divide and conquer. My favorite approach is to use the check_nrpe binary on the command line from the Nagios server to make sure that the call(s) to the remote monitored node(s) succeed long before I configure the Nagios server daemon and start getting those evil text messages and emails.

To test a remote NRPE client command from a nagios server via the command line, use the check_nrpe command:

shell> /opt/local/libexec/nagios/check_nrpe -H db1 -c check_tungsten_latency
OK: All slaves are running normally (max_latency=0.527)

The above command calls the NRPE daemon running on host db1 and executes the NRPE command "check_tungsten_latency" as defined in the db1:/etc/nagios/nrpe.cfg file.

The Wrap-Up

Put it all together and sleep better knowing your Tungsten Cluster is under constant surveillance

Once your tests are working and your Nagios server config files have been updated, just restart the Nagios server daemon and you are on your way!

Tuning the values in the nrpe.cfg file may be required for optimal performance, as always, YMMV.

To learn about Continuent solutions in general, check out https://www.continuent.com/solutions

The Library

Please read the docs!

For more information about monitoring Tungsten clusters, please visit https://docs.continuent.com/tungsten-clustering-6.0/ecosystem-nagios.html.

Below are a list of Nagios NRPE plugin scripts provided by Tungsten Clustering. Click on each to be taken to the associated documentation page.

check_tungsten_latency - reports warning or critical status based on the replication latency levels provided.
check_tungsten_online - checks whether all the hosts in a given service are online and running. This command only needs to be run on one node within the service; the command returns the status for all nodes. The service name may be specified by using the -s SVCNAME option.
check_tungsten_policy - checks whether the policy is in AUTOMATIC mode and returns a CRITICAL if not./
check_tungsten_progress - executes a heartbeat operation and validates that the sequence number has incremented within a specific time period. The default is one (1) second, and may be changed using the -t SECS option.
check_tungsten_services - confirms that the services and processes are running; their state is not confirmed. To check state with a similar interface, use the check_tungsten_online command.

Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!

For more information, please visit https://www.continuent.com/solutions

Want to learn more or run a POC? Contact us.

Published In

Categories:

Geographic Distribution and Geo-Scale, Monitoring and Observability

Series:

Tungsten University

Tags:

Architecture, HA, High Availability, Monitoring, geo-distributed, Multisite, MySQL, Nagios, NRPE

Author

Eric M. Stone

COO and VP of Product Management

Eric is a veteran of fast-paced, large-scale enterprise environments with 40 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

View All Eric M.’s Posts

Global Multi-Primary MySQL Cluster Monitoring Using Nagios and NRPE