Blog

How to Integrate Tungsten Clustering Monitoring Tools with PagerDuty Alerts

Overview

The Skinny

In this blog post we will discuss how to best integrate various Continuent-bundled cluster monitoring solutions with PagerDuty (pagerduty.com), a popular alerting service.

Agenda

What's Here?

  • Briefly explore the bundled cluster monitoring tools
  • Describe the procedure for establishing alerting via PagerDuty
  • Examine some of the multiple monitoring tools included with the Continuent Tungsten Clustering software, and provide examples of how to send an email to PagerDuty from each of the tools.

Exploring the Bundled Cluster Monitoring Tools

A Brief Summary

Continuent provides multiple methods out of the box to monitor the cluster health. The most popular is the suite of Nagios/NRPE scripts (i.e. cluster-home/bin/check_tungsten_*). We also have Zabbix scripts (i.e. cluster-home/bin/zabbix_tungsten_*). Additionally, there is a standalone script available, tungsten_monitor, based upon the shared Ruby-based tpm libraries. We also include a very old shell script called check_tungsten.sh, but it is obsolete.

Implementing a Simple PagerDuty Alert

How To Add a PagerDuty Email Endpoint for Alerting

  • Create a new user to get the alerts:
    Configuration -> Users -> Click on the [+ Add Users] button
    • Enter the desired email address and invite. Be sure to respond to the invitation before proceeding.
  • Create a new escalation policy:
    Configuration -> Escalation Policies -> Click on the [+ New Escalation Policy] button
    • Enter the policy name at the top, i.e. Continuent Alert Escalation Policy
    • "Notify the following users or schedules" - click in the box and select the new user created in the first step
    • "escalates after" Set to 1 minute, or your desired value
    • "If no one acknowledges, repeat this policy X times" - set to 1 time, or your desired value
    • Finally, click on the green [Save] button at the bottom
  • Create a new service:
    Configuration -> Services -> Click on the [+ New Service] button
    • General Settings: Name - Enter the service name, i.e. Continuent Alert Emails from Monitoring (what you type in this box will automatically populate the
    • Integration Settings: Integration Type - Click on the second radio choice "Integrate via email"
    • Integration Settings: Integration Name - Email (automatically set for you, no action needed here)
    • Integration Settings: Integration Email - Adjust this email address, i.e. alerts, then copy this email address into a notepad for use later
    • Incident Settings: Escalation Policy - Select the Escalation Policy you created in the third step, i.e. "Continuent Alert Escalation Policy"
    • Incident Settings: Incident Timeouts - Check the box in front of Auto-resolution
    • Finally, click on the green [Add Service] button at the bottom

At this point, you should have an email address like "alerts@yourCompany.pagerduty.com" available for testing.

Go ahead and send a test email to that email address to make sure the alerting is working.

If the test works, you have successfully setup a PagerDuty email endpoint to use for alerting, congratulations!

How to Send Alerts to PagerDuty using the tungsten_monitor Script

Invoking the Bundled Script via cron

The tungsten_monitor script provides a mechanism for monitoring the cluster state when monitoring tools like Nagios aren't available.

Each time the tungsten_monitor runs, it will execute a standard set of checks:

  • Check that all Tungsten services for this host are running
  • Check that all replication services and datasources are ONLINE
  • Check that replication latency does not exceed a specified amount
  • Check that the local connector is responsive
  • Check disk usage

Additional checks may be enabled using various command line options.

The tungsten_monitor is able to send you an email when problems are found.

It is suggested that you run the script as root so it is able to use the mail program without warnings.

Alerts are cached to prevent them from being sent multiple times and flooding your inbox. You may pass --reset to clear out the cache or --lock-timeout to adjust the amount of time this cache is kept. The default is 3 hours.

An example root crontab entry to run tungsten_monitor every five minutes:

*/5 * * * * /opt/continuent/tungsten/cluster-home/bin/tungsten_monitor --from=you@yourCompany.com --to=alerts@yourCompany.pagerduty.com >/dev/null 2>/dev/null

An alternate example root crontab entry to run tungsten_monitor every five minutes in case your version of cron does not support the new syntax:

0,5,10,15,20,25,30,35,40,45,50,55 * * * * /opt/continuent/tungsten/cluster-home/bin/tungsten_monitor --from=you@yourCompany.com --to=alerts@yourCompany.pagerduty.com >/dev/null 2>/dev/null

All messages will be sent to /opt/continuent/share/tungsten_monitor/lastrun.log

The online documentation is here:
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-tungsten_monitor.html

Big Brother is Watching You!

The Power of Nagios and the check_tungsten_* scripts

We have two very descriptive blog posts about how to implement the Nagios-based cluster monitoring solution:
https://www.continuent.com/global-multimaster-cluster-monitoring-using-nagios/
https://www.continuent.com/essential-cluster-monitoring-using-nagios-and-nrpe/

We also have Nagios-specific documentation to assist with configuration:
http://docs.continuent.com/tungsten-clustering-6.0/ecosystem-nagios.html

In the event you are unable to get Nagios working with Tungsten Clustering, please open a support case via our ZenDesk-based support portal https://continuent.zendesk.com/
For more information about getting support, visit https://docs.continuent.com/support-process/troubleshooting-support.html

There are many available NRPE-based check scripts, and the online documentation for each is listed below:
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-tungsten_health_check.html
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-check_tungsten_services.html
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-check_tungsten_progress.html
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-check_tungsten_policy.html
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-check_tungsten_online.html
http://docs.continuent.com/tungsten-clustering-6.0/cmdline-tools-check_tungsten_latency.html

Big Brother Tells You

Tell the Nagios server how to contact PagerDuty

The key is to have a contact defined for PagerDuty-specific email address, which is handled by the Nagios configuration file /opt/local/etc/nagios/objects/contacts.cfg:

objects/contacts.cfg

define contact{
    use          generic-contact
        contact_name             pagerduty
        alias                    PagerDuty Alerting Service Endpoint
        email                    alerts@yourCompany.pagerduty.com
}
 
define contactgroup{
        contactgroup_name       admin
        alias                   PagerDuty Alerts
        members                 pagerduty,anotherContactIfDesired,etc
}

Teach the Targets

Tell NRPE on the Database Nodes What To Do

The NRPE commands are defined in the /etc/nagios/nrpe.cfg file on each monitored database node:

/etc/nagios/nrpe.cfg

command[check_tungsten_online]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_online
command[check_tungsten_latency]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_latency -w 2.5 -c 4.0 
command[check_tungsten_progress_alpha]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s alpha
command[check_tungsten_progress_beta]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s beta
command[check_tungsten_progress_gamma]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress  -t 5 -s gamma

Note that sudo is in use to give the nrpe user access as the tungsten user to the tungsten-owned check scripts using the sudo wildcard configuration.

Additionally, there is no harm in defining commands that may not be called, which allows for simple administration - keep the master copy in one place and then just push updates to all nodes as needed then restart nrpe.

Big Brother Sees You

Tell the Nagios server to begin watching

Here are the service check definitions for the /opt/local/etc/nagios/objects/services.cfg file:

objects/services.cfg

# Service definition
define service{
    service_description         check_tungsten_online for all cluster nodes
    host_name                   db1,db2,db3,db4,db5,db6,db7,db8,db9
    check_command               check_nrpe!check_tungsten_online
    contact_groups     admin
    use                         generic-service
    }
 
 
# Service definition
define service{
    service_description         check_tungsten_latency for all cluster nodes
    host_name                   db1,db2,db3,db4,db5,db7,db8,db9
    check_command               check_nrpe!check_tungsten_latency
    contact_groups     admin
    use                         generic-service
    }
 
 
# Service definition
define service{
    service_description         check_tungsten_progress for alpha
    host_name                   db1,db2,db3
    check_command               check_nrpe!check_tungsten_progress_alpha
    contact_groups             admin
    use                         generic-service
    }
 
# Service definition
define service{
    service_description         check_tungsten_progress for beta
    host_name                   db4,db5,db6
    check_command               check_nrpe!check_tungsten_progress_beta
    contact_groups             admin
    use                         generic-service
    }
 
# Service definition
define service{
    service_description         check_tungsten_progress for gamma
    host_name                   db7,db8,db9
    check_command               check_nrpe!check_tungsten_progress_gamma
    contact_groups             admin
    use                         generic-service
    }

Summary

The Wrap-Up

In this blog post we discussed how to best integrate various cluster monitoring solutions with PagerDuty (pagerduty.com), a popular alerting service.

To learn about Continuent solutions in general, check out https://www.continuent.com/solutions

The Library

Please read the docs!

For more information about monitoring Tungsten clusters, please visit https://docs.continuent.com/tungsten-clustering-6.0/ecosystem-nagios.html.

Below are a list of Nagios NRPE plugin scripts provided by Tungsten Clustering. Click on each to be taken to the associated documentation page.

  • check_tungsten_latency - reports warning or critical status based on the replication latency levels provided.
  • check_tungsten_online - checks whether all the hosts in a given service are online and running. This command only needs to be run on one node within the service; the command returns the status for all nodes. The service name may be specified by using the -s SVCNAME option.
  • check_tungsten_policy - checks whether the policy is in AUTOMATIC mode and returns a CRITICAL if not./
  • check_tungsten_progress - executes a heartbeat operation and validates that the sequence number has incremented within a specific time period. The default is one (1) second, and may be changed using the -t SECS option.
  • check_tungsten_services - confirms that the services and processes are running; their state is not confirmed. To check state with a similar interface, use the check_tungsten_online command.

Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!

For more information, please visit https://www.continuent.com/solutions

Want to learn more or run a POC? Contact us.

About the Author

Eric M. Stone
COO

Eric is a veteran of fast-paced, large-scale enterprise environments with 35 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

Comments

great to know about cluster monitoring tool and their integration technique…

Add new comment