In this blog post we will discuss how to best integrate various Continuent-bundled cluster monitoring solutions with PagerDuty (pagerduty.com), a popular alerting service.
- Briefly explore the bundled cluster monitoring tools
- Describe the procedure for establishing alerting via PagerDuty
- Examine some of the multiple monitoring tools included with the Continuent Tungsten Clustering software, and provide examples of how to send an email to PagerDuty from each of the tools.
Exploring the Bundled Cluster Monitoring Tools
A Brief Summary
Continuent provides multiple methods out of the box to monitor the cluster health. The most popular is the suite of Nagios/NRPE scripts (i.e.
cluster-home/bin/check_tungsten_*). We also have Zabbix scripts (i.e.
cluster-home/bin/zabbix_tungsten_*). Additionally, there is a standalone script available,
tungsten_monitor, based upon the shared Ruby-based
tpm libraries. We also include a very old shell script called
check_tungsten.sh, but it is obsolete.
Implementing a Simple PagerDuty Alert
How To Add a PagerDuty Email Endpoint for Alerting
- Create a new user to get the alerts:
Configuration -> Users -> Click on the [+ Add Users] button
- Enter the desired email address and invite. Be sure to respond to the invitation before proceeding.
- Create a new escalation policy:
Configuration -> Escalation Policies -> Click on the [+ New Escalation Policy] button
- Enter the policy name at the top, i.e. Continuent Alert Escalation Policy
- "Notify the following users or schedules" - click in the box and select the new user created in the first step
- "escalates after" Set to 1 minute, or your desired value
- "If no one acknowledges, repeat this policy X times" - set to 1 time, or your desired value
- Finally, click on the green [Save] button at the bottom
- Create a new service:
Configuration -> Services -> Click on the [+ New Service] button
- General Settings: Name - Enter the service name, i.e. Continuent Alert Emails from Monitoring (what you type in this box will automatically populate the
- Integration Settings: Integration Type - Click on the second radio choice "Integrate via email"
- Integration Settings: Integration Name - Email (automatically set for you, no action needed here)
- Integration Settings: Integration Email - Adjust this email address, i.e. alerts, then copy this email address into a notepad for use later
- Incident Settings: Escalation Policy - Select the Escalation Policy you created in the third step, i.e. "Continuent Alert Escalation Policy"
- Incident Settings: Incident Timeouts - Check the box in front of Auto-resolution
- Finally, click on the green [Add Service] button at the bottom
At this point, you should have an email address like "alerts@yourCompany.pagerduty.com" available for testing.
Go ahead and send a test email to that email address to make sure the alerting is working.
If the test works, you have successfully setup a PagerDuty email endpoint to use for alerting, congratulations!
How to Send Alerts to PagerDuty using the tungsten_monitor Script
Invoking the Bundled Script via cron
tungsten_monitor script provides a mechanism for monitoring the cluster state when monitoring tools like Nagios aren't available.
Each time the
tungsten_monitor runs, it will execute a standard set of checks:
- Check that all Tungsten services for this host are running
- Check that all replication services and datasources are ONLINE
- Check that replication latency does not exceed a specified amount
- Check that the local connector is responsive
- Check disk usage
Additional checks may be enabled using various command line options.
The tungsten_monitor is able to send you an email when problems are found.
It is suggested that you run the script as root so it is able to use the mail program without warnings.
Alerts are cached to prevent them from being sent multiple times and flooding your inbox. You may pass
--reset to clear out the cache or -
-lock-timeout to adjust the amount of time this cache is kept. The default is 3 hours.
An example root crontab entry to run
tungsten_monitor every five minutes:
*/5 * * * * /opt/continuent/tungsten/cluster-home/bin/tungsten_monitor --from=you@yourCompany.com --to=alerts@yourCompany.pagerduty.com >/dev/null 2>/dev/null
An alternate example root crontab entry to run tungsten_monitor every five minutes in case your version of cron does not support the new syntax:
0,5,10,15,20,25,30,35,40,45,50,55 * * * * /opt/continuent/tungsten/cluster-home/bin/tungsten_monitor --from=you@yourCompany.com --to=alerts@yourCompany.pagerduty.com >/dev/null 2>/dev/null
All messages will be sent to
The online documentation is here:
Big Brother is Watching You!
The Power of Nagios and the check_tungsten_* scripts
We have two very descriptive blog posts about how to implement the Nagios-based cluster monitoring solution:
We also have Nagios-specific documentation to assist with configuration:
In the event you are unable to get Nagios working with Tungsten Clustering, please open a support case via our ZenDesk-based support portal https://continuent.zendesk.com/
For more information about getting support, visit https://docs.continuent.com/support-process/troubleshooting-support.html
There are many available NRPE-based check scripts, and the online documentation for each is listed below:
Big Brother Tells You
Tell the Nagios server how to contact PagerDuty
The key is to have a contact defined for PagerDuty-specific email address, which is handled by the Nagios configuration file
alias PagerDuty Alerting Service Endpoint
alias PagerDuty Alerts
Teach the Targets
Tell NRPE on the Database Nodes What To Do
The NRPE commands are defined in the
/etc/nagios/nrpe.cfg file on each monitored database node:
command[check_tungsten_online]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_online
command[check_tungsten_latency]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_latency -w 2.5 -c 4.0
command[check_tungsten_progress_alpha]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress -t 5 -s alpha
command[check_tungsten_progress_beta]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress -t 5 -s beta
command[check_tungsten_progress_gamma]=/usr/bin/sudo -u tungsten /opt/continuent/tungsten/cluster-home/bin/check_tungsten_progress -t 5 -s gamma
Note that sudo is in use to give the
nrpe user access as the
tungsten user to the tungsten-owned check scripts using the sudo wildcard configuration.
Additionally, there is no harm in defining commands that may not be called, which allows for simple administration - keep the master copy in one place and then just push updates to all nodes as needed then restart nrpe.
Big Brother Sees You
Tell the Nagios server to begin watching
Here are the service check definitions for the
# Service definition
service_description check_tungsten_online for all cluster nodes
# Service definition
service_description check_tungsten_latency for all cluster nodes
# Service definition
service_description check_tungsten_progress for alpha
# Service definition
service_description check_tungsten_progress for beta
# Service definition
service_description check_tungsten_progress for gamma
In this blog post we discussed how to best integrate various cluster monitoring solutions with PagerDuty (pagerduty.com), a popular alerting service.
To learn about Continuent solutions in general, check out https://www.continuent.com/solutions
Please read the docs!
For more information about monitoring Tungsten clusters, please visit https://docs.continuent.com/tungsten-clustering-6.0/ecosystem-nagios.html.
Below are a list of Nagios NRPE plugin scripts provided by Tungsten Clustering. Click on each to be taken to the associated documentation page.
- check_tungsten_latency - reports warning or critical status based on the replication latency levels provided.
- check_tungsten_online - checks whether all the hosts in a given service are online and running. This command only needs to be run on one node within the service; the command returns the status for all nodes. The service name may be specified by using the -s SVCNAME option.
- check_tungsten_policy - checks whether the policy is in AUTOMATIC mode and returns a CRITICAL if not./
- check_tungsten_progress - executes a heartbeat operation and validates that the sequence number has incremented within a specific time period. The default is one (1) second, and may be changed using the -t SECS option.
- check_tungsten_services - confirms that the services and processes are running; their state is not confirmed. To check state with a similar interface, use the
Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!
For more information, please visit https://www.continuent.com/solutions
Want to learn more or run a POC? Contact us.