Continuent Blog: Monitoring Made Easy: Watching Your Tungsten Cluster Using Built-In Tools

Blog

Agenda

What's Here?

Summary - Briefly describe the bundled cluster monitoring tools and related documentation pages
Explore the thinking behind cluster monitoring
Describe the use-cases for key monitoring tools included with the Continuent Tungsten Clustering software
Examine the best practices for using each tool along with examples

Summary

The Short version

All businesses strive for maximum uptime, and monitoring is key to uptime - if you don’t know that something is broken, you won’t know to fix it!

This blog post shows you the thinking behind each included Tungsten Cluster monitoring tool, and when to use which tool.

Continuent provides multiple methods out of the box to monitor the cluster health.

The most popular is the suite of Nagios/NRPE scripts (cluster-home/bin/check_tungsten_*).

We also have Zabbix scripts (cluster-home/bin/zabbix_tungsten_*).

Additionally, there are standalone scripts available like tungsten_monitor and tungsten_health_check, based upon the shared Ruby-based tpm libraries. We also include a very old shell script called check_tungsten.sh, but it is obsolete.

Resources To Guide You

We have Nagios-specific documentation to assist with configuration:

Monitoring Status Using nagios

In addition to this post, we have some other very descriptive blog posts about how to implement the Nagios-based cluster monitoring solutions:

The Thinking Behind Monitoring

Pay Attention To The Man Behind The Curtain

Why monitor?

More Uptime - if you do not know it is broken, you cannot fix it
Less Downtime - costs money in terms of lost revenue, lost reputation and lost time
Better Reliability - the more you can watch, the faster you can react to problems and potentially make improvements to prevent it happening again
Trending - be able to notice changes or trends in your system to predict issues

What things should I watch in my cluster?

Manager
Replicator
Connector
Database
OS
Hardware
Network

What should I look for?

Errors
Delays
Lack of operation
Wrong states
Unusual activity
Threshold Exceeded (too high or low as compared to desired norm)

Exploring the Bundled Cluster Monitoring Tools

What tools are provided to monitor the cluster?

There are five available Nagios/NRPE-based check scripts, and the online documentation for each is listed below:

check_tungsten_services - verify that the specified services are running, i.e. via the `ps` command
check_tungsten_online - verify that all services are in the ONLINE state, either for a single node (-n) or for all nodes (default), and you may specify a service name using `-s` in case you have more than one
check_tungsten_policy - verify that the dataservice policy for the cluster is AUTOMATIC
check_tungsten_progress - verify that the Replicator sequence number is increasing within a specific time period which you may specify using `-t` (default: 1 second)
check_tungsten_latency - verify that the current replication latency is below the specified Warning (-w) and Critical (-c) levels in seconds

Some tools are designed to help without Nagios:

tungsten_monitor - provides a mechanism for monitoring the cluster state when monitoring tools like Nagios aren't available. For example, here is a crontab entry to run the check once per hour:
```
10 * * * * /opt/continuent/tungsten/cluster-home/bin/tungsten_monitor --from=you@yourcompany.com --email=group@yourcompany.com >/dev/null 2>&1
```
tungsten_health_check - checks the cluster against known best practices, typically used on a periodic basis manually to verify the cluster, often during a health check call with Continuent

Two of the tools are designed to be run all the time and alert every time they find an issue:

check_tungsten_services - if the Java processes are not running, neither is the cluster node!
check_tungsten_online - if the services are not in the ONLINE state, something is not as it should be and requires investigation

One of the tools are designed to be run all the time but alert only outside of planned maintenance:

check_tungsten_policy - ensure the policy is AUTOMATIC because the cluster cannot react to an outage otherwise (i.e. in MAINTENANCE mode)

Two of the tools are designed to be tuned to match your environment:

check_tungsten_progress - tune the time period using `-t` (default: 1 second)
Perhaps your cluster does not have many updates, and so this check would signal an error condition when none existed. For example, wait for five seconds for a write to occur:
`check_tungsten_progress -t 5`
check_tungsten_latency - tune the specified Warning (-w) and Critical (-c) levels in seconds.
In this case, both the warning and critical values are required. A well-conditioned cluster should show replication latencies under one second. For example, for a properly-running cluster, specify values that would indicate a real issue to limit false positives:
`check_tungsten_latency -w 2 -c 4`

Component	Test	Tool	Built-In	Order	Tunable
Manager	Running?	check_tungsten_services	Yes	1	No
Manager	Online?	check_tungsten_online	Yes	2	No
Manager	Policy automatic?	check_tungsten_policy	Yes	5	No
Replicator	Running?	check_tungsten_services	Yes	1	No
Replicator	Online	check_tungsten_online	Yes	2	No
Replicator	Latency Too High?	check_tungsten_latency	Yes	4	Yes
Replicator	Progressing?	check_tungsten_progress	Yes	3	Yes
Connector	Running?	check_tungsten_services	Yes	1	No
Connector	Listening/reachable?	check_mysql or the client application	No	Last	-

What Other Things Should I Watch?

Database	Running?
Database	Listening/reachable?
Database	Errors
Database	Resources
OS	Running?
OS:CPU	Utilization too high?
OS:Memory	Enough free ram?
OS:Network	i/o bandwidth free?
OS:Network	Errors?
OS:network	Packet latency low enough?
OS:Disk	Enough free space?
OS:Disk	i/o bandwidth free?
OS	Other resources?

The Wrap-Up

Continuent provides multiple methods out of the box to monitor the cluster health.

We have described the built-in tools that allow you to monitor cluster operations, and how to tune those tools to minimize false positives.

Our documentation is extensive, please use the many links provided to explore the depths of the utilities.

If you have questions or concerns, or need a hand implementing any of this in your environment, please reach out to Continuent Support and we will be happy to help!

Lastly, in our next post, we will next cover the new Prometheus exporters included in version 7, due out later this year.

Published In

Categories:

Cluster Management, Monitoring and Observability

Series:

Tungsten University

Tags:

MySQL, MariaDB, Nagios, NRPE, zabbix, pagerduty

Author

Eric M. Stone

COO and VP of Product Management

Eric is a veteran of fast-paced, large-scale enterprise environments with 40 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

View All Eric M.’s Posts

Monitoring Made Easy: Watching Your Tungsten Cluster Using Built-In Tools