Monitoring Made Easy: Watching Your Tungsten Cluster Using Built-In Tools

Agenda

What's Here?

  • Summary - Briefly describe the bundled cluster monitoring tools and related documentation pages
  • Explore the thinking behind cluster monitoring
  • Describe the use-cases for key monitoring tools included with the Continuent Tungsten Clustering software
  • Examine the best practices for using each tool along with examples

Summary

The Short version

All businesses strive for maximum uptime, and monitoring is key to uptime - if you don’t know that something is broken, you won’t know to fix it!

This blog post shows you the thinking behind each included Tungsten Cluster monitoring tool, and when to use which tool.

Continuent provides multiple methods out of the box to monitor the cluster health.

The most popular is the suite of Nagios/NRPE scripts (cluster-home/bin/check_tungsten_*).

We also have Zabbix scripts (cluster-home/bin/zabbix_tungsten_*).

Additionally, there are standalone scripts available like tungsten_monitor and tungsten_health_check, based upon the shared Ruby-based tpm libraries. We also include a very old shell script called check_tungsten.sh, but it is obsolete.

Resources To Guide You

We have Nagios-specific documentation to assist with configuration:

In addition to this post, we have some other very descriptive blog posts about how to implement the Nagios-based cluster monitoring solutions:

The Thinking Behind Monitoring

Pay Attention To The Man Behind The Curtain

Why monitor?

  • More Uptime - if you do not know it is broken, you cannot fix it
  • Less Downtime - costs money in terms of lost revenue, lost reputation and lost time
  • Better Reliability - the more you can watch, the faster you can react to problems and potentially make improvements to prevent it happening again
  • Trending - be able to notice changes or trends in your system to predict issues

What things should I watch in my cluster?

  • Manager
  • Replicator
  • Connector
  • Database
  • OS
  • Hardware
  • Network

What should I look for?

  • Errors
  • Delays
  • Lack of operation
  • Wrong states
  • Unusual activity
  • Threshold Exceeded (too high or low as compared to desired norm)

Exploring the Bundled Cluster Monitoring Tools

What tools are provided to monitor the cluster?

There are five available Nagios/NRPE-based check scripts, and the online documentation for each is listed below:

  • check_tungsten_services - verify that the specified services are running, i.e. via the `ps` command
  • check_tungsten_online - verify that all services are in the ONLINE state, either for a single node (-n) or for all nodes (default), and you may specify a service name using `-s` in case you have more than one
  • check_tungsten_policy - verify that the dataservice policy for the cluster is AUTOMATIC
  • check_tungsten_progress - verify that the Replicator sequence number is increasing within a specific time period which you may specify using `-t` (default: 1 second)
  • check_tungsten_latency - verify that the current replication latency is below the specified Warning (-w) and Critical (-c) levels in seconds

Some tools are designed to help without Nagios:

  • tungsten_monitor - provides a mechanism for monitoring the cluster state when monitoring tools like Nagios aren't available. For example, here is a crontab entry to run the check once per hour:
    10 * * * * /opt/continuent/tungsten/cluster-home/bin/tungsten_monitor --from=you@yourcompany.com --to=group@yourcompany.com >/dev/null 2>&1
  • tungsten_health_check - checks the cluster against known best practices, typically used on a periodic basis manually to verify the cluster, often during a health check call with Continuent

Two of the tools are designed to be run all the time and alert every time they find an issue:

  • check_tungsten_services - if the Java processes are not running, neither is the cluster node!
  • check_tungsten_online - if the services are not in the ONLINE state, something is not as it should be and requires investigation

One of the tools are designed to be run all the time but alert only outside of planned maintenance:

  • check_tungsten_policy - ensure the policy is AUTOMATIC because the cluster cannot react to an outage otherwise (i.e. in MAINTENANCE mode)

Two of the tools are designed to be tuned to match your environment:

  • check_tungsten_progress - tune the time period using `-t` (default: 1 second)
    Perhaps your cluster does not have many updates, and so this check would signal an error condition when none existed. For example, wait for five seconds for a write to occur:
    `check_tungsten_progress -t 5`
  • check_tungsten_latency - tune the specified Warning (-w) and Critical (-c) levels in seconds.
    In this case, both the warning and critical values are required. A well-conditioned cluster should show replication latencies under one second. For example, for a properly-running cluster, specify values that would indicate a real issue to limit false positives:
    `check_tungsten_latency -w 2 -c 4`
Component Test Tool Built-In Order Tunable
Manager Running? check_tungsten_services Yes 1 No
Manager Online? check_tungsten_online Yes 2 No
Manager Policy automatic? check_tungsten_policy Yes 5 No
Replicator Running? check_tungsten_services Yes 1 No
Replicator Online check_tungsten_online Yes 2 No
Replicator Latency Too High? check_tungsten_latency Yes 4 Yes
Replicator Progressing? check_tungsten_progress Yes 3 Yes
Connector Running? check_tungsten_services Yes 1 No
Connector Listening/reachable? check_mysql or the client application No Last -

What Other Things Should I Watch?

Database Running?
Database Listening/reachable?
Database Errors
Database Resources
OS Running?
OS:CPU Utilization too high?
OS:Memory Enough free ram?
OS:Network i/o bandwidth free?
OS:Network Errors?
OS:network Packet latency low enough?
OS:Disk Enough free space?
OS:Disk i/o bandwidth free?
OS Other resources?

The Wrap-Up

Continuent provides multiple methods out of the box to monitor the cluster health.

We have described the built-in tools that allow you to monitor cluster operations, and how to tune those tools to minimize false positives.

Our documentation is extensive, please use the many links provided to explore the depths of the utilities.

If you have questions or concerns, or need a hand implementing any of this in your environment, please reach out to Continuent Support and we will be happy to help!

Lastly, in our next post, we will next cover the new Prometheus exporters included in version 7, due out later this year.

About the Author

Eric M. Stone
COO

Eric is a veteran of fast-paced, large-scale enterprise environments with 35 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

Add new comment