MySQL / MariaDB Clustering with Tungsten: Failover Focus

In this blog post, we discuss failover concepts and administration.

Failover Triggers

So when is a failover triggered, anyway? Believe it or not, there is only ONE single condition that causes a failover in a Tungsten Cluster - loss of the TCP socket connection to the MySQL database server.

By default, no other trigger causes a failover by design.

This has occasionally caused confusion when other factors impact MySQL server, yet the cluster does not fail over, for example when the server gets very slow.

What if the cluster does not failover and MySQL is misbehaving?

A recent support case illustrated this situation. The database server appeared to be hung, yet the socket was alive. The master was effectively in an error state, and the DBA needed to force a failover.

A cluster failover may be triggered manually via the command line or through the Tungsten Dashboard.

First, identify the hostname of the bad master:

shell> cctrl
cctrl> ls

Next, inform the cluster that the bad master is failed:

cctrl> datasource {BadMasterHostNameHere} fail
cctrl> ls

Lastly, if the cluster has not automatically triggered a failover at this point, tell it to do so manually:

cctrl> failover
cctrl> ls

The Nitty Gritty

A cluster does monitor a large variety of items as reflected in the /opt/continuent/tungsten/cluster-home/conf/ file.

***IMPORTANT: Please do NOT hand-edit this file unless instructed by Continuent support. Any changes to this file WILL alter the cluster behavior and Continuent cannot predict cluster behavior if this file is modified by hand.

The mappings in this file determine the way that the Tungsten manager and rules interpret the return status from utilities that probe the state of a database server. This mapping, in turn, can directly drive whether or not a particular probe return status can trigger, for example, a failover or just cause the database server state to be updated with a different status.

For example, see the extract below. The only section that is configured to trigger a failover is socket_io_error because socket_io_error.action=fail:

# Status values that indicate the db server is definitively stopped
socket_io_error.state = stopped
socket_io_error.threshold = 0
socket_io_error.action = fail
# Status values that indicate some sort of timeout condition
socket_connect_timeout.state = timeout
#socket_connect_timeout.threshold = 5
#socket_connect_timeout.action = fail
socket_connect_timeout.threshold = -1
socket_connect_timeout.action = none
login_response_timeout.state = timeout
#login_response_timeout.threshold = 5
#login_response_timeout.action = fail
login_response_timeout.threshold = -1
login_response_timeout.action = none

If the login_response_timeout section were changed to look like the following, then a failover would also be triggered if the manager is no longer able to get a login response from the MySQL database server.

login_response_timeout.threshold = 5
login_response_timeout.action = fail
#login_response_timeout.threshold = -1
#login_response_timeout.action = none

The change needs to be done on all nodes, and a manager restart is required on all nodes to make the changes take effect.

In future articles, we will continue to cover more advanced subjects of interest!

Questions? Contact Continuent

About the Author

Eric M. Stone
COO and VP of Product Management

Eric is a veteran of fast-paced, large-scale enterprise environments with 35 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

Add new comment