Continuent Blog: What is the Best Way to Check the Health of a Tungsten Cluster Before a Switch?

Blog

The Question

Recently, a customer asked us:

What would cause a node switch to fail in a Tungsten Cluster?

For example, we saw the following during a recent session where a switch failed:

cctrl> switch to db3 
 
SELECTED SLAVE: db3@alpha 
SET POLICY: MAINTENANCE => MAINTENANCE 
PURGE REMAINING ACTIVE SESSIONS ON CURRENT MASTER 'db1@alpha'
PURGED A TOTAL OF 0 ACTIVE SESSIONS ON MASTER 'db1@alpha'
FLUSH TRANSACTIONS ON CURRENT MASTER 'db1@alpha'
Exception encountered during SWITCH. 
Failed while setting the replicator 'db1' role to 'slave'
ClusterManagerException: Exception while executing command 'replicatorStatus' on manager 'db1'
Exception=Failed to execute '/alpha/db1/manager/ClusterManagementHelper/replicatorStatus alpha db3'
Reason= 
CLUSTER_MEMBER(true) 
STATUS(FAIL) 
+----------------------------------------------------------------------------+ 
|alpha | 
+----------------------------------------------------------------------------+ 
|Handler Exception: SYSTEM | 
|Cause:Exception | 
|Message:javax.management.MBeanException: MANAGER | 
|CLUSTER_MEMBER(true) | 
|STATUS(FAIL) | 
|Exception: ConnectionException | 
|Message: getResponseQueue():No response queue found for id: 1552059204364 |

The Answer

The Tungsten Manager is unable to communicate with a remote resource or has insufficient memory

Here are some possibilities to consider:

Network blockage - if the Manager is unable to communicate with the target layer (i.e. Replicator or another manager), then the above error will occur
Manager tuning - if a Manager restart on all nodes clears the issue, then this indicates that the Manager is starved for resources

The Solution

So what may be done to alleviate the problem?

Manager tuning - earlier versions of Tungsten Clustering did not allocate sufficient resources to the Java JVM, so make the following three configuration changes via tpm update:
- mgr-heap-threshold=200
- property=wrapper.java.initmemory=80
- mgr-java-mem-size=250

Network blockage - Make sure the replicators are all online and caught up, and check the manager's view of the cluster using the following commands on every node:

cctrl> ls
cctrl> members
cctrl> ping
cctrl> ls resources 
cctrl> cluster validate 
cctrl> show alarms

Here are examples:

tungsten@db1:/home/tungsten # cctrl
 
[LOGICAL] /east > members
east/db1(ONLINE)/10.0.0.126:7800
east/db2(ONLINE)/10.0.0.185:7800
east/db3(ONLINE)/10.0.0.7:7800
 
 
[LOGICAL] /east > ping
NETWORK CONNECTIVITY: PING TIMEOUT=2
NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY
HOST db1/10.0.0.126: ALIVE
(ping) result: true, duration: 0.01s, notes: ping -c 1 -w 2 10.0.0.126
NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2'
HOST db2/10.0.0.185: ALIVE
(ping) result: true, duration: 0.00s, notes: ping -c 1 -w 2 10.0.0.185
NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db3'
HOST db3/10.0.0.7: ALIVE
(ping) result: true, duration: 0.01s, notes: ping -c 1 -w 2 10.0.0.7
 
 
[LOGICAL] /east > cluster validate 
========================================================================
CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
QUORUM SET MEMBERS ARE: db1, db3, db2
SIMPLE MAJORITY SIZE: 2
GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3
VALIDATED DB MEMBERS ARE: db1, db3, db2
REACHABLE DB MEMBERS ARE: db1, db3, db2
========================================================================
MEMBERSHIP IS VALID BASED ON VIEW/VALIDATED CONSOLIDATED MEMBERS CONSISTENCY
CONCLUSION: I AM IN A PRIMARY PARTITION OF 3 DB MEMBERS OUT OF THE REQUIRED MAJORITY OF 2
VALIDATION STATUS=VALID CLUSTER
ACTION=NONE
 
 
[LOGICAL] /east > ls resources 
+----------------------------------------------------------------------------+
|RESOURCES                                                                   |
+----------------------------------------------------------------------------+
|                db1:DATASERVER:    ONLINE                                   |
|                   db1:MANAGER:    ONLINE                                   |
|                    db1:MEMBER:    ONLINE                                   |
|                db1:REPLICATOR:    ONLINE                                   |
|                db2:DATASERVER:    ONLINE                                   |
|                   db2:MANAGER:    ONLINE                                   |
|                    db2:MEMBER:    ONLINE                                   |
|                db2:REPLICATOR:    ONLINE                                   |
|                db3:DATASERVER:    ONLINE                                   |
|                    db3:MEMBER:    ONLINE                                   |
|                db3:REPLICATOR:    ONLINE                                   |
|                  west:CLUSTER:    ONLINE                                   |
+----------------------------------------------------------------------------+
 
 
[LOGICAL] /east > show alarms
+----------------------------------------------------------------------------+
|ALARMS                                                                      |
+----------------------------------------------------------------------------+
|                                                                            |
+----------------------------------------------------------------------------+

Summary

The Wrap-Up

In this blog post we discussed what would cause a node switch to fail in a Tungsten Cluster and what may be done about it.

To learn about Continuent solutions in general, check out https://www.continuent.com/solutions

The Library

Please read the docs!

For more information about Tungsten clusters, please visit https://docs.continuent.com

Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!

For more information, please visit https://www.continuent.com/solutions

Want to learn more or run a POC? Contact us

Published In

Categories:

Database Administration, Cluster Management, Monitoring and Observability, Native

Series:

Tungsten University

Tags:

Cluster, Clustering, Config, Configuration, JVM, Memory, MySQL, Network, Switch, Tune, Tungsten, Tuning

Author

Eric M. Stone

COO and VP of Product Management

Eric is a veteran of fast-paced, large-scale enterprise environments with 40 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

View All Eric M.’s Posts

What is the Best Way to Check the Health of a Tungsten Cluster Before a Switch?