Blog

What is the Best Way to Check the Health of a Tungsten Cluster Before a Switch?

The Question

Recently, a customer asked us:

What would cause a node switch to fail in a Tungsten Cluster?

For example, we saw the following during a recent session where a switch failed:

cctrl> switch to db3 
 
SELECTED SLAVE: db3@alpha 
SET POLICY: MAINTENANCE => MAINTENANCE 
PURGE REMAINING ACTIVE SESSIONS ON CURRENT MASTER 'db1@alpha'
PURGED A TOTAL OF 0 ACTIVE SESSIONS ON MASTER 'db1@alpha'
FLUSH TRANSACTIONS ON CURRENT MASTER 'db1@alpha'
Exception encountered during SWITCH. 
Failed while setting the replicator 'db1' role to 'slave'
ClusterManagerException: Exception while executing command 'replicatorStatus' on manager 'db1'
Exception=Failed to execute '/alpha/db1/manager/ClusterManagementHelper/replicatorStatus alpha db3'
Reason= 
CLUSTER_MEMBER(true) 
STATUS(FAIL) 
+----------------------------------------------------------------------------+ 
|alpha | 
+----------------------------------------------------------------------------+ 
|Handler Exception: SYSTEM | 
|Cause:Exception | 
|Message:javax.management.MBeanException: MANAGER | 
|CLUSTER_MEMBER(true) | 
|STATUS(FAIL) | 
|Exception: ConnectionException | 
|Message: getResponseQueue():No response queue found for id: 1552059204364 | 

The Answer

The Tungsten Manager is unable to communicate with a remote resource or has insufficient memory

Here are some possibilities to consider:

  • Network blockage - if the Manager is unable to communicate with the target layer (i.e. Replicator or another manager), then the above error will occur
  • Manager tuning - if a Manager restart on all nodes clears the issue, then this indicates that the Manager is starved for resources

The Solution

So what may be done to alleviate the problem?

  • Manager tuning - earlier versions of Tungsten Clustering did not allocate sufficient resources to the Java JVM, so make the following three configuration changes via tpm update:
    • mgr-heap-threshold=200
    • property=wrapper.java.initmemory=80
    • mgr-java-mem-size=250
  • Network blockage - Make sure the replicators are all online and caught up, and check the manager's view of the cluster using the following commands on every node:
    cctrl> ls
    cctrl> members
    cctrl> ping
    cctrl> ls resources 
    cctrl> cluster validate 
    cctrl> show alarms

    Here are examples:

    tungsten@db1:/home/tungsten # cctrl
     
    [LOGICAL] /east > members
    east/db1(ONLINE)/10.0.0.126:7800
    east/db2(ONLINE)/10.0.0.185:7800
    east/db3(ONLINE)/10.0.0.7:7800
     
     
    [LOGICAL] /east > ping
    NETWORK CONNECTIVITY: PING TIMEOUT=2
    NETWORK CONNECTIVITY: CHECKING MY OWN ('db1') CONNECTIVITY
    HOST db1/10.0.0.126: ALIVE
    (ping) result: true, duration: 0.01s, notes: ping -c 1 -w 2 10.0.0.126
    NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db2'
    HOST db2/10.0.0.185: ALIVE
    (ping) result: true, duration: 0.00s, notes: ping -c 1 -w 2 10.0.0.185
    NETWORK CONNECTIVITY: CHECKING CLUSTER MEMBER 'db3'
    HOST db3/10.0.0.7: ALIVE
    (ping) result: true, duration: 0.01s, notes: ping -c 1 -w 2 10.0.0.7
     
     
    [LOGICAL] /east > cluster validate 
    ========================================================================
    CHECKING FOR QUORUM: MUST BE AT LEAST 2 DB MEMBERS
    QUORUM SET MEMBERS ARE: db1, db3, db2
    SIMPLE MAJORITY SIZE: 2
    GC VIEW OF CURRENT DB MEMBERS IS: db1, db2, db3
    VALIDATED DB MEMBERS ARE: db1, db3, db2
    REACHABLE DB MEMBERS ARE: db1, db3, db2
    ========================================================================
    MEMBERSHIP IS VALID BASED ON VIEW/VALIDATED CONSOLIDATED MEMBERS CONSISTENCY
    CONCLUSION: I AM IN A PRIMARY PARTITION OF 3 DB MEMBERS OUT OF THE REQUIRED MAJORITY OF 2
    VALIDATION STATUS=VALID CLUSTER
    ACTION=NONE
     
     
    [LOGICAL] /east > ls resources 
    +----------------------------------------------------------------------------+
    |RESOURCES                                                                   |
    +----------------------------------------------------------------------------+
    |                db1:DATASERVER:    ONLINE                                   |
    |                   db1:MANAGER:    ONLINE                                   |
    |                    db1:MEMBER:    ONLINE                                   |
    |                db1:REPLICATOR:    ONLINE                                   |
    |                db2:DATASERVER:    ONLINE                                   |
    |                   db2:MANAGER:    ONLINE                                   |
    |                    db2:MEMBER:    ONLINE                                   |
    |                db2:REPLICATOR:    ONLINE                                   |
    |                db3:DATASERVER:    ONLINE                                   |
    |                    db3:MEMBER:    ONLINE                                   |
    |                db3:REPLICATOR:    ONLINE                                   |
    |                  west:CLUSTER:    ONLINE                                   |
    +----------------------------------------------------------------------------+
     
     
    [LOGICAL] /east > show alarms
    +----------------------------------------------------------------------------+
    |ALARMS                                                                      |
    +----------------------------------------------------------------------------+
    |                                                                            |
    +----------------------------------------------------------------------------+

Summary

The Wrap-Up

In this blog post we discussed what would cause a node switch to fail in a Tungsten Cluster and what may be done about it.

To learn about Continuent solutions in general, check out https://www.continuent.com/solutions

The Library

Please read the docs!

For more information about Tungsten clusters, please visit https://docs.continuent.com

Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!

For more information, please visit https://www.continuent.com/solutions

Want to learn more or run a POC? Contact us

About the Author

Eric M. Stone
COO

Eric is a veteran of fast-paced, large-scale enterprise environments with 35 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

Add new comment