Tungsten Clustering: Plugging the Holes - Risk Mitigation Through Best Practices | Blog

Blog

Introduction

Tungsten Clustering depends on a number of prerequisites and best practices to function optimally.

In this blog post, we explore a critical, yet easily-overlooked step when installing a Tungsten Cluster node - setting up start at boot, ideally under `systemd` control.

To ensure proper functioning of a Tungsten Cluster, please ensure that start-at-boot / stop-at-shutdown has been configured using deployall.

Tungsten Clustering relies upon a voting quorum and therefore not having the node configured to start at boot can impact the functionality badly. If managers can’t form a majority of the quorum then even failover is in danger. As an example, imagine that NO start-at-boot support has been deployed. If the first node reboots no Tungsten service will run after reboot. If the second node restarts, the cluster will be in a shunned state as the third node isn’t part of the majority of the quorum and will shun itself. If start-at-boot support is in place we will always have at least 2 managers up and running and failover can happen cleanly.

The Question

Recently a customer asked us:

“What caused the failover to hang for a long time after a GCP virtual power-off was invoked?”

Plug The Hole: Root Cause

Tungsten processes (specifically the Tungsten Manager) were NOT under systemd control.

Tell Me More

This is a corner case where the coordinator is the primary node, and the node is shut down.

When the Coordinator and Primary are the same node, and Tungsten is NOT stopped by systemd during the power-off sequence, then the MySQL Server is stopped, and the Tungsten Manager remains running, which then invokes the failover before the power down completes. The power is then halted, and the failover never completes because that node was the active coordinator, and it is now dead.

The Fine Print

There is a difference between a graceful power-down signal and an instant power-off/dirty fail.

Tungsten Cluster WILL fail over in the event of a Primary instant power fail even if it was the COORDINATOR because:

the Manager as Coordinator would not have any time to take any action due to the instal power-off
the other two Manager on the remaining nodes would notice a missing coordinator and elect a replacement.

When a GCP virtual poweroff is invoked, the Linux systemd power-down sequence will gracefully shut down processes in the reverse order that they were started up.

As a result, we would expect the Tungsten processes to be stopped BEFORE the MySQL Server process when under systemd control.

What happened to cause the long delay was that the Tungsten processes were NOT under systemd control, so they were NOT STOPPED as part of the systemd graceful power-down process.

This allowed the Manager as Coordinator to begin a fail over that never got to complete, because it was stopped by the power-off in the middle.

The remaining Managers have a lengthy timeout to process because the Coordinator simply vanished due to the power down.

Plug The Hole: Solutions

The solution is to make the Tungsten Cluster start at boot and stop at shutdown using systemd or init via the deployall tool.

The deployall script will automatically detect the initialization system in use (systemd or init) and prefer systemd when both are available.

By default, the deployall script must be run manually to enable start-at-boot/stop-at-shutdown.

To automatically execute the deployall script at installation time, add the install=true tpm option to your configuration.

The online documentation for deployall may be found here:

https://docs.continuent.com/tungsten-clustering-7.0/cmdline-tools-deployall.html

Java Environment

Since systemd will start services using sudo, java needs to be accessible to the root user. Please ensure that the java environment is correct under sudo access.

If you downloaded and extracted a java tarball somewhere, then you will need the following update-alternatives --install command to register the location. For example, if you extracted the tarball under directory /opt/jre1.8.0_312/, then your command might look something like this:

shell> sudo update-alternatives --install /usr/bin/java java /opt/jre1.8.0_312/bin/java 20

Next, confirm that there is a selected java using update-alternatives --config like this:

shell> sudo update-alternatives --config java

There is 1 program that provides 'java'.

  Selection    Command
-----------------------------------------------
*+ 1           /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java

Enter to keep the current selection[+], or type selection number:

Lastly, confirm the user environment is healthy for both root and the tungsten OS user:

tungsten@db7-demo:/home/tungsten # sudo which java
/usr/bin/java

tungsten@db7-demo:/home/tungsten # sudo java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

tungsten@db7-demo:/home/tungsten # which java
/usr/bin/java

tungsten@db7-demo:/home/tungsten # java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

Cluster Start At Boot

When installing a new cluster, the tpm tungsten.ini flag install=true will automatically install services and start them with the systemd or initd command.

When updating a running cluster, the following steps are needed to properly install the services, depending on the method in use:

Using init

When using the older init method of configuring start-at-boot/stop-at-shutdown, there is just a single command to run:

shell> deployall

Using systemd

When using the modern systemd method of configuring start-at-boot/stop-at-shutdown, there are potentially multiple steps to run, especially if the cluster is already up and running.

For continuity-of-service reasons, the deployall script will NOT restart individual components if they had already been previously started by other methods.

For example:

shell> cctrl
cctrl> set policy maintenance
cctrl> exit

shell> deployall
shell> /opt/continuent/tungsten/tungsten-replicator/bin/replicator stop sysd
shell> sudo systemctl start treplicator

shell> /opt/continuent/tungsten/tungsten-manager/bin/manager stop sysd
shell> sudo systemctl start tmanager

shell> /opt/continuent/tungsten/tungsten-connector/bin/connector stop sysd
shell> sudo systemctl start tconnector

shell> cctrl
cctrl> set policy automatic
cctrl> exit

Removing Cluster Start At Boot

To remove the boot scripts from the system, use the undeployall command:

shell> undeployall

Wrap-Up

In this post we explored a critical, yet easily-overlooked step when installing a Tungsten Cluster node - setting up start at boot and stop at shutdown, under either init or systemd control.

To ensure proper functioning of a Tungsten Cluster, please ensure that start-at-boot / stop-at-shutdown has been configured using deployall.

Smooth sailing!

Published In

Categories:

Cluster Management, Database Administration, Monitoring and Observability

Series:

MySQL High Availability (HA) & Disaster Recovery (DR)

Tags:

MySQL, MariaDB, systemd, installation, best practices

Author

Eric M. Stone

COO and VP of Product Management

Eric is a veteran of fast-paced, large-scale enterprise environments with 40 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

View All Eric M.’s Posts

Tungsten Clustering: Plugging the Holes - Risk Mitigation Through Best Practices