Tungsten Clustering depends on a number of prerequisites and best practices to function optimally.
In this blog post, we explore a critical, yet easily-overlooked step when installing a Tungsten Cluster node - setting up start at boot, ideally under `systemd` control.
To ensure proper functioning of a Tungsten Cluster, please ensure that start-at-boot / stop-at-shutdown has been configured using
Tungsten Clustering relies upon a voting quorum and therefore not having the node configured to start at boot can impact the functionality badly. If managers can’t form a majority of the quorum then even failover is in danger. As an example, imagine that NO start-at-boot support has been deployed. If the first node reboots no Tungsten service will run after reboot. If the second node restarts, the cluster will be in a shunned state as the third node isn’t part of the majority of the quorum and will shun itself. If start-at-boot support is in place we will always have at least 2 managers up and running and failover can happen cleanly.
Recently a customer asked us:
“What caused the failover to hang for a long time after a GCP virtual power-off was invoked?”
Plug The Hole: Root Cause
Tungsten processes (specifically the Tungsten Manager) were NOT under
Tell Me More
This is a corner case where the coordinator is the primary node, and the node is shut down.
When the Coordinator and Primary are the same node, and Tungsten is NOT stopped by
systemd during the power-off sequence, then the MySQL Server is stopped, and the Tungsten Manager remains running, which then invokes the failover before the power down completes. The power is then halted, and the failover never completes because that node was the active coordinator, and it is now dead.
The Fine Print
There is a difference between a graceful power-down signal and an instant power-off/dirty fail.
Tungsten Cluster WILL fail over in the event of a Primary instant power fail even if it was the COORDINATOR because:
- the Manager as Coordinator would not have any time to take any action due to the instal power-off
- the other two Manager on the remaining nodes would notice a missing coordinator and elect a replacement.
When a GCP virtual poweroff is invoked, the Linux systemd power-down sequence will gracefully shut down processes in the reverse order that they were started up.
As a result, we would expect the Tungsten processes to be stopped BEFORE the MySQL Server process when under systemd control.
What happened to cause the long delay was that the Tungsten processes were NOT under systemd control, so they were NOT STOPPED as part of the systemd graceful power-down process.
This allowed the Manager as Coordinator to begin a fail over that never got to complete, because it was stopped by the power-off in the middle.
The remaining Managers have a lengthy timeout to process because the Coordinator simply vanished due to the power down.
Plug The Hole: Solutions
The solution is to make the Tungsten Cluster start at boot and stop at shutdown using
init via the
deployall script will automatically detect the initialization system in use (
init) and prefer
systemd when both are available.
By default, the
deployall script must be run manually to enable start-at-boot/stop-at-shutdown.
To automatically execute the
deployall script at installation time, add the
install=true tpm option to your configuration.
The online documentation for
deployall may be found here:
Since systemd will start services using sudo, java needs to be accessible to the root user. Please ensure that the java environment is correct under sudo access.
If you downloaded and extracted a java tarball somewhere, then you will need the following
update-alternatives --install command to register the location. For example, if you extracted the tarball under directory
/opt/jre1.8.0_312/, then your command might look something like this:
shell> sudo update-alternatives --install /usr/bin/java java /opt/jre1.8.0_312/bin/java 20
Next, confirm that there is a selected java using
update-alternatives --config like this:
shell> sudo update-alternatives --config java There is 1 program that provides 'java'. Selection Command ----------------------------------------------- *+ 1 /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java Enter to keep the current selection[+], or type selection number:
Lastly, confirm the user environment is healthy for both root and the tungsten OS user:
tungsten@db7-demo:/home/tungsten # sudo which java /usr/bin/java tungsten@db7-demo:/home/tungsten # sudo java -version openjdk version "1.8.0_312" OpenJDK Runtime Environment (build 1.8.0_312-b07) OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode) tungsten@db7-demo:/home/tungsten # which java /usr/bin/java tungsten@db7-demo:/home/tungsten # java -version openjdk version "1.8.0_312" OpenJDK Runtime Environment (build 1.8.0_312-b07) OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
Cluster Start At Boot
When installing a new cluster, the tpm tungsten.ini flag
install=true will automatically install services and start them with the systemd or initd command.
When updating a running cluster, the following steps are needed to properly install the services, depending on the method in use:
When using the older
init method of configuring start-at-boot/stop-at-shutdown, there is just a single command to run:
When using the modern
systemd method of configuring start-at-boot/stop-at-shutdown, there are potentially multiple steps to run, especially if the cluster is already up and running.
For continuity-of-service reasons, the
deployall script will NOT restart individual components if they had already been previously started by other methods.
shell> cctrl cctrl> set policy maintenance cctrl> exit shell> deployall shell> /opt/continuent/tungsten/tungsten-replicator/bin/replicator stop sysd shell> sudo systemctl start treplicator shell> /opt/continuent/tungsten/tungsten-manager/bin/manager stop sysd shell> sudo systemctl start tmanager shell> /opt/continuent/tungsten/tungsten-connector/bin/connector stop sysd shell> sudo systemctl start tconnector shell> cctrl cctrl> set policy automatic cctrl> exit
Removing Cluster Start At Boot
To remove the boot scripts from the system, use the
In this post we explored a critical, yet easily-overlooked step when installing a Tungsten Cluster node - setting up start at boot and stop at shutdown, under either