Continuent Blog: Zero-Downtime Cluster Maintenance: Comparing the Procedures for Upgrades versus DB/OS Maintenance

Blog

Overview

The Skinny

Part of the power of Tungsten Clustering for MySQL / MariaDB is the ability to perform true zero-downtime maintenance, allowing client applications full access to the database layer, while taking out individual nodes for maintenance and upgrades. In this blog post we cover various types of maintenance scenarios, the best practices associated with each type of action, and the key steps to ensure the highest availability.

Important Questions

Understand the Environment as a Whole First

There are a number of questions to ask when planning cluster maintenance that are critical to understand before starting. (Note that at Continuent we use "Connector" and "Proxy" interchangeably to refer to our intelligent database proxy.)

For example:

What is the cluster topology?
- Standalone (connector writes to single cluster Primary)
  Single cluster: Locate the service name and the node list
- Composite Active/Passive (CAP) (active/passive, connectors write to single cluster Primary)
  Multiple clusters: Locate the list of service names and the nodes per service
- Composite Active/Active (CAA) (active/active, connectors write to multiple cluster Primaries)
  Multiple clusters: Locate the list of service names and the nodes per service
  - You may also have Multi-Site Active/Active (MSAA) or Dynamic Active/Active (DAA). Learn more about the differences between these active/active topologies here.
Where are the connectors installed?

Since the Connectors are where the calling client application meets the database layer, it is critical to understand where the Connectors are running and how they are being used by the various database clients to best plan the appropriate procedures for your deployment.
- On the application servers? There is an architectural expectation that there is a load-balancer above multiple applications servers, which would remove the application node in the event of a Connector failure, and therefore remove any single point of failure at the application layer.
- On a Connector farm fronted by a load balancing solution?
- On the database nodes?
What load-balancing solution is being used in front of the Connectors, if any?

Tungsten Connectors are not inherently highly-available - they provide HA for the database layer, so some form of HA solution is required for the Connector layer.
- How are client connections drained from an individual Connector?
- Is this procedure fully documented and practiced?
Are there any downstream consumers running the Tungsten Replicator?
Is it possible to take a maintenance window?
What are times of the least traffic during the week over the entire day?

Maintenance Scenario: Database or OS Restart

Rolling Thunder

This scenario covers cases where the database itself will be down, either because of an OS reboot or the restart of the MySQL daemon itself.

The best practice for operations involving an OS reboot or a database restart is a procedure called "Rolling Maintenance", where each database node is serviced one at a time.

The reason for this is that attention must be given to multiple layers in order to ensure uptime - look at both connections per Connector and connections per database (both available via `cctrl> ls`):

Database layer - the database will go down, so we need to prevent all client connections through the Connector to that database node prior the the reboot or restart, both existing and new.
This is accomplished via the `shun` command within `cctrl`.
Connector layer - if the Connector is running on the database node and that node gets rebooted, then all connections through that Connector will be abruptly and ungracefully severed when the restart occurs.
This needs to be handled by draining connections from that Connector in the load balancer layer above. Note that if restarts are happening during a maintenance window and the application is offline, we do not have to drain the connector, further simplifying the process.

A summary of the Rolling Maintenance Procedure is as follows:

Perform the maintenance operation on all of the current Replicas, one at a time
Move the Primary role to a completed Replicas using `cctrl> switch`
Perform the maintenance operation on the old Primary (now a Replica)

Here is a link to the online documentation for performing rolling maintenance:
https://docs.continuent.com/tungsten-clustering-6.1/operations-maintenance-dataservice.html

Maintenance Scenario: Upgrades

The Special Case

This scenario covers cases where the database will NOT be down, so the focus becomes the Connector only. This means that we can modify the above rolling maintenance procedures somewhat to make it much easier to accomplish.

In the upgrade scenario, all three layers of the Tungsten Clustering software need to be stopped and restarted using the new version of software in a new directory.

The Connector is the only layer of the three that will impact client availability, which makes this operation capable of being performed live and in-place, with no switch needed.

In this use case, one needs to look at the connections per Connector (available via `cctrl> ls`), and NOT the connections per database because the databases stay up the whole time:

Connector layer - if the Connector is running on the database node and that node gets upgraded, then all connections through that Connector will be gracefully severed when the upgrade occurs.
This still requires draining connections from each Connector in the load balancer layer above.
Database layer - the database stays up, so the Datasource stays ONLINE, no need to SHUN, and connections can happen without interruption from the database's perspective.

There are two types of upgrade procedures, and the correct one is based upon the deployment method chosen during installation: Staging or INI.

For both types of upgrades, use the `-no-connectors` argument to `tools/tpm update` to prevent the Connectors from restarting. Once the upgrade is completed on all nodes, simply drain each Connector one-by-one and run the `tpm promote-connector` command to stop and restart the Connector once all client connections have been drained. As above, if this is happening during a maintenance window when applications are offline, we do not have to drain the connectors and we do not need to employ the `--no-connectors` argument.

In this manner one may upgrade an entire cluster with zero downtime!

A summary of the INI Upgrade Procedure is as follows:

Upgrade all of the current Replicas, one at a time
Perform an in-place upgrade on the Primary while live, with no switch required
Drain and execute `tpm promote-connector` on each node running a Connector (if the applications are online due to no maintenance window)

A summary of the Staging Upgrade Procedure is as follows:

Perform an in-place upgrade on the database and connector nodes (if any) all at once, with no switch required
Drain and execute `tpm promote-connector` on each node running a Connector (if the applications are online due to no maintenance window)

Click here for both INI and Staging Upgrade Procedure details already covered in a prior blog post...

Best Practices

Do's and Don'ts

Ensure that replication on each node has caught up with minimal latency before recovering/welcoming back into the cluster
Use `switch`, not `switch to {datasource}` if at all possible.
For both types of upgrades, if applications are running, use the -no-connectors argument to tools/tpm update to prevent the Connectors from restarting. Once the upgrade is completed on all nodes, simply drain each Connector one-by-one and run the tpm promote-connector command to stop and restart the Connector once all client connections have been drained.
Consider any downstream stand-alone Tungsten Replicators in the topology. Occasionally, there are incompatible changes in THL which will cause an older version of the replicator to go offline. Upgrade the Replicator at the same time you are upgrading the cluster to address replication issues due to THL incompatibility.

The Library

Please read the docs!

For more information the Performing Database or OS Maintenance, please visit the docs page at https://docs.continuent.com/tungsten-clustering-6.1/operations-maintenance.html

For more information about the specfic procedures, please visit the online documentation:

INI Upgrades: https://docs.continuent.com/tungsten-clustering-6.1/cmdline-tools-tpm-ini-upgrades.html
Staging Upgrades: https://docs.continuent.com/tungsten-clustering-6.1/cmdline-tools-tpm-cmdline-upgrade.html
The `tpm update` command: https://docs.continuent.com/tungsten-clustering-6.1/cmdline-tools-tpm-commands-update.html

For more information about Tungsten clusters, please visit https://docs.continuent.com

Summary

The Wrap-Up

In this blog post we discussed the power of Tungsten Clustering for MySQL / MariaDB to perform true zero-downtime maintenance, allowing client applications full access to the database layer, while taking out individual nodes for maintenance and upgrades. We covered various types of maintenance scenarios, the best practices associated with each type of action, and the key steps to ensure the highest availability.

Tungsten Clustering is the most flexible, performant global database layer available today – use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!

For more information, please visit https://www.continuent.com/solutions.

Want to learn more or run a POC? Contact us.

Published In

Categories:

Cluster Management, Database Administration, Zero Downtime Maintenance

Series:

Tungsten University

Tags:

Architecture, cctrl, Command, Composite, Datasource, Downtime, HA, Maintenance, primary, MySQL, Q&A, Question, recover, Shell, Shun, replica, Switch, Tungsten, Welcome, Zero, Zero-Downtime

Author

Eric M. Stone

COO and VP of Product Management

Eric is a veteran of fast-paced, large-scale enterprise environments with 40 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

View All Eric M.’s Posts

Zero-Downtime Cluster Maintenance: Comparing the Procedures for Upgrades versus DB/OS Maintenance