Continuent Blog: Geo-MySQL Reality Check: How Galera Cluster Caused Downtime

Blog

Synchronous Replication for MySQL HA and DR Deployments

Recently a European Based Mobile Service Operator approached us looking for improvements to their current Galera Cluster [aka MariaDB Cluster and Percona XtraDB Cluster].

Synchronous replication promises a lot: a fully consistent cluster with infinite write scalability. It is great. Until it isn’t. We at Continuent have a lot of experience with that as our first two solutions we developed, m/cluster back in 2004 and uni/cluster back in 2007, both were based on synchronous replication.

We changed this in 2009 after we concluded that eventual consistency using asynchronous replication is a better way to go. The promise of the synchronous replication was ultimately proven to be too good to be true.

There were two main problems that were causing major issues for this Mobile Provider’s 24/7/365 enterprise operations: (1) Unplanned Downtime and (2) Lack of a reliable Geo-Scale solution.

Let’s look at how using Tungsten Clustering addresses their issues.

Problem 1: Unplanned, Unexpected MySQL Downtime

Practically all clustering solutions address the issue of a “split-brain” by establishing a quorum. Split-brain potentially results in applications writing in two separate instances at the same time without knowing each other, thus creating data drift, inconsistent databases.

You can read about split-brain scenarios here: https://www.continuent.com/resources/blog/what-earth-split-brain-scenario-mysql-database-cluster

And about quorum here:
https://www.continuent.com/resources/blog/why-does-mysql-mariadb-cluster-require-odd-number-nodes

The risk of Galera Clusters shutting itself down. The important point here is that within a 3-node Galera cluster, you can lose one of the cluster nodes and have the cluster remain online. Great. But if you lose two (2) nodes, that single remaining node will also shut down, which means your cluster is completely offline.

One of the key frustrations that this company has with Galera is that the single healthy node had all of the data. So, why couldn’t they use it? (Technically, one could disable split-brain checking protection in a Galera cluster, but that would be VERY dangerous).

Tungsten Clustering, on the other hand, can easily keep your cluster and thus application online even when losing 2 out of 3 nodes. The split brain protection in Tungsten Clustering contains many rules and checks to determine if a node is down (machine crashed, MySQL database server crashed, etc) as opposed to a network partition.

This also eliminates the precarious situation of having no failover during online maintenance activities. Although both solutions support zero downtime maintenance, each time you do maintenance on a 3 node cluster in Galera, your cluster is at risk -- if one other node goes offline, your cluster will shut down. With Tungsten Clustering, you can perform zero downtime maintenance on a 3 node cluster and your cluster will remain online even if you lose one of the remaining online nodes.

Problem 2: Deploy at Geo-Scale

The speed of light is an ultimate limit for the double phase commit needed with synchronous replication. Deploying MySQL at a geo-scale is a requirement that we often hear from prospective customers. Synchronous solutions like Galera struggle with any deployment over a WAN due to a single issue: Physics. Latency kills performance.

Any replication technology that uses synchronous, semi-synchronous, or virtual-synchronous replication will add massive latency for all application writes. This is because the transaction MUST be acknowledged by a remote node before it is committed on the primary node, all the while the application will be waiting for this commit message.

While synchronous replication seems like “the dream” for data synchronization, all transactions must go through a certification phase, which will impact performance even in a local cluster, not to mention geo-scale clusters. Even on the fastest of links, this will cause the application to crawl. A WAN connectivity with even a small latency will make applications unusable.

Continuent has been deploying MySQL clusters at the geo-scale since 2009, when we recognized that asynchronous replication is the requirement for loosely coupled clusters across WAN’s (and between cloud providers and on-prem). Adding additional sites to your existing deployment will not affect performance of your existing clusters. This was important for this prospect because they need to deploy their applications close to customers for the best customer experience.

Added Benefit: Intelligent MySQL Proxy

There was one other key component with Tungsten Clustering we should not forget: The Tungsten Proxy (aka Tungsten Connector), which is an intelligent MySQL proxy that is integrated with Tungsten Clustering. Having loosely coupled clusters deployed in a geo-distributed Active/Active or Active/Passive topology requires intelligent routing. The Proxy is aware of the state of all nodes in the geo deployment, thus failover between sites can happen quickly and without having to reconfigure application servers. The Tungsten Proxy also offers a variety of load balancing and routing algorithms to utilize all nodes in the deployment.

Wrap Up

This was another successful presentation of some common issues that we are asked about: Local HA and deployments at the Geo-Scale. One of the other items mentioned was that Tungsten Clustering is a complete product, meaning it comes with all of the components needed to deploy geo-scale MySQL clusters. I have already mentioned the Tungsten Replicator and Tungsten Connector, but also included is Tungsten Manager, which is an intelligent management and orchestrator, and Tungsten Dashboard, the cutting edge GUI administration and monitoring tool for Tungsten Cluster deployments. I am confident that this mobile service operator will be using Tungsten Clustering for years to come.

Published In

Categories:

Geographic Distribution and Geo-Scale, Zero Downtime Maintenance

Series:

Competitor Comparisons

Tags:

active/active, Galera

Author

Matthew Lang

VP of Customer Success, Americas

Matthew has over 25 years of experience in database administration, database programming, and system architecture, including the creation of a database replication product that is still in use today. He has designed highly available, scalable systems that have allowed startups to quickly become enterprise organizations, utilizing a variety of technologies including open source projects, virtualization and cloud.

View All Matthew’s Posts

Geo-MySQL Reality Check: How Galera Cluster Caused Downtime