Automatic failover sounds like a solved problem. Your primary goes down, a replica gets promoted, and the application reconnects. In practice, the decision to automate — and how far to automate — is one of the most consequential architectural choices a team can make.
Done well, automatic failover reduces downtime from minutes to seconds. Done poorly, it creates split-brain conditions, promotes stale replicas, or triggers cascading failures that a human operator would have avoided entirely.
This article examines where automation works, where it breaks down, what the major tools offer, and how to decide which approach fits your environment.
Why Failover Is Hard
On the surface, failover seems mechanical: detect a failure, pick the best replica, promote it, reconfigure the remaining replicas, and redirect application traffic. Each of those steps is individually straightforward. The difficulty is in the coordination and in the edge cases.
A reliable failover system requires three components working together: replication to keep replicas in sync with the primary, orchestration to detect failures and manage promotion logic, and connectivity (a proxy or router) to redirect application traffic to the new primary. If any one of these is missing or poorly integrated with the others, failover will have gaps - either data is lost, the wrong node is promoted, or the application doesn't know where to send traffic after the switch. Many of the differences between failover tools come down to whether these three components are tightly integrated or assembled from separate projects.
Consider what happens when a primary becomes unreachable. The failover system has to answer several questions simultaneously:
- Is the primary actually down, or is there a network partition between the monitoring system and the primary?
- Are the replicas still connected to the primary even though the failover system cannot reach it?
- Which replica has the most recent data?
- Are there in-flight transactions that haven't been replicated yet?
- Is it safe to promote right now, or would doing so cause data loss?
Getting the answers wrong - even to just one of these - can produce outcomes worse than the original outage.
The Case for Automation
Manual failover has an obvious weakness: it requires a human. And humans need to be paged, need to wake up, need to assess the situation, need to connect to the right systems, and need to execute the right commands in the right order without mistakes. Under pressure. Possibly at 3 AM.
For many organizations, the time it takes to do all of that exceeds their downtime tolerance. If your SLA requires recovery within 30 seconds, manual intervention is not an option.
Automatic failover also removes one of the biggest sources of human error in database operations. Even experienced DBAs occasionally promote the wrong replica, forget to disable writes on the old primary, or skip a step in a runbook they have executed dozens of times before.
Automation is most effective when failure modes are well-understood, topologies are predictable, and the environment has been thoroughly tested. In a standard three-node cluster within a single datacenter, automated local failover is a mature and reliable practice.
The Case for Manual Intervention
Automation is only as good as its failure detection logic. And failure detection in distributed systems is famously unreliable.
The classic problem is the false positive: the monitoring system concludes the primary is dead when it's actually alive but temporarily unreachable. If the failover system promotes a replica while the old primary is still accepting writes, the result is a split-brain - two nodes both believing they are the primary, both accepting writes, and diverging silently.
There are also situations where automation will technically do the right thing but at the wrong time. During a planned maintenance window, during a known network event, during a storage hiccup that resolves in seconds - in all of these cases, a human would hold off. An automated system, unless explicitly told otherwise, will not.
Manual intervention is also the safer choice when the failure is ambiguous. Partial failures - where some replicas can reach the primary and some cannot - are particularly dangerous for automated systems. A human can investigate, gather information from multiple sources, and make a judgment call. An automated system has to rely on predefined rules that may not account for the specific combination of circumstances.
The worst failover is the one that makes things worse.
For cross-datacenter failover, the stakes are higher still. Network partitions between datacenters are more common than total server failures, and the risk of split-brain increases significantly. Many teams choose to automate failover within a datacenter but require manual intervention for cross-site promotion.
Split-Brain: The Central Risk
Split-brain deserves its own discussion because it is the single most dangerous outcome of a poorly managed failover.
In a split-brain scenario, two (or more) nodes independently accept write traffic, allowing the datasets to diverge. Each node applies its own stream of changes. When the partition resolves, the two datasets have diverged in ways that may be impossible to reconcile automatically.
The consequences range from silent data corruption to application-visible inconsistencies to complete data loss of one side's changes. And the problem often is not detected immediately - it surfaces hours or days later when someone notices conflicting records or missing data.
Independent testing by Jepsen has confirmed that these risks are not theoretical - they have been observed in production-grade clustering solutions under controlled test conditions.
Split-brain prevention requires several things working together:
-
Quorum-based decision making
A failover should only proceed if a majority of cluster members agree that the primary is genuinely unavailable. This is why odd numbers of nodes (3, 5, 7) are standard practice - they make majority decisions unambiguous. -
Fencing the old primary
Before a new primary starts accepting writes, the old primary must be prevented from doing so. This can be done by settingread_only, killing the process, blocking network access, or using STONITH (Shoot The Other Node In The Head) mechanisms. If you cannot reliably fence the old primary, you cannot safely automate failover. -
Replication position awareness
The promoted replica should be the one with the most recent data. Promoting a significantly behind replica means all transactions between its position and the old primary's position are effectively lost.
Any failover tool that does not address all three of these concerns is incomplete for production use.
The Tool Landscape
Several tools and platforms address MySQL failover, each with different design philosophies and trade-offs.
A Note on Managed Services
Before looking at self-managed solutions, it is worth acknowledging that cloud-managed MySQL services - Amazon Aurora, Google Cloud SQL for MySQL, and Azure Database for MySQL - handle failover automatically as part of the platform. The cloud provider owns the detection logic, promotion process, and traffic rerouting. For teams that are comfortable delegating this control, managed services remove the failover decision entirely.
The trade-off is that they also remove choice. You typically cannot select which replica gets promoted, cannot inspect the failover logic, cannot intervene if the automated decision is wrong, and cannot extend the system across cloud providers or into on-premises infrastructure. Cross-region failover support varies by provider and often involves significant replication lag and manual steps. And because these are proprietary implementations built on MySQL-compatible (not always fully MySQL-native) engines, migration away from them can be complex.
For the rest of this article, we focus on self-managed solutions where the team controls the failover behavior directly - which is where the "automate vs. manual" question actually applies.
MHA (Master High Availability)
MHA was created by Yoshinori Matsunobu and became one of the most widely deployed MySQL failover tools. It is designed specifically for asynchronous replication topologies and focuses on minimizing data loss during promotion.
MHA's approach is methodical: when the primary fails, it identifies the most advanced replica, retrieves any missing relay log events from other replicas, applies them to the promotion candidate, and then promotes it. This relay log synchronization step is what distinguishes MHA from simpler tools - it tries to recover as many transactions as possible before completing the failover.
The trade-offs are worth understanding. MHA requires passwordless SSH between nodes, which introduces operational complexity. It is written in Perl and while still maintained, active development has slowed considerably. It exits after performing a single failover by design, to prevent flapping, which means you need external automation to restart monitoring after recovery. MHA also does not natively understand MySQL Group Replication or native MariaDB GTID.
Importantly, MHA handles only one part of the failover problem: orchestration. It does not include a proxy for traffic routing or a monitoring dashboard - those must be built separately, typically by pairing MHA with ProxySQL or HAProxy and custom scripting. It also does not manage post-failover recovery: once the new primary is promoted, bringing the old primary back as a replica and returning the cluster to its original topology is a manual process.
For teams running traditional asynchronous replication in a single datacenter, MHA remains a solid and well-understood choice. For more complex topologies or teams that need an integrated solution, it may not be the right fit.
Orchestrator
Orchestrator, originally developed at GitHub by Shlomi Noach, takes a fundamentally different approach. Rather than just watching a primary and reacting to its failure, Orchestrator continuously discovers and maps the entire replication topology. It understands how every node relates to every other node, and it uses that understanding to make smarter promotion decisions.
One of Orchestrator's key strengths is its false-positive detection. It does not rely solely on its own view of the primary - it also checks whether replicas can still see the primary. If Orchestrator cannot reach the primary, but all replicas are happily replicating, it correctly concludes that the problem is with Orchestrator's network path, not with the primary itself.
Orchestrator supports both automated and manual failover, with configurable hooks for pre- and post-failover actions. It provides a web UI for topology visualization and can be deployed in a multi-node Raft configuration for its own high availability.
The learning curve is steeper than MHA's. Orchestrator requires a backend database to store topology state, and its configuration has many options that require careful tuning. Like MHA, Orchestrator focuses on the orchestration layer - it does not include a proxy or load balancer. Traffic routing must be handled externally, typically through pre- and post-failover hooks that update ProxySQL, HAProxy, or DNS. This means the team assembling the solution is responsible for testing that all the pieces work together under every failure scenario, and for maintaining compatibility as each component is updated independently. Orchestrator also does not automate failback or recovery of the old primary - those remain manual operations.
MySQL InnoDB Cluster (Group Replication + MySQL Router)
MySQL InnoDB Cluster is Oracle's native high availability solution, combining Group Replication, MySQL Router, and MySQL Shell into an integrated stack.
Group Replication provides virtually synchronous replication with automatic conflict detection. In single-primary mode - the recommended configuration for most use cases - one node handles writes and the others serve reads. When the primary fails, the remaining members use a consensus protocol to elect a new primary, and MySQL Router automatically redirects traffic.
This integration is InnoDB Cluster's biggest advantage. Because the replication layer, the routing layer, and the management layer are all designed to work together, failover is more tightly coordinated than in systems where these components are bolted together from separate projects. That said, MySQL Router is a relatively basic proxy - it handles connection routing based on ports (separate ports for read-write and read-only traffic), but does not perform intelligent read/write splitting within a single connection or offer the load balancing sophistication of dedicated proxy solutions.
The limitations are equally real. Group Replication is sensitive to network latency, making it difficult to use across geographically distant datacenters. It also only supports the InnoDB storage engine and requires MySQL 5.7 or later (with most practical improvements arriving in 8.0), so teams running MariaDB, Percona Server, or other storage engines will need to look elsewhere. InnoDB ClusterSet extends the model to support multiple sites, but emergency cross-site failover explicitly warns of split-brain risk and does not guarantee data consistency - the original primary cluster is invalidated during the process and must be manually repaired afterward. Write throughput is also constrained because every transaction must be certified across all group members before commit.
For single-site deployments where all nodes are within the same low-latency network, InnoDB Cluster provides a well-integrated, low-operational-overhead solution. For multi-region deployments, the constraints become significant.
Galera Cluster (Percona XtraDB Cluster, MariaDB Galera Cluster)
Galera Cluster takes yet another approach: it avoids the need for a separate failover manager entirely by making every node a writable primary. Using a certification-based synchronous replication protocol, Galera ensures that all nodes apply the same transactions in the same order. If a node fails, the remaining nodes continue operating with no promotion step required - the concept of "failover" largely disappears from the operational model.
This is a genuinely appealing property. There is no promotion delay, no risk of promoting a stale replica, and no need for external orchestration to detect failure and trigger a role change. The cluster manages its own membership through group communication, and nodes that lose connectivity are expelled automatically once the remaining members have a quorum.
The trade-offs show up in other places. Every write must be certified across all nodes before commit, which means write latency increases with the number of nodes and with network distance between them. Under sustained write pressure, Galera uses flow control to throttle the fastest node down to the speed of the slowest, which can produce unexpected write stalls in production. Cross-datacenter deployments are possible, but require careful attention to network latency and flow control tuning. Like InnoDB Cluster, Galera only supports the InnoDB storage engine.
Beyond performance, there are correctness considerations. A March 2026 Jepsen analysis of MariaDB Galera Cluster identified scenarios in which acknowledged committed transactions could be lost during network partitions and node membership changes - precisely the conditions under which failover occurs. These findings highlight the importance of independent testing under realistic failure conditions, regardless of which replication technology a team selects.
Galera is a strong fit for teams that want to eliminate the failover problem at the replication layer and are willing to accept the write performance constraints that come with synchronous certification. For write-heavy workloads or geographically distributed deployments, those constraints can become significant.
Tungsten Cluster
Tungsten Cluster takes a different architectural approach. Rather than relying on synchronous replication, it uses its own asynchronous replication engine (Tungsten Replicator) and layers cluster management, intelligent proxy routing, and failover orchestration on top. Notably, it ships all three components - replication, orchestration, and connectivity - as a single integrated and tested product, rather than requiring teams to assemble them from separate projects.
The architecture consists of several integrated components. The Tungsten Manager monitors all nodes and orchestrates failover decisions using a majority quorum of managers - which directly addresses split-brain prevention. The Tungsten Connector acts as an intelligent MySQL proxy that routes application traffic to the correct primary, handles automatic read/write splitting without application changes, and critically, keeps existing application connections alive during failover events rather than dropping them. This is a meaningful difference from solutions where the application must detect the failure and reconnect to a new endpoint.
One aspect worth noting is how Tungsten handles multi-region deployments. Because the replication layer is asynchronous, write performance on the primary is not affected by network latency to remote replicas - a significant advantage over synchronous approaches when clusters span geographic regions. The system supports several topologies for multi-site operation, including Composite Active/Passive, Composite and Dynamic Active/Active, and Distributed Datasource Groups where failover between datacenters happens automatically based on quorum decisions.
Tungsten also handles what happens after failover - an area where many tools stop short. The failed primary is automatically shunned and can be easily recovered as a replica, and failback to the original topology can be performed with a single command. Rolling upgrades and zero-downtime maintenance operations are built in, which addresses a common operational concern adjacent to failover: how do you perform MySQL, OS, or hardware maintenance without triggering a disruptive role change?
On the compatibility side, Tungsten supports all MySQL variants - MySQL Community, MySQL Enterprise, MariaDB, and Percona Server - across versions, and works with all storage engines. It can be deployed on-premises, in any cloud, and in hybrid or multi-cloud configurations. This is relevant for teams running MariaDB or Percona Server who are excluded from InnoDB Cluster, or for organizations that need to avoid vendor lock-in to a single cloud provider.
The trade-off is that Tungsten Cluster is a commercial product with annual subscription fees, which puts it in a different category from the open-source tools. However, for teams that have evaluated the total cost of ownership - including the engineering time required to build, integrate, test, and maintain a DIY failover stack - the comparison is often more nuanced than the licensing cost alone would suggest.
A Decision Framework
There is no universal answer to "should we automate failover?" The right answer depends on your specific environment, your team's capabilities, and your tolerance for different types of risk.
Automate failover when:
- Your downtime tolerance is measured in seconds, not minutes.
- You have a well-understood, stable topology (for example, a standard three-node cluster within a single datacenter).
- You have thoroughly tested the failover process under realistic failure conditions - not just clean shutdowns, but network partitions, storage failures, and partial outages.
- You have reliable fencing mechanisms for the old primary.
- Your monitoring and alerting are mature enough that you will know when automation fires and can verify the result quickly.
Keep failover manual when:
- Your topology spans multiple datacenters and the risk of split-brain during a network partition is significant.
- Your failure modes are complex or unpredictable - for example, environments with unusual storage configurations, mixed replication modes, or heavy cross-database dependencies.
- Your team has not tested automated failover under realistic failure conditions.
- The data loss risk of a wrong promotion outweighs the cost of a few extra minutes of downtime.
- You do not have reliable fencing for the old primary.
Consider a hybrid approach when:
- You want automatic failover within a datacenter, but manual control for cross-site promotion.
- You want automation to detect and prepare for failover but require human approval before executing.
- You operate in a regulated environment where audit requirements mandate human oversight of certain operational decisions.
Many production environments land on this hybrid model: automated local failover for speed and reliability, with manual (or semi-automated) cross-site failover for safety.
Practical Recommendations
Regardless of which tool or approach you choose, several practices significantly reduce failover risk.
-
Test failover regularly
Not just in staging - in production, during business hours, with real traffic. If you have never tested your failover process under production conditions, you do not know if it works. Controlled switchovers (promoting a replica while the primary is still healthy) are a low-risk way to verify the entire pipeline. -
Monitor replication lag continuously
The safety of any failover depends on replicas being close to the primary's state. If your replicas regularly fall behind, every failover carries a larger data loss window. Understand your baseline lag and alert on deviations. -
Implement proper fencing
Whatever tool you use, make sure the old primary cannot continue accepting writes after a failover.read_onlyflags, network isolation, process termination - use what works in your environment, but do not skip this step. -
Plan for the failure of your failover system
If the tool that manages failover is itself unavailable, what happens? Orchestrator solves this with Raft-based HA for its own deployment. Tungsten Cluster uses a quorum of managers. MHA requires external monitoring to restart. Whatever your tool, make sure it is not a single point of failure. -
Document and rehearse manual procedures
Even if you automate everything, keep a runbook for manual failover. Automation can fail, and when it does, your team needs to be able to act without it.
Summary
MySQL automatic failover is a spectrum, not a binary choice. The question is not whether to automate, but how much to automate, under what conditions, and with what safeguards.
The tools available today - MHA, Orchestrator, MySQL InnoDB Cluster, Galera Cluster, Tungsten Cluster, and others - each make different trade-offs between simplicity, safety, integration depth, and operational scope. None of them eliminates the need to understand what is happening underneath.
Automation reduces response time and removes human error from well-understood failure scenarios. Manual intervention provides judgment and adaptability for ambiguous or high-stakes situations. The best failover strategies use both - and know clearly when each applies.
Comments
Add new comment