High availability for MySQL in a hybrid cloud setup sounds great in theory. You get flexibility, resilience, and freedom from putting all your eggs in one basket.
In practice, most outages in hybrid environments don’t come from MySQL itself, they come from the edges: networks, routing, storage, and unclear failover behavior. In this post, we’ll walk through the most common traps we see in real hybrid deployments and, more importantly, how to avoid them.
What “Hybrid-Cloud HA” Actually Looks Like
When teams talk about MySQL HA in a hybrid cloud, they usually mean some mix of:
- An on-premises MySQL cluster with replicas in the cloud (or the other way around).
- Multiple data centers combined with one or more cloud regions.
- A blend of VMs, managed MySQL services, and sometimes containers.
The goal is resilience: if a data center, a cloud region, or even a provider fails, the application keeps running. The challenge is that every connection between those environments now affects overall availability.
Pitfall 1: Assuming the Network Will “Just Work”
The weakest link in most hybrid HA designs is the network between sites.
WAN links behave very differently from local networks:
- VPNs and private interconnects drop or flap more often than teams expect
- Latency and jitter change throughout the day
- A small routing or firewall change can quietly break replication
How to avoid it:
- Design with the assumption that WAN links fail intermittently
- Use asynchronous replication between sites by default
- Actively monitor replication lag and replication errors, not just server uptime
- Plan for graceful degradation, such as running one site read-only if replication stalls
Pitfall 2: Stretching Clusters Across Environments
Running a single MySQL cluster stretched across on-prem and cloud using synchronous or semi-synchronous replication looks clean on paper. In practice, it often leads to:
- Slower commits due to cross-site round trips
- Increased risk of split-brain during partial outages
- Application-visible incidents triggered by minor network glitches
How to avoid it:
- Use asynchronous replication between sites
- Treat each location as its own failure domain
- Make cross-site promotion a deliberate action, not an automatic reaction
Pitfall 3: No Clear Owner for Writes
One of the most dangerous situations in hybrid setups is ambiguity around who is allowed to write.
Common warning signs:
- Both on-prem and cloud accept writes “temporarily”
- Ad-hoc active-active replication between sites
- Different teams believe different sites are primary
This almost always ends in data inconsistency.
How to avoid it:
- Decide — and document — one clear write authority at any moment
-
If you use multi-primary designs, strictly limit them by:
- Partitioning data (by region, tenant, or shard)
- Establish site affinity (i.e. by user, customer, etc.)
- Defining conflict rules up front
- Encode these rules in tooling or automation, not in tribal knowledge
Pitfall 4: DNS and Routing That Don’t Keep Up
In hybrid HA, routing is just as important as replication. Replication moves the data, but routing determines where applications actually connect when failures occur.
Problems we often see:
- DNS TTLs that are too long
- Load balancers sending writes to read-only sites
- Client-side caching masking topology changes
How to avoid it:
- Keep DNS TTLs short and validate real client behavior
- Prefer database-aware routers or proxies over static hostnames
- Separate read and write endpoints and enforce their roles consistently
Pitfall 5: Mixing Storage and Backups Without a Plan
Recovery depends on more than replication. Without a consistent backup strategy across storage systems, failures become painful to recover from.
Hybrid environments almost always involve different storage systems:
- SAN or NAS on-prem
- Block or object storage in the cloud
- Different snapshot and restore semantics
This becomes painful during recovery.
How to avoid it:
- Rely on portable logical backups that work everywhere
- Regularly test restores across environments
- Clearly define RPO and RTO expectations per site and design backups accordingly
For more info about creating a backup plan, read our blog post here.
Pitfall 6: Designing for Cloud First, On-Prem Second
When a company leans too heavily on cloud features and treats on-prem as a fallback, it usually leads to:
- Manual, unreliable failover
- Inconsistent security models
- Completely different operational procedures
How to avoid it:
- Define one HA and security model for both environments
- Make sure either side can operate independently if the other is unavailable
- Use the same monitoring, alerting, and SLOs everywhere
Pitfall 7: Automating Failover Too Early
Automatic cross-site failover promises speed, but without careful design it can turn minor issues into major incidents.
Risks include:
- False positives promoting the wrong site
- Network partitions creating dual primaries
- Painful manual recovery after the fact
How to avoid it:
- Start with manual, well-rehearsed failover.
- Use multiple signals before promotion (DB health, replication state, app checks)
- Practice failover regularly, including partial failures
- Implement and test failback procedures too
A Safer Reference Architecture
A safer hybrid architecture starts by accepting that failures will happen and designing for controlled, predictable recovery.
A simple, reliable hybrid MySQL HA pattern looks like this:
-
Independent HA per site
- On-prem: local MySQL HA
- Cloud: local MySQL HA
-
Asynchronous replication between sites
- One site handles writes
- The other stays ready for recovery
-
Controlled site switchover
- Stop writes
- Catch replicas up
- Promote target site
- Update routing
-
Unified observability
- Same metrics, logs, and alerts everywhere
This design trades a small recovery window for far fewer surprises.
Before You Go Live: a Reality Check
Many hybrid MySQL designs fail not because of missing features, but because basic operational questions were never answered. Before calling your hybrid MySQL setup “highly available,” make sure:
- Everyone agrees where writes happen right now.
- Replication lag and failures are visible and alerted.
- You’ve tested restores across environments.
- Routing follows database roles.
- Failovers have been practiced end-to-end.
If your team can confidently answer those questions, you’re already ahead of most hybrid deployments. If not, the best improvement is often a simpler, more explicit design, not more automation.
Comments
Add new comment