Continuent Blog: MySQL HA in Hybrid Cloud: Avoiding the Pitfalls

Blog

High availability for MySQL in a hybrid cloud setup sounds great in theory. You get flexibility, resilience, and freedom from putting all your eggs in one basket.

In practice, most outages in hybrid environments don’t come from MySQL itself, they come from the edges: networks, routing, storage, and unclear failover behavior. In this post, we’ll walk through the most common traps we see in real hybrid deployments and, more importantly, how to avoid them.

What “Hybrid-Cloud HA” Actually Looks Like

When teams talk about MySQL HA in a hybrid cloud, they usually mean some mix of:

An on-premises MySQL cluster with replicas in the cloud (or the other way around).
Multiple data centers combined with one or more cloud regions.
A blend of VMs, managed MySQL services, and sometimes containers.

The goal is resilience: if a data center, a cloud region, or even a provider fails, the application keeps running. The challenge is that every connection between those environments now affects overall availability.

Pitfall 1: Assuming the Network Will “Just Work”

The weakest link in most hybrid HA designs is the network between sites.

WAN links behave very differently from local networks:

VPNs and private interconnects drop or flap more often than teams expect
Latency and jitter change throughout the day
A small routing or firewall change can quietly break replication

How to avoid it:

Design with the assumption that WAN links fail intermittently
Use asynchronous replication between sites by default
Actively monitor replication lag and replication errors, not just server uptime
Plan for graceful degradation, such as running one site read-only if replication stalls

Pitfall 2: Stretching Clusters Across Environments

Running a single MySQL cluster stretched across on-prem and cloud using synchronous or semi-synchronous replication looks clean on paper. In practice, it often leads to:

Slower commits due to cross-site round trips
Increased risk of split-brain during partial outages
Application-visible incidents triggered by minor network glitches

How to avoid it:

Use asynchronous replication between sites
Treat each location as its own failure domain
Make cross-site promotion a deliberate action, not an automatic reaction

Pitfall 3: No Clear Owner for Writes

One of the most dangerous situations in hybrid setups is ambiguity around who is allowed to write.

Common warning signs:

Both on-prem and cloud accept writes “temporarily”
Ad-hoc active-active replication between sites
Different teams believe different sites are primary

This almost always ends in data inconsistency.

How to avoid it:

Decide — and document — one clear write authority at any moment
If you use multi-primary designs, strictly limit them by:
- Partitioning data (by region, tenant, or shard)
- Establish site affinity (i.e. by user, customer, etc.)
- Defining conflict rules up front
Encode these rules in tooling or automation, not in tribal knowledge

Pitfall 4: DNS and Routing That Don’t Keep Up

In hybrid HA, routing is just as important as replication. Replication moves the data, but routing determines where applications actually connect when failures occur.

Problems we often see:

DNS TTLs that are too long
Load balancers sending writes to read-only sites
Client-side caching masking topology changes

How to avoid it:

Keep DNS TTLs short and validate real client behavior
Prefer database-aware routers or proxies over static hostnames
Separate read and write endpoints and enforce their roles consistently

Pitfall 5: Mixing Storage and Backups Without a Plan

Recovery depends on more than replication. Without a consistent backup strategy across storage systems, failures become painful to recover from.

Hybrid environments almost always involve different storage systems:

SAN or NAS on-prem
Block or object storage in the cloud
Different snapshot and restore semantics

This becomes painful during recovery.

How to avoid it:

Rely on portable logical backups that work everywhere
Regularly test restores across environments
Clearly define RPO and RTO expectations per site and design backups accordingly

For more info about creating a backup plan, read our blog post here.

Pitfall 6: Designing for Cloud First, On-Prem Second

When a company leans too heavily on cloud features and treats on-prem as a fallback, it usually leads to:

Manual, unreliable failover
Inconsistent security models
Completely different operational procedures

How to avoid it:

Define one HA and security model for both environments
Make sure either side can operate independently if the other is unavailable
Use the same monitoring, alerting, and SLOs everywhere

Pitfall 7: Automating Failover Too Early

Automatic cross-site failover promises speed, but without careful design it can turn minor issues into major incidents.

Risks include:

False positives promoting the wrong site
Network partitions creating dual primaries
Painful manual recovery after the fact

How to avoid it:

Start with manual, well-rehearsed failover.
Use multiple signals before promotion (DB health, replication state, app checks)
Practice failover regularly, including partial failures
Implement and test failback procedures too

A Safer Reference Architecture

A safer hybrid architecture starts by accepting that failures will happen and designing for controlled, predictable recovery.

A simple, reliable hybrid MySQL HA pattern looks like this:

Independent HA per site
- On-prem: local MySQL HA
- Cloud: local MySQL HA
Asynchronous replication between sites
- One site handles writes
- The other stays ready for recovery
Controlled site switchover
- Stop writes
- Catch replicas up
- Promote target site
- Update routing
Unified observability
- Same metrics, logs, and alerts everywhere

This design trades a small recovery window for far fewer surprises.

Before You Go Live: a Reality Check

Many hybrid MySQL designs fail not because of missing features, but because basic operational questions were never answered. Before calling your hybrid MySQL setup “highly available,” make sure:

Everyone agrees where writes happen right now.
Replication lag and failures are visible and alerted.
You’ve tested restores across environments.
Routing follows database roles.
Failovers have been practiced end-to-end.

If your team can confidently answer those questions, you’re already ahead of most hybrid deployments. If not, the best improvement is often a simpler, more explicit design, not more automation.

Published In

Categories:

Cluster Management, Total Cost of Ownership

Series:

Tungsten University

Tags:

High Availability, hybrid cloud, Architecture, cross-site replication, failover

Author

Dmitry Skripka

Director of Marketing Infrastructure

Dmitry has over 15 years of experience in web development & digital productions. Before joining Continuent he was working at a wide range of companies like The Moscow Times & Severalnines. Now he's in charge of our website & all sorts of marketing activities like webinars, white papers, etc.

View All Dmitry’s Posts

MySQL HA in Hybrid Cloud: Avoiding the Pitfalls

What “Hybrid-Cloud HA” Actually Looks Like

Pitfall 1: Assuming the Network Will “Just Work”

Pitfall 2: Stretching Clusters Across Environments

Pitfall 3: No Clear Owner for Writes

Pitfall 4: DNS and Routing That Don’t Keep Up

Pitfall 5: Mixing Storage and Backups Without a Plan

Pitfall 6: Designing for Cloud First, On-Prem Second

Pitfall 7: Automating Failover Too Early

A Safer Reference Architecture

Before You Go Live: a Reality Check

Published In

Author

Comments

Add new comment

Filtered HTML