MySQL Replication Lag – Part 2: Diagnosis and Practical Fixes | Blog

Blog

In the previous article, we examined how MySQL replication works and where lag originates. We looked at the mechanics: binlog generation, event transfer, relay log writes, and transaction apply.

This article focuses entirely on execution.

If you are dealing with MySQL replication lag, this is the operational guide:

How to measure lag correctly
How to determine where the delay is occurring
Concrete MySQL slave lag fixes
How to reduce replication delay safely
When architectural adjustments are required
Where cluster-level tooling (including Tungsten Cluster) can help

No theory repetition. This is about diagnosis and correction.

Measure Replication Lag Correctly

The first operational mistake teams make is assuming that Seconds_Behind_Master is sufficient.

It is not.

It is a derived metric and can read zero even when lag exists, and it can spike during large transactions without representing steady-state delay. You need multiple measurements.

Check Replication State

Start with:

SHOW SLAVE STATUS\G

Focus on:

Seconds_Behind_Master
Slave_IO_Running
Slave_SQL_Running
Relay_Log_Space
Last_SQL_Error
Last_IO_Error

If either I/O or SQL threads are not running, this is not lag - it is replication failure. If both are running, continue.

Check Relay Log Growth

If Relay_Log_Space is continuously increasing, the replica is receiving events faster than it is applying them.

You can also inspect:

SHOW GLOBAL STATUS LIKE 'Relay_log_space';

If relay logs are growing, but CPU and disk look healthy, suspect large transactions or locking in the apply sage.

Use a Heartbeat Table

For accurate measurement, implement a heartbeat table on the primary:

CREATE TABLE replication_heartbeat (
    id INT PRIMARY KEY,
    ts TIMESTAMP NOT NULL
);

Update it periodically:

UPDATE replication_heartbeat SET ts = NOW() WHERE id = 1;

On the replica:

SELECT TIMESTAMPDIFF(SECOND, ts, NOW()) AS replication_delay
FROM replication_heartbeat
WHERE id = 1;

This measures actual apply delay on the replica.

Heartbeat-based measurement is significantly more reliable than Seconds_Behind_Master.

Compare Write Rates

Run the same queries on primary and replica and compare rate per second:

SHOW GLOBAL STATUS LIKE 'Com_insert';
SHOW GLOBAL STATUS LIKE 'Com_update';
SHOW GLOBAL STATUS LIKE 'Com_delete';

If primary write throughput exceeds replica apply throughput, lag is expected.

Identify Where the Bottleneck Is

Replication lag always corresponds to one of three bottlenecks:

network
disk I/O
apply execution (CPU / locking / transaction structure)

Your goal is to isolate which and the following checks determine which stage is responsible.

Network-Related Replication Lag

Network delay is common in cross-region replication.

Indicators

replica I/O thread frequently in Waiting for master to send event
high RTT between nodes
limited throughput
lag increases proportionally with binlog size

Measure Network Latency

Note

iperf3 requires both a server and a client to run, and it actively consumes network bandwidth during the test, so it’s NOT recommended for use in production environments.

ping primary_host

Measure sustained bandwidth:

iperf3 -c primary_host

If RTT is high (tens to hundreds of milliseconds), semi-sync replication will increase commit latency, but asynchronous replication will still accumulate lag if bandwidth is insufficient.

How to Reduce Replication Delay (Network)

Note

Avoiding link saturation from other traffic can be the most effective approach - reduce additional network traffic on the same link or interface.

Increase available bandwidth
Reduce cross-region distance
Use private network links
Enable compression (if CPU allows)
Avoid saturating link with other traffic

Cluster managers that support WAN-optimized replication (including Tungsten Cluster) can improve binlog streaming efficiency and manage multi-region topologies more predictably than native MySQL alone, particularly in active-active or cross-datacenter setups.

However, network throughput limits cannot be bypassed by configuration alone.

Disk I/O Bottlenecks

Large Transactions

Large transactions are a predictable source of replication delay. Replication applies transactions atomically. A large transaction blocks worker threads until fully applied.

Detect Large Transactions

Check binary log sizes, look for sudden jumps:

SHOW BINARY LOGS;

Check current SQL thread activity:

SHOW PROCESSLIST;

If the replica SQL thread is processing a single large statement for an extended period, lag will increase.

Reduce Replication Delay from Large Transactions

Operational fixes:

break bulk updates into batches
use smaller DELETE statements with LIMIT
avoid multi-million row UPDATE in single transaction
avoid long-running explicit transactions

This is one of the most effective MySQL slave lag fixes available.

Enable and Tune Parallel Replication

Modern MySQL versions support parallel replication. Parallel replication is often the difference between sustained lag and stable performance under write-heavy workloads.

Check configuration:

SHOW VARIABLES LIKE 'slave_parallel_workers';

If zero, replication is single-threaded. Enable workers:

SET GLOBAL slave_parallel_workers = 8;
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';
SET GLOBAL slave_preserve_commit_order = 1;

Parallel replication reduces apply-stage bottlenecks when transactions are independent. Though, parallelism benefits are limited when the workload updates the same rows repeatedly.

Lock Contention on Replica

If replicas serve read traffic, long-running queries can block apply threads.

Diagnose Lock Waits

SHOW ENGINE INNODB STATUS\G

Look for:

Lock waits involving replication threads
Metadata lock conflicts

Also, if replication threads are blocked, lag accumulates:

SELECT * FROM performance_schema.data_locks;

Reduce Lock-Based Replication Delay

Limit analytical queries on replicas.
Use separate replicas for reporting.
Kill long-running blocking sessions.
Review index coverage to reduce read locking impact.

Intelligent proxy routing can help isolate read-heavy workloads from replicas used for failover. Tungsten Cluster includes such routing control, which can reduce operational contention in mixed workloads.

CPU Saturation

High CPU usage slows transaction apply. CPU limitations often appear after enabling row-based replication on write-heavy systems.

Check:

top
mpstat -P ALL 1

If CPU remains above 85–90% during lag growth, compute capacity is insufficient.

Fixes include:

Increase CPU allocation
Reduce unnecessary indexes
Optimize queries
Enable parallel replication

Configuration Alignment

Certain settings affect replication performance.

Check:

SHOW VARIABLES LIKE 'sync_binlog';
SHOW VARIABLES LIKE 'innodb_flush_log_at_trx_commit';
SHOW VARIABLES LIKE 'binlog_format';

Misalignment between primary and replica durability settings can create apply imbalances. Row-based replication increases event volume but improves correctness.

Changing durability settings reduces latency, but increases crash recovery risk. Such changes must align with availability objectives.

When Replication Lag Is Architectural

If you have:

cross-region replication
high write throughput
low-latency SLA requirements
strict failover windows

Then replication tuning alone may not solve the problem.

At that point, consider:

read/write traffic separation
active-active architecture
intelligent proxy routing
automated failover systems aware of replication state

Tungsten Cluster operates on asynchronous replication but adds:

coordinated failover
replication state monitoring
intelligent routing
multi-region topology control

It does not eliminate replication delay by design. Instead, it manages its operational impact and ensures safe promotion decisions based on actual replication state. For many SaaS platforms, that combination — advanced async replication with strong cluster management — provides better stability than moving to fully synchronous systems.

A Structured Workflow for MySQL Replication Lag Fixes

When lag appears, follow this order:

Confirm replication threads are running

measure real delay using heartbeat
check relay log growth
inspect disk I/O saturation
inspect CPU utilization
detect large transactions
review parallel replication configuration
check lock contention
compare write throughput rates

Only change one variable at a time.

Do not:

restart replicas without diagnosis
promote replicas under heavy lag
disable durability without risk evaluation
assume semi-sync will fix lag

Reducing Replication Delay in Production

Reducing replication delay is rarely about one setting.

It is about balancing:

write generation rate
replica apply capacity
network throughput
storage performance
concurrency design

In most production environments, sustained MySQL replication lag is resolved by:

enabling parallel replication
improving replica disk I/O
reducing large transactions
isolating read workloads
scaling hardware appropriately

Replication lag is rarely random. It corresponds to measurable constraints.

Final Observations

MySQL replication lag is a capacity and architecture problem.

It is caused by:

network constraints
storage limitations
CPU saturation
lock contention
transaction design
configuration misalignment

It is reduced by:

accurate measurement
removing bottlenecks
enabling concurrency
optimizing write patterns
using an architecture that supports your workload scale

Asynchronous replication remains the most operationally stable model for many systems, provided lag is monitored and managed correctly. Semi-sync reduces data loss windows but does not eliminate replication delay. Fully synchronous models remove lag but introduce latency and scalability trade-offs.

If you combine strong monitoring, controlled failover logic, intelligent routing, and advanced asynchronous replication management, you can operate high-write MySQL systems with predictable replication behavior.

Replication lag is not eliminated by default. It is engineered into acceptable bounds.

Published In

Categories:

Advanced Replication, Performance

Series:

Tungsten University

Tags:

replication, practice, how to, Performance Tuning

Author

Dmitry Skripka

Director of Marketing Infrastructure

Dmitry has over 15 years of experience in web development & digital productions. Before joining Continuent he was working at a wide range of companies like The Moscow Times & Severalnines. Now he's in charge of our website & all sorts of marketing activities like webinars, white papers, etc.

View All Dmitry’s Posts

MySQL Replication Lag – Part 2: Diagnosis and Practical Fixes