Continuent Blog: How Tungsten Cluster Achieves Data Integrity With Asynchronous Replication

Blog

Summary

Asynchronous replication is often avoided due to the typical assumptions that a replica may be promoted before it has received or applied all committed transactions - a risk synchronous replication appears to remove.

Yet synchronous replication is NOT the safer choice! Why?

slowdowns
blocked writes
stale reads
conflicting updates
difficult recovery scenarios

Tungsten's HA model is built around verified transaction state rather than assumed lockstep behavior, using the following vectors:

THL tracking
Candidate filtering
Binlog draining
Apply waits
Recovery safeguards
Node shunning to protect data integrity during failure

Overview

Synchronous replication is often treated as the safer choice because it appears to keep every node in lockstep. That confidence is misleading. Synchronous clustering can move coordination problems into the write path, creating slowdowns, blocked writes, stale reads, conflicting updates, and difficult recovery scenarios when nodes, networks, or queues fall out of step.

Asynchronous replication is often feared for the opposite reason: people assume a replica may be promoted before it has received or applied all committed transactions. That risk is real in basic async designs where failover is based mainly on availability.

Continuent deliberately chose asynchronous replication for Tungsten Cluster because that risk can be controlled more directly than the coordination risks of synchronous clustering. Tungsten Replicator records complete transactions into the Transaction History Log (THL), and Tungsten Manager uses stored and applied THL position, datasource health, latency, quorum, and recovery rules to decide whether promotion is safe.

The result is an HA model built around verified transaction state rather than assumed lockstep behavior. The sections below explain how Tungsten Cluster uses THL tracking, candidate filtering, binlog draining, apply waits, recovery safeguards, and shunning to protect data integrity during failure.

THL - Transaction History Log: Establishing a Shared View of State

Tungsten Replicator extracts change data from the source database (from the binary logs, or binlogs, for MySQL) and records complete transactions into the Transaction History Log or THL. Each transaction is written as a complete unit and assigned an incremental global transaction ID, allowing the system to identify whether a given transaction has been received and applied on any datasource.

This creates a durable, ordered record of change that every node in the cluster can reason about. More importantly, Tungsten tracks two distinct positions for each replica: how much THL it has stored, and how much it has applied to the database. That distinction becomes critical during failover. A replica may be ahead in received transactions but behind in applied state, and both dimensions must be considered before promotion.

Safeguard function: Establishes a shared transaction record that distinguishes between received and fully applied data across replicas.

Tungsten Manager: Turning State Into Decisions

Tungsten Manager acts as the control plane for the cluster. It monitors replication status across datasources, maintains cluster health, and communicates state changes to Tungsten Connector, so traffic can be redirected when needed. Failover is based on the full observed state of the system rather than a single failure signal.

At any point, the Manager is reasoning over:

datasource role and availability (ONLINE, STANDBY, ARCHIVE)
replicator state (ONLINE, SYNCHRONIZING, or otherwise)
applied and stored THL positions
Manager availability across the cluster

These inputs feed a rules engine that determines whether to restart services, promote a new Primary, delay failover, or block unsafe recovery paths entirely.

Safeguard function: Keeps failover tied to known datasource, replicator, Manager, and transaction state.

Promotion Candidate Selection: Filtering Before Comparison

When a Primary failure occurs, Tungsten Manager does not immediately compare all replicas. It first removes candidates that are not safe to promote. A replica is excluded if it is:

not ONLINE
not configured as a STANDBY
marked as ARCHIVE
missing an ONLINE Manager
running a replicator that is not ONLINE or SYNCHRONIZING

Only after this filtering step does the Manager evaluate viable candidates. It compares both applied and stored THL positions, which allows it to resolve cases where one replica has applied more transactions while another has received a more complete set of events from the Primary. In those situations, preference can be given to the replica with the most complete stored THL, reducing the likelihood of transaction loss.

Safeguard function: Prevents promotion of replicas that are incomplete, unavailable, or not in a valid state for leadership.

Latency Thresholds: Preventing Stale Promotion

Replication lag is treated as a hard safety constraint rather than an operational inconvenience. Tungsten enforces this through the promotion latency threshold:

policy.slave.promotion.latency.threshold = 900

A replica whose applied latency exceeds this threshold is not considered a failover candidate. This prevents the system from promoting a node that is significantly behind, even if it is otherwise healthy. The goal is to restore availability only when the new Primary represents a sufficiently current view of the data.

Safeguard function: Prevents a significantly delayed Replica from becoming Primary.

Binlog Drain Behavior: Completing Recovery Before Promotion

One of the most important protections appears when the Primary fails, but its binary logs are still accessible. In this scenario, Tungsten Replicator continues extracting remaining events from the Primary’s binlogs and writing them into THL.

With the default setting:

replicator.store.thl.stopOnDBError = false

Failover is briefly held while Tungsten finishes extracting any remaining recoverable binlog events. In practice, this is a short, bounded step and does not materially change RTO, which is typically under one minute.

That means:

remaining binlog events are read from the failed Primary
those events are written into THL
replicas receive the complete set of transactions

This behavior keeps failover aligned with the most complete recoverable transaction history.

Safeguard function: Preserves recoverable transactions from the failed Primary with minimal RTO impact before promotion proceeds.

THL Apply Wait: Confirming the Promotion Candidate

Even after a candidate replica is selected, Tungsten performs one more consistency check before promotion. The Manager confirms that the candidate has applied the THL events already stored on that node, so the new Primary starts from its most complete available transaction state. This is controlled by:

manager.failover.thl.apply.wait.timeout = 0

With the default setting, Tungsten gives the selected candidate time to finish applying its local THL before it becomes Primary, which in practice completes within normal failover time bounds. This keeps promotion aligned with the candidate’s stored transaction history and helps ensure that failover moves forward from the most complete applied state available.

Safeguard function: Promotes the selected replica only after its stored THL has been applied.

Recovery Safeguards: Controlled Reentry After Failover

After a successful failover, the former Primary remains out of service until it is explicitly recovered. This prevents a node that may have a different transaction history from automatically rejoining the active topology.

Recovery is handled as a controlled step. Operators can validate the former Primary’s state, inspect possible orphaned events with tools such as tungsten_find_orphaned, and decide whether the node should be recovered or rebuilt before it returns to service.

Safeguard function: Prevents a former Primary from rejoining the cluster until its state has been validated or safely rebuilt.

Quorum and Shunning: Protecting Against Split Brain

Tungsten Cluster also enforces quorum rules to prevent split-brain scenarios during network partitions. The Manager requires either an odd number of members or a witness to establish a majority. Without a majority, a partition cannot safely continue operating as the authoritative cluster.

During partition events:

only the majority partition remains active
a Primary in a minority partition can be shunned
in full partition scenarios, nodes may enter FAILSAFE SHUN mode

These mechanisms are designed to prevent multiple writable partitions from forming and to preserve a recoverable transaction path under adverse network conditions.

Safeguard function: Prevents multiple writable partitions during network failure.

Bringing It Together

Tungsten Cluster’s replication is asynchronous, and its failover behavior is tightly controlled through explicit safeguards. Data integrity is preserved through a combination of mechanisms that operate together:

THL visibility: transaction-level visibility through THL, including stored and applied state
Candidate filtering: strict filtering and comparison of promotion candidates
Latency enforcement: enforced latency thresholds that prevent stale failover
Binlog draining: delayed failover to drain remaining Primary binlog events
Apply before promotion: mandatory application of THL events before promotion
Controlled recovery: controlled recovery paths with orphaned-event analysis
Split-brain protection: quorum enforcement and shunning to prevent split brain

These safeguards define how the system behaves when correctness is at risk. The result is a model where asynchronous replication provides flexibility and performance, while failover and recovery are governed by rules designed to keep the cluster as consistent and recoverable as possible. In practice, these safeguards operate within normal recovery time expectations while ensuring the cluster moves forward from the most complete available state.

Published In

Categories:

Advanced Replication

Series:

MySQL High Availability (HA) & Disaster Recovery (DR)

Tags:

tungsten cluster, data protection, Asynchronous Replication, failover

Author

Nia Teerikorpi

COO

Nia, PhD, leverages her extensive expertise in data management to optimize operational efficiency. She brings over 10 years of experience in academia to her role, where she is translating her strengths in project management into the software industry. Her passion for technology and knowledge building enables her to empower Continuent to deliver robust, high-availability database solutions to clients worldwide.

Prior to working at Continuent, Nia has conducted significant research into Autism and congenital heart disease, offering valuable insights aimed at enhancing understanding and treatment options for these conditions. Her commitment to intellectual development reflects her desire to make a meaningful impact in both healthcare and technology.

View All Nia’s Posts

How Tungsten Cluster Achieves Data Integrity With Asynchronous Replication