Database clustering is supposed to remove doubt. You deploy multiple nodes so that failures do not cost you data, and so that the system continues to reflect a single, coherent state. Once a transaction commits, it should exist cluster-wide . When a write succeeds, a read that follows should not drift backwards in time.
In practice, commit is expected to mean final.
Galera Cluster is a synchronous multi master replication solution for MySQL and MariaDB. Each node can accept reads and writes, and transactions are replicated to peer nodes during commit using a certification process that checks for conflicts before finalizing changes. The model promises virtually synchronous replication and cluster wide consistency. This means all nodes in the cluster should hold the same data at all times, so applications can read from or write to any node and get consistent results. Once a transaction commits, it is expected to be durable across nodes, immediately visible to subsequent reads, and preserved through failover. In practice, this allows applications to treat nodes as interchangeable and assume that acknowledged writes represent a stable cluster wide state.
Kyle Kingsbury’s March 2026 Jepsen report on MariaDB Galera Cluster 12.1.2 identifies a gap between this promise and observed behavior under fault conditions.The report examines what a commit actually guarantees in Galera, and whether acknowledged transactions behave the way applications expect. Jepsen observed committed transactions that later disappeared. It demonstrated concurrent updates that overwrite one another without warning. It showed successful commits that are not immediately visible to subsequent reads. Some of these behaviors appear when failures overlap. Others appear during ordinary operation, without injecting faults.
This mix makes the Jepsen findings difficult to dismiss. The anomalies are not limited to rare multi failure scenarios. Some emerge during normal concurrency.
Jepsen Identifies Four Scenarios Across Failure and Normal Operation
The report ties these behaviors to four concrete findings:
-
MDEV-38974 — Write loss under coordinated node crashes
Transactions acknowledged to clients did not survive recovery when multiple nodes failed close together, despite recovery completing successfully. This represents a durability violation: transactions reported as committed were not preserved across recovery. -
MDEV-38976 — Write loss under crash combined with partition
Committed writes were lost when a node crash overlapped with a network partition, even with durability settings intended to prevent data loss. Here commit acknowledgement did not correspond to durable state, meaning acknowledged transactions could later disappear. -
MDEV-38977 — Lost Update in healthy clusters
Two clients updating the same value concurrently both received successful commits, but one update was silently discarded. This is the P4 Lost Update anomaly and violates both Snapshot Isolation and Repeatable Read guarantees. -
MDEV-38999 — Stale reads after commit
A client was able to read stale data immediately after another transaction committed, indicating commit visibility lag within the cluster. This breaks read after write visibility, meaning a transaction can commit successfully while subsequent reads observe an earlier state.
Two Classes of Issues: Durability Under Failure and Correctness During Normal Operation
The findings ultimately concentrate around two points of failure: durability under failure and consistency during normal operation. Each weakens a different assumption that applications typically rely on.
Durability Under Failure
When commit acknowledgement is treated as a durability boundary, downstream systems begin acting on a state that is assumed to be permanent. Events are emitted. Workflows advance. External systems update. If acknowledged transactions can later disappear, those actions are no longer anchored to the source of truth.The commit becomes less like writing to disk and more like leaving a note on a whiteboard that might be erased later.
Where this appears: MDEV-38974 and MDEV-38976, where acknowledged transactions may not survive overlapping failures.
What this causes: The immediate effect is not an outage, but divergence. External systems retain state based on writes that no longer exist. Event streams develop gaps and audit trails no longer reconcile.
Operational impact: Because operations that appeared successful may later be missing, recovery shifts from failover to reconciliation, manual repair, and reconstruction of lost transactions. This is the lucky path, IF the missing entries are detected. Worst-case scenario is that data diverge silently from the expected external state, in which case reconciliation might become a near-impossible task.
Consistency During Normal Operation
Application logic assumes that successful transactions produce a coherent result. When updates can be overwritten silently or committed changes are not immediately visible, that assumption no longer holds. Transactions that succeed independently do not necessarily produce a consistent final state. Logic that assumes read after write visibility or preserved update ordering begins to drift.
Where this appears: MDEV-38977 and MDEV-38999, where concurrent operations can succeed while producing inconsistent state.
What this causes: The result is silent state drift. Because each transaction individually succeeds, there is no clear signal that the resulting state is no longer accurate. Counters drift, ordering assumptions break, and state shared across services begins to diverge. The cluster remains available while consistency erodes inside normal traffic.
Operational impact: Detection happens after an incorrect state has already propagated. Teams must detect and reconcile lost updates, stale reads, and ordering anomalies after they have already been observed by applications.
Why These Issues Surface Late and are Hard to Trace
Despite differing causes, these issues share the same operational pattern: they do not produce an immediate failure signal. The cluster remains available. Replication appears healthy. Monitoring stays green. The only signal is incorrect data.
When the signal is incorrect data rather than a system failure, investigation typically begins in application logic, retry behavior, or caching layers. The replication model rarely appears as the immediate suspect. What operators see is not a single failure, but a series of isolated anomalies. A missing record. An overwritten value. A stale read. Each looks like an application edge case and rarely reproduces under test conditions.
Nothing crashes, alarms do not fire, and yet the data no longer reflects what the application believes it wrote. By the time replication is considered, the triggering sequence has already passed and inconsistencies may have propagated into downstream systems.
These behaviors are not independent. They all stem from how Galera establishes write ordering and cluster state.
Preventing Write Loss and Inconsistent State
The issues Jepsen found in Galera are not all the same kind of failure, but they point back to the same design question: when is write ordering actually established?
In Galera’s synchronous, optimistic, certification based multi master model, transactions execute independently on different nodes and are only ordered during certification. That ordering happens before changes are fully applied across the cluster, so acknowledgement reflects certification agreement and does not guarantee completed cluster wide apply. The cluster therefore does not advance along a single shared transaction history.
The Jepsen findings show what can occur inside those windows: committed transactions can be lost, and successful operations can still produce stale reads or overwritten updates.
A system that avoids these behaviors establishes a single ordered transaction history so commit, reads, and failover derive from the same sequence.
Continuent later moved away from synchronous clustering after encountering failure scenarios that were difficult to reason about in production. Founder Eero Teerikorpi led the transition to an asynchronous architecture designed to separate commit acknowledgement from cluster wide certification. That work became Continuent Tungsten Cluster in 2010.
-
Acknowledged writes survive failover:
Commit occurs after the transaction is written to a durable replication log. Failover selects the most advanced log position, so acknowledged transactions remain part of the recovered state. -
Concurrent updates execute in order:
Transactions are serialized into the replication stream, so read-modify-write paths follow a single ordered sequence instead of overlapping independent writers. -
Reads observe committed state in order:
Read routing follows replica apply position, ensuring visibility advances with the same history failover preserves.
Under this model, the issues identified by Jepsen do not arise because commit, reads, and recovery all advance the same transaction history.
Main Takeaway
The Jepsen report shows that MariaDB Galera Cluster can lose acknowledged transactions under certain failure conditions and exhibit consistency anomalies during normal operation. These findings highlight the importance of understanding how replication models behave under concurrency and failure.
The findings ultimately narrow the question to what commit actually guarantees. If acknowledgement, visibility, and recovery are not anchored to the same ordering boundary, transactions can succeed while the resulting state diverges from what clients observed.
In the next article, we will compare Galera Cluster with Continuent Tungsten Cluster and examine how architectural differences influence durability, consistency, and operational behavior.
Comments
Add new comment