High Availability in an Active/Active Cluster: When Transparent Reconnects Are Not a Good Idea

One amazing feature of the Tungsten Connector Proxy is that it is fully aware of active/active clusters. It has the notion of sites (think multi-site) and can even be configured to write to a preferred primary site, while reading from another site. Dynamic Active/Active is one of Continuent’s exclusive features which allows instant failover between sites while providing data consistency for applications that are not "active/active aware".

On the other hand, one of the sweetest (and earliest!) features provided by the Connector is “transparent reconnects”. When a node failure occurs between two SQL transactions, the Tungsten Proxy is able to transparently reconnect the application to the newly elected primary; what’s amazing is that this can happen without having the application notice it at all!

Within multi-active clusters, it is perfectly fine to reconnect between the nodes of a local cluster. During local failover, the manager will wait for the failed primary’s binary logs events to be fully transferred to the new primary, and application connections will be paused. Once the new Primary is available, application reconnection will occur quickly, and the client application experiences just a very small execution delay.

In active/active, failover is done manually. Data might have been written to the failing site and not yet transferred to the second. Even outside a transaction, there are simple cases where a reconnection would create data inconsistencies:

Transaction 1: INSERT INTO TEST VALUES (1, 'Joe')

<Site failure>

Transaction 2: UPDATE TEST  SET name='John' WHERE id=1

If a transparent reconnect occurs between transaction 1 and transaction 2, AND transaction 1 hasn't reached the other site, the transaction 2 won't update anything and "John" will not appear, while the application will happily continue, not getting any error.

With Tungsten Active/Active clusters, it is up to the Database Administrator issuing the failover to make sure that the changes made on the failing site can be safely ignored before resuming application traffic.

When it comes to Dynamic Active/Active (DAA), reconnection will be automatic and immediate. Thus, great attention must be paid to side effects. In order to prevent them, the Tungsten Connector Proxy, where the core logic of Dynamic Active/Active resides, has a default configuration setting that prevents cross-site auto-reconnects when writing data. Applications that are writing data when a site fails will receive an error, allowing them to double-check consistency before resuming their work. This setting is connector-allow-cross-site-reconnectsForWrites=false.

Since read operations are less of an issue, we have made the choice to allow cross-site reconnects for read operations. For highly sensitive environments, it is however still possible to forbid these with the configuration flag connector-allow-cross-site-reconnectsForReads=false. For tolerant applications and when all risks are understood and appreciated, it is possible to allow full reconnects via connector-allow-cross-site-reconnectsForWrites=true and connector-allow-cross-site-reconnectsForReads=true.

When it comes to high availability, great care must be taken when doing automatic application reconnects. Tungsten Clustering handles that for you in local clusters, and has proper protection in Active/Active environments, especially when reconnections are as immediate as in Dynamic Active/Active clusters!

About the Author

Gilles Rayrat
VP of Engineering

Gilles has over 20 years experience in software engineering. Previously holding positions at Orange and Xerox, he joined the Continuent adventure in 2005. As the connectivity expert at Continuent, he has worn many hats including software development, QA, support, project and operations management. Gilles has held most of the engineering positions that he now manages, giving him both deep and wide experience.

Add new comment