Continuent Blog: Guardrails for Tungsten Maintenance: Using Ansible Like a Pro for Safe Changes

Blog

Introduction

Recently a customer asked us about an outage they experienced after running modified Ansible scripts.

This blog post talks about how to handle the database node side of Ansible interactions when a Tungsten cluster is in use.

Ansible is a deeply powerful admin tool, and as always, with great power comes great responsibility. Invoking an Ansible script can have devastating impacts in a production environment if changes are made incorrectly or without understanding what will happen.

Yet Ansible can safely co-exist with Tungsten Clusters as long as best practices are followed.

In this case, the customer was left with a cluster in a messy state as seen in the `cctrl` output displaying the status of all database nodes:

Tungsten Clustering 7.0.2 build 161
ecommerce_prod: session established, encryption=false, authentication=false
jgroups: unencrypted, database: unencrypted
[LOGICAL] /ecommerce_prod > ls
COORDINATOR
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@ecommercep-connector1[1125](ONLINE, created=190354519, active=0) |
|connector@ecommercep-connector2[1219](ONLINE, created=190746390, active=0) |
|connector@ecommercep-connector3[1121](ONLINE, created=190386656, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|ecommercep-db1(slave:FAILED(DATASERVER 'ecommercep-db1@ecommerce_prod' STOPPED), progress=-1, |
|latency=-1.000) |
|STATUS [CRITICAL] [2025/10/27 01:11:24 PM UTC] |
|REASON[DATASERVER 'ecommercep-db1@ecommerce_prod' STOPPED] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=ecommercep-db2, state=SUSPECT) |
| DATASERVER(state=STOPPED) |
| CONNECTIONS(created=308462872, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|ecommercep-db2(master:OFFLINE, progress=42221738159, THL latency=0.512) |
|STATUS [CRITICAL] [2025/10/27 01:11:34 PM UTC] |
|REASON[DATASERVER 'ecommercep-db2@ecommerce_prod' STOPPED, BUT THERE IS NO SLAVE TO PROMOTE] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=DEGRADED) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=263024693, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|ecommercep-db3(slave:FAILED:ARCHIVE (DATASERVER 'ecommercep-db3@ecommerce_prod' STOPPED), |
|progress=42221738159, latency=0.585) |
|STATUS [CRITICAL] [2025/10/27 01:11:38 PM UTC] |
|REASON[DATASERVER 'ecommercep-db3@ecommerce_prod' STOPPED] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=ecommercep-db2, state=ONLINE) |
| DATASERVER(state=STOPPED) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+

Status Analysis of cctrl Output

From the output above, it looks like MySQL has been stopped on the hosts.

Timeline

MySQL was stopped on db1 at 1:11:24 which caused this node to go into a FAILED state - It's a Replica so no further actions needed.

MySQL was then stopped on db2 at 1:11:34, the Primary node, this would normally have triggered a failover however because db1 was already in a FAILED state and db3 is configured as an ARCHIVE server (not a failover candidate), there are no hosts available to promote.

MySQL was then stopped on db3 at 1:11:38 putting that node in the FAILED state.

Tungsten Behavior

A Tungsten cluster is unable to automatically withstand the shutdown of all database processes on all nodes at once, so when there is a cascading multi-node failure, the outcome that you see above is expected.

When the nodes come back online, Tungsten does not automatically recover because the Manager processes will not know why they were shutdown. Managers in this condition deliberately do not automatically come back online without the correct checks made by you/your team to avoid data corruption by selecting the incorrect node as the Primary.

Repairing the Damage

In this case, since db2 was stable, the following commands allow db2 to be chosen as the Primary, and tells the Management layer to clean up the rest of the Replicas once the Primary is back ONLINE.

shell> cctrl
cctrl> set force true
cctrl> datasource ecommercep-db2 welcome
cctrl> datasource ecommercep-db2 online
cctrl> recover
cctrl> exit

Best Practices

Before you touch packages, services, or configs on database nodes, put the cluster in a known state and make the cluster do the heavy lifting. That means: validate topology with cctrl, drain and shun nodes deliberately, perform rolling changes with serial limits, and rejoin/welcome via the cluster after service restarts. Build idempotence and explicit dependencies into your roles so you never “succeed” your way into split-brain, partial reinstalls, or dangling replication.

Best Practices: Tungsten Cluster

To perform operations on an entire cluster at once, MAINTENANCE mode is your friend - this prevents the Managers from doing anything at all, and also locks the Connector proxies to their current write Primary:

shell> cctrl
cctrl> ls
cctrl> set policy maintenance
cctrl> ls
cctrl> exit

There are clearly-documented procedures for safely performing maintenance and handling a safe controlled shutdown of nodes by placing the cluster into MAINTENANCE mode and handling the Replica nodes one by one, along with a controlled switch. Doing this blindly, by automated scripts that do not have the correct checks in place, will undoubtedly cause outages because the operations are happening without the cluster's knowledge.

Best Practices: Ansible

Here’s the practical takeaway: treat every Ansible run as a controlled operation on a living system.

The strategy: linear, serial: 1 and throttle: 1 settings are a must for a rolling MySQL restart. Here is a script excerpt that shows the concept:

- name: Rolling MySQL restart with hard safety rails
  hosts: db_nodes
  strategy: linear   
  serial: 1          # <— critical: one host at a time

   pre_tasks:
    - name: Fail if cluster not healthy (pre-flight)
      ansible.builtin.shell: "echo 'ls' | cctrl"
      register: cluster_ls
      changed_when: false
      failed_when: "'CRITICAL' in cluster_ls.stdout or 'SUSPECT' in cluster_ls.stdout"or ‘OFFLINE’ in cluster_ls.stdout
      delegate_to: "{{ groups['managers'][0] }}"
      run_once: true

    - name: Put cluster in MAINTENANCE 
      ansible.builtin.shell: "echo 'set policy maintenance' | cctrl"
      changed_when: false
      delegate_to: "{{ groups['managers'][0] }}"
      run_once: true

  tasks:
    - name: Apply package/config changes
        ansible.builtin.package:
        name: mysql-server
        state: latest
        throttle: 1      

    - name: Restart mysqld (must never be parallel)
      systemd:
        name: mysqld
        state: restarted
      throttle: 1   # ← suspenders, even if serial later changes

  post_tasks:
    - name: Return to AUTOMATIC policy (if maintenance was used)
      ansible.builtin.shell: "echo 'set policy automatic' | cctrl"
      changed_when: false
      delegate_to: "{{ groups['managers'][0] }}"
      run_once: true

Here is a quick pre-flight you can use:

Confirm cluster health and roles in cctrl; fail the play if anything is degraded.
Select targets with --limit and serial: 1 for rolling changes; avoid parallel blasts.
Quiesce a specific node via cctrl> datasource HOSTNAME shun, or set maintenance mode for the entire cluster as needed - do not rely on systemd alone.
Apply changes.
Post-checks: replication status, latency, and cluster view must match the intended state.
When done, welcome back the specific node via cctrl> datasource HOSTNAME welcome, or set automatic mode for the entire cluster.
Always have a tested rollback and a rejoin path ready.

Summary

If there’s one lesson from this incident, it’s that automation amplifies intent — good and bad. Ansible can elegantly orchestrate Tungsten Clustering, but only when your playbooks respect the cluster’s control plane, state, and sequencing.

The outage our customer saw wasn’t mysterious; it was a predictable result of running host-level changes without coordinating with the cluster Managers, skipping health checks, and bypassing guardrails that prevent actions from rippling across nodes at the wrong time.

When you design playbooks around the cluster’s perspective and not just the OS, you turn risky node tweaks into routine, reversible operations.

Published In

Series:

Tungsten University

Tags:

Maintenance, Ansible, rolling restart, best practices

Authors

Chris Parker

VP of Customer Success, EMEA

Chris is based in the UK, and has over 20 years of experience working as a database administrator. Prior to joining Continuent, Chris managed large-scale Oracle and MySQL deployments at Warner Bros., BBC, and prior to joining the Continuent Team, he worked at the online fashion company, Net-A-Porter.

View All Chris’s Posts

Eric M. Stone

COO and VP of Product Management

Eric is a veteran of fast-paced, large-scale enterprise environments with 40 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

View All Eric M.’s Posts

Guardrails for Tungsten Maintenance: Using Ansible Like a Pro for Safe Changes