Introduction
Recently a customer asked us about an outage they experienced after running modified Ansible scripts.
This blog post talks about how to handle the database node side of Ansible interactions when a Tungsten cluster is in use.
Ansible is a deeply powerful admin tool, and as always, with great power comes great responsibility. Invoking an Ansible script can have devastating impacts in a production environment if changes are made incorrectly or without understanding what will happen.
Yet Ansible can safely co-exist with Tungsten Clusters as long as best practices are followed.
In this case, the customer was left with a cluster in a messy state as seen in the `cctrl` output displaying the status of all database nodes:
Tungsten Clustering 7.0.2 build 161
ecommerce_prod: session established, encryption=false, authentication=false
jgroups: unencrypted, database: unencrypted
[LOGICAL] /ecommerce_prod > ls
COORDINATOR
ROUTERS:
+---------------------------------------------------------------------------------+
|connector@ecommercep-connector1[1125](ONLINE, created=190354519, active=0) |
|connector@ecommercep-connector2[1219](ONLINE, created=190746390, active=0) |
|connector@ecommercep-connector3[1121](ONLINE, created=190386656, active=0) |
+---------------------------------------------------------------------------------+
DATASOURCES:
+---------------------------------------------------------------------------------+
|ecommercep-db1(slave:FAILED(DATASERVER 'ecommercep-db1@ecommerce_prod' STOPPED), progress=-1, |
|latency=-1.000) |
|STATUS [CRITICAL] [2025/10/27 01:11:24 PM UTC] |
|REASON[DATASERVER 'ecommercep-db1@ecommerce_prod' STOPPED] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=ecommercep-db2, state=SUSPECT) |
| DATASERVER(state=STOPPED) |
| CONNECTIONS(created=308462872, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|ecommercep-db2(master:OFFLINE, progress=42221738159, THL latency=0.512) |
|STATUS [CRITICAL] [2025/10/27 01:11:34 PM UTC] |
|REASON[DATASERVER 'ecommercep-db2@ecommerce_prod' STOPPED, BUT THERE IS NO SLAVE TO PROMOTE] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=master, state=DEGRADED) |
| DATASERVER(state=ONLINE) |
| CONNECTIONS(created=263024693, active=0) |
+---------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------+
|ecommercep-db3(slave:FAILED:ARCHIVE (DATASERVER 'ecommercep-db3@ecommerce_prod' STOPPED), |
|progress=42221738159, latency=0.585) |
|STATUS [CRITICAL] [2025/10/27 01:11:38 PM UTC] |
|REASON[DATASERVER 'ecommercep-db3@ecommerce_prod' STOPPED] |
+---------------------------------------------------------------------------------+
| MANAGER(state=ONLINE) |
| REPLICATOR(role=slave, master=ecommercep-db2, state=ONLINE) |
| DATASERVER(state=STOPPED) |
| CONNECTIONS(created=0, active=0) |
+---------------------------------------------------------------------------------+
Status Analysis of cctrl Output
From the output above, it looks like MySQL has been stopped on the hosts.
Timeline
MySQL was stopped on db1 at 1:11:24 which caused this node to go into a FAILED state - It's a Replica so no further actions needed.
MySQL was then stopped on db2 at 1:11:34, the Primary node, this would normally have triggered a failover however because db1 was already in a FAILED state and db3 is configured as an ARCHIVE server (not a failover candidate), there are no hosts available to promote.
MySQL was then stopped on db3 at 1:11:38 putting that node in the FAILED state.
Tungsten Behavior
A Tungsten cluster is unable to automatically withstand the shutdown of all database processes on all nodes at once, so when there is a cascading multi-node failure, the outcome that you see above is expected.
When the nodes come back online, Tungsten does not automatically recover because the Manager processes will not know why they were shutdown. Managers in this condition deliberately do not automatically come back online without the correct checks made by you/your team to avoid data corruption by selecting the incorrect node as the Primary.
Repairing the Damage
In this case, since db2 was stable, the following commands allow db2 to be chosen as the Primary, and tells the Management layer to clean up the rest of the Replicas once the Primary is back ONLINE.
shell> cctrl
cctrl> set force true
cctrl> datasource ecommercep-db2 welcome
cctrl> datasource ecommercep-db2 online
cctrl> recover
cctrl> exit
Best Practices
Before you touch packages, services, or configs on database nodes, put the cluster in a known state and make the cluster do the heavy lifting. That means: validate topology with cctrl, drain and shun nodes deliberately, perform rolling changes with serial limits, and rejoin/welcome via the cluster after service restarts. Build idempotence and explicit dependencies into your roles so you never “succeed” your way into split-brain, partial reinstalls, or dangling replication.
Best Practices: Tungsten Cluster
To perform operations on an entire cluster at once, MAINTENANCE mode is your friend - this prevents the Managers from doing anything at all, and also locks the Connector proxies to their current write Primary:
shell> cctrl
cctrl> ls
cctrl> set policy maintenance
cctrl> ls
cctrl> exit
There are clearly-documented procedures for safely performing maintenance and handling a safe controlled shutdown of nodes by placing the cluster into MAINTENANCE mode and handling the Replica nodes one by one, along with a controlled switch. Doing this blindly, by automated scripts that do not have the correct checks in place, will undoubtedly cause outages because the operations are happening without the cluster's knowledge.
Best Practices: Ansible
Here’s the practical takeaway: treat every Ansible run as a controlled operation on a living system.
The strategy: linear, serial: 1 and throttle: 1 settings are a must for a rolling MySQL restart. Here is a script excerpt that shows the concept:
- name: Rolling MySQL restart with hard safety rails
hosts: db_nodes
strategy: linear
serial: 1 # <— critical: one host at a time
pre_tasks:
- name: Fail if cluster not healthy (pre-flight)
ansible.builtin.shell: "echo 'ls' | cctrl"
register: cluster_ls
changed_when: false
failed_when: "'CRITICAL' in cluster_ls.stdout or 'SUSPECT' in cluster_ls.stdout"or ‘OFFLINE’ in cluster_ls.stdout
delegate_to: "{{ groups['managers'][0] }}"
run_once: true
- name: Put cluster in MAINTENANCE
ansible.builtin.shell: "echo 'set policy maintenance' | cctrl"
changed_when: false
delegate_to: "{{ groups['managers'][0] }}"
run_once: true
tasks:
- name: Apply package/config changes
ansible.builtin.package:
name: mysql-server
state: latest
throttle: 1
- name: Restart mysqld (must never be parallel)
systemd:
name: mysqld
state: restarted
throttle: 1 # ← suspenders, even if serial later changes
post_tasks:
- name: Return to AUTOMATIC policy (if maintenance was used)
ansible.builtin.shell: "echo 'set policy automatic' | cctrl"
changed_when: false
delegate_to: "{{ groups['managers'][0] }}"
run_once: true
Here is a quick pre-flight you can use:
-
Confirm cluster health and roles in
cctrl; fail the play if anything is degraded. -
Select targets with
--limitandserial: 1for rolling changes; avoid parallel blasts. -
Quiesce a specific node via
cctrl> datasource HOSTNAME shun, or set maintenance mode for the entire cluster as needed - do not rely on systemd alone. - Apply changes.
- Post-checks: replication status, latency, and cluster view must match the intended state.
-
When done, welcome back the specific node via
cctrl> datasource HOSTNAME welcome, or set automatic mode for the entire cluster. - Always have a tested rollback and a rejoin path ready.
Summary
If there’s one lesson from this incident, it’s that automation amplifies intent — good and bad. Ansible can elegantly orchestrate Tungsten Clustering, but only when your playbooks respect the cluster’s control plane, state, and sequencing.
The outage our customer saw wasn’t mysterious; it was a predictable result of running host-level changes without coordinating with the cluster Managers, skipping health checks, and bypassing guardrails that prevent actions from rippling across nodes at the wrong time.
When you design playbooks around the cluster’s perspective and not just the OS, you turn risky node tweaks into routine, reversible operations.
Comments
Add new comment