Troubleshooting Failed DDL Across All Replicas In a Tungsten Cluster

Introduction:
The Situation

Recently we had a customer support case where all of the Replicas were SUSPECT in cctrl:

[LOGICAL] /east > ls
...
DATASOURCES: 
+----------------------------------------------------------------------------+ 
|db1(slave:ONLINE, progress=-1, latency=-1.000) | 
|STATUS [OK] [2021/07/16 11:32:15 PM UTC] | 
+----------------------------------------------------------------------------+ 
| MANAGER(state=ONLINE) | 
| REPLICATOR(role=slave, master=db2, state=SUSPECT) | 
| DATASERVER(state=ONLINE) | 
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+ 
|db2(master:ONLINE, progress=15010618, THL latency=0.942) | 
|STATUS [OK] [2021/05/29 09:20:27 PM UTC] | 
+----------------------------------------------------------------------------+ 
| MANAGER(state=ONLINE) | 
| REPLICATOR(role=master, state=ONLINE) | 
| DATASERVER(state=ONLINE) | 
+----------------------------------------------------------------------------+
+----------------------------------------------------------------------------+ 
|db3(slave:ONLINE, progress=-1, latency=-1.000) | 
|STATUS [OK] [2021/06/25 03:22:28 PM UTC] | 
+----------------------------------------------------------------------------+ 
| MANAGER(state=ONLINE) | 
| REPLICATOR(role=slave, master=db2, state=SUSPECT) | 
| DATASERVER(state=ONLINE) | 
+----------------------------------------------------------------------------

Further investigation into `trepctl status` and the `trepsvc.log` file showed that an ALTER TABLE query was failing on ALL Replicas because of a MySQL error: Incorrect date value: '0000-00-00, which lead to Replicator error “Statement failed on slave but succeeded on master

`trepctl status` extract:

pendingError           : Event application failed: seqno=15005005 fragno=0 message=java.sql.SQLException: Statement failed on slave but succeeded on master
pendingErrorCode       : NONE
pendingErrorEventId    : mysql-bin.000106:0000000022670495;11584008
pendingErrorSeqno      : 15005005
pendingExceptionMessage: java.sql.SQLException: Statement failed on slave but succeeded on master
                         alter table channel add column jira char(32)

`trepcvs.log` extract:

2021/07/20 01:44:28.716 | Jul 20, 2021 1:44:28 AM org.drizzle.jdbc.internal.mysql.MySQLProtocol executeQuery
2021/07/20 01:44:28.716 | WARNING: Could not execute query org.drizzle.jdbc.internal.common.query.DrizzleQuery@163d79cb: Incorrect date value: '0000-00-00' for column 'ch_last_up' at row 2193
2021/07/20 01:44:28.716 |     db3 | [east - q-to-dbms-0] ERROR pipeline.SingleThreadStageTask Event application failed: seqno=15005005 fragno=0 message=java.sql.SQLException: Statement failed on slave but succeeded on master

We tested by hand and saw the same result:

mysql> alter table custdb.channel add column jira char(32); 
ERROR 1292 (22007): Incorrect date value: '0000-00-00' for column 'ch_last_up' at row 2193

Troubleshooting Steps

Run the following on both the Primary and one of the Replicas to compare the golden data on the Primary to the potentially bad data on the Replicas:

select count(*) from custdb.channel;
select count(*) from custdb.channel where ch_last_up = '0000-00-00';

RESULTS for the Primary

mysql> select count(*) from custdb.channel; 
+----------+ 
| count(*) | 
+----------+ 
| 129896 | 
+----------+ 
1 row in set (0.00 sec)
mysql> select count(*) from custdb.channel where ch_last_up = '0000-00-00'; 
+----------+ 
| count(*) | 
+----------+ 
| 0 | 
+----------+ 
1 row in set, 1 warning (0.18 sec)

RESULTS for the Replica

mysql> select count(*) from custdb.channel; 
+----------+ 
| count(*) | 
+----------+ 
| 129896 | 
+----------+ 
1 row in set (0.00 sec)
mysql> select count(*) from custdb.channel where ch_last_up = '0000-00-00'; 
+----------+ 
| count(*) | 
+----------+ 
| 2 | 
+----------+ 
1 row in set, 1 warning (0.18 sec)

Note that on the Replica there are 2 rows that have the “bad” data in them!

The key to solving the issue is to clean up those two rows so that the ALTER TABLE will succeed.

Root Cause

The issue is that the ALTER TABLE failed due to bad data in the date/time fields.

The customer used to be running MySQL 5.5 where a value of 0 and 0000-00-00 was allowed, by default, in date fields. After a recent upgrade to version 5.7, the sql_mode of MySQL changed by default to include STRICT_TRANS_TABLE, this in turn disallows 0 and 0000-00-00 as values.

The customer’s application has been fixed, and as a safeguard the customer implemented the zerodate2null filter that would convert such values to null if it came across them. In this instance there were old rows of data in the databases that still contained these values, meaning that the ALTER TABLE failed it’s data validation.

Solution Steps

Change the bad fields to null values, which will allow the ALTER TABLE to proceed.

First Attempt - Failed!

The customer needed to run this manually on EVERY replica node:

set session sql_log_bin = 0; 
update custdb.channel set ch_last_up = NULL where ch_last_up = '0000-00-00';
alter table custdb.channel add column jira char(32);

When attempting to run that, an error was encountered with the update statement:

ERROR 1292 (22007): Incorrect date value: '0000-00-00' for column 'ch_last_up' at row 1 

Second Attempt - Success!

So we gathered the current SQL_MODE:

mysql> show variables like 'sql_mode';
| sql_mode | ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION | 

And tried again, this time with an altered SQL_MODE - removing ONLY_FULL_GROUP_BY and STRICT_TRANS_TABLES:

set session sql_mode = 'NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
set session sql_log_bin = 0; 
update custdb.channel set ch_last_up = NULL where ch_last_up = '0000-00-00';
alter table custdb.channel add column jira char(32);
exit;

Which worked!

mysql> set session sql_mode = 'NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION'; 
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> set session sql_log_bin = 0; 
Query OK, 0 rows affected (0.00 sec)

mysql> update custdb.channel set ch_last_up = NULL where ch_last_up = '0000-00-00'; 
Query OK, 2 rows affected, 1 warning (1.93 sec) 
Rows matched: 2 Changed: 2 Warnings: 1

mysql> alter table custdb.channel add column jira char(32); 
Query OK, 129896 rows affected (4.19 sec) 
Records: 129896 Duplicates: 0 Warnings: 0

Once that had executed on all Replicas, we made sure ALL clusters wee in maintenance mode, then one-by-one executed the following on Replicas ONLY (NOT ON THE PRIMARY) (This was to ensure the replicator saw, and cached, the updated tables columns):

replicator restart

Once all replicators were restarted, all clusters were placed back into automatic mode.

The final step was to return all Replicators to a healthy state on every Replica node.

To do that, we needed to skip the failed ALTER query, since that had been handled manually on each Replica node.

To be sure we skipped the correct event, we check the `trepctl status` output from above, for the two key bits of information - the pendingErrorSeqno and the pendingExceptionMessage:

pendingErrorSeqno      : 15005005
pendingExceptionMessage: java.sql.SQLException: Statement failed on slave but succeeded on master
                         alter table channel add column jira char(32)

Since the failing event shown was indeed the ALTER TABLE, we can use the displayed pendingErrorSeqno to come safely back online:

ttrepctl online -skip-seqno 15005005 res

When this last step was completed on all Replicas, the cluster was healthy once again!

Wrap-Up

Data drift occurred in this case because the tighter data value restrictions in newer MySQL versions invalidated data which used to be ok (in this case a date column had an invalid value of ‘0000-00-00’), which then blocked a simple ALTER TABLE.

The issue was resolved by setting a more permissive SQL_MODE, then locating and cleaning the bad data with proper NULLs before re-issuing the ALTER TABLE.

About the Author

Eric M. Stone
COO

Eric is a veteran of fast-paced, large-scale enterprise environments with 35 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

Add new comment