Continuent Blog: Troubleshooting Data Differences in a MySQL Database Cluster

Blog

Overview

The Skinny

From time to time we are asked how to check whether or not there are data discrepancies between Master/Slave nodes within a MySQL (or MariaDB) cluster that's managed with Tungsten Clustering. This is always a challenging task, not least because we hope and believe that our replication mechanism would avoid such occurrences, that said there can be factors outside of our control that can appear to "corrupt" data - such as inadvertent execution of DML against a slave using a root level user account.

Tungsten Replicator, the core replication component in our Tungsten Clustering solution for MySQL (& MariaDB), is just that, a replicator - it takes transactions from the binary logs and replicates them around. The replicator isn't a data synchronisation tool in that respect, the replicator won't/can't compare tables - this is by design and is one of the many benefits in the product that avoids a) us being tightly coupled to the database and b) avoids the inherent performance impact of what could be incredibly resource consuming processes, add to that the complications of how to confidently complete such checks in extremely active environments.

Agenda

What's Here?

The following steps walk through the recommended methods for troubleshooting based on a 3-node cluster using MySQL 5.7 community edition, managed by Tungsten Clustering 6.0

What's Not Here?

There are a number of tools that can help identify and fix data drift, and even structural differences, however a lot of them assume native MySQL replication is in place and therefore just the default usage of such products within a Tungsten Clustering environment can cause further issues, and may even fail completely.

In this blog post, I am not going to cover the best practices to avoid data drift nor the rights and wrongs of it, what I will cover is how to utilise existing third-party tools designed for doing such tasks.

Identify Structural Differences

Simple, Yet Effective

If you suspect that there are differences to a table structure, a simple method to resolve this will be to compare schema DDL, mysqldump offers an easy and fast way to extract DDL without row data, then using simple OS commands we can identify such differences.

Extract DDL on the Master node, specifying the schema in place of :

mysqldump -u root -p --no-data -h localhost --databases  >master.sql

Repeat the same on the Slave node(s):

mysqldump -u root -p --no-data -h localhost --databases  >slave.sql

Now, using diff, you can compare the results:
```
diff master.sql slave.sql
```

Using the output of diff, you can then craft the necessary DDL statements to re-align your structure

Identify Data Differences

The Real Challenge

The first challenge when looking at data differences is that in busy environments, and especially if you are running a Multi-Site Multi-Master (pre v6) or Composite Multi-Master (v6) Topology, then you may well be presented with false positives due to the constant changing environment.

It is possible to use pt-table-checksum from the Percona Toolkit to identify data differences, providing you use the syntax described below for bypassing the native replication checks.

First of all, it is advisable to familiarise yourself with the product by reading through the providers own documentation here:
https://www.percona.com/doc/percona-toolkit/

Once you are ready, ensure you install the latest version of the persona toolkit on all nodes, or at least ensure the version you install is compatible with your release of MySQL.

Next, execute the following on the Master node:

shell> pt-table-checksum --set-vars innodb_lock_wait_timeout=500 --recursion-method=none --ignore-databases=mysql --ignore-databases-regex=tungsten* h=localhost,u=tungsten,p=secret

It is important to include the ignore-database options - we do not want to compare the mysql schema, nor do we want to compare any tungsten tracking schemas.

You can add additional schemas to these options if necessary within your environment.

On first run, this will create a database called percona, and within that database a table called checksums. The process will gather checksum information on every table in every database excluding any listed using the ignore options mentioned previously. The tables and the processes will replicate through Tungsten Replicator and therefore you can now query these tables on the slave nodes, the following is an example SELECT that you can use:

SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks
FROM percona.checksums
WHERE (
 master_cnt  this_cnt
 OR master_crc  this_crc
 OR ISNULL(master_crc)  ISNULL(this_crc))
GROUP BY db, tbl;

This SELECT will return any tables that it detects are different, it won't show you the differences, or indeed how many, this is just a basic check.

To identify and fix the changes, you could use then use pt-table-sync (Also within the Percona Toolkit), however this product would by default assume native replication and also try and fix the problems for you. This assumption is unavoidable, therefore within a Tungsten Clustering environment we need to supply the --print switch. This won't execute the statement but will instead, display them on the screen (Or redirect to a file) and then from here you can gather the SQL needed to be executed to fix the mistakes and process this manually.

The output should be reviewed carefully to determine whether you want to manually patch the data, if there are significant differences, then you may need to consider using tungsten_provision_slave to reprovision a node instead.
To use pt-table-sync, first identify the tables with differences on each slave, in this example, the SELECT statement above identified that there was a data difference on the departments table within the employees database on db2. Execute the pt-table-sync script on the master, passing in the database name, table name and the slave host that the difference exists on:

shell> pt-table-sync --databases employees --tables departments --print h=db1,u=tungsten,p=secret,P=13306 h=db2

The first h= option should be the Master (also the node you run the script from) the second h= option relates to the slave that the difference exist on.

Executing the script will output SQL statements that can be used to patch the data, for example the above statement produces the following output:

UPDATE `employees`.`departments` SET `dept_name`='Sales' WHERE `dept_no`='d007' LIMIT 1 /*percona-toolkit src_db:employees src_tbl:departments src_dsn:P=13306,h=db1,p=...,u=tungsten dst_db:employees dst_tbl:departments dst_dsn:P=13306,h=db2,p=...,u=tungsten lock:0 transaction:1 changing_src:0 replicate:0 bidirectional:0 pid:24524 user:tungsten host:db1*/;

The UPDATE staments could now be issued directly on the slave to correct the problem.

Warning

Remember, at the start I mentioned one way data drift can happen is due to the inadvertent execution of DML on a slave, which is highly unrecommended, however in the following examples I contradict myself and suggest the only way to fix the data is to actually do just that. Care should be taken, and ALWAYS ensure you have a FULL backup, it would be recommended to place the cluster into MAINTENANCE mode and shun the slave node before making any changes so as not to cause any potential interruption to connected clients!

Summary

The Wrap-Up

In this blog post we discussed how to check whether or not there are data discrepancies between Master/Slave nodes within a cluster.

To learn about Continuent solutions in general, check out https://www.continuent.com/solutions

The Library

Please read the docs!

For more information about troubleshooting data differences in Tungsten clusters, please visit http://docs.continuent.com/tungsten-clustering-6.0/troubleshooting-data.html.

Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!

For more information, please visit https://www.continuent.com/solutions

Want to learn more or run a POC? Contact us.

Published In

Categories:

Advanced Replication

Series:

Tungsten University

Tags:

Architecture, geo-distributed, Multisite, MySQL, Percona, pt-table-checksum, pt-table-sync, replication, Replicator

Author

Chris Parker

VP of Customer Success, EMEA

Chris is based in the UK, and has over 20 years of experience working as a database administrator. Prior to joining Continuent, Chris managed large-scale Oracle and MySQL deployments at Warner Bros., BBC, and prior to joining the Continuent Team, he worked at the online fashion company, Net-A-Porter.

View All Chris’s Posts

Troubleshooting Data Differences in a MySQL Database Cluster