Recently, a customer asked us:
After importing a new section of user data into our Tungsten cluster, we are seeing perpetually rising replication lag. We are sitting at 8.5hrs estimated convergence time after importing around 50 million rows and this lag is climbing continuously. We are currently migrating some of our users from a NoSQL database into our Tungsten cluster. We have a procedure to write out a bunch of CSV files after translating our old data into columns and then we recursively send them to the write master using the mysql client. Specifically our import SQL is doing
LOAD DATA LOCAL INFILEand the reading in a large CSV file to do the import. We have 20k records per CSV file and we have 12 workers which insert them in parallel.
In cases like this, the slaves are having trouble with the database unable to keep up with the apply stage (
A Deep Dive
Each CSV file will equate to a transaction, which means one sequence number will contain 20,000 rows.
Naturally the load is quick on the master, because
LOAD DATA INFILE is a fast and efficient process, and especially when you have 12 parallel threads running.
However, the replicators are not running in parallel so it is going to be the same as a single load of 20,000 rows at a time.
Reducing the number of records per CSV and the number of parallel workers may reduce the latency but with 50 million records there is going to be some significant overhead to process this downstream to the slaves.
LOAD DATA is treated as unsafe and when
binlog_format=MIXED the statement is logged in ROW-based format.
So, both ROW and MIXED = ROW, which means that the slower method is used.
LOAD DATA INFILE operations, set the
binlog_format to STATEMENT for that specific
LOAD DATA session only - this will generate less THL traffic to be processed.
It is critical that this setting must only be at session level for the
LOAD DATA process, not for anything else.
Select a slave, and add the following property to your ini file:
Turning this optimize parameter off will create a single INSERT per row, which may not sound like it would be quicker but it means the target database is processing one insert at a time rather than 20,000.
As it stands, your 20,000-record CSV file is most likely getting converted into a single INSERT for all 20000 rows. In some cases the
optimize-row-events setting being
true is good as it can help to batch multiple inserts into a single statement for quicker processing, however it could be that 20,000 records per insert is causing too much stress on the target database hence it's slower to apply.
Also, does the target table have indexes? Huge data loads with a heavily indexed table will increase the time taken for the DB to process these rows.
In this blog post we discussed importing CSV data into a Tungsten Cluster.
To learn about Continuent solutions in general, check out https://www.continuent.com/solutions
Please read the docs!
For more information about monitoring Tungsten clusters, please visit https://docs.continuent.com.
Tungsten Clustering is the most flexible, performant global database layer available today - use it underlying your SaaS offering as a strong base upon which to grow your worldwide business!
For more information, please visit https://www.continuent.com/solutions
Want to learn more or run a POC? Contact us.
csvfile (not verified)
Sun, 10/20/2019 - 21:11
Wonderful information, I learned a lot of new things from your blog for my project, thanks for sharing with us.
Add new comment