Blog

Understanding THL, Events and Storage: Part 1

When Tungsten Replicator extracts data, the information that has been extracted is written down into the Tungsten History Log, or THL. These files are in a specific format and they are used to store all of the extracted information in a way that can easily be used to recreate and generate data in a target.

Each transaction from the source is written into the THL as an event, so within a single THL file there will be one or more events stored. For each event, we record information about the overall transaction, as well as then information about the transaction itself. That event can contain one or more statements, or rows, or both. Because we don't want to get an ever increasing single file, the replicator will also divide up the THL into multiple files to tmake the data easier to manage.

We'll get down into the details soon, until then, let's start by looking at the basics of the THL, files and sequence numbers and how to select.

The simplest way to look at the files is first of all to start with the thl command. This provides a simple interface first of all into looking at the THL data and then also understand the contents.

Let's start by getting a list of the THL files and the events which we can do with the index command:

$ thl index
LogIndexEntry thl.data.0000000001(0:295)
LogIndexEntry thl.data.0000000002(296:591)

This shows us that there are two files, each containing 295 and 296 events.

We can look inside using the list command to thl. This supports a number of different selection mechanisms, first you can select a single item:

$ thl list -seqno 1
SEQ# = 1 / FRAG# = 0 (last frag)
 - TIME = 2018-12-06 12:44:10.0
 - EPOCH# = 0
 - EVENTID = mysql-bin.000108:0000000000001574;-1
 - SOURCEID = ubuntu
 - METADATA = [mysql_server_id=1;dbms_type=mysql;tz_aware=true;strings=utf8;service=alpha;shard=msg;tungsten_filter_columnname=true;tungsten_filter_primarykey=true;tungsten_filter_enumtostring=true]
 - TYPE = com.continuent.tungsten.replicator.event.ReplDBMSEvent
 - OPTIONS = [foreign_key_checks = 1, unique_checks = 1, time_zone = '+00:00']
 - SQL(0) =
 - ACTION = INSERT
 - SCHEMA = msg
 - TABLE = msg

We can also specify a range using either -high or -to and -low and -from. If you specify only one of these, then it assumes you mean from the start or end. For example:

$ thl list -to 100

Will list all events from the start up to and including event 100. While:

$ thl list -low 45

Lists all events from 45 until the end.

You can also be explicit:

$ thl list -from 45 -to 60

Will list events from 45 to 60 inclusive.

Finally, you can use two special options, -first and -last which will show you the first (Surprise!) and last events. You can also supply an optional number. So:

$ thl list -first

Shows the first event, while:

$ thl list -first 10

Shows the first 10 events.

This last option can be really useful when diagnosing an issue because it means we can look at the last events capture by the replicator without having to find the event IDs and work it out.

Up to now I've focused on the events, but there's a critical element to THL that also needs to be considered when thinking about replication of data and this is the data and the files themselves.

This is important to consider because of the file and THL load generated - if you have a busy system then you will generate a lot of events and that, in turn, will generate a lot of THL that needs to be stored. On a very busy system, it's possible to fill up a disk very quickly with THL.

You can see the files by looking into the directory where they are stored within your installation, by default the thl/SERVICE directory. For example:

$ ll /opt/continuent/thl/alpha/
total 896
drwxrwxr-x 2 mc mc   4096 Dec  6 12:44 ./
drwxrwxr-x 3 mc mc   4096 Dec  6 12:43 ../
-rw-rw-r-- 1 mc mc      0 Dec  6 12:43 disklog.lck
-rw-rw-r-- 1 mc mc 669063 Dec  6 12:44 thl.data.0000000001
-rw-rw-r-- 1 mc mc 236390 Dec  6 12:45 thl.data.0000000002

You can see that these files corresponding to the output of thl index (fortunately), and you can also see that the two files are different in size. THL is automatically managed by the replicator - it creates new files on a configurable boundary, but also automatically clears those files away.

We'll cover that in a future session, but for the moment I'm going to admit that I skipped something important earlier. These two THL files actually contain two copies of the exact same set of data. The difference is that the second file uses a new feature that will come in a forthcoming release which is compression of the THL.

In this case, we are talking about 295 row-based events, but thes second THL file is almost a third of the size of the first. That's quite a saving for something that normally uses up a lot of space.

If we look at the true output of thl index, we can see that the file is indeed compressed:

$ thl index
LogIndexEntry thl.data.0000000001(0:295)
LogIndexEntry thl.data.0000000002(296:591) - COMPRESSED

For installations where you are paying for your disk storage in a cloud environment, reducing your THL overhead by less than half is fairly significant. We've done a lot of testing and consistently got compression ratios of 2:1 or higher.

We've also implemented encryption at this level too, which means you get on disk encryption without the need for an encrypted filesystem, and with compression as well you get secure and efficient storage of that THL which should lower your storage and I/O overhead as well as the storage costs.

About the Author

MC Brown
Former VP of Products

MC Brown is a professional writer and technologist for over 25 years, an author and contributor to over 26 books covering a wide array of topics, and technical and architectural advisor on databases, cloud and grid computing, and operating system development. During his time he has worked for and with Sun, MySQL, Oracle, Couchbase, VMware, Microsoft and IBM, and written for O'Reilly, IBM, Computerworld, IBM developerWorks and many others.

Add new comment