Replicating into Elasticsearch Webinar Followup

We had a great webinar on Wednesday looking at how we can use Tungsten Replicator for moving data into Elasticsearch, whether that's for analytics, searching, or reporting.

You can ahead and watch the video recording of that session here

We had one question on that session, which I wanted to answer in full:

Can UPDATE and DELETE be converted to INSERT?

One of the interesting issues with replicating databases between databases is that we don't always want the information in the same format when it goes over another side. Typically when replicating within a homogeneous environment the reason we are using replication is that we want an identical copy over on the other side of the process. In heterogeneous, especially when we move the data out to an analytics environment like Elasticsearch, we might not. That covers a whole range of aspects, from the datatypes and other columns.

When it comes to the point in this question, what we've really got is an auditing rather than strict data replicating style scenario, and it sets up a different set of problems.

The simple answer is that currently, no, we don't.

The longer answer is that we want to do this correctly and it needs some careful considerations depending on your target environment.

Usually, we want to update the target database with a like-for-like operation for the target database, and we do that by making use of a primary key or other identifiers to update the right information. There can only be one primary key or identifier and if we are tracking modifications rather than the current record state, we need to not use the primary ID. But that also means we don't normally track what operation created the entry we just modify the record accordingly which would additionally require another field. That means for the above request to make sense within the typical replication deployment, we need to do two things:

  1. Stop using a primary key as the record identifier
  2. Record the operation information (and changes) in the target. For most situations, this means you want to know the old and new data, for example, in an update operation.

Fortunately, within Elasticsearch we can do this a bit more easily, first by using the auto-generated ID, and second by using the document structure within Elasticsearch to model the old/new information, or add a new field to say what operation that is.

We haven't that within Elasticsearch yet but have learnt some good practices for handling that type of change within our Kafka applier. Those changes will be in a release shortly and we'll be adding them into Elasticsearch soon after.

Have many more questions about Elasticsearch or Tungsten Replicator? Feel free to go ahead and ask!

About the Author

MC Brown
Former VP of Products

MC Brown is a professional writer and technologist for over 25 years, an author and contributor to over 26 books covering a wide array of topics, and technical and architectural advisor on databases, cloud and grid computing, and operating system development. During his time he has worked for and with Sun, MySQL, Oracle, Couchbase, VMware, Microsoft and IBM, and written for O'Reilly, IBM, Computerworld, IBM developerWorks and many others.

Add new comment