Database Disaster Recovery (DR) for the Enterprise
You know that having a robust Disaster Recovery (DR) strategy is critical for your business, as unplanned (or even planned) down time can have a huge financial impact on the enterprise, while also eroding customer confidence in your products and services. Unexpected outages will happen at some point, as in March of this year when an entire datacenter in Europe went offline due to a fire. This was not a simple outage, but rather an extended period of time of being completely offline.
In times of a disaster, failover (and eventual failback) needs to be reliable, (relatively) fast and easy. Working out the details of failover when disaster happens is NOT going to produce the desired result and thus we advocate testing your DR plan regularly.
InnoDB Cluster with Disaster Recovery
If you do a search for “InnoDB Cluster with DR,” you’ll get quite a few hits with blogs and articles about implementing DR with InnoDB Cluster. The articles generally describe how it’s not advisable to deploy a cluster with nodes over a WAN due to high latencies of WAN’s and group replication. Instead, it’s advised to use native MySQL asynchronous replication to replicate to a standalone MySQL database in another region or datacenter.
After doing this, the articles state that you now have DR. But, is this considered true DR? No, not really!
I would say this is not DR, but rather simply creating an offsite replica, which while a step in the right direction, really isn’t what it should be.
There are still these points to consider:
- How do you do a database failover? (Remember, it should be reliable, fast and easy)
- How do you fail back?
- Are you comfortable with just a standalone MySQL server in your DR site? (Performance, single point of failure)
- How do you point your applications to the correct datacenter, or have your applications fail over?
- How do you manage and orchestrate all of the above? And where can you see the status of the entire topology?
If you decide to spend the time to try to address the above, the result will be an unsupported “DIY” solution which can have serious risks for the business.
Consider what needs to happen for a simple failover:
- The DR node must be made read/write (it should be read only prior to this, with the exception of the replication user, to avoid corruption),
- Replication needs to catch up as much as possible, and
- Apps need to be directed to the DR database, either through a proxy or having their DSN’s updated (which will cause an app outage as well).
After you’ve done all of these steps, you would want your monitoring to report against the new primary database.
Also, as per the note above, you can only have a single MySQL instance in your DR site. Running a single instance for production is extremely risky, which is why you considered implementing a MySQL DR cluster in production.
Even more complicated is failback, because after you perform the initial failover, you are left with a single database. You must now reprovision the failed cluster from the single active instance, or from a backup of that instance (you are taking backups of both sites, right?). After rebuilding the primary cluster, you then perform the failback with the steps you took for failover but all reversed. This could easily take several days, especially for large datasets and busy systems.
So, Now What?
DR and multi-site topologies are always challenging to implement when using synchronous or semi-synchronous replication technologies with MySQL.
Fortunately, it doesn’t have to be so difficult and you can read all about it here: https://www.continuent.com/resources/blog/dr-with-innodb-cluster-possible