Long Live Backups! The Role of Recovery in a Continuous MySQL Environment

Long Live Backups!
Failure To Plan is Planning for Failure

Calculating acceptable business risk and then building strategies and protocols to address those risks are key parts of disaster recovery planning.

When planning for a disaster, two target objectives are often used to define the recovery plan:

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

RPO defines how much data can be lost - for example, if backups are made once per day, then the RPO would have a maximum of 24 hours if the failure happened just before the next backup ran.

RTO defines how long the business can be down, for example the time it takes to locate the correct backup, transfer it, decompress it, and restore it so the database becomes available again.

Without clustering, a database failure would require manual recovery from a backup, a process which requires downtime, and potentially incurs the loss of data since the last backup.

While a proper clustering solution can enable continuous MySQL operations with a very fast recovery time (RTO) and a very small loss window (RPO), database clustering cannot protect against all eventualities.

To Backup, or Not To Backup, That Is the Question

Why are backups even needed? There are multiple MySQL replicas in a fault-tolerant cluster, so the data is completely safe, correct?

NO!

Please remember that writes to the Primary are copied to all Replica nodes in a cluster as quickly as possible. That means that a bad write will propagate throughout the entire cluster, rendering every copy of the data partially or completely useless.

There are many possible reasons for a backup to be needed, here are just a few:

  • Admin error resulting in loss or corruption of data
    (i.e. someone types "DROP DATABASE" on your Primary by mistake…)
  • Application error, or automated SQL, leading to corruption
  • Malicious activity/hacking leading to partial or total data loss

It is for reasons like this (and more!) that backups are REQUIRED for the safety of the data and the business operation.

If I Must Have Backups, Then Why Bother with Clustering?

What is the Tungsten “Essential Toolkit”?

“Instead of always using a hammer, you can use a screw driver, or pliers, and sometimes you don’t have to do anything at all.”

Continuent knows the pain of DBAs, SREs, DevOps, SysAdmins first-hand; that’s why our engineers have distilled various manual processes down to a single, seemingly magical command. Some of my favorites include:

  • "switch," we can redirect connections to another part of the cluster, so you can perform maintenance, patches or updates without bringing your application down.
  • "recover," we can restore optimal cluster health - automatically check status and states of all nodes and make any necessary configurations and changes needed.
  • "tprovision," we can take a backup from one node and restore it on another.

You might be wondering if clustering is worth it if you still have to maintain a classic, proper backup process.

Clustering reduces the impact of a wide variety of risks that would otherwise cause a long outage with significant data loss. Scenarios like database, host, network, site and even regional failures can be protected against and remediated rapidly with clustering. Additionally, clustering provides for read-scaling and automated recovery. Backups alone provide none of this.

With a cluster in place (true at least for Tungsten Clustering ), there’s a:

  • dramatic drop in the number of failure scenarios to recover from manually
  • lower total cost of ownership (TCO)
  • decrease in administrative overhead

Furthermore, having a fully-integrated, infrastructure-agnostic clustering solution like Tungsten Clustering makes it easier than ever to deal with complex cluster operations. Check out this blog about “the Boss,” or “Cluster Manager or Orchestrator, such as Tungsten Manager” that makes DR, multi-site, hybrid-cloud, multi-cloud MySQL easy and cost-effective.

On top of reliability, resilience, HA, DR, load balancing, performance and geo-scale distribution, Tungsten Clustering comes with an essential toolkit to make your MySQL environment easy to manage.

Conclusion: Disasters Happen...So Plan For Them

Even with clustering, backups are always necessary!

Best practices for data availability include a number of methods for ensuring business continuity, use them all and do not rely upon any single tool.

Check out: “3, 2, 1 MySQL Backup is Fun!” to learn more about backup planning for a clustered environment.

OBSOLETE CONTENT BELOW

For Disaster Recovery and other situations where you’d ordinarily be in a bind and counting down against your RPO and RTO, the cluster not only provides real-time replicas to recover from, it also makes the recovery process itself extremely easy and in many cases, completely automated.

As one expert in the field asked:
"When I think of clustering and continuous operations, I don't think of RPO and RTO at all -- that's why I put in a cluster..."

On top of reliability, resilience, HA, DR, load balancing, performance and geo-scale distribution, Tungsten Clustering comes with an essential toolkit to make your MySQL environment easy to manage.

So instead of worrying about going the classic backup-and-recovery route every time things go wrong, you only need to worry about that with data corruption on your Primary.

For Disaster Recovery and other situations where you’d ordinarily be in a bind and counting down against your RPO and RTO, the cluster not only provides real-time replicas to recover from, it also makes the recovery process itself extremely easy and in many cases, completely automated.

Furthermore, having a fully-integrated, infrastructure-agnostic clustering solution like Tungsten Clustering makes it easier than ever to deal with complex cluster operations. Learn more about what makes DR, multi-site, hybrid-cloud, multi-cloud MySQL easy and cost-effective.

Why are Backups Necessary in a Replicated MySQL Operation?

Just because a database is replicated, clustered and fault-tolerant doesn't mean it's immune to all of the issues that come up with any live operational database. While the soundness of your database availability is guaranteed, If the data on your primary becomes corrupt, then assume all your replicas are corrupt as well, and what’s the point of a highly available, corrupt database?

For example, if someone types "DROP DATABASE" on your Primary by mistake…

As covered in the recent backup blog, classic backup-and-recovery is still required for situations such as:

  • Admin error resulting in loss or corruption of data
  • Application error, or automated SQL, leading to corruption
  • Malicious activity/hacking

So, I hope you see that there is no alternative to a proper backup process.

Tungsten Clustering is designed to work for us and ensure database availability. That's why we focus on providing the highest quality enterprise support, training and education. Our solutions are not for the faint of heart...like any type of automation, it may serve to highlight our human imperfections if we’re not careful. But, along with providing an essential level of assurance that our database will be available, and reducing the pain of the classic backup process (while not eliminating it), the technology frees us to spend more time on creative problem solving.

“As machines become more and more efficient and perfect, so it will become clear that imperfection is the greatness of man.” – Ernst Fischer.

About the Author

Eric M. Stone
COO

Eric is a veteran of fast-paced, large-scale enterprise environments with 35 years of Information Technology experience. With a focus on HA/DR, from building data centers and trading floors to world-wide deployments, Eric has architected, coded, deployed and administered systems for a wide variety of disparate customers, from Fortune 500 financial institutions to SMB’s.

Add new comment