“We don’t take backups, we use replication instead.” If you happen to agree with this statement, I urge you to continue reading. But even if you think you have a good backup plan, I still urge you to continue reading. Taking backups is usually not one of the most exciting parts of the job, but it might be the part that saves your company from a catastrophe. So, let’s make a plan!
3, 2, 1…
Our backup strategy should be able to recover from the following scenarios:
- Losing a physical machine, VM, or cloud instance
- Loss of a physical site or disaster
- Admin error resulting in loss or corruption of data
- Application error leading to corruption
- Malicious activity/hacking
Of course, an enterprise level MySQL clustering solution like Tungsten Clustering can mitigate loss of hardware, however corruption or loss of data may require restoring a backup.
The “3 2 1 backup rule” states that we should keep 3 copies of our data, 2 local, and 1 “offsite.” Looking at the list above, it should be clear that the goal of this rule is to mitigate the loss of the backups. Why 2 local copies? Quite simply, one could fail. Consider unauthorized access to your database; if the perpetrator can gain access to your production database, it’s quite possible they could gain access to your primary backup as well. Therefore it makes sense to keep the 2nd backup copy separated from the first copy. This can be achieved by storing it on a different system, ideally on a different subnet with different credentials from the first backup. Now we’ve made it much more difficult to hack our second copy.
The “offsite copy” at one time meant storing a copy of the backup onto tape and then storing that tape in a completely separate location from the datacenter. It would have been stored in an offsite vault or even in an employee’s home! Although this system is still in use, the offsite copy could be on disk in another data center, or in an object store from your favorite cloud provider. Of course, the credentials for the offsite storage should be different from the production system to protect the backup from nefarious entities. If you are deployed in the cloud, you may consider having an offsite copy hosted in a different cloud provider. Remember if your primary cloud provider is having an issue, chances are you won’t be able to access your backups in that provider.
How long should we keep our backups? This is a business decision that needs to be addressed by stakeholders in your organization. Would a backup from a year ago be needed for research purposes? 2 years? How about for compliance, what is the minimum retention period and frequency of backups that need to be saved? These are important questions that must be answered, documented, and signed off, along with the costs for maintaining all of the backups.
RPO and RTO
Two other questions also need to be addressed when setting up the backup policy. The first is RPO, or “Recovery Point Objective.” Quite simply, what is the volume, in time, of data that the business can tolerate the loss of during a disaster? The obvious answer is “None!” but this is not realistic in a backup scenario, especially if we are recovering from corruption or hacking -- we do not want the latest version of the data! This is again a business decision, with costs mounting as this number gets lower. If the RPO is 24 hours, then a simple nightly backup is sufficient. If it’s an hour or less, then consider more frequent backups and incremental backups.
RTO, or “Recovery Time Objective” is the amount of time that the business can tolerate being offline during a disaster while a backup is restored. This is often overlooked, even in organizations with frequent viable backups, as the time it takes to locate the correct backup, transfer it, decompress it, understand the steps needed to restore it, etc becomes much longer than expected. The stakeholders should decide how much time you have to recover. If this is agreed to be 8 hours, depending on the size of the data to be recovered, this might be pretty easy to meet.
If the RTO is 5 minutes, or else the business could fail, the plan is much different -- perhaps the backup is stored on a filesystem that can be quickly mounted to the production instance and applications simply restarted. Or consider using a database cluster, such as Tungsten Cluster, with near real-time recovery. But even with a cluster a proper backup planning is must.
Are the Backups Viable?
This is an easy question to answer! If you have tested the backup by doing a restore and verifying the backup works, then yes, you have a viable backup. If you have not tested your backup, NO, it is not considered viable and you are at risk. Test your backups! This can even be done with automated scripts to do the restore and checking for you. Imagine a scenario where the business has defined RPO and RTO, given you the resources to meet all of the backup requirements, you have implemented the plan, and then during a disaster you discover that your backup cannot be restored and you are the one who will deliver the news to the organization…
There’s one more important step: Making a Disaster/Recovery plan. You get paged at 3:00 AM to restore a backup and your RTO is 10 minutes. Will you panic and try to remember the restore procedure that you did 8 months ago, or will you simply follow a detailed plan that’s been tested and approved?
Now the Fun Part!
Well, backup and recovery is not necessarily fun. I’ve yet to see an Employee of Month award for a good implementation of backup and recovery. It’s also time-consuming. But at some point it will save your business and many jobs.