Enterprises require high availability for their business-critical applications. Even the smallest unplanned outage or even a planned maintenance operation can cause lost sales, productivity, and erode customer confidence. Additionally, updating and retrieving data needs to be robust to keep up with user demand.
Let’s take a look at how Continuent Clustering helps enterprises keep their data available and globally scalable, and compare it to Amazon’s RDS running MySQL (RDS/MySQL).
Replicas and Failover
What does RDS do?
Having multiple copies of a database is ideal for high availability. RDS/MySQL approaches this with “Multi-AZ” deployments. The term “Multi-AZ” here is a bit confusing, as enabling this simply means a single “failover replica” will be created in a different availability zone from the primary database instance. Only one failover replica can be created, and thus we have just one failover candidate with a copy of the database in a “Multi”-AZ deployment. The failover replica has only one purpose – to be used as a failover target, and cannot be used for other purposes – but more about this later.
The failover process for RDS happens automatically and takes between 1 – 2 minutes. It also updates the DNS record for the database to point to the failover replica. As a result, we have the following consequences:
- Application downtime of 1 -2 minutes
- Application must reconnect to database, which, depending on the application, may report cryptic errors to the user or even crash
- You may need to reconfigure your JVM environment to handle DNS caching in this case
- You now do not have any other failover candidates until another is brought online
How does Continuent Clustering handle failover?
Using Continuent Clustering, you set up your cluster to have a primary (master), and 2 or more replicas (slaves). Each slave in a cluster is a candidate for failover, and since this is a true cluster, the application simply connects to the cluster with no modification, and any changes to the cluster happen behind the scenes to the application. This is made possible using the Connector, which is an intelligent proxy that speaks the MySQL protocol.
During failover, the cluster selects a slave to promote to master. The Connector temporary holds connections from the applications until failover process is complete. When the slave has been promoted to master, the Connector resumes connections but to the new master. The advantages here are:
- Fast failover time, often within 10 seconds!
- Applications do not get disconnected. No errors reported to the users.
- Applications do not need to be aware of a new master
- After a failover in a 3 node cluster, there is still yet another slave that can handle a subsequent failover. A 5 node cluster with a failed master would still have 3 slaves online!
- The failed master in many cases can repaired and added back into the cluster, saving reprovisioning time.
Performance and Scalability
RDS/MySQL provides “read replicas,” which, although not automatic failover candidates, are replicas of the primary instance and can be used for reading data, offloading some traffic from the master. Note that a read replica can be manually promoted to a master. A read replica will have a different IP address, thus to take advantage of using it for reads, your application must be designed to send reads to the read replica, and writes to the master.
Coming back to the “failover replica,” note that the failover replica uses synchronous replication. This means that EACH write to primary database will block until the write has been committed and acknowledged on the failover replica. This will introduce high latency to your applications, and it could be significant for systems with a lot of writes.
In a Continuent Cluster, a slave is not only a failover candidate, but can be used for reads as well. That 3 node cluster mentioned above already has 3 nodes available for reading, and once again, using the power of the Connector, reads can be automatically directed to slaves without modifying our application! Since the Connector is a true proxy and router, there are quite a few algorithms available for splitting reads and writes. If your application is already read/write aware, great! We can leverage your existing logic. If not, the Connector offers read/write algorithms for you to use.
Also note that by adding more slaves in a Continuent Cluster, you are scaling the number of nodes available for reads without impacting your application.
With maintenance tasks, you are in control with Continuent Clustering. Plan your maintenance when you want, and perform many maintenance tasks, like OS patches and MySQL upgrades, with no downtime. Imagine upgrading from MySQL 5.6 to MySQL 5.7 with NO downtime!
RDS/MySQL requires a maintenance window, and during that window, your instances may be restarted. This of course translates to application downtime.
Benefits of a True Cluster
There are many benefits of using Continuent Clustering. Some of the benefits we discussed are: Automatic failover with no application disconnect, read scaling, read/write splitting, and ease of maintenance. However, there are many more benefits – multi-master (deploy replicated clusters across the continent or across the globe), cloud compatibility (think running a cluster in AWS and being able to failover to Google Cloud!), replication to other databases and data warehouses, and support from engineers with 20-30 years of experience each in databases and clustering!
The Player Accounts team at Riot Games needed to consolidate the player account infrastructure and provide a single, global accounts system for the League of Legends player base. To do this, they migrated hundreds of millions of player accounts into a consolidated, globally replicated composite database cluster in AWS. This provided higher fault tolerance and lower latency access to account data. In this talk by Tyler Turk (Infrastructure Engineer, Riot Games), we discuss this effort to migrate eight disparate database clusters into AWS as a single composite database cluster replicated in four different AWS regions, provisioned with terraform, and managed and operated by Ansible.
Why does the DIY approach fail to deliver vs. the Continuent Clustering solution for geo-distributed MySQL multimaster deployments?
- Continuent Clustering is a complete solution, comprised of the Replicator, Manager and Connector components
- With DIY, you must first decide the architecture, then select the individual tools to handle each layer of the topology. Each part must be installed, configured, maintained, monitored and fixed separately.
- With DIY, you must craft scripts to connect everything together. In Continuent Clustering, the three core components, Manager, Connector and Replicator, handle all of the messaging and control in a seamlessly-orchestrated fashion.
- Continuent Clustering has more than ten (10) years of development maturity. No DIY solution can match the depth of enterprise experience we bring to the table.
- Continuent Clustering is designed from the ground up to provide 24×7 data access
- Development of Database High-Availability Solutions
- DIY requires significant in-house investment of time, money and human resources to build a full solution, and even then it could not come close to matching what Continuent offers based on ten years of development efforts.
- Creating a simple solution is relatively fast, but managing and correctly automating all the possible corner cases that a geo-distributed multi-master solution exposes is very difficult to master. It is often the corner cases that lead to downtime.
- DIY also requires extensive institutional knowledge to maintain the multiple portions of the chosen architecture. If the staff are lost, often a DIY solution becomes impossible to manage and maintain.
- Continuent Clustering is a complete, proven and supported solution with significant resources including extensive documentation, release notes, white papers and instructional training videos & webinars.
- Continuent Clustering is under continuous development with new features and bug fixes. It covers all the changes in the environment (MySQL version, OS version, etc.)
- Database Administration (DBA)
- DIY requires that your DBA do a lot of manual work. Properly automating local failover is an enormous task, especially when trying to avoid split-brain situations.
- With DIY, scripting a cohesive solution for a geo-distributed active-active highly-available definitely is a non-trivial task. Continuent Clustering provides a global mesh right out of the box.
- With DIY, recovering failures is time consuming and laborious. Continuent Clustering automates most normal operations.
- With DIY, almost all updates and upgrades (MySQL version, schema changes, etc.) will require down-time. Continuent Clustering allows continuous operation during all changes.
- With Continuent Clustering, you can add new sites/clusters by following standard, easy-to-follow instructions
- All solutions must be monitored to be fully effective, so Continuent Continuent includes various monitoring scripts as part of the core product.
- We offer a cron-based watcher script that alerts via email if you do not already have a monitoring system.
- The included shell scripts are very easy to read, understand and modify to suit your needs.
- We include Nagios and Zabbix support as part of the base product.
- Re-provisioning a slave can be handled using a single command.
- All enterprise databases need to be backed up on a regular basis regardless of the number of slave replicas to prevent data vulnerability from replicated errors.
- Continuent Clustering fully supports and is tightly integrated with both MySQL’s mysqldump command along with the excellent and free XtraBackup by Percona.
- A configurable cron-based backup script is also provided as part of the core distribution, powerful enough to select an appropriate node and perform automated backups.
- Backup and Restore may also be handled via the Manager’s cctrl command-line interface.
- Multi-master Replication (strength of Tungsten Replicator)
- Tungsten Replicator handles complex topologies with ease.
- Tungsten Replicator can switch roles very easily (i.e. master becomes a slave, or a slave becomes a master).
- Tungsten Replicator is cluster-aware and can automatically switch to another THL source if the current one becomes unavailable.
- Management (zero downtime maintenance, etc.)
- We have already engineeered the rules and communications and tools needed to control the entire clustering process end-to-end.
- With Continuent Clustering, there is a unified view of all the cluster resources. With DIY, there is no such view.
- With Continuent Clustering, it is very easy to change the role of a node, with DIY there are many different moving parts to control individually.
- With DIY, failures may not be handled correctly because one layer does not know what another layer is doing.
- With Continuent Clustering, a node may be taken down for maintenance without impacting operations, then easily returned to the cluster and re-synced.
- Connectivity/Intelligent Proxy and Router (multisite aware proxy, etc.)
- Connector is able to intelligently pause and resume client traffic during switches and failovers.
- Connector is able to load-balance in multiple ways, including across sites using intelligent proxying and query redirection
- Connector gets status updates and signals from the manager, and is therefore integrated into the management process.
- Connector is able to reconnect to another manager if it loses connection with the current one.
- Tungsten Connector has a full view of all nodes.
- Connector Bridge mode provides full-speed performance with very low latency
- Support (can’t beat our 24/7 support)
- All layers are supported by the same team end-to-end. Rapid response for urgent cases. Our SLA is 1 hr max. For urgent cases, we average 6 minutes.
- For a DIY solution, there is no single place to go for help. Many of the tools/solutions have no available support, or require support contracts from multiple vendors.
- Staff spread out over the globe for follow-the-sun support around the clock.
- 24x7x365 enterpise-level support is included in all term licenses.
- You are supported by mature, knowledgable staff, with decades of experience.
- Built-in diagnostics-gathering and submission tools covering all layers including the OS and database.
- Tungsten Replicator has features that are not available with native MySQL replication
- Easy management – no need to log into MySQL to manage replication.
- Global Transaction ID (GTID) for ANY version of MySQL.
- Off-the-shelf MySQL support (MySQL Community/Enterprise, Percona Server and MariaDB)
- Inspect every event in detail, including the event source, timestamp and time zone using simple commands.
- Easy to query to get performance data, latency, status, and errors.
- High tolerance for network outages, automatically picks up where it left off.
- Able to use either a master or another slave as a THL source. Use a slave to reduce load on the master!
- Can automatically switch to another THL source if the current one becomes unavailable.
- Dedicated replicator log files to help quickly diagnose issues.
- Ability to skip transactions as needed with a single intuitive command.
- Ability to replicate to other targets in addition to MySQL: Hadoop, Oracle, AWS Redshift, Kafka, HPE Vertica, Cassandra, Elasticsearch and others, with the same single extraction from the MySQL source(s).
- Over 40 filters available! Replicate subsets of schemas, tables, columns, and even do data transformation.
- Create time-delayed replicas so that roll-back to a known good state is easy.
- Implement complex topologies, such as fan-in.
- Parallel apply support, based on schema.
Learn how Adobe Sign serves its global e-signature customers in the cloud. Adobe Sign is a cloud-based, enterprise-class e-signature service that lets you replace paper and ink signature processes using a browser or mobile device. For Adobe Sign and companies using it, application uptime is absolutely crucial. Learn how Adobe Sign secured their SaaS revenue using Tungsten Clustering on cloud-based services. We discuss how Adobe ensures their continuous operation, with site-level and cross-site failover for Adobe Sign application availability, 24/7.