Deepthi Sigireddi on Distributed Database Architecture in the Cloud Native Era
Subscribe on:
Transcript
Srini Penchikala: Hi everyone. My name is Srini Penchikala. I am the Lead Editor for AI/ML and the Data Engineering community at InfoQ and also a podcast host. In today’s podcast, I will be speaking with Deepthi Sigareddi, technical lead for Vitess, a Cloud Native Computing Foundation graduated open-source project. We will discuss the topic of distributed databases architecture, especially in the cloud native era where it’s critical for the database solutions to provide capabilities like performance, availability, scalability and resilience. Hi Deepthi, thank you for joining me in this podcast. First question, can you introduce yourself and tell our listeners about your career and what areas have you been focusing on recently?
Introductions [01:12]
Deepthi Sigireddi: Hi Srini. Thank you for inviting me to speak with you today. I am currently the technical lead for Vitess, open-source project and also the Vitess engineering lead at PlanetScale, which is a database as a service company. That database as a service is actually built on Vitess. My career started a long time ago and almost from the beginning I have worked on databases, but to start with as an application developer, so there is a database back-end, Oracle, DB2, Informix, and you are writing things that will run against, that data is stored in the tables and applications will fetch the data, do something with them.
But what happened very early on starting in the year 2000 is that I was working on doing supply chain planning solutions for the retail industry. And the retail industry has massive amounts of data. So we had to actually start thinking about how do you run anything on these massive amounts of data? Because the hardware that was available to us at that time, it was actually not possible to load everything into memory and process it and then write out the results. So we had to come up with parallelizable, essentially parallel computing solutions to work with these monolithic databases where people kept their data.
I did that for about 10 years and then I moved on to working on cloud security at a startup that was eventually acquired by IBM, and this was mobile device management, so mobile security. And even there you have a lot of data and it was being stored in Oracle and we were actually doing what is now called custom sharding because we would store different users data in different schemas in the same Oracle database so it was essentially a multi-tenant system. Fast-forward a few more years, and then I joined PlanetScale and started working on Vitess, which is a massively scalable distributed database system built around MySQL. MySQL is the foundation, but MySQL is monolithic and cannot actually scale beyond the limits of one server. And what Vitess does is that it actually solves many of those limitations for the largest users of MySQL.
Srini Penchikala: Yes. MySQL is a popular open-source database and it’s been around for a long time so Vitess is built on top of that. So as you said, Vitess is a distributed database. Can you discuss the high-level architecture of how the database works, how it brings this distributed data management capabilities to MySQL?
Deepthi Sigireddi: So let me give a little bit of history of Vitess first, which will set the context for all the things that Vitess is capable of doing. Vitess was created at YouTube in 2010, and the reason it was created is because YouTube, at that time, was having a lot of difficulty dealing with the amount of traffic they were receiving and all of the video metadata was stored in MySQL. The site would basically go down every day because it was just exploding and MySQL was not able to keep up. By that time, YouTube was already part of Google, but they were running their own infrastructure and some of the engineers decided that this was not a situation that you could live with. It was simply not tenable to continue the way they were going, and they decided that they had to solve this problem in some fundamental way, and that’s how Vitess was born and it grew in phases.
So they basically, to start with, identified what was the biggest pain point, like MySQL runs out of connections, stuff like that so let’s do some connection pooling. Let’s put a management layer in front of MySQL so that we can reduce the load on MySQL type of things, right? But eventually, it has grown to where we are. So with Vitess, you can take your data and shard it vertically, meaning you take your set of tables and put a subset on a different MySQL server and Vitess can make it transparently look as if there is only one database behind the scenes, even though there are multiple MySQL servers and different tables live on different servers.
Or you can do horizontal sharding where a single table is actually distributed across multiple servers but still looks like a single table to the application. So that’s a little bit of history and features. In terms of architecture, what Vitess does is that as part of the Vitess system, every MySQL that is being managed has a sidecar process, which we call VTTablet, which manages that MySQL.
And then the process that applications interact with is called VTGate. It’s a gateway that accepts requests for data and that can be MySQL protocol or it can be gRPC calls and it will decide how to route those queries to the backing VTTablets, which will then send the queries to the underlying MySQL and then return the results to VTgate, which will aggregate them if necessary and then send them back to the client application. So at a very high level, you have the client talks to VTGate. VTGAte figures out how to parse, plan, execute that query, sends it down to VTTablets, which will send it down to MySQL, and then the whole path travels back in reverse with the results.
Cloud Native Databases [07:05]
Srini Penchikala: Yes, I’m interested in getting into specifics of a distributed database. But before we do that, I would like to ask a couple of high-level questions. So what should the database developers consider when looking for a cloud-native database if they have a database need and they would like to leverage a cloud database? What should they look for in what kind of high-level architecture and design considerations?
Deepthi Sigireddi: At this point in time, almost every commercially available database has a cloud offering and they are capable of running in the cloud. That was not always the case. So really, the things that you want to look for when you are looking for a cloud database are usability because it’s important. You have very little control over how the database is being run behind the scenes so you have to be able to see what the configuration is, whether it works for you and be able to tune it to some extent. There’ll be some things that any service provider will not allow you to tune, but it needs to be usable for your application needs.
Compatibility, whatever it is that you’re doing needs to be supported by the database you’re looking at because there are many, many cloud databases. There’s Amazon RDS, Google Cloud SQL, Mongo has a cloud offering. So you have SQL databases, NoSQL databases, Oracle has cloud database offerings so there are many options so that compatibility is very important. And then uptime. You have no control over how well the database is staying up, and many times the database is critical to operations so you have to look at the historical uptime and the SLA that the cloud provider is providing you.
What you should not need to worry about are things like, will my data be safe? Will my database be available? Will I lose my data? Can I get my data out? All those are baseline things that the cloud providers need to make sure happen and you shouldn’t have to worry about them, but you need to check those boxes when you’re choosing an option.
Srini Penchikala: And kind of more of a follow-up question to that. I know not every application requires a massively parallel distributed database. So what are the things that database developers and application developers should look for when they could use a database that’s not a cloud-native database?
Deepthi Sigireddi: That’s a good question. I don’t know that, at this point of software system evolution, there are very many situations where you can’t use a cloud database. Usually, it is because you have highly sensitive data that legally you are not allowed to store in a third party. Because otherwise, in terms of encryption, in terms of network security, things have come a long way, and in general, it is safe to do things in the cloud. So the only thing I can really think of is legal reasons that you would not use something in the cloud.
Srini Penchikala: Yes. Also, I kind of misphrased my question. So I think cloud databases are always good choices for pretty much most of the applications, but the distributed database may not be needed for some of the apps so you can use a standalone database in the cloud.
Deepthi Sigireddi: Correct. Not everyone needs an infinitely scalable database – that is actually perfectly right. And that is why when you are starting off, you want to choose your database based on usability, based on compatibility with whatever applications and frameworks you’re already using, whatever tooling you’re already using, and just choose something that you can rely on that will stay up. So that goes back to my previous explanation of how people should make these decisions.
Sharding and Replication [11:06]
Srini Penchikala: So assuming that a distributed database is the right choice, one of the architectural highlights of Vitess is sharding, how it manages the database across different shards. So can you tell us how data sharding or partitioning works as well as data replication works in Vitess database?
Deepthi Sigireddi: Okay. Sharding was designed to provide scalability, right? So we’ll talk about that first, and then replication is being used to provide high availability, so we’ll talk about that next. In Vitess, sharding is customizable, meaning you choose which column in your table you want to use as the sharding key. It is not hardcoded, it is not fixed, and you can also choose the function based on which you are sharding. So we have several functions available and we also have a public interface using which people can build custom sharding functions as well. Essentially, what it means is that whenever you are inserting a row into a sharded table, VTGate will compute which shard it needs to go into and it will write it into that shard. And then whenever you are trying to read a row, VTGate will compute which shard that row lives in and fetch it from that shard.
Sometimes there are queries where you cannot compute specifically one shard. For example, if you say, “Get me all the items in my store, which cost more than a million dollars.” For a query like that, you don’t know which shard the data is living in. In fact, it can live in multiple shards. So what VTGate will do is called a scatter query. It will actually send that query to all the shards and gather the results and bring them back. So this is basically how sharding works from the user point of view. Behind the scenes, we have a lot of tooling that helps people to move from their currently unsharded databases. So people get into terabytes of data and then everything starts slowing down and managing things becomes very difficult. At that point, they decide, “Okay, we need to shard and either we need to do vertical sharding, which is separate tables into multiple instances or horizontal sharding”. We’ll focus on horizontal sharding.
So we have the tools using which you can define what your sharding scheme is, how you want to shard each table based on which column. You choose the tables that you want to keep together and first you may do a vertical sharding to move them onto their own servers, and then you do a horizontal sharding to split them up across servers, and the tooling will basically copy all the data onto the new shards and it’ll keep it up to date with your current database to which you can keep writing.
Your system is up and running, you can copy everything in the background, and then when you’re ready to switch to your sharded configuration and you issue that command, we will stop accepting all writes for a very small period of time. This should take less than 30 seconds, maybe 10 or 15 seconds. We’ll stop accepting all writes on the original database or databases and switch everything over to the new shards and then we’ll open it up again for writes so that the new data will flow to the new shards. And in this process, you can optionally also set up what we call a reverse replication so that the old databases are kept in sync with new data and changes that are happening on your new shards just in case you need to roll back.
So we have the facility to do a cut over and then reverse it if something goes wrong because things can always go wrong. So we have a lot of tooling around how you do the sharding, how you make it safe, how you do the cut overs, rollbacks, all of those things.
Srini Penchikala: Okay, that’s for the sharding. Is replication… How does the replication work?
Deepthi Sigireddi: So in our modern software world, application services, they have to stay up all the time. Downtime is not acceptable, right? Downtime is tracked for every popular service and beyond a certain point, users will just leave. If your service is not up, they will go somewhere else. They have options. Given that Vitess is backing many… Originally it backed YouTube. YouTube had to be up 24 by 7, people all over the world are using it. And then over time, other places like Slack, GitHub, Square Cash, you want to make a money transaction. You don’t want the database to be down, right? You don’t want the service to be down. So the service needs the database to be up at all times because you have this stack where you have the user, who’s interacting with the web app from a browser, and then you have all these other components and your uptime is only as high as the weakest of those components.
It’s like the weakest link concept, right? So everything has to be highly available. And the way, traditionally, everybody has achieved high availability with MySQL is through replication, because MySQL comes with replication as a feature so you can have a primary which is actually being written to, and then you can have replicas which are followers, and they’re just getting all the changes from the primary and applying the changes to their local database, and it’s like a always available copy of the data.
So in Vitess also, we use replication to provide high availability, and we have tooling around how to guarantee very high availability. So let’s say you want to do planned maintenance and you have a primary and you have one or more replicas. In the old days, if you have one MySQL server, you have to take it down for maintenance and then it’s unavailable. What we can do is we can actually transition the leadership from the current primary to a replica that is following, and we’ll basically say, “Okay, let’s find the one that is most caught up. If they are lagging, let’s get them to catch up and then switch over so that you have a new primary. And during that period, whatever request we are getting from applications, we’ll hold them up to some limit.” Right? Let’s say a thousand requests.
We can buffer, we can do buffering so that requests don’t error out and it’ll take five or 10 seconds for this transition to happen and after that, we can start serving requests again. So we use replication for high availability and around that replication, we have built features to do planned maintenance, but we also handle unplanned failures. So let’s say the primary went down for some reason. It’s out of memory, there’s a disk error, there’s a network issue, something or the other. The primary MySQL is not reachable.
In Vitess, we have a monitoring component called VTOrc, which stands for Vitess Orchestrator, and there’s a history behind that name as well. VTOrc monitors the cluster and it will say, if the primary is unreachable, I’ll elect a new one and I’ll elect whichever one is most ahead so that we don’t lose data. It will monitor replication on the replicas and make sure that they are always replicating correctly from the cluster primary and it can monitor for some other error conditions as well and fix them. So that’s how Vitess handles high availability by having all of this tooling to handle both planned maintenance and unplanned failures.
Srini Penchikala: So let’s quickly follow up on that, Deepthi, is like you mentioned, if one of the nodes in the cluster is not available or needs to be down for maintenance and the second node or replica node will take over the leadership. So again, the data needs to be properly sharded, partitioned and replicated, right? And then the primary comes back up so now we want to go back to the previous primary node.
So how much of this process is automated using the tools you mentioned, the VT Orchestrator, and how much of this is manual and also what is the maximum delay in switching between the nodes?
Deepthi Sigireddi: For planned maintenance, because it has to be initiated by someone, it’s not automated by Vitess. But Vitess users automate them in some fashion because typically what will happen is that they’re doing a rolling upgrade of the cluster, whether it’s a Vitess version upgrade or it’s some underlying hardware upgrade, you basically want to apply it to all the nodes, so it has to be triggered from outside. So they will call a Vitess command, because we have a command line interface. They will call a Vitess command. You can call an RPC also, but most people use the command line, which will trigger that failover.
Now, the other part of it, the unplanned failures. As long as VTOrc is running, it does the monitoring and the failover so no human intervention is needed actually for that, and that’s the whole idea of it. In terms of timing, we’ve seen things happen as quickly as five to 10 seconds, and especially during the planned failures because we buffer the requests, applications don’t even see it, or even if they see it is seen as a slight response delay versus errors. When you have an unplanned failure, then yes, applications are going to see errors, but typically, it is under 30 seconds because we can detect the failure in less than 15 seconds and then repair it immediately.
Srini Penchikala: For most of the use cases, I think that is tolerable, right? So that…
Deepthi Sigireddi: I think 30 seconds when you have an unplanned failure, like a node failure in a cloud or something like that, or an out of memory error, is actually really good because what used to happen is that somebody would have to be paged and they would have to go and do something manually, which would always take at least 30 minutes.
Consistency, Availability, and Distributed Transactions [21:31]
Srini Penchikala: Absolutely, and it’s planned. So usually the users are aware of the change and they’re expecting some outage. We cannot have a database discussion without getting into the transactions. So I know they’ve been the topic of the talk for a long time. Some people love them, some people hate them depending on how we use them. So distributed databases come with a distributed transaction kind of thing. So how does Vitess balance the conflicting capabilities of data consistency and data availability? So if you want one, you can have the other good old cap theorem. So how do the transactions work in the Vitess database?
Deepthi Sigireddi: So I think consistency and availability applies even if you are not using transactions. So I’ll talk about my view of how things happen in Vitess regarding consistency and availability first, and then we’ll talk about transactions. So if people are always reading from the primary instance of a shard, then there is actually no consistency problem in Vitess because that’s always the most up-to-date view of the data. That is the authoritative view of the data. When we get into consistency issues is that in order to scale read workloads, people will run many replicas and they will actually start reading data from the replicas. So then we get into consistency issues because replicas may be caught up to different extents. It may just be one or two seconds. So you have A, which is the primary, you have B and C, and maybe B is just one second behind A and C is two seconds behind B.
And if you fetch some data from B and some from C, you may end up with an inconsistent view of the data, right? The way, historically, Vitess has dealt with it is if consistency is important, you always read from the primary. If it’s not important, you read from the replicas. Now, people have used tricks like if in one user session they’re actually writing anything, then they always read from the primary. If it’s a read-only session, then they read from replicas or they will write something in a transaction and before they read something, they’ll set a timer like 60 seconds. So for the next 60 seconds, just read from the primary so that you get the up-to-date data. After that, you can go back to reading from replicas so people use these kinds of tricks.
What we would really like to do, and this gets a little bit into the roadmap, is to provide a feature in Vitess where you can specify that I want the read-after-write type of consistency, and Vitess can guarantee it whether you are going to a primary or to a replica. That’s a feature that we don’t have yet. Because you always have to have something more to do, no software is a hundred percent done. It’s never perfect so this is one of the things we would like to do. Now, coming to transactions. In a distributed system, there is always a possibility of something ending up being a distributed transaction.
Ideally, most write transactions go to one shard, in which case there’s actually no distributed transaction problem because MySQL itself will provide the transactional guarantees. But maybe in one transaction you’re updating two rows and they live in different shards like it’s a bank transaction and you have a sender and you have a recipient and you want to subtract $10 from the sender’s balance and add $10 to the recipient’s balance. And either both need to happen or neither of them should happen so that you actually have the consistency of the data.
You don’t want money to disappear or get created from thin air. So Square Case actually solved this problem and they came up with a very creative solution without actually having the true distributed transaction support in Vitess. And the way they did it was that they write this in a ledger and they use that to reconcile things after the fact so that even if something failed, they’re always able to repair it. The other thing we do in Vitess, which gets us out of some failure modes, is that when there is a distributed transaction, and suppose it involves four shards, right? We don’t execute all of the writes in parallel.
So we say, “Okay, let’s write to shard one. If it fails, we roll it back. Let’s write to shard two. If that fails, we can roll back both two and one.” So we have open transactions to all four shards and we’ll do one at a time and if at any point there is a failure, we roll back all of them. Once we have done the write when we have to commit the transaction, that is when we issue all the four commits in. Once MySQL has accepted a write, the probability of it rejecting a commit is actually very low, and it will only happen if you have a network problem and you’re not able to reach the server or if the server goes down. So we’ve reduced the universe of possible failures when you’re doing this type of a best effort distributed transaction. So that’s what we call it in Vitess. We are doing best effort distributed transactions. These are not truly atomic distributed transactions. But truly atomic distributed transactions is another thing we have on the roadmap that we would like to get to in the next one or two years.
Database Schema Management Best Practices [27:20]
Srini Penchikala: Switching gears a little bit, so schema management and data contracts are getting a lot of attention lately for a good reason. One of the interesting features is the VTAdmin that’s used for schema management APIs. Can you discuss about this a little bit and also share what you’re seeing as best practices on creating, managing and versioning schemas for the data models?
Deepthi Sigireddi: Okay, so VTAdmin is a part of Vitess that was built a couple of years ago. Previously, we had a very simple web UI, which you would use to see the state of the cluster and get some… It was mostly read-only. You could take some actions on the Vitess components and so on. So VTAdmin is like the next generation new generation administration component for Vitess and it has an API and a UI and all the APIs are well-structured. The problem with the previous thing we had was that we had HTTP APIs. We did not have gRPC APIs, and the response is not well-structured so no one could rely on a contract for the response. This time around, the API responses are well-structured. They are protobuf responses and you can build things relying on them. So we have a structured gRPC API and a VTAdmin API, and the UI actually talks to the API and gets the data and then renders it on the browser.
There are many things we show in the VTAdmin UI, and one of the things is schemas. And right now you can view the schemas, but you can’t really manage them from the UI. Even though we have the APIs behind the scenes, we haven’t built out the UI flows for doing the schema management so that’s something that will also be in the future. It’s in the roadmap. In terms of best practices for managing schemas and schema versions, I know there are lots of tools out there which people use to manage schema versions. Every framework comes with its own internal table, which versions the schemas and so on. In my personal opinion, store your schema and git. Store it in a version control system. And schema changes are actually a very fraught topic for the MySQL community because for at least 10 years or maybe even more, there have been multiple iterations of online schema change tools built for MySQL.
Because historically, MySQL performed very poorly when you tried to do a schema change on a large table, which is under load, right? So there would be locks taken, which would block access to the table completely. You couldn’t write anything to it, things would become unusable. MySQL itself, the MySQL team at Oracle, has been improving this and they have actually made a certain category of schema changes instant so that things don’t get locked or blocked or whatever. So they’ve been improving it, but of course, the community needed this many years before anything was done in Oracle’s product so there have been many iterations of these schema change tools.
Our stance is that schema changes should be non-blocking. They should happen in the background, and ideally, you need to be able to revert them without losing any data. So those are the principles on which Vitess’s online schema change system is built. So Vitess has an online schema change system and you can say that, “Hey, this is the schema change I want to do, start it, just cut it over when it’s done, or wait for me to tell you and I’ll initiate a cut over.” So all those things are possible with Vitess online schema changes and the biggest thing is they’re non-blocking. Your system will keep running.
Obviously, there’ll be a little additional load because you have to copy everything in the background while the system is still running, and you have to allow for that, but you shouldn’t be running the system at a hundred percent CPU and memory all the time anyway, so you should have room for these kinds of things. So that’s basically what we do in Vitess for non-blocking schema changes.
Vitess Database Project Roadmap [31:43]
Srini Penchikala: What are some of the new features coming up on the roadmap of Vitess project and interesting features that you can share with our listeners?
Deepthi Sigireddi: So compatibility and performance are things that we continuously work on so MySQL compatibility. MySQL keeps adding syntax and features, and it’s not like even historically at any point we supported a hundred percent of MySQL’s syntax. It has been an ongoing effort. So there is work happening on that pretty much continuously. A couple of things were added very recently. One is Common Table Expressions and the other one is window functions. The support for them is pretty basic, but we’ll be working on expanding the support for that so that we can cover all of the forms in which those things can be used. Performance is another area. So one of the ways in which Vitess removes stress from the backing MySQL instances is by using connection pooling.
Everybody doesn’t get a dedicated connection because MySQL connections are very heavy. They get a lightweight session to VTGate, which multiplexes everything over a single gRPC connection to a VTTablet, and the VTTablet maintains a pool of connections to MySQL. So we recently shipped a new connection pooling implementation, which makes the performance of the connection pool much better in terms of wait times or even memory utilization because we would tend to cycle through the whole pool and then go back. Whereas, if your queries per second is not very high, you may not even need to use the whole pool. You may not even need to keep a hundred or 150 open connections to MySQL so it’s just a much more efficient connection pooling, and we have ongoing work to improve our benchmarks.
We benchmark query performance on Vitess every night, and we publish those results on a dedicated website called benchmark.vitess.io, and we have an ongoing effort, basically continuous improvement of the benchmarks. Beyond that, in terms of functionality, we recently shipped point-in-time recovery, like the second iteration of point-in-time recovery in Vitess. There had been a previous iteration four years ago, and we’ll be making improvements to that and to online DDL based on user feedback.
Online Resources [34:20]
Srini Penchikala: Okay, thank you, Deepthi. Do you have any recommendations on what online resources our listeners can check out to learn more about distributed database in general or Vitess database in particular?
Deepthi Sigireddi: Vitess is part of the Cloud Native Computing Foundation. Google donated Vitess to CNCF in 2018, and Vitess graduated from CNCF in 2019. We have a website, vitess.io, where we have documentation, we have examples, we have quick start guides to downloading and running Vitess on your laptop. You can run the examples on a laptop. You can run it within Kubernetes. Vitess has always run within Kubernetes, so all of those are available on our website. And we also have links to videos because myself and some of our other maintainers and also community members, people who are using Vitess in their own companies, they go and talk about what they are doing.
So we have links to those videos on our website, but you can also just go to YouTube and search for Vitess and there are plenty of talks, and some of those talks are actually very good in terms of providing an introduction. A much more detailed introduction than what I just did to Vitess features and architecture, and they have some nice diagrams which are easier to consume sometimes than words in terms of how the architecture works.
Srini Penchikala: Before we wrap up today’s discussion, do you have any additional comments or any remarks?
Deepthi Sigireddi: I just want to say that working on Vitess and Open Source has been a really positive experience for me, and I encourage people to get involved in something that’s bigger than just whatever company or team you are working in, because it opens you up to new interactions with people, new experiences. It just enriches life as a software developer.
Srini Penchikala: Sounds good. Thank you, Deepthi. Thank you very much for joining this podcast. It’s been great to discuss one of the very important topics in cloud hosted database solutions, the distributed databases topic. And also, it’s great to talk to a practitioner like you who brings a lot of practical knowledge and experience to these discussions. To our listeners, thank you for listening to this podcast. If you would like to learn more about data engineering topics, check out the AI/ML and Data Engineering community web pages on infoq.com website. I encourage you to listen to the recent podcasts and check out the articles and news items the InfoQ team has been posting on the website. Thank you, Deepthi. Thanks for your time.
Deepthi Sigireddi: Thank you, Srini. This was great.
.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.