I read this blog post from Discord, which presents a really interesting approach to a problem that they ran into. Basically, in the cloud you have the choice between fast and ephemeral disks and persistent (and much slower) disks.
Technically speaking you can try to get really fast persistent disks, but those are freaking expensive. Ultra disks on Azure and io2 disks on AWS and Extreme persistent disks for GCP. The cost for a single 1TB disk with 25K IOPS would be around 1,800 USD / month, for example. That is just for the disk!
To give some context, an r6gd.16xlarge instance with 64 cores and 512 GB (!) of RAM as well as 3.8 TB of NVMe drive will cost you less than that. That machine can also do 80K IOPS, for context.
At Discord scale, they ran into the limitations of I/O in the cloud with their database and had to come up with a solution for that.
My approach would be to… not do anything, to be honest. They are using a database (ScyllaDB) that is meant to run on commodity hardware. That means that losing a node is an expected scenario. The rest of the nodes in the cluster will be able to pitch in and recover the data when the node comes back or is replaced. That looks like the perfect scenario for running with fast ephemeral disks, no?
There are two problems with ephemeral disks. First, they are ephemeral , wich means that a hardware problem can make the whole disk go poof. Not an issue, that is why you are using a replicated database, no? We’ll get to that.
Another issue that is raised by Discord is that they rely on disk snapshots for backups and other important workflows. The nice thing about snapshotting a cloud disk is that it is typically fast and cheap to do. Otherwise you need to deal with higher level backup systems at the database layer. The issue is that you probably do want that, because restoring a system from a set of disk snapshots that were taken at different times and at various states is likely to be… challenging.
Discord's solution to this is to create a set of ephemeral disks that are mirrored into persistent disks. Reads are going to the fast disks and writes are spread around. A failure on the ephemeral disks will lead to a recovery of the data from the network disk.
Their post has the gory details and a lot more explanation aside. I wanted to write this post for a simple reason. As a database guy, this system scares me to pieces.
The issue is that I don’t know what kind of data consistency we can expect in such an environment. Do I get any guarantees about the equivalence of data between the fast disks and the network disk? Can they diverge from one another? What happens when you have partial errors?
Consider a transaction that modifies three different pages that end up going to three different fast disks + the network disk. If one of the fast disks has an error during write, what is written to the persistent disk?
Can I get an out-of-date read from this system if we read from the network disk for some reason? That may mean that two pages that were written in one transaction are coming back as different versions. That will likely violate assumptions and invariants and can lead to all sorts of… interesting problems.
Given that this system is meant to handle the failure modes, it sounds really scary because it is an additional layer of complexity (and one that the database is unaware of) to deal with.
Aside from the snapshots, I assume that the reason for this behavior is to avoid the cost of rebuilding a node when there is a failure. I don’t have enough information to say what is the failure rate and the impact on the overall system, but the solution provided is elegant, beautiful, and quite frankly, pretty scary to me.
There have been quite a few unknowns that we had to deal with in the realm of storage. But this adds a whole new layer of things that can break.
More posts in "re" series:
- (16 Aug 2022) How Discord supercharges network disks for extreme low latency
- (02 Jun 2022) BonsaiDb performance update
- (14 Jan 2022) Are You Sure You Want to Use MMAP in Your Database Management System?
- (09 Dec 2021) Why IndexedDB is slow and what to use instead
- (23 Jun 2021) The performance regression odyssey
- (27 Oct 2020) Investigating query performance issue in RavenDB
- (27 Dec 2019) Writing a very fast cache service with millions of entries
- (26 Dec 2019) Why databases use ordered indexes but programming uses hash tables
- (12 Nov 2019) Document-Level Optimistic Concurrency in MongoDB
- (25 Oct 2019) RavenDB. Two years of pain and joy
- (19 Aug 2019) The Order of the JSON, AKA–irresponsible assumptions and blind spots
- (10 Oct 2017) Entity Framework Core performance tuning–Part III
- (09 Oct 2017) Different I/O Access Methods for Linux
- (06 Oct 2017) Entity Framework Core performance tuning–Part II
- (04 Oct 2017) Entity Framework Core performance tuning–part I
- (26 Apr 2017) Writing a Time Series Database from Scratch
- (28 Jul 2016) Why Uber Engineering Switched from Postgres to MySQL
- (15 Jun 2016) Why you can't be a good .NET developer
- (12 Nov 2013) Why You Should Never Use MongoDB
- (21 Aug 2013) How memory mapped files, filesystems and cloud storage works
- (15 Apr 2012) Kiip’s MongoDB’s experience
- (18 Oct 2010) Diverse.NET
- (10 Apr 2010) NoSQL, meh
- (30 Sep 2009) Are you smart enough to do without TDD
- (17 Aug 2008) MVC Storefront Part 19
- (24 Mar 2008) How to create fully encapsulated Domain Models
- (21 Feb 2008) Versioning Issues With Abstract Base Classes and Interfaces
- (18 Aug 2007) Saving to Blob
- (27 Jul 2007) SSIS - 15 Faults Rebuttal
- (29 May 2007) The OR/M Smackdown
- (06 Mar 2007) IoC and Average Programmers
- (19 Sep 2005) DLinq Mapping