How to waste CPU and kill your disk by scaling 100 million inefficiently
I recently run into this blog post Scaling to 100M: MySQL is a Better NoSQL (from about 6 months ago) and cringed, hard. Go ahead and read it, I’ll wait. There are so much stuff going on here that I disagree with that I barely even know where to start.
I think that what annoys me the most about this post is that it attempts to explain a decision, but does that in a way that clearly shows a lack of depth in the decision making process.
I absolutely agree on the first section, you shouldn’t make your database choice based on hype, or by whatever it is “everyone” is doing. But saying that “if everyone jumps off the roof…” is generally a bad argument to make when literally everyone jumps off the roof (maybe it is on fire, maybe it is 1 meter drop, maybe it has a pool to jump into, etc). If this sounds ridiculous, this is because it is.
In particular, I take offense at:
This post will explain why we’ve found that using MySQL for the key/value use case is better than most of the dedicated NoSQL engines, and provide guidelines to follow when using MySQL in this way.
Then they go to list some of their requirements. I’m assuming that you read the post, so I’ll answer it directly.
The dataset they are talking about is about 210GB, and is composed of about 100 million records. In other words, you can fit that entire thing to memory in an AWS instance such as d2.8xlarge, at a cost of about 1.5$ / hour for a 3 year plan. Read this again, their dataset can actually fit in memory.
And even with that, they report a rate of 200K request per minute, which is funny, because the typical metric is looking at requests per second. At which point we are talking about around 3,400 req/second. But they have three database servers, so we are probably talking about around a thousand requests per second overall.
Oh, and they report an average of 1 – 1.5 ms latency numbers. Leaving aside the fact that averages means nothing (a percentiles summary would work much better), that is a really long time to process a single request.
I really liked this one:
Our existing system has scaling / throughput / concurrency / latency figures that are impressive for any NoSQL engine.
No, it isn’t. Just to give you some idea, assuming even distribution of the data, each site entry is about 2KB in size, so their throughput numbers are less than 10 MB / second.
Now, let us talk about the ways that their approach is actually broken. To start with, they have statements such as this one:
Serial keys impose locks… …Also notice that we are not using serial keys; instead, we are using varchar(50), which stores client-generated GUID values—more about that in the next section.
I mean, okay, so you have no idea how to generate serial keys without requiring locks, because things like that are so hard. I can think of several ways without even trying hard (Snowflake, HiLo, ranges, guid.comb, to name just a few). Now, why would you want want to take the time to do something like this? Because using a GUID is… how shall we say it, a horrible idea!
GUIDs are not sorted, which means that you are inserting (at a high rate) a lot of entries to the table, which forces a lot of page splits, which results in a bigger and deeper B+Tree, which result in a higher cost to find records, which is what you were trying to prevent in the first place.
Allowing sequential inserts can improve your insert performance (and afterward, the query speed) by orders of magnitude. So that is most certainly something that you really want to invest the 30 minutes it takes to code a sequential number solution from scratch, if you can use the literally dozens of ready made solutions.
But the thing that is really takes the cake is the fact that all of their queries take the following form:
So a sub-select is required to run this query (which with most reasonable query optimizers will be exactly equivalent to the query plan of an inner join), but the usage of TEXT data in the site information will mean at least another disk seek (to load the actual value) after the relevant row was located.
Now, it is possible that MySQL was a good decision for their use case, but this is:
- Not an optimal usage of MySQL in the first place.
- Small data set, can fit on one machine, can actually fit into memory
- Inflexible system, very hard to change (needing another queryable field is now a major operation)
- Low overall performance
That last one is very important. Just to give you some idea, for the size that they are talking about, we can probably handle the full 200,000 request per minute that they are talking about on their three way cluster using a single machine, and doing that in one second.
Assuming that I’m trying to find a dedicated solution to the problem (trie for the routing, simple memory mapped storage for the actual site data, where the routing trie will contain the position of the data). Of course, you would be crazy to do that. Just having a speedy solution for this is not enough, you also need to handle all of the rest of the associated costs of a database (operations, metrics, backup/restore, replication, etc).
But the provided solution is just Not Good.