How to waste CPU and kill your disk by scaling 100 million inefficiently
I recently run into this blog post Scaling to 100M: MySQL is a Better NoSQL (from about 6 months ago) and cringed, hard. Go ahead and read it, I’ll wait. There are so much stuff going on here that I disagree with that I barely even know where to start.
I think that what annoys me the most about this post is that it attempts to explain a decision, but does that in a way that clearly shows a lack of depth in the decision making process.
I absolutely agree on the first section, you shouldn’t make your database choice based on hype, or by whatever it is “everyone” is doing. But saying that “if everyone jumps off the roof…” is generally a bad argument to make when literally everyone jumps off the roof (maybe it is on fire, maybe it is 1 meter drop, maybe it has a pool to jump into, etc). If this sounds ridiculous, this is because it is.
In particular, I take offense at:
This post will explain why we’ve found that using MySQL for the key/value use case is better than most of the dedicated NoSQL engines, and provide guidelines to follow when using MySQL in this way.
Then they go to list some of their requirements. I’m assuming that you read the post, so I’ll answer it directly.
The dataset they are talking about is about 210GB, and is composed of about 100 million records. In other words, you can fit that entire thing to memory in an AWS instance such as d2.8xlarge, at a cost of about 1.5$ / hour for a 3 year plan. Read this again, their dataset can actually fit in memory.
And even with that, they report a rate of 200K request per minute, which is funny, because the typical metric is looking at requests per second. At which point we are talking about around 3,400 req/second. But they have three database servers, so we are probably talking about around a thousand requests per second overall.
Oh, and they report an average of 1 – 1.5 ms latency numbers. Leaving aside the fact that averages means nothing (a percentiles summary would work much better), that is a really long time to process a single request.
I really liked this one:
Our existing system has scaling / throughput / concurrency / latency figures that are impressive for any NoSQL engine.
No, it isn’t. Just to give you some idea, assuming even distribution of the data, each site entry is about 2KB in size, so their throughput numbers are less than 10 MB / second.
Now, let us talk about the ways that their approach is actually broken. To start with, they have statements such as this one:
Serial keys impose locks… …Also notice that we are not using serial keys; instead, we are using varchar(50), which stores client-generated GUID values—more about that in the next section.
I mean, okay, so you have no idea how to generate serial keys without requiring locks, because things like that are so hard. I can think of several ways without even trying hard (Snowflake, HiLo, ranges, guid.comb, to name just a few). Now, why would you want want to take the time to do something like this? Because using a GUID is… how shall we say it, a horrible idea!
GUIDs are not sorted, which means that you are inserting (at a high rate) a lot of entries to the table, which forces a lot of page splits, which results in a bigger and deeper B+Tree, which result in a higher cost to find records, which is what you were trying to prevent in the first place.
Allowing sequential inserts can improve your insert performance (and afterward, the query speed) by orders of magnitude. So that is most certainly something that you really want to invest the 30 minutes it takes to code a sequential number solution from scratch, if you can use the literally dozens of ready made solutions.
But the thing that is really takes the cake is the fact that all of their queries take the following form:
So a sub-select is required to run this query (which with most reasonable query optimizers will be exactly equivalent to the query plan of an inner join), but the usage of TEXT data in the site information will mean at least another disk seek (to load the actual value) after the relevant row was located.
Now, it is possible that MySQL was a good decision for their use case, but this is:
- Not an optimal usage of MySQL in the first place.
- Small data set, can fit on one machine, can actually fit into memory
- Inflexible system, very hard to change (needing another queryable field is now a major operation)
- Low overall performance
That last one is very important. Just to give you some idea, for the size that they are talking about, we can probably handle the full 200,000 request per minute that they are talking about on their three way cluster using a single machine, and doing that in one second.
Assuming that I’m trying to find a dedicated solution to the problem (trie for the routing, simple memory mapped storage for the actual site data, where the routing trie will contain the position of the data). Of course, you would be crazy to do that. Just having a speedy solution for this is not enough, you also need to handle all of the rest of the associated costs of a database (operations, metrics, backup/restore, replication, etc).
But the provided solution is just Not Good.
Comments
The latency numbers probably are 95% network.
Smells like the network limited throughput. They need to increase DOP or batch.
I wonder why they did not pick Postgres. I cannot imagine it being significantly slower than MySQL and it's so far more advanced. MySQL is an anti-pattern just by itself.
I really don't understand those decisions, seems like they try to save but actually it costed more. there are many good solutions for that problem, (redis, ravendb and even memcached). especially when all data can fit in ram today the memory is cheap and 1TB ram servers is not something rare. go figure...
I really don't get why people continue to choose MySQL. You hear so many stories of how people chose MySQL, traffic increased and they had performance issues, and then resorted to a bunch of hacks to make it work. Just use a better database to start with. I think memcached would have been a better solution to the problem proposed anyway.
I don't understand this myth about GUIDs being orders of magnitude slower than INTs. I was sure it was wrong, but I thought I'd measure it. Here are my results:
So... about twice as slow. 100K records / second is rather ok for their requirements.
The code is at https://github.com/mdpopescu/public/tree/master/SqlBenchmark if you want to make sure I didn't make a mistake.
That was per minute, sorry, not per second. I wish I could edit a comment...
@Marcel - I would guess that with a narrow table (i.e. a small number of columns) the hit of a page split is pretty low. If you had a wide table, physically moving the data during a split would be more expensive. You would probably see the same cost again on each index as well.
@marcel
a) cost of retrieving a random record from the database? b) only 100k rows? try 100 million - 100k is going to fit into memory c) you could also look at the query plan differences
Yeah, inserting 100,000 guids into a one-column table with no nonclustered indexes doesn't really demonstrate anything of value. If you want to see the performance problems of using guids as identifiers, you need to have a wider table & multiple nonclustered indexes that use the guid as the clustering key. The amount of work required to keep each nonclustered index in order does not increase linearly with the size of the data set.
Kimberly Tripp has been talking about this for many years. Back in 2010 she showed some tests with a sales database containing a base set of 6.7 million rows. Then, adding 10,000 rows took 17 seconds for a copy of the DB using ints as identifiers, and over 5 minutes for the copy using GUIDs. Then she ran another insert of 10k rows and the times for the second run went to 24 seconds for ints and 7+ minutes for guid ids.
http://www.sqlskills.com/blogs/kimberly/disk-space-is-cheap/
Now imagine if the data set size was 100 million rows? It's really easy to see how the scale of difference would get to multiple orders of magnitude as Oren said.
Ted, I agree. Why use MySQL when you could be using PostgreSQL.
@marcel
Will it make a difference in performance if you used batches of transactions?
Is there an implicit COMMIT in your code?
Marcel, So you see a 100% cost for using guids, and you are using extremely small table. The rows are small, and the entire thing can fit into memory with no issue. Now, try to do the same using SqlBulkCopy, but do that over 50GB or so. Afterward, try to issue a few queries (by id), and see the difference.
Going up to ten million records and using bulk inserts, the insert times are indeed scarier - almost two orders of magnitude worse in the case of GUID or string keys.
Are the bulk inserts relevant though? I thought they were inserting these records one at a time, following external events, though I might have misunderstood the article.
Surprisingly, the query time is almost identical. (I've only done 1000 queries though, not 10 million.)
All in all, an interesting exercise; I'll definitely keep it in mind for the next time I'm working with huge tables. (I used to work for a company that kept records for pretty much everyone in the US, so a lot of tables had hundreds of millions of records.)
Just to give some insights into the actual cost. In the sequential case the cost on 'any reasonably coded database engine' should be dominated by networking, housekeaking and/or memory speed. The reason is that given the sequential nature of the insert the amount of disk seeks / accesses are low. Why? Because you can pack a punch of GUID in a data page, and you usually keep tree-traversal at a minimum. However, on the random case (as long as you dont have enough memory to put the entire damn thing in memory) the cost should be dominated by tree-traversal (aka disk page faults).
<br/> So, no matter what database engine you use chances are that sequential beats random and ints beats GUIDs.
@Marcel The idea behind the bulk inserts is that w/ enough users hitting the database, inserting one-at-a-time looks like a bulk insert. Think walmart or any big box store on Black Friday/Cyber Monday.
http://corporate.walmart.com/_news_/news-archive/2014/12/02/walmart-breaks-its-online-records-on-cyber-monday-2014
Walmart did 1.5 billion page views in 2014, so if you figure 10% conversion, that's 150 million orders in 24 (really 12) hours, or about 1700 inserts/sec
"GUIDs are not sorted, which means that you are inserting (at a high rate) a lot of entries to the table, which forces a lot of page splits, which results in a bigger and deeper B+Tree, which result in a higher cost to find records, which is what you were trying to prevent in the first place."
Would the equality "=" operator use a hash index or a B-Tree in the proposed queries?
Marcel Popescu: Not sure about MySql, but SQL Server has NEWSEQUENTIALID, which at least in theory would help somewhat, allowing clustered PK index and better page handling. Downside of this approach is that you still cannot get PK of the record before it is saved to DB. Of course, you can also use underlying Windows UuidCreateSequential function, which the NEWSEQUENTIALID wraps.
With only 100M rows, that's a little more than 1000 inserts a second for a day. Focusing on insert seems like a distraction for this use-case..
Most modern flash does at least 80,000 IOPS per second, with more commercial offerings doing 10x that... You can likely get better numbers with little to no memory when running SSD.
Even AWS does 3,000 IOPS per TB disk allocated with generic SSD.
Marcel Popescu: MSSqlServer has the unfortunate default of automatically applying the CLUSTERED attribute to Primary Key constraints even if the column type is UNIQUEIDENTIFIER.
For the life of me I can't figure out why you would ever want to cluster on a GUID column.
It makes a difference as follows:
With CLUSTERED PRIMARY KEYS:
EYQBOD 00:00:01.0323501 968663.634555758 operations per second
VHECDG 00:00:15.2126126 65734.9283975062 operations per second
OADNEJ 00:00:16.9577344 58970.1416717554 operations per second
With NONCLUSTERED PRIMARY KEYS:
APYWPU 00:00:00.9581030 1043729.11889432 operations per second
KCDQJI 00:00:11.6992822 85475.3294180732 operations per second
LEYLQG 00:00:18.3282841 54560.4811963821 operations per second
It's pretty clear you want you UNIQUEIDENTIFIER columns NONCLUSTERED...
Change to Code:
Marcel, Bulk insert is simply an easy way to generate a lot of load on the db, so you can see just the cost of making this change, without the costs of network, query parsing, etc.
Regarding queries, just running a 1000 queries isn't enough. You need to run about half a million, and see the results, but even so you can see that the int is MUCH faster
Kyle, Typically indexes in DBs would use a B+Tree, using hash is a rather unique situation. Note that a hash doesn't allow range queries, which can sometimes be very useful, and also has atypical growth rate costs. When you need to increase the hash table size, you may be paying a LOT of I/O.
Sander, See the links in the post. seq guid helps, sure, but it is still slower than an int.
Rich, Inserts are easy to measure. But you'll actually feel this in every single query. When your b+tree is fragmented, you are forcing more nesting, and that means more work and more I/O
Comment preview