How to waste CPU and kill your disk by scaling 100 million inefficiently

time to read 5 min | 989 words

I recently run into this blog post Scaling to 100M: MySQL is a Better NoSQL (from about 6 months ago) and cringed, hard. Go ahead and read it, I’ll wait. There are so much stuff going on here that I disagree with that I barely even know where to start.

I think that what annoys me the most about this post is that it attempts to explain a decision, but does that in a way that clearly shows a lack of depth in the decision making process.

I absolutely agree on the first section, you shouldn’t make your database choice based on hype, or by whatever it is “everyone” is doing. But saying that “if everyone jumps off the roof…” is generally a bad argument to make when literally everyone jumps off the roof (maybe it is on fire, maybe it is 1 meter drop, maybe it has a pool to jump into, etc). If this sounds ridiculous, this is because it is.

In particular, I take offense at:

This post will explain why we’ve found that using MySQL for the key/value use case is better than most of the dedicated NoSQL engines, and provide guidelines to follow when using MySQL in this way.

Then they go to list some of their requirements. I’m assuming that you read the post, so I’ll answer it directly.

The dataset they are talking about is about 210GB, and is composed of about 100 million records. In other words, you can fit that entire thing to memory in an AWS instance such as d2.8xlarge, at a cost of about 1.5$ / hour for a 3 year plan. Read this again, their dataset can actually fit in memory.

And even with that, they report a rate of 200K request per minute, which is funny, because the typical metric is looking at requests per second. At which point we are talking about around 3,400 req/second. But they have three database servers, so we are probably talking about around a thousand requests per second overall.

Oh, and they report an average of 1 – 1.5 ms latency numbers. Leaving aside the fact that averages means nothing (a percentiles summary would work much better), that is a really long time to process a single request.

I really liked this one:

Our existing system has scaling / throughput / concurrency / latency figures that are impressive for any NoSQL engine.

No, it isn’t. Just to give you some idea, assuming even distribution of the data, each site entry is about 2KB in size, so their throughput numbers are less than 10 MB / second.

Now, let us talk about the ways that their approach is actually broken. To start with, they have statements such as this one:

Serial keys impose locks… …Also notice that we are not using serial keys; instead, we are using varchar(50), which stores client-generated GUID values—more about that in the next section.

I mean, okay, so you have no idea how to generate serial keys without requiring locks, because things like that are so hard. I can think of several ways without even trying hard (Snowflake, HiLo, ranges, guid.comb, to name just a few). Now, why would you want want to take the time to do something like this? Because using a GUID is… how shall we say it, a horrible idea!

GUIDs are not sorted, which means that you are inserting (at a high rate) a lot of entries to the table, which forces a lot of page splits, which results in a bigger and deeper B+Tree, which result in a higher cost to find records, which is what you were trying to prevent in the first place.

Allowing sequential inserts can improve your insert performance (and afterward, the query speed) by orders of magnitude. So that is most certainly something that you really want to invest the 30 minutes it takes to code a sequential number solution from scratch, if you can use the literally dozens of ready made solutions.

But the thing that is really takes the cake is the fact that all of their queries take the following form:

So a sub-select is required to run this query (which with most reasonable query optimizers will be exactly equivalent to the query plan of an inner join), but the usage of TEXT data in the site information will mean at least another disk seek (to load the actual value) after the relevant row was located.

Now, it is possible that MySQL was a good decision for their use case, but this is:

Not an optimal usage of MySQL in the first place.
Small data set, can fit on one machine, can actually fit into memory
Inflexible system, very hard to change (needing another queryable field is now a major operation)
Low overall performance

That last one is very important. Just to give you some idea, for the size that they are talking about, we can probably handle the full 200,000 request per minute that they are talking about on their three way cluster using a single machine, and doing that in one second.

Assuming that I’m trying to find a dedicated solution to the problem (trie for the routing, simple memory mapped storage for the actual site data, where the routing trie will contain the position of the data). Of course, you would be crazy to do that. Just having a speedy solution for this is not enough, you also need to handle all of the rest of the associated costs of a database (operations, metrics, backup/restore, replication, etc).

But the provided solution is just Not Good.

Tweet Share Share 22 comments

Tags:

Comments

31 May 2016
09:25 AM

tobi

The latency numbers probably are 95% network.

Smells like the network limited throughput. They need to increase DOP or batch.

I wonder why they did not pick Postgres. I cannot imagine it being significantly slower than MySQL and it's so far more advanced. MySQL is an anti-pattern just by itself.

31 May 2016
09:29 AM

Uri

I really don't understand those decisions, seems like they try to save but actually it costed more. there are many good solutions for that problem, (redis, ravendb and even memcached). especially when all data can fit in ram today the memory is cheap and 1TB ram servers is not something rare. go figure...

31 May 2016
13:42 PM

Ted

I really don't get why people continue to choose MySQL. You hear so many stories of how people chose MySQL, traffic increased and they had performance issues, and then resorted to a bunch of hacks to make it work. Just use a better database to start with. I think memcached would have been a better solution to the problem proposed anyway.

31 May 2016
15:54 PM

Marcel Popescu

I don't understand this myth about GUIDs being orders of magnitude slower than INTs. I was sure it was wrong, but I thought I'd measure it. Here are my results:

	Benchmarking the time it takes to insert 100000 records into three tables
	1. (Id, Text) where Id is an integer identity column
	2. (Id, Text) where Id is an unique identifier (a GUID)
	3. (Id, Text) where Id is a varchar(50) (also a GUID)
	Id is the primary key in all three cases.
	QUGUII 00:00:32.3946774 3086.92686657222 inserts per second
	TYSOOD 00:00:51.6730195 1935.24591687544 inserts per second
	DLYJXD 00:00:56.9313432 1756.50168043111 inserts per second

So... about twice as slow. 100K records / second is rather ok for their requirements.

The code is at https://github.com/mdpopescu/public/tree/master/SqlBenchmark if you want to make sure I didn't make a mistake.

31 May 2016
15:55 PM

Marcel Popescu

That was per minute, sorry, not per second. I wish I could edit a comment...

31 May 2016
16:51 PM

Chris B

@Marcel - I would guess that with a narrow table (i.e. a small number of columns) the hit of a page split is pretty low. If you had a wide table, physically moving the data during a split would be more expensive. You would probably see the same cost again on each index as well.

31 May 2016
16:54 PM

Stephen Ward

@marcel

a) cost of retrieving a random record from the database? b) only 100k rows? try 100 million - 100k is going to fit into memory c) you could also look at the query plan differences

31 May 2016
18:22 PM

Warren Schrey

Yeah, inserting 100,000 guids into a one-column table with no nonclustered indexes doesn't really demonstrate anything of value. If you want to see the performance problems of using guids as identifiers, you need to have a wider table & multiple nonclustered indexes that use the guid as the clustering key. The amount of work required to keep each nonclustered index in order does not increase linearly with the size of the data set.

Kimberly Tripp has been talking about this for many years. Back in 2010 she showed some tests with a sales database containing a base set of 6.7 million rows. Then, adding 10,000 rows took 17 seconds for a copy of the DB using ints as identifiers, and over 5 minutes for the copy using GUIDs. Then she ran another insert of 10k rows and the times for the second run went to 24 seconds for ints and 7+ minutes for guid ids.

http://www.sqlskills.com/blogs/kimberly/disk-space-is-cheap/

Now imagine if the data set size was 100 million rows? It's really easy to see how the scale of difference would get to multiple orders of magnitude as Oren said.

31 May 2016
18:28 PM

Phillip Haydon

Ted, I agree. Why use MySQL when you could be using PostgreSQL.

01 Jun 2016
09:44 AM

Carsten Hansen

@marcel

Will it make a difference in performance if you used batches of transactions?

Is there an implicit COMMIT in your code?

01 Jun 2016
10:49 AM

Oren Eini

Marcel, So you see a 100% cost for using guids, and you are using extremely small table. The rows are small, and the entire thing can fit into memory with no issue. Now, try to do the same using SqlBulkCopy, but do that over 50GB or so. Afterward, try to issue a few queries (by id), and see the difference.

01 Jun 2016
15:45 PM

Marcel Popescu

Going up to ten million records and using bulk inserts, the insert times are indeed scarier - almost two orders of magnitude worse in the case of GUID or string keys.

Benchmarking the time it takes to insert 10000000 records into three tables
1. (Id, Text) where Id is an integer identity column
2. (Id, Text) where Id is an unique identifier (a GUID)
3. (Id, Text) where Id is a varchar(50) (also a GUID)
Id is the primary key in all three cases.
Bulk inserts
CIZVHA 00:00:08.0452410 1242970.84450298 operations per second
LDQOLV 00:03:33.8213781 46768.0083668865 operations per second
BNAFSU 00:05:38.8161865 29514.5285214997 operations per second
Queries
CIZVHA 00:00:00.1331495 75103549.0182088 operations per second
LDQOLV 00:00:00.1405718 71138023.4157918 operations per second
BNAFSU 00:00:00.1382571 72329016.0143674 operations per second

Are the bulk inserts relevant though? I thought they were inserting these records one at a time, following external events, though I might have misunderstood the article.

Surprisingly, the query time is almost identical. (I've only done 1000 queries though, not 10 million.)

All in all, an interesting exercise; I'll definitely keep it in mind for the next time I'm working with huge tables. (I used to work for a company that kept records for pretty much everyone in the US, so a lot of tables had hundreds of millions of records.)

01 Jun 2016
19:42 PM

Federico Lois

Just to give some insights into the actual cost. In the sequential case the cost on 'any reasonably coded database engine' should be dominated by networking, housekeaking and/or memory speed. The reason is that given the sequential nature of the insert the amount of disk seeks / accesses are low. Why? Because you can pack a punch of GUID in a data page, and you usually keep tree-traversal at a minimum. However, on the random case (as long as you dont have enough memory to put the entire damn thing in memory) the cost should be dominated by tree-traversal (aka disk page faults).

<br/> So, no matter what database engine you use chances are that sequential beats random and ints beats GUIDs.

01 Jun 2016
20:42 PM

Stephen Ward

@Marcel The idea behind the bulk inserts is that w/ enough users hitting the database, inserting one-at-a-time looks like a bulk insert. Think walmart or any big box store on Black Friday/Cyber Monday.

http://corporate.walmart.com/_news_/news-archive/2014/12/02/walmart-breaks-its-online-records-on-cyber-monday-2014

Walmart did 1.5 billion page views in 2014, so if you figure 10% conversion, that's 150 million orders in 24 (really 12) hours, or about 1700 inserts/sec

02 Jun 2016
14:34 PM

Kyle

"GUIDs are not sorted, which means that you are inserting (at a high rate) a lot of entries to the table, which forces a lot of page splits, which results in a bigger and deeper B+Tree, which result in a higher cost to find records, which is what you were trying to prevent in the first place."

Would the equality "=" operator use a hash index or a B-Tree in the proposed queries?

02 Jun 2016
16:02 PM

Sander

Marcel Popescu: Not sure about MySql, but SQL Server has NEWSEQUENTIALID, which at least in theory would help somewhat, allowing clustered PK index and better page handling. Downside of this approach is that you still cannot get PK of the record before it is saved to DB. Of course, you can also use underlying Windows UuidCreateSequential function, which the NEWSEQUENTIALID wraps.

02 Jun 2016
20:40 PM

Rich

With only 100M rows, that's a little more than 1000 inserts a second for a day. Focusing on insert seems like a distraction for this use-case..

Most modern flash does at least 80,000 IOPS per second, with more commercial offerings doing 10x that... You can likely get better numbers with little to no memory when running SSD.

Even AWS does 3,000 IOPS per TB disk allocated with generic SSD.

03 Jun 2016
18:36 PM

Erhan Hosca

Marcel Popescu: MSSqlServer has the unfortunate default of automatically applying the CLUSTERED attribute to Primary Key constraints even if the column type is UNIQUEIDENTIFIER.

For the life of me I can't figure out why you would ever want to cluster on a GUID column.

It makes a difference as follows:

With CLUSTERED PRIMARY KEYS:

EYQBOD 00:00:01.0323501 968663.634555758 operations per second

VHECDG 00:00:15.2126126 65734.9283975062 operations per second

OADNEJ 00:00:16.9577344 58970.1416717554 operations per second

With NONCLUSTERED PRIMARY KEYS:

APYWPU 00:00:00.9581030 1043729.11889432 operations per second

KCDQJI 00:00:11.6992822 85475.3294180732 operations per second

LEYLQG 00:00:18.3282841 54560.4811963821 operations per second

It's pretty clear you want you UNIQUEIDENTIFIER columns NONCLUSTERED...

Change to Code:

        private const int COUNT = 1000 * 1000;

        var table1 = CreateTable("Id int IDENTITY PRIMARY KEY NONCLUSTERED , Text nvarchar(100)");

        var table2 = CreateTable("Id uniqueidentifier PRIMARY KEY NONCLUSTERED, Text nvarchar(100)");

        var table3 = CreateTable("Id varchar(50) PRIMARY KEY NONCLUSTERED, Text nvarchar(100)");

04 Jun 2016
05:50 AM

Oren Eini

Marcel, Bulk insert is simply an easy way to generate a lot of load on the db, so you can see just the cost of making this change, without the costs of network, query parsing, etc.

Regarding queries, just running a 1000 queries isn't enough. You need to run about half a million, and see the results, but even so you can see that the int is MUCH faster

04 Jun 2016
05:50 AM

Oren Eini

Kyle, Typically indexes in DBs would use a B+Tree, using hash is a rather unique situation. Note that a hash doesn't allow range queries, which can sometimes be very useful, and also has atypical growth rate costs. When you need to increase the hash table size, you may be paying a LOT of I/O.

04 Jun 2016
05:50 AM

Oren Eini

Sander, See the links in the post. seq guid helps, sure, but it is still slower than an int.

04 Jun 2016
05:51 AM

Oren Eini

Rich, Inserts are easy to measure. But you'll actually feel this in every single query. When your b+tree is fragmented, you are forcing more nesting, and that means more work and more I/O

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB