The Guts n’ Glory of Database Internals: Understanding durability with hard disks

architecture (629) rss
bugs (451) rss
challenges (137) rss
community (391) rss
databases (482) rss
design (905) rss
development (673) rss
hibernating-practices (75) rss
miscellaneous (593) rss
performance (399) rss
programming (1125) rss
raven (1492) rss
ravendb.net (582) rss
reviews (184) rss

2026
- June (2)
- May (2)
- April (5)
- February (4)
- January (5)
2025
- December (8)
- November (4)
- October (4)
- September (10)
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB Workshops - Deep dive into practical use of Document Data Modeling

Jun 30 2016

The Guts n’ Glory of Database InternalsUnderstanding durability with hard disks

time to read 6 min | 1012 words

One of the most important aspect of a database engine is that it needs to support durability. Now, that is a personal opinion as a database author, but I consider this to be a pretty important consideration when selecting a database. Other database engines disagree, from pure in memory databases that lose all the data on restart to databases that makes “best effort” and will work as long as they don’t crash in the wrong place.

Because there are several different metrics for what durable means, I’ll provide several levels of possible durability requirements.

None – if the application / server restarts, all the data is lost. Typically is used for in memory databases, explicitly giving up durability for performance.
Try – the data is written to disk, but no attempt is made to make sure that it is coherent / up to date. Typically is used to startup an in memory / near in memory database from cold state.
Crash – if the database has confirmed a write, and then immediately crashed, the data is still there. Most databases try to get to this level.
Power loss – if the database has confirmed a write, even complete power loss of the entire machine will still keep the written data. This is where you wanna be.

Note that in all those cases, I’m talking about single node databases, distributed stuff is a lot more complex, so I’ll not touch it here.

This is when talking about durability, but there is also the notion of atomicity, in other words, a transaction that is composed of multiple operations should either be there complete (and certain if confirmed to the client) or not be there at all (rollback), there should never be a situation where some of the records went in, and some didn’t.

Finally, there is the paranoid mode for durability, in which you don’t trust the hardware, so you write to multiple locations, hash it and validate. Personally, at that level, I think that this is the purpose of the file system to verify that, and this is where the responsibility of the database ends, but if you are stuck with a poor file system choice (like FAT a decade or two ago), that was certainly something that you’ll have to consider.

At any rate, one of the major problems with gaining durability is that you have to pay so much for it. In order to actually be durable, you have to write your changes to the actual disk platter, and that is something that might actually require physical parts to move, so that is something that is very costly. How costly? A high end (15,000 RPM) hard disk can do a theoretical maximum of 250 such writes per second, and that is an extremely theoretical number. In most cases, even on high end hard disks, you’ll see a maximum of a 100 – 150 per second. You can probably double or triple that for high end SSD drive, but those are still very poor numbers, compared to the number of operations you can do in memory and in the CPU in that time frame.

That puts a pretty hard limit on the number of times you can hit the disk, but that is not something that we can tolerate, so the process of making a write in most operation systems, looks like this:

Notice the number of buffers in the middle? Each of them is going to gather the I/O as it can before issuing a (large) write to the layer below them. With the idea that this way, we can amortize the cost of going to the physical disk among many different writes. It works, it works quite well, in fact, to the point where most of the time, you don’t really think how slow the hardware is really is.

But for databases, this is a horrendous problem. To start with, if I’m issuing a write to the disk, I have no idea when this actually hit the platter. For fun, all through this chain (which is actually simplified), each component is free to re-order the writes in any way it feels like it, so we can’t even rely on the order of the writes to ensure consistency. No, in order to protect itself from data loss / corruption in the presence of power loss, we have to tell the operating system to actually skip all those layers and flush directly to disk.

As I said, this is expensive to do. In particular, the normal way to do to make your writes, and then to to call fsync(fd) in order to flush those changes down the chain, all the way to the disk. This has a few issues, however. In particular, note that in the gap between the file system and the disk driver, we’ll lose the correlation between writes made to a particular file and any other writes made to that device. That end result is that the fsync command forces us to flush the entire disk driver buffer (and the disk buffers, etc). In production systems, that can be hundreds of MB that were innocently sitting there, slowly being written to disk, and suddenly you have this disruptive fsync that need to be written, so everything is flushed to disk, and the fsync (which you expect to be short, because you wrote on only a few dozen KB) now takes a minute, because it is actually flushing 100 MB writes from totally different process.

This post is getting long enough, so I’ll defer the actual discussion on how databases actually achieve durability to the next one, in the meantime, just consider the complexities involved in making sure that the data is on the disk, and how much of the design of modern databases is spent in optimizing just this part.

Tweet Share Share 16 comments

Tags:

Comments

30 Jun 2016
10:46 AM

mark

What do you think about the approach of always writing using WRITE_THROUGH? That way you can control durability at the individual IO level.

30 Jun 2016
10:48 AM

OmariO

When you tell the OS to write a buffer to disk is the write atomic? Ordered?

30 Jun 2016
10:52 AM

Oren Eini

Mark,
Just using WRITE_THROUGH doesn't actually give you the benefit of no buffering. You need both.
See here: https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx

This particular item is of interest:

The operating system also requests a write-through of the hard disk's local hardware cache to persistent media.

This is mentioned only when both are valid. There is also a mention of performance benefit when using both because we skip metadata updates.

There is also another issue, we don't actually want to read from the journal (unless we crashed, which is different), so there is no real point in filling the system cache with stuff that should never be read.

30 Jun 2016
10:54 AM

Oren Eini

OmariO,

Disk writes are atomic to a single sector (in other words, a 512 bytes write should be atomic), but in practice, you don't rely on that.
You force flushes (either via the flags or via fsync) to make sure that there is proper ordering between those operations.

Note that the file system is free to reorder and break apart I/O requests as it wishes, and only fsync / flags can cause it to do this directly in a way that an application can take advantage of it.

30 Jun 2016
11:22 AM

mark

Right, but what do you think about doing that (using both) as an IO strategy?

30 Jun 2016
11:27 AM

Oren Eini

Mark,
That is what we are using, yes. Wait for tomorrow's post with the full details :-)

30 Jun 2016
13:40 PM

OmariO

So when disk writes your buffer that spans a bunch of sectors it does so in the order - one sector after another?
Can it happen that it writes atomically the last sector first and then loses power in its capacitors?

30 Jun 2016
17:31 PM

Oren Eini

OmariO,
There is absolutely no guarantee with regards to what order the disk is handling the write.
In particular, while you can probably safely assume that a write that cover an entire sector is sort of atomic, that isn't always the case (consider remote file systems), and you can't assume that the writes will go in any particular direction.

When you have multi sector write, for example, the drive is actually encouraged to write it in a different order then the one you specified (maybe the disk head is in the proper position to start from a particular location, so it will start writing there), so you can have a write to sectors 1,2,3 and 5, but not 4 (in the same multi sector write).
That would be strange, but depending on the actual layout of the data on disk, that is possible.

Fragmentation plays into that a lot, too. So even though you think you issues a three sector writes that are right next to each other, they may have been allocated very far away, so they can be written in any order.

Basically, the short answer is that until you get a fsync() result back, you can't really assume anything about the data on disk. And you need to be suspicious of fsync under certain systems (sometimes admin / kernel / driver just turn it off for whatever reason).

30 Jun 2016
22:39 PM

OmariO

Thanks for the detailed answer.
Looking forward for the next post!

30 Jun 2016
23:13 PM

Ross

Argh! This is worse than Game of Thrones. You set up a fascinating discussing, and then say "more on this in the next post". I want details!

30 Jun 2016
23:33 PM

Oren Eini

Ross,
There are currently 15 posts in this series, talking about everything in detail.
Trying to get it all in one post means that people skip a lot of the details.

01 Jul 2016
09:56 AM

Craig

Hi Oren, this might be slightly off topic, but there is a really interesting paper by the late Jim Grey examining disk failure rates. If you haven't seen it before, it's a great read http://arxiv.org/abs/cs/0701166

01 Jul 2016
10:36 AM

Oren Eini

Craig,
Yeah, this is what happens when you can't trust the disk. It is a bit high than the numbers I'm familiar with, but he is talking about decades old low end hardware.
Note that at this point, the entity responsible for figuring out is the file system, not the database itself, there are some things that a database can do to recover from them, but they are quite expensive, and a good file system should already be doing them.

And you are supposed to run your db on something a bit better than a single low end hard disk.

01 Jul 2016
11:31 AM

Craig

agreed, what surprised me though was the failure rates in the SAN firmware, which I didn't expect. it's admittedly a rather old article. it would be interesting to see a follow up study...

01 Jul 2016
12:21 PM

Daniel

I have two questions related to the use of fsync:

How much performance does it cost to wait for the writes to complete (i.e. moving from the "Try" to the "Crash" or "Power loss" level)?
How do you deal with admins disabling fsync or hard drives ignoring fsync?

01 Jul 2016
14:18 PM

Oren Eini

Daniel,
With fsync & friends, we are seeing on one machine steady state of 65MB/sec writes to disk, with peeks of about 550MB/sec.

Without it, we are seeing steady state of about 20MB /sec with peeks of about 550MB/sec (but usually the peeks are around 330MB/sec).

To compare, writing 2 million 2KB items with fsync costs gives us a rate of about 7.4 items per ms. Without fsync, we are talking about 7.8 items per ms.

Now, to be fair, this is just by removing the fsync all, the rest of the code still make sure that the preparation for actually writing with fsync. If we didn't have it at all, we would see a major difference in the architecture of the solution.
Because we would be able to rely on different guarantees from the OS.

In other words, while I have some numbers here, they are very suspect.

Another project of ours, writing a log file, can do roughly 400 MB / second sustained without putting too much pressure on the disk, by explicitly utilizing the buffered nature of such calls.

In general, I would say that the cost of fsync and friends is an order of magnitude. If you are calling fsync directly, this is really expensive. If you are doing it in the optimized fashion we do, it is still expensive, but much less so.

As for admins that turn this off. They sometimes have valid reason (battery backup exists), and there is no way the database itself can discover whatever this is the case or not.
Both that and lying drives are stuff that you just have to live with, document carefully and trust the administrator to make the right decisions.

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

The Guts n’ Glory of Database InternalsUnderstanding durability with hard disks

More posts in "The Guts n’ Glory of Database Internals" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "The Guts n’ Glory of Database Internals" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication