The Guts n’ Glory of Database InternalsManaging concurrency
This is an interesting post, because I’m going to lay down some of the options that we have for concurrency inside the database engine, but not really discuss them in depth. In particular, even with concurrency control, you get into… strange situations with databases (see transaction isolation levels and what problems each is trying to solve) because of the nature of the beast.
The ideal is, of course, that each client feels like they are the only client touching the database, and no one can touch the database while it is running. The easiest way to do that is to allow just a single client at a time to enter the database. That is actually something that happens frequently, but in practice, that isn’t something that you really want to do. Even embedded databases allow at least multiple readers, so that isn’t really something that we’ll deal with.
Before we get into actually concurrency discussion, we first need to figure out what we are talking about with regards to concurrency inside the database engine. The holy grail is writers that don’t hold readers, and readers that don’t hold up writers, allowing both reads and writes to proceed without blocking.
Before we get to the actual complexities involved in implementing it, let us see what kind of problems we have after we solved this. In particular, the issue is when we have multiple clients reading / writing to the same value concurrently. What are we supposed to do then? If we have W1 and W2 both trying to mutate the same record, which one will win? Do we serialize the accesses? What happens if we have a W1 and R2 both modifying and reading from the same record? Until the write transaction is completed, do we give the reader the new value, make it wait?
The classic model in databases used to be that the database would take locks. You can think about them as Reader/Writer locks whenever it read / wrote a particular value. The release schedule for those locks would impact the transaction isolation level, but that isn’t really important for this discussion. Note that a lock per value is pretty expensive, so one of the things that a database engine would do will be to escalate the lock, if it noted that there are too many locks in a particular page, it will escalate to a page lock (and onward until the entire tree / table were locked). That exposed a pretty nasty issue to users, if you had a hotspot in your database (recently modified records), it was easy to get into a situation where all the clients were waiting for the same lock, effectively causing a train. Note that in this case, both readers and writers are waiting for each other, and the idea is that we gain concurrency by spreading the locks around in the hope that they won’t contend so much.
Another way to handle this is called MVCC (Multi Versioning Concurrency Control), in this manner, instead of overwriting a record immediately on change, we keep the old value, so readers don’t have to wait for the end of the transaction to get the value, they can get the old value (which we can give without waiting). Writers still need to wait for each other if they modify the same record, but we just ensures that writers and readers don’t need to wait for one another. MVCC is a bit complex, because you need to maintain multiple concurrent versions, but it is a very common choice today.
But a lot of the complexity around writers and writers locks is actually embedded in the notion of having a conversation with the database. The ability to issue multiple statements to the database in the same connection, with potentially human reactions behind that makes for a much more complex system, because you have to hold all the locks for the duration of the connection. In most newer databases, there is no such concept, a write transaction is held for the duration of a single command (or a batch of commands), which is processed independently and then commit immediately.
Under that scenario, it actually make a lot more sense to skip the notion of concurrency writers, and move to concurrent readers/single writer mode. In that mode, there can be only a single write transaction, and the only concurrency that you have to deal with is with the readers (which can be solved efficiently with MVCC), so that makes for a much simpler database design. Combine that with the serial nature of I/O which databases depend on for durability (more on that in a future post), this actually make a lot of sense, since it removes a lot of the complexity from the database code.
RavenDB, LMDB, MongoDB, CouchDB are all built with a single concurrent writer in mind. In fact, even databases such LevelDB or RocksDB are effectively single writer (they just do concurrent transaction merges).
Let us talk about transaction merging for a while, shall we? LevelDB in particular is an interesting case, because you can use the notion of WriteBatch to write to it, and multiple threads can submit WriteBatch at the same time, giving us concurrent writes. The way it is implemented, though, is quite interesting. All those threads submitting all those WriteBatch instances will add their batch to a queue, then compete on a lock. The first one that wins will run through all the WriteBatch in the queue and commit them all.
It is a simple, and attractive model, but I seriously dislike it. The problem is that it is you might two WriteBatch that modify the same record, and you don’t really have a good way of knowing about that. So you can’t really reason about it. Since a single lock is taken anyway, I much prefer the single writer model, in which the write transaction know that it is the only one that can modify things, so it doesn’t need to worry about concurrency with other transactions that might have different idea about what is supposed to be in the database.
When we implemented transaction merging, we did this with explicit optimistic concurrency in mind, and it worked, but it was a complex model, and switching to a single writer made the whole thing much simpler.
More posts in "The Guts n’ Glory of Database Internals" series:
- (08 Aug 2016) Early lock release
- (05 Aug 2016) Merging transactions
- (03 Aug 2016) Log shipping and point in time recovery
- (02 Aug 2016) What goes inside the transaction journal
- (18 Jul 2016) What the disk can do for you
- (15 Jul 2016) The curse of old age…
- (14 Jul 2016) Backup, restore and the environment…
- (11 Jul 2016) The communication protocol
- (08 Jul 2016) The enemy of thy database is…
- (07 Jul 2016) Writing to a data file
- (06 Jul 2016) Getting durable, faster
- (01 Jul 2016) Durability in the real world
- (30 Jun 2016) Understanding durability with hard disks
- (29 Jun 2016) Managing concurrency
- (28 Jun 2016) Managing records
- (16 Jun 2016) Seeing the forest for the trees
- (14 Jun 2016) B+Tree
- (09 Jun 2016) The LSM option
- (08 Jun 2016) Searching information and file format
- (07 Jun 2016) Persisting information
Have you considered Wired Tiger engine? It has MVCC, concurrent writes and separate checkpoint() method for durability. For our task (indexing of about 3 billion 40 byte-length keys, 16 gb mem cache, with 'not strict' durability - we can afford reindexing from scratch on failure, but this is undesirable of course) it was the best among LMDB, ESENT and Voron in our benchmarks both for read and write perf.
Petr, See the "not strict" durability issue, that is a non starter.
I know that there are scenarios where that is desirable (in fact, we built a "checkpoint" like mechanism that you can use to optimize certain operations if you can tolerate data loss, ensuring that there is no corruption even if there is an error), but for the most part, when we talk about a database, when we put data it, it stays in, come hell or high water.
A lot of databases cheat. WiredTiger does flushes in the background, and amortize them among many operations. That make it faster, but it also expose it to data loss issues with certain scenarios.