The Guts n’ Glory of Database Internals: Managing concurrency
This is an interesting post, because I’m going to lay down some of the options that we have for concurrency inside the database engine, but not really discuss them in depth. In particular, even with concurrency control, you get into… strange situations with databases (see transaction isolation levels and what problems each is trying to solve) because of the nature of the beast.
The ideal is, of course, that each client feels like they are the only client touching the database, and no one can touch the database while it is running. The easiest way to do that is to allow just a single client at a time to enter the database. That is actually something that happens frequently, but in practice, that isn’t something that you really want to do. Even embedded databases allow at least multiple readers, so that isn’t really something that we’ll deal with.
Before we get into actually concurrency discussion, we first need to figure out what we are talking about with regards to concurrency inside the database engine. The holy grail is writers that don’t hold readers, and readers that don’t hold up writers, allowing both reads and writes to proceed without blocking.
Before we get to the actual complexities involved in implementing it, let us see what kind of problems we have after we solved this. In particular, the issue is when we have multiple clients reading / writing to the same value concurrently. What are we supposed to do then? If we have W1 and W2 both trying to mutate the same record, which one will win? Do we serialize the accesses? What happens if we have a W1 and R2 both modifying and reading from the same record? Until the write transaction is completed, do we give the reader the new value, make it wait?
The classic model in databases used to be that the database would take locks. You can think about them as Reader/Writer locks whenever it read / wrote a particular value. The release schedule for those locks would impact the transaction isolation level, but that isn’t really important for this discussion. Note that a lock per value is pretty expensive, so one of the things that a database engine would do will be to escalate the lock, if it noted that there are too many locks in a particular page, it will escalate to a page lock (and onward until the entire tree / table were locked). That exposed a pretty nasty issue to users, if you had a hotspot in your database (recently modified records), it was easy to get into a situation where all the clients were waiting for the same lock, effectively causing a train. Note that in this case, both readers and writers are waiting for each other, and the idea is that we gain concurrency by spreading the locks around in the hope that they won’t contend so much.
Another way to handle this is called MVCC (Multi Versioning Concurrency Control), in this manner, instead of overwriting a record immediately on change, we keep the old value, so readers don’t have to wait for the end of the transaction to get the value, they can get the old value (which we can give without waiting). Writers still need to wait for each other if they modify the same record, but we just ensures that writers and readers don’t need to wait for one another. MVCC is a bit complex, because you need to maintain multiple concurrent versions, but it is a very common choice today.
But a lot of the complexity around writers and writers locks is actually embedded in the notion of having a conversation with the database. The ability to issue multiple statements to the database in the same connection, with potentially human reactions behind that makes for a much more complex system, because you have to hold all the locks for the duration of the connection. In most newer databases, there is no such concept, a write transaction is held for the duration of a single command (or a batch of commands), which is processed independently and then commit immediately.
Under that scenario, it actually make a lot more sense to skip the notion of concurrency writers, and move to concurrent readers/single writer mode. In that mode, there can be only a single write transaction, and the only concurrency that you have to deal with is with the readers (which can be solved efficiently with MVCC), so that makes for a much simpler database design. Combine that with the serial nature of I/O which databases depend on for durability (more on that in a future post), this actually make a lot of sense, since it removes a lot of the complexity from the database code.
RavenDB, LMDB, MongoDB, CouchDB are all built with a single concurrent writer in mind. In fact, even databases such LevelDB or RocksDB are effectively single writer (they just do concurrent transaction merges).
Let us talk about transaction merging for a while, shall we? LevelDB in particular is an interesting case, because you can use the notion of WriteBatch to write to it, and multiple threads can submit WriteBatch at the same time, giving us concurrent writes. The way it is implemented, though, is quite interesting. All those threads submitting all those WriteBatch instances will add their batch to a queue, then compete on a lock. The first one that wins will run through all the WriteBatch in the queue and commit them all.
It is a simple, and attractive model, but I seriously dislike it. The problem is that it is you might two WriteBatch that modify the same record, and you don’t really have a good way of knowing about that. So you can’t really reason about it. Since a single lock is taken anyway, I much prefer the single writer model, in which the write transaction know that it is the only one that can modify things, so it doesn’t need to worry about concurrency with other transactions that might have different idea about what is supposed to be in the database.
When we implemented transaction merging, we did this with explicit optimistic concurrency in mind, and it worked, but it was a complex model, and switching to a single writer made the whole thing much simpler.