This post asked an interesting question, why are hash table so prevalent for in memory usage and (relatively) rare in the case of databases. There is some good points in the post, as well as in the Hacker News thread.
Given that I just did a spike of persistent hash table and have been working on database engines for the past decade, I thought that I might throw my own two cents into the ring.
B+Tree is a profoundly simple concept. You can explain it in 30 minutes, and it make sense. There are some tricky bits to a proper implementation, for sure, but they are more related to performance than correctness.
Hash tables sounds simple, but the moment you have to handle collisions gracefully, you are going to run into real challenges. It is easy to get into nasty bugs with hash tables, the kind that silently corrupt your state without you realizing it.
For example, consider the following code:
This is a hash table using linear addressing. Collisions are handled by adding them to the next available node. And in this case, we have a problem. We want to put “ghi” in position zero, but we can’t, because it is already full. We move it to the first available location. That is well understood and easy. But when we delete “def”, we remove the entry from the array, but we forgot to do fixups for the relocated “ghi”, that value is now gone from the table, effectively. This is the kind of bug you need the moon to be in a certain position while a cat sneeze to figure out.
A B+Tree also maps very nicely to persistent model, but it is entirely non obvious how you can go from the notion of a hash table in memory to one on disk. Extendible hashing exists, and has for a very long time. Literally for more time than I’m alive, but it is not very well known / generically used. It is a beautiful algorithm, mind you. But just mapping the concept to a persistence model isn’t enough, typically, you also had a bunch of additional requirements from disk data structure. In particular, concurrency in database systems is frequently tied closely to the structure of the tree (page level locks).
There is also the cost issue. When talking about disk based data access, we are rarely interested in the actual O(N) complexity, we are far more interested in the number of disk seeks that are involved. Using extendible hashing, you’ll typically get 1 – 2 disk seeks. If the directory is in memory, you have only one, which is great. But with a B+Tree, you can easily make sure that the top levels of the tree will also be memory resident (similar to the extendible hash directory), that leads to typical 1 disk access to read the data, so in many cases, they are roughly the same performance for either option.
Related to the cost issue, you have to also consider security risks. There have been a number of attacks against hash tables that relied on generating hash collisions. The typical in memory fix is to randomize the hash to avoid this, but if you are persistent, you have to use the same hash function forever. That means that an attacker can very easily kill your database server, by generating bad keys.
But these are all relatively minor concerns. The key issue is that B+Tree is just so much more useful. A B+Tree can allow me to:
- Store / retrieve my data by key
- Perform range queries
- Index using a single column
- Index using multiple columns (and then search based on full / partial key)
- Iterate over the data in specified order
Hashes allow me to:
- Store / retrieve my data by key
And that is pretty much it. So B+Tree can do everything that Hashes can, but also so much more. They are typically as fast where it matters (disk reads) and more than sufficiently fast regardless.
Hashes are only good for that one particular scenario of doing lookup by exact key. That is actually a lot more limited than what you’ll consider.
Finally, and quite important, you have to consider the fact that B+Tree has certain access patterns that they excel at. For example, inserting sorted data into a B+Tree is going to be a joy. Scanning the B+Tree in order is also trivial and highly performant.
With hashes? There isn’t an optimal access pattern for inserting data into a hash. And while you can scan a hash at roughly the same cost as you would a B+Tree, you are going to get the data out of order. That means that it is a lot less useful than it would appear to upfront.
All of that said, hashes are still widely used in databases. But they tend to be used as specialty tools. Deployed carefully and for very specific tasks. This isn’t the first thing that you’ll reach to, you need to justify its use.
More posts in "re" series:
- (27 Oct 2020) Investigating query performance issue in RavenDB
- (27 Dec 2019) Writing a very fast cache service with millions of entries
- (26 Dec 2019) Why databases use ordered indexes but programming uses hash tables
- (12 Nov 2019) Document-Level Optimistic Concurrency in MongoDB
- (25 Oct 2019) RavenDB. Two years of pain and joy
- (19 Aug 2019) The Order of the JSON, AKA–irresponsible assumptions and blind spots
- (10 Oct 2017) Entity Framework Core performance tuning–Part III
- (09 Oct 2017) Different I/O Access Methods for Linux
- (06 Oct 2017) Entity Framework Core performance tuning–Part II
- (04 Oct 2017) Entity Framework Core performance tuning–part I
- (26 Apr 2017) Writing a Time Series Database from Scratch
- (28 Jul 2016) Why Uber Engineering Switched from Postgres to MySQL
- (15 Jun 2016) Why you can't be a good .NET developer
- (12 Nov 2013) Why You Should Never Use MongoDB
- (21 Aug 2013) How memory mapped files, filesystems and cloud storage works
- (15 Apr 2012) Kiip’s MongoDB’s experience
- (18 Oct 2010) Diverse.NET
- (10 Apr 2010) NoSQL, meh
- (30 Sep 2009) Are you smart enough to do without TDD
- (17 Aug 2008) MVC Storefront Part 19
- (24 Mar 2008) How to create fully encapsulated Domain Models
- (21 Feb 2008) Versioning Issues With Abstract Base Classes and Interfaces
- (18 Aug 2007) Saving to Blob
- (27 Jul 2007) SSIS - 15 Faults Rebuttal
- (29 May 2007) The OR/M Smackdown
- (06 Mar 2007) IoC and Average Programmers
- (19 Sep 2005) DLinq Mapping