The interesting thing about this problem is that I was very careful in how I phrased things. I said what I wanted to happen, but didn’t specify what needs to be done. That was quite intentional. For that matter, the fact that I am posting about what is going to be our acceptance criteria is also intentional. The idea is to have a non trivial task, but something that should be very well understood and easy to research. It also means that the candidate needs to be able to write some non trivial code. And I can tell a lot about a dev from such a project. At the same time, this is a very self contained scenario. The idea is that this is something that you can do in a short amount of time.
The reason that this is an interesting exercise is that this is actually at least two totally different but related problems. First, in a 15TB file, we obviously cannot rely on just scanning the entire file. That means that we have to have an index. And that mean that we have to build it. Interestingly enough, an index being a sorted structure, that means that we have to solve the problem of sorting more data than can fit in main memory.
The second problem is probably easier, since it is just an implementation of external sort, and there are plenty of algorithms around to handle that. Note that I am not really interested in actual efficiencies for this particular scenario. I care about being able to see the code. See that it works, etc. My solution, for example, is a single threaded system that make no attempt at parallelism or I/O optimizations. It clocks at over 1 GB / minute and the memory consumption is at under 150MB. Queries for a unique value return the result in 0.0004 seconds. Queries that returned 153K results completed in about 2 seconds.
When increasing the used memory to about 650MB, there isn’t really any difference in performance, which surprised me a bit.
Then again, the entire code is probably highly inefficient. But that is good enough for now.
The process is kicked off with indexing:
I am actually using the Chicago crime data for this. This is a 1GB file that I downloaded from the Chicago city portal in CSV format. This is what the data looks like:
The ExternalSorter will read and parse the file, and start reading it into a buffer. When it gets to a certain size (about 64MB of source data, usually), it will sort the values in memory and output them into temporary files.
Those file looks like this:
Initially, I tried to do that with binary data, but it turns out that that was too complex to be easy, and writing this in a human readable format made it much easier to work with. The format is pretty simple, you have the value of the left, and on the right you have start position of the row for this value.
We generate about 17 such temporary files for the 1GB file. One temporary file per each 64 MB of the original file. This lets us keep our actual memory consumption very low, but for larger data sets, we’ll probably want to actually do the sort every 1 GB or maybe more. Our test machine has 16 GB of RAM, so doing a sort and outputting a temporary file every 8 GB can be a good way to handle things. But that is beside the point.
The end result is that we have multiple sorted files, but they aren’t sequential. In other words, in file #1 we have values 1,4,6,8 and in file #2 we have 1,2,6,7. We need to merge all of them together. Luckily, this is easy enough to do. We basically have a heap that we feed entries from the files into. And that pretty much takes care of this. See merge sort if you want more details about this.
The end result of merging all of those files is… another file, just like them, that contains all of the data sorted. Then it is time to actually handle the other issue, actually searching the data.
We can do that using simple binary search, with the caveat that because this is a text file, and there is no fixed size records or pages, it is actually a big hard to figure out where to start reading.
In effect, what I am doing is to select an arbitrary byte position, then walk backward until I find a ‘\n’. Once I found the new line character, I can read the full line, check the value, and decide where I need to look next. Assuming that I actually found my value, I can now go to the byte position of the value in the original file and read the original line, giving it to the user.
Assuming an indexing rate of 1 GB / minute a 15 TB file would take about 10 days to index. But there are ways around that as well, but I’ll touch on them in my next post.What all of this did was bring home just how much we usually don’t have to worry about such things. But I consider this research well spent, we’ll be using this in the future.