Big Data SearchThe index format is horrible
I have completed my own exercise, and while I did wanted to try it with “few allocations” rule, it is interesting to see just how far out there the code is.
This isn’t something that you can really for anything except as a basis to see how badly you are doing. Let us start with the index format. It is just a CSV file with the value and the position in the original file.
That means that any search we want to do on the file is actually a binary search, as discussed in the previous post. But doing a binary search like that is an absolute killer for performance.
Let us consider our 15TB data set. In my tests, a 1GB file with 4.2 million rows produced roughly 80MB index. Assuming the same is true for the larger file, that gives us a 1.2 TB file.
In my small index, we have to do 24 seeks to get to the right position in the file. And as you should know, disk seeks are expensive. They are in the order of 10ms or so. So the cost of actually searching the index is close to quarter of a second. Now, to be fair, there is going to be a lot of caching opportunities here, but probably not that many if we have a lot of queries to deal with ere.
Of course, the fun thing about this is that even with a 1.2 TB file, we are still talking about less than 40 seeks (the beauty of O(logN) in action), but that is still pretty expensive. Even worse, this is what happens when we are running on a single query at a time.
What do you think will happen if we are actually running this with multiple threads generating queries. Now we will have a lot of seeks (effective random) that would generate a big performance sink. This is especially true if we consider that any storage solution big enough to store the data is going to be composed of an aggregate of HDD disks. Sure, we get multiple spindles, so we get better performance overall, but still…
Obviously, there are multiple solutions for this issue.
B+Trees solve the problem by packing multiple keys into a single page, so instead of doing a O(log2N), you are usually doing O(log36N) or O(log100N). Consider those fan outs, we will have 6 – 8 seeks to do to get to our data. Much better than the 40 seeks required using plain binary search. It would actually be better than that in the common case, since the first few levels of the trees are likely to reside in memory (and probably in L1, if we are speaking about that).
However, given that we are storing sorted strings here, one must give some attention to Sorted Strings Tables. The way those work, you have the sorted strings in the file, and the footer contains two important bits of information. The first is the bloom filter, which allows you to quickly rule out missing values, but the more important factor is that it also contains the positions of (by default) every 16th entry to the file. This means that in our 15 TB data file (with 64.5 billion entries), we will use about 15GB just to store pointers to the different locations in the index file (which will be about 1.2 TB). Note that the numbers actually are probably worse. Because SST (note that when talking about SST I am talking specifically about the leveldb implementation) utilize many forms of compression, it is actually that the file size will be smaller (although, since the “value” we use is just a byte position in the data file, we won’t benefit from compression there). Key compression is probably a lot more important here.
However, note that this is a pretty poor way of doing things. Sure, the actual data format is better, in the sense that we don’t store as much, but in terms of the number of operations required? Not so much. We still need to do a binary search over the entire file. In particular, the leveldb implementation utilizes memory mapped files. What this ends up doing is rely on the OS to keep the midway points in the file in RAM, so we don’t have to do so much seeking. Without that, the cost of actually seeking every time would make SSTs impractical. In fact, you would pretty much have to introduce another layer on top of this, but at that point, you are basically doing trees, and a binary tree is a better friend here.
This leads to an interesting question. SST is probably so popular inside Google because they deal with a lot of data, and the file format is very friendly to compression of various kinds. It is also a pretty simple format. That make it much nicer to work with. On the other hand, a B+Tree implementation is a lot more complex, and it would probably several orders of magnitude more complex if it had to try to do the same compression tricks that SSTs do. Another factor that is probably as important is that as I understand it, a lot of the time, SSTs are usually used for actual sequential access (map/reduce stuff) and not necessarily for the random reads that are done in leveldb.
It is interesting to think about this in this fashion, at least, even if I don’t know what I’ll be doing with it.
More posts in "Big Data Search" series:
- (24 Jan 2014) Sorting randomness
- (23 Jan 2014) Sorting Optimizations
- (22 Jan 2014) The index format is horrible
- (20 Jan 2014) Binary Search of Textual Data
- (17 Jan 2014) Setting up
Comments
You mentioned one interesting number. I wonder if one can get a better than 100 fan out with a B+ tree, even with a 4kb page. As you were asking for querying against email, and zip codes, the first can be greatly compressed changing the notation from [user]@[domain] to [domain]@[user] and using prefixes in B+ trees.
What about SSD and disk seek as mentioned in the comments to your link: https://gist.github.com/jboner/2841832
Maybe a hierarchical structure of the index file: memory(map) cache, SSD, and finally ordinary Harddisks
Is it possible to get rid of some disk seeks if you read a e.g larger than 4K block when you are near the target position in your binary search.
If you point to a large block in your in index, you end with a smaller index file, since all the values can be divided by xK
@Carsten Hansen, The operation that you are describing is called a lookahead read and many db engines do it (MS SQL as far as I know), the only small problem here is that you need to do a forward read you need to position yourself in a Binary Search index that will let you read forward but that's actually not a big deal.
@Ayende, I wonder if bin search in the SST example could be improved by using cache oblivious binary search trees, but that itself is not easy and would require an extra tree structure with special layout thus making the implementation complex.
@Ayende, Ok actually cache oblivious bin search trees wouldn't help at all as there would be very little to actually no benefit from using them, the story might have been different however if the data could be added or modified (but then again a test would be needed)
Scooletz, You can do that with key compression, sure. But I am not sure how much that would be worth it. The key issue is that B+Tree with prefix compression is really hard. If I was using Voron, I would probably go for a multi tree per domain. That would give me domain "compression", since I would only store the domain once, but without all the complexity.
Carsten, A B+Tree already is able to take major advantage of the hierarchical nature of memory. You have the top branches in the tree cached in memory, and then the rest flows from there. You usually don't want to try managing it yourself, you would run into a hell of complexity trying.
Note that in Windows, at least, you are actually usually reading 32 KB at once from the disk, then the rest from the fs cache.
Bartosz, The main problem is that you would still would need to do a lot of effectively random seeks. And potentially quite a lot of them. The key point with SST is that the format is simple and easily compressed. If most of what you do is reading the whole file in sequence, that is very good format.
@Ayende I suppose this is true, we could hoverer try to keep a partial bin tree in cache (oblivious or not) that will contain (X)Kb pages and do one seek given by the tree page offset and issue a forward read, then cache the page so it potentially could serve other threads (so random seeks would be less frequent), but since we are doing that (introducing lot's of complexity and memory management issues) it's very close from here to have a full B+Tree so I see what are you saying.
Ayende, that's also an option. What I meant was to use the specific data format to lower the overhead of storing repeated fields.
Ayende, do you prefer MemoryMappedViewStream over MemoryMappedViewAccessor? Have you found a difference in performance for random access of a very large memory mapped file?
Tyler, I prefer access the memory directly. Both stream and accessor has a very high overhead
Comment preview