Big Data SearchSetting up
The interesting thing about this problem is that I was very careful in how I phrased things. I said what I wanted to happen, but didn’t specify what needs to be done. That was quite intentional. For that matter, the fact that I am posting about what is going to be our acceptance criteria is also intentional. The idea is to have a non trivial task, but something that should be very well understood and easy to research. It also means that the candidate needs to be able to write some non trivial code. And I can tell a lot about a dev from such a project. At the same time, this is a very self contained scenario. The idea is that this is something that you can do in a short amount of time.
The reason that this is an interesting exercise is that this is actually at least two totally different but related problems. First, in a 15TB file, we obviously cannot rely on just scanning the entire file. That means that we have to have an index. And that mean that we have to build it. Interestingly enough, an index being a sorted structure, that means that we have to solve the problem of sorting more data than can fit in main memory.
The second problem is probably easier, since it is just an implementation of external sort, and there are plenty of algorithms around to handle that. Note that I am not really interested in actual efficiencies for this particular scenario. I care about being able to see the code. See that it works, etc. My solution, for example, is a single threaded system that make no attempt at parallelism or I/O optimizations. It clocks at over 1 GB / minute and the memory consumption is at under 150MB. Queries for a unique value return the result in 0.0004 seconds. Queries that returned 153K results completed in about 2 seconds.
When increasing the used memory to about 650MB, there isn’t really any difference in performance, which surprised me a bit.
Then again, the entire code is probably highly inefficient. But that is good enough for now.
The process is kicked off with indexing:
1: var options = new DirectoryExternalStorageOptions("/path/to/index/files");2: var input = File.OpenRead(@"/path/to/data/Crimes_-_2001_to_present.csv");3: var sorter = new ExternalSorter(input, options, new int[]4: {
5: 1,// case number6: 4, // ICHR7:
8: });
9:
10: sorter.Sort();
I am actually using the Chicago crime data for this. This is a 1GB file that I downloaded from the Chicago city portal in CSV format. This is what the data looks like:
The ExternalSorter will read and parse the file, and start reading it into a buffer. When it gets to a certain size (about 64MB of source data, usually), it will sort the values in memory and output them into temporary files.
Those file looks like this:
Initially, I tried to do that with binary data, but it turns out that that was too complex to be easy, and writing this in a human readable format made it much easier to work with. The format is pretty simple, you have the value of the left, and on the right you have start position of the row for this value.
We generate about 17 such temporary files for the 1GB file. One temporary file per each 64 MB of the original file. This lets us keep our actual memory consumption very low, but for larger data sets, we’ll probably want to actually do the sort every 1 GB or maybe more. Our test machine has 16 GB of RAM, so doing a sort and outputting a temporary file every 8 GB can be a good way to handle things. But that is beside the point.
The end result is that we have multiple sorted files, but they aren’t sequential. In other words, in file #1 we have values 1,4,6,8 and in file #2 we have 1,2,6,7. We need to merge all of them together. Luckily, this is easy enough to do. We basically have a heap that we feed entries from the files into. And that pretty much takes care of this. See merge sort if you want more details about this.
The end result of merging all of those files is… another file, just like them, that contains all of the data sorted. Then it is time to actually handle the other issue, actually searching the data.
We can do that using simple binary search, with the caveat that because this is a text file, and there is no fixed size records or pages, it is actually a big hard to figure out where to start reading.
In effect, what I am doing is to select an arbitrary byte position, then walk backward until I find a ‘\n’. Once I found the new line character, I can read the full line, check the value, and decide where I need to look next. Assuming that I actually found my value, I can now go to the byte position of the value in the original file and read the original line, giving it to the user.
Assuming an indexing rate of 1 GB / minute a 15 TB file would take about 10 days to index. But there are ways around that as well, but I’ll touch on them in my next post.
What all of this did was bring home just how much we usually don’t have to worry about such things. But I consider this research well spent, we’ll be using this in the future.More posts in "Big Data Search" series:
- (24 Jan 2014) Sorting randomness
- (23 Jan 2014) Sorting Optimizations
- (22 Jan 2014) The index format is horrible
- (20 Jan 2014) Binary Search of Textual Data
- (17 Jan 2014) Setting up
Comments
@Ayende, Why merge the files ? Since you have 1 Gb files that are sorted then you just need to preserve that information and in case of a query all you need to do is a lookup and pick the correct file and read from it (doing binary search or anything you want), I believe that this is called split index or something like that (but those have a link to the next file at the end). Now I know that merging has it's benefits (one file instead of hundreds of files, and merge time complexity is laughable as it's just O(N + M) ad it was stated that prepare stage can be almost infinite in time complexity, but it just seams an unnecessary step given that the constraints are very fixed in their nature.
Bartosz, Assume a 15 TB file, which we split into 15,360 x 1 GB files that contains just the index. That means that in order to do a proper search, you would have to read 15,360 files. That alone is going to take you beyond the 30 seconds limit.
@Ayende, Since I hold 15,360 sorted entries in ram about the data ranges I could easily know in O(1) which file to open and read.
Bartosz, It is very likely that you'll have to open all of them. Assume that you have uniform distribution, you are going to have emails starting with a and emails starting with z in every file.
@Ayende, actually O(1) would be hard to do but I could even do a linear search on those data ranges and it still would find the file to read in a very short time < 1s. Binary search would be also possible.
@Ayende, Ok my bad you are correct the merge operation would be required.
@Bartosz, only under condition that those 15360 files have empty intersection of data ranges. According what Ayende said, it is not so. :-) You can have the same (or very near/similar) data range e.g. in 1000 files.
@Jan, Yes now I'm aware of that for some bizzarish reason I did not took that into consideration :)
Thanks for pointing that out.
This is the same approach as I would take.
"In effect, what I am doing is to select an arbitrary byte position, then walk backward until I find a ‘\n’."
Why arbitrary? I assumed you could store the byte position with index.
"Initially, I tried to do that with binary data, but it turns out that that was too complex to be easy, "
You have an integer and a string just use a hash of the string and handle the low probability of collision to make it even lower use two hashes (64 bit). We do this in event store. Your code will actually be far more complex for NOT doing this as you can't figure out where midpoints are do to variable record size.
Another trick we use is to cache the mid points of the binary search say 2^16 worth of them.
Cheers,
Greg Greg
@Greg,
Collision is the key here, I'm going to answer by pasting (and editing) comment I wrote regarding a different post (as I'm lazy like that)
Hash tables are time efficient not space efficient so as soon as you load them to memory you can end up in a very bad distribution that purely dependant on the data and hashing function. The hashing function needs to be custom made and not build in to have good distribution so whatever you pick you need to test is for eg. with this: http://code.google.com/p/smhasher/. MurMur is considered one of the best hashes but v2 has a bug and v3 isn't in .NET and even if it is you still need to validate it's correctness otherwise the collision to bucket ratio will be catastrophic.
Another thing is to determine what kind of hash table is it going to be (the standard .NET one is rather poor for big data sets) ideally you would need a space efficient structure (you are in ram now) with low collision to bucket ratio as well as low bucket miss ratio. So you could use cucoo, chained, double chained, weighted. Each has it's pros and cons, but the decision is actually not easy (cucoo being a double hashed table can have infinite loop cycles if not implemented wisely).
Unless you had something else in mind then storing a hashtable in memory.
I think that your second suggestion is way better since while we merge we can store mid point offsets, or just store the offsets from the merged files, so we could just do seeks and one lookahead forward read should we find the smallest offset where the data that we want should be.
@Ayende I was just wondering how you arrived at this 10 days estimation? You assumed that it scales linearly? Would that imply that you have an algorithm for sorting in O(n)?
@AG, Merge sort for two sorted data sets is O(N + M) = O(N) :)
@Bartosz Adamczewski Let's put it other way: sorting of 1GB in 1 minute 15360 GB in 15360 minutes (about 10 days) I would assume that you're going to tell me that 10000GB would take 10000 minutes?
In that case I'm really curious about that sorting algorithm :) Maybe @ayende can share it?
@AG,
My time complexity applies only to the step after you have all of the files then this is applicable, but like you said: "10000GB would take 10000 minutes" the times would still be off as we are dealing with lot's of moving parts (disk, cache etc.).
As for sorting the 1 GB chunks we could use Red-Black Tree that has O(n) insertion time, but then again O(log n) read time this is not counting the write overhead, so it would be very hard to do in O(n).
I'm also curious about that now :)
@AG, One idea to speed up Red-Black tree is to hold an extra structure (array) and when doing insertions just compare red nodes with black ones and then swamp them accordingly this should enable us to read a sorted set in O(N) time. Take that with a huge grain of salt as this is an idea off the top of my head (maybe I'm gonna take a stab at it to check if this is doable).
This is soooo computer science 101 second term. What do you do when sort data doesn't fit into memory?? Call the police!!!
Surya, In arbitrary byte position I meant, I am going to go to the middle of the file, but that might be in the middle of a line, so we need to go back to find the start of the line.
Greg, I don't think that this is very costly in terms of perf, and it greatly ease debugging.
AG, The sorting algorithm is whatever List.Sort is using.
Sam, Yup. That is the idea. You'll be surprised when you see how many people just don't get how to solve this.
@Ayende, many people don't know how to solve FizzBuzz so this is in another galaxy regarding problem complexity.
Can you share some recruitment statistics (provided that you collect them) regarding the problem in the blog post as I'm interested just how many people struggle and where ?
Bartosz
I did not suggest using a hash table but a sorted file of fixed width keys and to binary search it. Storing hashes of strings not the strings themselves.
@ayende
Why not try what I discuss it's much simpler than what you have.
Greg
@Greg,
Oh I see then I understood you wrong.
If you store hashes of strings in a sorted file then the problem of choosing the proper hash function still exists. If you pick the wrong one (for eg SuperFastHash has been found to generate many many collisions for certain data set) then you might end up scanning the whole file.
Don't get me wrong as I think that fixed keys are way better then non fixed one if we are talking about production system but since this is supposed to be an recruitment assignment so I think that other ways should be simpler since finding a good hash function is no trivial stuff and I would not dare using the build in one for 15TB data set.
Since we are on the topic could you share what kind of hashing function do you use in EventStore ?
Guys, just stop it. The right way to solve this is with external merge sort. Anything else is just avoiding to do the right thing in the first place.
There is a structure solving issues algorithmically, not just random brainstorming. And this is not a forum, don't spam it.
You should honestly try to judge your abilities given Oren's example. There shouldn't even time being needed to think about this issue for the right person, it's just obvious, straight in the face.
It is a good test because if you don't see it then you shouldn't work there anyways. I like the question now because it's easy and filters obviously people fast (which I am actually surprised of, how bad people seem to be educated).
Greg, Let us imagine that we have two different string hashing algorithms, that generate a 32 bits ints each. That means that we can get away with 16 bytes per entry (two 32 bits hashs + 64 bits file position). We create a sorted file for this, and now we can very quickly search for a particular value. That is assuming that we don't have a double collision, of course, but we'll just leave that. However! This structure only allows for searching using exact matches. It cannot do range searches, and it cannot do prefix searches.
If we store the key itself in the index, we have a LOT more options available for us.
Comment preview