Big Data Search
I got tired of the old questions that we were asking candidates, so I decided to add a new one. This one is usually something that we’ll give the candidates to do at home, at their leisure. Let us imagine the following file:
1: "first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web"2: "James","Butt","Benton, John B Jr","6649 N Blue Gum St","New Orleans","Orleans","LA",70116,"504-621-8927","504-845-1427","jbutt@gmail.com","http://www.bentonjohnbjr.com"3: "Josephine","Darakjy","Chanay, Jeffrey A Esq","4 B Blue Ridge Blvd","Brighton","Livingston","MI",48116,"810-292-9388","810-374-9840","josephine_darakjy@darakjy.org","http://www.chanayjeffreyaesq.com"4: "Art","Venere","Chemel, James L Cpa","8 W Cerritos Ave #54","Bridgeport","Gloucester","NJ","08014","856-636-8749","856-264-4130","art@venere.org","http://www.chemeljameslcpa.com"5: "Lenna","Paprocki","Feltz Printing Service","639 Main St","Anchorage","Anchorage","AK",99501,"907-385-4412","907-921-2010","lpaprocki@hotmail.com",http://www.feltzprintingservice.com
As you can see, this is a pretty trivial CSV file. However, let assume that it is a small example of a CSV file that is 15 TB in size. The requirement is to be able to query on that file. We need to be able to query by email or all the people with in a particular zip code. Because of the size, the solution can be composed of two parts, a prepare part (which can run for as long as it is needed) and answer to queries part. Maximum time to answer any query must be under 30 seconds.
- You can assume that the file never changes, and that once the prepare part is done, it will never need to be run again.
- The answer to a query is the full CSV row.
- You can assume a machine with a single machine 100TB disk, 16 GB RAM and 8 CPU cores.
- The solution cannot use any existing databases.
- The solution needs to include explanation of the various options that were available and why this specific solution was chosen.
- After the prepare phase is done, the solution has to take less than 30TB of data (including the original file).
- The solution should be easy to apply to different CSV file.
I decided that it wouldn’t be fair to ask candidates to do something like that without doing it myself. Mostly because the fact that I have a good idea about how to do something doesn’t meant that I understand the actual implementation issues that might pop up.
I actually gave myself a somewhat harder task, do the above mention task, but do it without access to any library other than the BCL and do so with a minimal amount of memory usage. The entire thing took less than a day, and it solves the problem quite a bit more efficiently than I actually anticipated.
But I’ll discuss the details of this in my next post.
Comments
How long did it take you without the no libraries complain?
I think expecting a candidate to spend a full day rules it out as a worthwhile code test.
If range queries are not required then we could use two indexes.
Radix tree lookup time is O(WorldLen) so it's fast and it's space optimized and can be further compacted if needed. To increase further the read performance and enable lookahead sequential reads one could split the tree into multiple files representing subtree parts, but since the structure of this tree is not split friendly this part would actually be tricky.
Actually a better solution for zip codes (since they a just numbers) is to create a stripped down version of B-Tree or even better create server files that are partitioned by range (similar to log structured merge but without the merge step), also emails could use this idea to but for strings Radix could be better IMHO and it does permit this approach as well.
One issue is that this a 15TB file and as far as know streams can only read up to 8 TB so that would be something to consider.
All in all this is a very fun and interesting exercise to do.
Flukus, Overall, about 6 - 8 hours, I think. I did mention that I did it in a way that candidates probably would not (I was trying to reduce string allocation overhead). We had a candidate that solved it quite elegantly with far less complexity & code.
And if range queries are indeed required and we are stubborn like that and still use Radix then we could create a Radix+Tree where we link the leafs. This is not ideal for various reasons but I think that it would work.
Bartosz, I think that you made it a lot more complex than it needs to be, though.
What do you mean by the 8 TB limit. I am not aware of that.
@Ayende, Now that I think about it the Log Structured Merge (without merge) would be the easiest to implement and still fast enough. I find SST tables the easiest to manage/code and handle :P.
Radix on the other side is just perfect for strings very simple and direct but creating the tree is not very easy, granted.
As for the 8TB limit the stream in .NET operates on Int64 so the maximum length is 9,223,372,036,854,775,807 bytes (I could be wrong) so the pointer and seek operations can go only to that length, I'm not sure if you can just read beyond that but that would be strange to go beyond the Length. Take this with a grain of salt as I never tested it, it's just a hunch :)
Bartosz,
Unless my math is wrong:
9,223,372,036,854,775,808 bytes 9,007,199,254,740,992 kilo bytes 8,796,093,022,208 mega bytes 8,589,934,592 giga bytes 8,388,608 tera bytes 8,192 peta bytes 8 exa bytes
@Ayende, Your math is Ok, mine is off.
Sorry About That
The problem with stream is therefore non existent :P
Prepare: perform an external sort and store the CSV sorted by email. Then traverse it once and store every 100000th email you find (in sorted order) with the file offset where it was found. Store that into an index file.
At query time you only need to scan the index which is 100000 times smaller than the original data set, and perform a lookup in the original CSV. Should be doable in 30sec.
We don't need no fancy data structures at all for these requirements.
The same for zip code. The data stored two times that is just about 30TB. You obviously chose the numbers so that this strategy would work out.
Fun little problem.
@Ayende we do something similar at Corvalius. We give our promising applicants (the ones that pass the first interview) a coding problem. Usually on graphics and GPGPU, because we look for stuff that we are sure they havent done before.
They are required to tell us when to send the exercise and they have 1 week for a test that would take around 8 to 12 hours (for the candidates that we want).
The end result for us. We moved the rejection rate from 1-in-5 to 1-in-10 (many never finish the problem) but we won absurdly in the quality of the hires.
No one will ever need more than 8 exa bytes......
Assuming that "query by email" means exact match.
Prepare: 1. Split the file into partitions 2. For each partition create hash tables and save them to the disc (row by email, row by zip code) 3. Create bloom filters (partition by email, partition by zip code)
Query: 1. Use the bloom filter to determine the partition 2. Load the hash table into memory (hash table must fit into memory, so the number of partitions should be calculated correspondingly) 3. Query hash table 4. Load row
In addition we could also try to compress the hash tables on disk to meet 15TB space requirement.
@Sergey Shumov
Bloom filters to be useful and not give false positives need to occupy lots of bytes, there are actually papers suggesting that the formula that everyone is using to calculate bytes is dead wrong and you need much more.
Hash tables are time efficient not space efficient so as soon as you load them to memory you can end up in a very bad distribution that purely dependant on the data and hashing function. The hashing function needs to be custom and not build in to have good distribution so whatever you pick you need to test is for ex with this: http://code.google.com/p/smhasher/. MurMur is considered one of the best hashes but v2 has a bug and v3 isn't in .NET and even if it is you still need to validate it's correctness otherwise the collision to bucket ratio will be catastrophic.
Another thing is to determine what kind of hash table is going to be (the standard .NET one is rather poor for big data sets) ideally you would need space efficient (you are in ram now) and with low collision to bucket ratio as well as low bucket miss ratio. So you could use cucoo, chained, double chained, weighted, etc. The implementation will not be easily replaceable if the hash or hash-table that you used isn't the best one as it represents a large portion of your code.
This is a lot of work for those two requirements :)
I'm guessing that it's the same idea that Google has with it's BigTable (I could be wrong).
I would solve the problem using classical algoritms: merge and binary search. I Would also use List<T>.Sort method.
The preparation phase would create two indexes Email and ZipCode. An index is a file containing a sequence of ordered fixed length records. An index record contains two fields: Key and Offset. Index records are ordered by Key field.
I would build an index in two phases. The first phase reads the csv file in 1,000,000 record sets, sorts each set in memory and write each set to a file. The second phase merges the files from the first phase to produce the index.
To seach, I just need to binary seach in the index to obtain the offset. I read the record in csv file at offset position.
Can you post the CSV file generator? It sounds like a nice exercise to do.
Catalin
@JesusLopez, this wouldn't work because the CSV file wouldn't fit in memory. The file is 15TB, and you only have 16GB of RAM.
I'll be honest, I went straight to creating an inverted index and was working that out. But then I cheated and read a couple replys. Specifically where Oren mentioned complexity.
So rethinking it, you could have three files. One sorted by zip code with pointers to the original data file locations. The second would be domains.The third would be emails with the domain stripped, but they would be grouped by the domain. We'd be able to look up domain name first using a binary search, then reduce our search space on the email index and repeat the binary search technique.
To construct the files I'd scan through the original data file, constructing zip code and email files, in original order. Then use an external sort on each file - emails in reverse string order. Once sorted, I could construct the third file of domain names pointing to the ranges in the email file.
Mark, 15 TB is about 100 billion rows, assuming about 160 bytes per row. It takes 37 lookups to perform binary search on such index. Therefore a simple index file with [email, offset] pairs should be searchable in about 4 seconds on a standard HDD, well within task limits.
@macrohard:
I don't load the entire csv in memory. I load the first million records, sort by key, save to indexfragment.0. Load the second million records, sort by key, save to indexfragment.1 file. Load the third million records, sort by key, save to indexfragment.2 file, and so on. I just need enough memory to hold and sort one million records, I can do it with 1 Gb of RAM.
Once I have all index fragment files, I merge them in one single pass. I Would use a SortedDictionary with an entry for each indexframent file to implement the merge
I will not be able to merge the index fragment files in one single pass because I cannot have an unlimited number of open files. Threfore I will need to perform the merge in several rounds.
Online hand. Since this is filed under, "architecture", if an interview candidate did at not least mention using a database (not promoting it as the solution since that was not asked-for), I would end the interview early. Not putting big data in time-tested systems explicitly designed for them is almost like stealing money from your employer, not architect should have "feet in the technology but head in the business" and excluding a DB from a real production system is against business.
Having spewed all that, I think that the answer by Jesús López is great: segment, sort, save, repeat. Using a Stream to read and write to make it "easy" to go between disk or memory. Plus, segmenting as you go also helps mitigate issues with GC and/or large memory allocation problems that will kill the whole .NET runtime when the thing actually runs.
SteveC, We are a company that build databases. It is not a valid answer to the question that we ask, since we need to provide those facilities.
Comment preview