Big Data Search

time to read 4 min | 773 words

I got tired of the old questions that we were asking candidates, so I decided to add a new one. This one is usually something that we’ll give the candidates to do at home, at their leisure. Let us imagine the following file:

   1: "first_name","last_name","company_name","address","city","county","state","zip","phone1","phone2","email","web"
   2: "James","Butt","Benton, John B Jr","6649 N Blue Gum St","New Orleans","Orleans","LA",70116,"504-621-8927","504-845-1427","jbutt@gmail.com","http://www.bentonjohnbjr.com"
   3: "Josephine","Darakjy","Chanay, Jeffrey A Esq","4 B Blue Ridge Blvd","Brighton","Livingston","MI",48116,"810-292-9388","810-374-9840","josephine_darakjy@darakjy.org","http://www.chanayjeffreyaesq.com"
   4: "Art","Venere","Chemel, James L Cpa","8 W Cerritos Ave #54","Bridgeport","Gloucester","NJ","08014","856-636-8749","856-264-4130","art@venere.org","http://www.chemeljameslcpa.com"
   5: "Lenna","Paprocki","Feltz Printing Service","639 Main St","Anchorage","Anchorage","AK",99501,"907-385-4412","907-921-2010","lpaprocki@hotmail.com",http://www.feltzprintingservice.com

As you can see, this is a pretty trivial CSV file. However, let assume that it is a small example of a CSV file that is 15 TB in size. The requirement is to be able to query on that file. We need to be able to query by email or all the people with in a particular zip code. Because of the size, the solution can be composed of two parts, a prepare part (which can run for as long as it is needed) and answer to queries part. Maximum time to answer any query must be under 30 seconds.

  • You can assume that the file never changes, and that once the prepare part is done, it will never need to be run again.
  • The answer to a query is the full CSV row.
  • You can assume a machine with a single machine 100TB disk, 16 GB RAM and 8 CPU cores.
  • The solution cannot use any existing databases.
  • The solution needs to include explanation of the various options that were available and why this specific solution was chosen.
  • After the prepare phase is done, the solution has to take less than 30TB of data (including the original file).
  • The solution should be easy to apply to different CSV file.

I decided that it wouldn’t be fair to ask candidates to do something like that without doing it myself. Mostly because the fact that I have a good idea about how to do something doesn’t meant that I understand the actual implementation issues that might pop up.

I actually gave myself a somewhat harder task, do the above mention task, but do it without access to any library other than the BCL and do so with a minimal amount of memory usage. The entire thing took less than a day, and it solves the problem quite a bit more efficiently than I actually anticipated.

But I’ll discuss the details of this in my next post.