Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,125 | Comments: 45,488

filter by tags archive

Question 6 is a trap, a very useful one

time to read 3 min | 599 words

In my interview questions, I give candidates a list of 6 questions. They can either solve 3 questions from 1 to 5, or they can solve question 6.

Stop for a moment and ponder that. What do you assume that relative complexity of those questions?







Questions 1 –5 should take anything between 10 – 15  minutes to an hour & a half, max. Question 6 took me about 8 hours to do, although that included some blogging time about it.

Question 6 require that you’ll create an index for a 15 TB CSV file, and allow efficient searching on it.

While questions 1 – 5 are basically gate keeper questions. If you answer them correctly, we’ve a high view of you and you get an interview, answering question 6 correctly pretty much say that we past the “do we want you?” and into the “how do we get you?”.

But people don’t answer question 6 correctly. In fact, by this time, if you answer question 6, you have pretty much ruled yourself out, because you are going to show that you don’t understand something pretty fundamental.

Here are a couple of examples from the current crop of candidates. Remember, we are talking about a 15 TB CSV file here, containing about 400 billion records.

Candidate 1’s code looked something like this:

foreach(var line in File.EnumerateAllLines("big_data.csv"))
       var fields = line.Split(',');
       var email = line[2]
       File.WriteAllText(Md5(email), line);

Plus side, this doesn’t load the entire data set to memory, and you can sort of do quick searches. Of course, this does generate 400 billion files, and takes more than 100% as much space as the original file. Also, on NTFS, you have a max of 4 billion files per volume, and other FS has similar limitations.

So that isn’t going to work, but at least he had some idea about what is going on.

Candidate 2’s code, however, was:

// prepare
string[] allData = File.ReadAllLines("big_data.csv");
var users = new List<User>();
foreach(var line in allData)
new XmlSerializer().Serialize(users, "big_data.xml");

// search by:

var users = new XmlSerialize().Deserialize("big_data.xml") as List<User>()
users.AsParallel().Where(x=>x.Email == "the@email.wtf");

So take the 15 TB file, load it all to memory (fail #1), convert all 400 billion records to entity instances (fail #2), write it back as xml (fail #3,#4,#5). Read the entire (greater than) 15 TB XML file to memory (fail #6), try to do a parallel brute force search on a dataset of 400 billion records (fail #7 – #400,000,000,000).

So, dear candidates 1 & 2, thank you for making it easy to say, thanks, but no thanks.



Mocking these people publicly is pretty harsh, but I have no defence for candidate #2, that's hilarious!


chunking through MemoryMappedFile


Where do you get these guys? do they not know who you are and the kind of questions you'll ask?

One of the first questions on the interview should be "what do you know about us?" and surely these candidates have no idea about your company.


Pretty sure you could do this with just tee, sed, and sort (or parallel). It would be my first attempt before even attempting to write any C#.

I don't have 15TB of disk space free to test :)

Ayende Rahien

Jason, No, you can't do it like that. Sort require access to the entire data set to actually be able to sort.


Sort (BASH) can sort very large files. parallel (same but parallelised) can do the same thing faster.

They both use very little memory as they create temp files to /tmp.

Whether they can sort a 15TB file I don't know.... but certainly 100s of GBs

Ayende Rahien

Jason, I didn't know about that, very interesting, thanks.

Petar Repac

File.WriteAllText - If the target file already exists, it is overwritten.

So, Code 1 will loose data if some emails repeat in the file.

Ben Fulton

The problem appears to be a tradeoff between disk space, memory, and speed. Do you place limitations on any of those, or have a preference for which resource is minimized?

Ayende Rahien

Ben, In the question, I specify what machine this is running on, then letting the candidate try it out.


how about this, sort the file, using disk based merge sort. memory map the file and do a binary search for the require key.

Ayende Rahien

Isaiah, Great, that is the expected solution. You can even skip the mmap, you don't need that.


Do you mean "var email = fields[2]" (candidate #1 line 4)?

Ayende Rahien

Mandawah, yes, sorry about that.


Thanks. I am so anxious about missing the whole point ;-)


When I hire developers I find that the old trusty Fizz Buzz test eliminates about 90% of our candidates. See http://blog.codinghorror.com/why-cant-programmers-program/.

Ayende Rahien

Mike, FizzBuzz is great, but that just gets you into the "can or cannot" program. I'm actually interested in "can program REALLY well"


For me instead of questions there are only empty paragraphs, is this interview test taken from movie "Exam"? :)

Kijana Woodard

The interesting thing to note is that you are on your own for these exercise. Full access to the internet.

Google. It works.


Do the candidates get to the see the questions before they make they're decision?

Ayende Rahien

Ade, They have all the questions, yes.


Empty lines here too. :(

But I'd fail any candidate who'd try writing a line in C#.


Push the thing into a serious DBMS and let it do the job? Hmm... I suspect that wasn't the expected answer...

Comment preview

Comments have been closed on this topic.


  1. The design of RavenDB 4.0: Physically segregating collections - 2 hours from now
  2. RavenDB 3.5 Whirlwind tour: I need to be free to explore my data - about one day from now
  3. RavenDB 3.5 whirl wind tour: I'll have the 3+1 goodies to go, please - 4 days from now
  4. The design of RavenDB 4.0: Voron has a one track mind - 5 days from now
  5. RavenDB 3.5 whirl wind tour: Digging deep into the internals - 6 days from now

And 12 more posts are pending...

There are posts all the way to May 30, 2016


  1. The design of RavenDB 4.0 (14):
    03 May 2016 - Making Lucene reliable
  2. RavenDB 3.5 whirl wind tour (14):
    04 May 2016 - I’ll find who is taking my I/O bandwidth and they SHALL pay
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats