Ayende @ Rahien

It's a girl

Question 6 is a trap, a very useful one

In my interview questions, I give candidates a list of 6 questions. They can either solve 3 questions from 1 to 5, or they can solve question 6.

Stop for a moment and ponder that. What do you assume that relative complexity of those questions?

 

 

 

 

 

 

Questions 1 –5 should take anything between 10 – 15  minutes to an hour & a half, max. Question 6 took me about 8 hours to do, although that included some blogging time about it.

Question 6 require that you’ll create an index for a 15 TB CSV file, and allow efficient searching on it.

While questions 1 – 5 are basically gate keeper questions. If you answer them correctly, we’ve a high view of you and you get an interview, answering question 6 correctly pretty much say that we past the “do we want you?” and into the “how do we get you?”.

But people don’t answer question 6 correctly. In fact, by this time, if you answer question 6, you have pretty much ruled yourself out, because you are going to show that you don’t understand something pretty fundamental.

Here are a couple of examples from the current crop of candidates. Remember, we are talking about a 15 TB CSV file here, containing about 400 billion records.

Candidate 1’s code looked something like this:

foreach(var line in File.EnumerateAllLines("big_data.csv"))
{
       var fields = line.Split(',');
       var email = line[2]
       File.WriteAllText(Md5(email), line);
}

Plus side, this doesn’t load the entire data set to memory, and you can sort of do quick searches. Of course, this does generate 400 billion files, and takes more than 100% as much space as the original file. Also, on NTFS, you have a max of 4 billion files per volume, and other FS has similar limitations.

So that isn’t going to work, but at least he had some idea about what is going on.

Candidate 2’s code, however, was:

// prepare
string[] allData = File.ReadAllLines("big_data.csv");
var users = new List<User>();
foreach(var line in allData)
{
     users.Add(User.Parse(line));
}
new XmlSerializer().Serialize(users, "big_data.xml");

// search by:

var users = new XmlSerialize().Deserialize("big_data.xml") as List<User>()
users.AsParallel().Where(x=>x.Email == "the@email.wtf");

So take the 15 TB file, load it all to memory (fail #1), convert all 400 billion records to entity instances (fail #2), write it back as xml (fail #3,#4,#5). Read the entire (greater than) 15 TB XML file to memory (fail #6), try to do a parallel brute force search on a dataset of 400 billion records (fail #7 – #400,000,000,000).

So, dear candidates 1 & 2, thank you for making it easy to say, thanks, but no thanks.

Comments

slang
08/20/2014 09:45 AM by
slang

Mocking these people publicly is pretty harsh, but I have no defence for candidate #2, that's hilarious!

noone
08/20/2014 11:13 AM by
noone

chunking through MemoryMappedFile

Max
08/20/2014 09:56 AM by
Max

Where do you get these guys? do they not know who you are and the kind of questions you'll ask?

One of the first questions on the interview should be "what do you know about us?" and surely these candidates have no idea about your company.

Jason
08/20/2014 11:49 AM by
Jason

Pretty sure you could do this with just tee, sed, and sort (or parallel). It would be my first attempt before even attempting to write any C#.

I don't have 15TB of disk space free to test :)

Ayende Rahien
08/20/2014 12:16 PM by
Ayende Rahien

Jason, No, you can't do it like that. Sort require access to the entire data set to actually be able to sort.

Jason
08/20/2014 12:19 PM by
Jason

Sort (BASH) can sort very large files. parallel (same but parallelised) can do the same thing faster.

They both use very little memory as they create temp files to /tmp.

Whether they can sort a 15TB file I don't know.... but certainly 100s of GBs

Ayende Rahien
08/20/2014 01:09 PM by
Ayende Rahien

Jason, I didn't know about that, very interesting, thanks.

Petar Repac
08/20/2014 01:14 PM by
Petar Repac

File.WriteAllText - If the target file already exists, it is overwritten.

So, Code 1 will loose data if some emails repeat in the file.

Ben Fulton
08/20/2014 01:25 PM by
Ben Fulton

The problem appears to be a tradeoff between disk space, memory, and speed. Do you place limitations on any of those, or have a preference for which resource is minimized?

Ayende Rahien
08/20/2014 01:27 PM by
Ayende Rahien

Ben, In the question, I specify what machine this is running on, then letting the candidate try it out.

isaiah
08/20/2014 02:38 PM by
isaiah

how about this, sort the file, using disk based merge sort. memory map the file and do a binary search for the require key.

Ayende Rahien
08/20/2014 02:39 PM by
Ayende Rahien

Isaiah, Great, that is the expected solution. You can even skip the mmap, you don't need that.

Mandawah
08/20/2014 02:45 PM by
Mandawah

Do you mean "var email = fields[2]" (candidate #1 line 4)?

Ayende Rahien
08/20/2014 02:46 PM by
Ayende Rahien

Mandawah, yes, sorry about that.

Mandawah
08/20/2014 03:10 PM by
Mandawah

Thanks. I am so anxious about missing the whole point ;-)

Mike
08/20/2014 05:14 PM by
Mike

When I hire developers I find that the old trusty Fizz Buzz test eliminates about 90% of our candidates. See http://blog.codinghorror.com/why-cant-programmers-program/.

Ayende Rahien
08/21/2014 05:55 AM by
Ayende Rahien

Mike, FizzBuzz is great, but that just gets you into the "can or cannot" program. I'm actually interested in "can program REALLY well"

Giedrius
08/21/2014 01:46 PM by
Giedrius

For me instead of questions there are only empty paragraphs, is this interview test taken from movie "Exam"? :)

Kijana Woodard
08/21/2014 11:43 PM by
Kijana Woodard

The interesting thing to note is that you are on your own for these exercise. Full access to the internet.

Google. It works.

Ade
08/22/2014 03:45 PM by
Ade

Do the candidates get to the see the questions before they make they're decision?

Ayende Rahien
08/22/2014 08:25 PM by
Ayende Rahien

Ade, They have all the questions, yes.

InuYasha
08/28/2014 08:02 AM by
InuYasha

Empty lines here too. :(

But I'd fail any candidate who'd try writing a line in C#.

Simon
11/15/2014 08:22 AM by
Simon

Push the thing into a serious DBMS and let it do the job? Hmm... I suspect that wasn't the expected answer...