Question 6 is a trap, a very useful one

time to read 3 min | 599 words

In my interview questions, I give candidates a list of 6 questions. They can either solve 3 questions from 1 to 5, or they can solve question 6.

Stop for a moment and ponder that. What do you assume that relative complexity of those questions?

Questions 1 –5 should take anything between 10 – 15 minutes to an hour & a half, max. Question 6 took me about 8 hours to do, although that included some blogging time about it.

Question 6 require that you’ll create an index for a 15 TB CSV file, and allow efficient searching on it.

While questions 1 – 5 are basically gate keeper questions. If you answer them correctly, we’ve a high view of you and you get an interview, answering question 6 correctly pretty much say that we past the “do we want you?” and into the “how do we get you?”.

But people don’t answer question 6 correctly. In fact, by this time, if you answer question 6, you have pretty much ruled yourself out, because you are going to show that you don’t understand something pretty fundamental.

Here are a couple of examples from the current crop of candidates. Remember, we are talking about a 15 TB CSV file here, containing about 400 billion records.

Candidate 1’s code looked something like this:

foreach(var line in File.EnumerateAllLines("big_data.csv"))
{
       var fields = line.Split(',');
       var email = line[2]
       File.WriteAllText(Md5(email), line);
}

Plus side, this doesn’t load the entire data set to memory, and you can sort of do quick searches. Of course, this does generate 400 billion files, and takes more than 100% as much space as the original file. Also, on NTFS, you have a max of 4 billion files per volume, and other FS has similar limitations.

So that isn’t going to work, but at least he had some idea about what is going on.

Candidate 2’s code, however, was:

// prepare
string[] allData = File.ReadAllLines("big_data.csv");
var users = new List<User>();
foreach(var line in allData)
{
     users.Add(User.Parse(line));
}
new XmlSerializer().Serialize(users, "big_data.xml");

// search by:

var users = new XmlSerialize().Deserialize("big_data.xml") as List<User>()
users.AsParallel().Where(x=>x.Email == "the@email.wtf");

So take the 15 TB file, load it all to memory (fail #1), convert all 400 billion records to entity instances (fail #2), write it back as xml (fail #3,#4,#5). Read the entire (greater than) 15 TB XML file to memory (fail #6), try to do a parallel brute force search on a dataset of 400 billion records (fail #7 – #400,000,000,000).

So, dear candidates 1 & 2, thank you for making it easy to say, thanks, but no thanks.

Tweet Share Share 23 comments

Tags:

development

Comments

20 Aug 2014
09:45 AM

slang

Mocking these people publicly is pretty harsh, but I have no defence for candidate #2, that's hilarious!

20 Aug 2014
09:56 AM

Max

Where do you get these guys? do they not know who you are and the kind of questions you'll ask?

One of the first questions on the interview should be "what do you know about us?" and surely these candidates have no idea about your company.

20 Aug 2014
11:13 AM

noone

chunking through MemoryMappedFile

20 Aug 2014
11:49 AM

Jason

Pretty sure you could do this with just tee, sed, and sort (or parallel). It would be my first attempt before even attempting to write any C#.

I don't have 15TB of disk space free to test :)

20 Aug 2014
12:16 PM

Ayende Rahien

Jason, No, you can't do it like that. Sort require access to the entire data set to actually be able to sort.

20 Aug 2014
12:19 PM

Jason

Sort (BASH) can sort very large files. parallel (same but parallelised) can do the same thing faster.

They both use very little memory as they create temp files to /tmp.

Whether they can sort a 15TB file I don't know.... but certainly 100s of GBs

20 Aug 2014
13:09 PM

Ayende Rahien

Jason, I didn't know about that, very interesting, thanks.

20 Aug 2014
13:14 PM

Petar Repac

File.WriteAllText - If the target file already exists, it is overwritten.

So, Code 1 will loose data if some emails repeat in the file.

20 Aug 2014
13:25 PM

Ben Fulton

The problem appears to be a tradeoff between disk space, memory, and speed. Do you place limitations on any of those, or have a preference for which resource is minimized?

20 Aug 2014
13:27 PM

Ayende Rahien

Ben, In the question, I specify what machine this is running on, then letting the candidate try it out.

20 Aug 2014
14:38 PM

isaiah

how about this, sort the file, using disk based merge sort. memory map the file and do a binary search for the require key.

20 Aug 2014
14:39 PM

Ayende Rahien

Isaiah, Great, that is the expected solution. You can even skip the mmap, you don't need that.

20 Aug 2014
14:45 PM

Mandawah

Do you mean "var email = fields[2]" (candidate #1 line 4)?

20 Aug 2014
14:46 PM

Ayende Rahien

Mandawah, yes, sorry about that.

20 Aug 2014
15:10 PM

Mandawah

Thanks. I am so anxious about missing the whole point ;-)

20 Aug 2014
17:14 PM

Mike

When I hire developers I find that the old trusty Fizz Buzz test eliminates about 90% of our candidates. See http://blog.codinghorror.com/why-cant-programmers-program/.

21 Aug 2014
05:55 AM

Ayende Rahien

Mike, FizzBuzz is great, but that just gets you into the "can or cannot" program. I'm actually interested in "can program REALLY well"

21 Aug 2014
13:46 PM

Giedrius

For me instead of questions there are only empty paragraphs, is this interview test taken from movie "Exam"? :)

21 Aug 2014
23:43 PM

Kijana Woodard

The interesting thing to note is that you are on your own for these exercise. Full access to the internet.

Google. It works.

22 Aug 2014
15:45 PM

Ade

Do the candidates get to the see the questions before they make they're decision?

22 Aug 2014
20:25 PM

Ayende Rahien

Ade, They have all the questions, yes.

28 Aug 2014
08:02 AM

InuYasha

Empty lines here too. :(

But I'd fail any candidate who'd try writing a line in C#.

15 Nov 2014
08:22 AM

Simon

Push the thing into a serious DBMS and let it do the job? Hmm... I suspect that wasn't the expected answer...

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB