Fail, fail, fail

time to read 9 min | 1737 words

Sometimes, reading candidates answer is just something that I know is going to piss me off.

We have a question that goes something like this (the actual question is much more detailed):

We have a 15TB csv file that contains web log, the entries are sorted by date (since this is how they were entered). Find all the log entries within a given date range. You may not read more than 32 MB.

A candidate replied with an answered that had the following code:

   1: string line = string.Empty;

   2: StreamReader file;

3:

   4: try

   5: {

   6:     file = new StreamReader(filename);

   7: }

   8: catch (FileNotFoundException ex)

   9: {

  10:     Console.WriteLine("The file is not found.");

  11:     Console.ReadLine();

  12:     return;

  13: }

14:

  15: while ((line = file.ReadLine()) != null)

  16: {

  17:     var values = line.Split(',');

  18:     DateTime date = Convert.ToDateTime(values[0]);

  19:     if (date.Date >= startDate && date.Date <= endDate)

  20:         output.Add(line);

21:

  22:     // Results size in MB

  23:     double size = (GetObjectSize(output) / 1024f) / 1024f;

  24:     if (size >= 32)

  25:     {

  26:         Console.WriteLine("Results size exceeded 32MB, the search will stop.");

  27:         break;

  28:     }

  29: }

My reply was:

The data file is 15TB in size, if the data is beyond the first 32MB, it won't be found.

The candidate then fixed his code. It now includes:

   1: var lines = File.ReadLines(filename);

Yep, this is on a 15TB file.

Now I’m going to have to lie down for a bit, I am not feeling so good.

Tweet Share Share 42 comments

Tags:

Comments

28 Jan 2014
09:38 AM

hilton smith

one of my nightmares is seeing my code on your blog with the title "fail"...

28 Jan 2014
09:44 AM

Bartosz Adamczewski

@Ayende, how much time did he had to actually do it? It was stress free environment or you watched his every step of the way ? Was it on site or off ?

Not that I'm defending or judging anyone here but those factors can heavily influence the output (excluding the fix that he did later as it's total nonsense).

28 Jan 2014
09:51 AM

Matt Johnson

You'd think he'd at least have put the reader in a using block... or closed the file in some other way... or quit when the search range was exceeded... or yielded or streamed the output... or paid attention to the date format... I could go on all day. Where do you find these folks?

Of course the better solution would be to intelligently find a starting position, but that must be for the "advanced" folks.... lol.

28 Jan 2014
09:52 AM

Mike

What do you think about a BinarySearch over the file to find the Start Date and from there just iterate to the Enddate?

28 Jan 2014
09:59 AM

Ayende Rahien

Bartosz, It was offsite, his machine, full net access. He had several days.

28 Jan 2014
09:59 AM

Ayende Rahien

Mike, That is what I expect people to do.

28 Jan 2014
10:05 AM

Thomas Levesque

Well, at least he didn't use File.ReadAllLines ;)

28 Jan 2014
10:10 AM

Bartosz Adamczewski

@Ayende, In that case this is a bit shocking, having a few days (a weekend maybe) he should provide at least 3-4 different solutions of the problem (binsearch. range index, probabilistic) just for fun and write all of them in literate style :)

This sort of thing isn't the problem in itself but how often such thing happen? If this is frequent then it's scary.

28 Jan 2014
10:11 AM

Petar Repac

@Bartosz,

stress free environment ? So, when production servers start to "burn" I'm allowed to read 15TB in memory ? :))

Joke aside, stress is almost always present in some form sooner or later.

Now, if Ayende is nearby holding an axe, that's a different story.......

Talking about stressful interviews.... http://www.youtube.com/watch?v=rUY8HysBzsE

28 Jan 2014
10:12 AM

Ayende Rahien

Bartosz, We tend to get very different results. Either we get stuff like this, where you can see that they don't really understand the problem set, or we get people who solve this in the way I expected. And we sometimes get people who solve this in really crazy innovative ways.

28 Jan 2014
10:19 AM

krlm

@Ayende, It would be interesting and inspiring to read about this 'crazy innovative ways' too.

28 Jan 2014
10:37 AM

Simon

I find it amazing that people with this level of skill apply for a job with someone like Ayende! I'm pretty sure I'd have realised that loading a 15Tb file into memory would not be the correct answer, but I still don't think I'm anywhere near good enough to work for Ayende (unfortunately!).

Dunning-Kruger effect in action I guess.

28 Jan 2014
10:43 AM

Tyler

Mostly I feel bad for this person's current employer.

28 Jan 2014
10:51 AM

Carsten Hansen

Since you are indexing on a date field. Maybe it is possible to use a Newton–Raphson search instead of a binary search. It might be faster.

28 Jan 2014
10:53 AM

Ayende Rahien

Carsten, The data is sorted by date, a binary search would probably be fastest. I don't think that you can do better on sorted data than O(logN)

28 Jan 2014
11:03 AM

Pavel Dvorak

The real shocker would be to realize how much prod code is actually written like this these days.

28 Jan 2014
11:11 AM

Petar Repac

@Ayende, apart from binary search maybe you can guess and so improve results in average.

The file is a web log, so we can guess that entries are evenly distributed and instead of jumping in the middle (as binary search does) we can jump closer to expected position of our search date.

Thoughts ?

BTW, that Newton–Raphson is interesting ... http://mozartreina.com/binary-search.html

28 Jan 2014
12:12 PM

Dalibor Carapic

Perhaps I am missing something here but your first answer: "The data file is 15TB in size, if the data is beyond the first 32MB, it won't be found." is not correct. What he is doing is gathering all lines which fall within the specific date in a list (I guess, definition of 'output' variable is somewhere outside the pasted code). Once he gathers more then 32MB of lines he will abort the search (again I guess that GetObjectSize method returns the number of bytes that an object should consume in memory). This means that if there is less then 32MB of lines in the log which fall within specified range his code will work correctly. Also if we take a look at your prerequisite "You may not read more than 32 MB" (which I assume was originally stated "Do not use more then 32MB of memory to read the file") then his initial code can also be considered valid since once he used up more then 32MB to store the results he should stop (difficult to say without seeing how the question was originally stated).

In any case I guess he misinterpreted your question and he should have really asked more questions if he did not fully understood the assignment.

His attempt to fix the issue however was terrible.

28 Jan 2014
12:32 PM

Michael

I'm a software tester and even I thought that binary search should be the best choice... Is the number of lines known? The solution he gave you for searching more than 32mb was really funny though :))

28 Jan 2014
13:38 PM

Jesús López

I agree with Petar. I think interpolation search is the winner to solve this problem.

http://en.wikipedia.org/wiki/Interpolation_search

28 Jan 2014
13:59 PM

Joe King

I thing you're being a little hard on the guy. As Dalibor mentioned he probably just misinterpreted "You may not read more than 32 MB".

28 Jan 2014
14:06 PM

Ayende Rahien

Joe, The spec says a 15 TB file. That alone should tell you that you can't read it all.

28 Jan 2014
14:44 PM

Peter Morlion

I hope you mean no disrespect to this candidate, because this post seems a bit so. If not, apologies. Yet you can't blame the person for not knowing. Maybe he/she never learned. There's no shame in not knowing. There is in not wanting to learn. I had a similar question in a job interview when I was young. I hadn't learned in school (or didn't remember) and had never needed it. Now I know. I'm also at a point where I'd ask why you have such a large log file in the first place. And I'd recommend Splunk or some other log management tool.

28 Jan 2014
16:25 PM

Ayende Rahien

Peter, If you are applying for a job in my company, you are expected to know how to deal with such things. And yes, I consider such a task a very basic measure of competence

28 Jan 2014
17:07 PM

Konrad Kokosa

What's wrong with ReadLines? I ask seriously. It reads file line by line, with 1024 kB internal buffer.

28 Jan 2014
17:10 PM

Konrad Kokosa

I mean - 1024 bytes. And of course there is a lot of other optimizations, there is no need to read all lines. But at least it is not ReadAllLines :)

28 Jan 2014
18:34 PM

Ayende Rahien

Konard, It forces you to read each line. The key problem here is that you are reading a 15 TB file. That is going to take over a day or so.

28 Jan 2014
18:53 PM

Steve S

I'd question the genius who decided to put all the web logs into a single file that was 15TB in size.

28 Jan 2014
18:55 PM

Ayende Rahien

Steve, Why? It efficient, easy to work with, and fast.

28 Jan 2014
19:13 PM

Itamar Syn-Hershko

@Oren, because it's sys-admins 101, not to have huge log files so they can be rotated, expired and moved between servers when needed (or debugged).

28 Jan 2014
19:18 PM

donkeymagic

You need to chill out or fix your process. If you have no screen before this then you have no right to be pissed off. It whiffs of intellectual snobbishness. Conversely, if this candidate got by a pre-screen, be angry at who pre-screened them. Maybe this guy was just out of college, or maybe never worked in a high performing team. Programming isn't all intellect, its experience and learning. Don't be pissed at a guy who hasn't had the opportunity to learn. And don't confuse skill with potential. Actually, this post pissed me off. Superiority is not a pleasant or laudable personality trait.

28 Jan 2014
20:38 PM

Sven

In the interest of learning: can somebody explain to me how you do a binary search when the record length is variable (i.e. every line in the log is not necessarily of the same length, is it ?)

Are you supposed to just "seek" to the half way point, step backwards/forwards to the start of the previous/next line and read the date in order to determine whether the current line is in the date range you're looking for ?

28 Jan 2014
20:47 PM

Sven

Never mind: found the answer here:

http://ayende.com/blog/165124/big-data-search-binary-search-of-textual-data

Note to self: catch up on Ayende's blog before asking questions :-)

28 Jan 2014
22:16 PM

oguzh4n

1 - Read single line 2 - Parse line 3 - Check date in range - yes - Write line using StreamWriter 4- Go to step 1

29 Jan 2014
04:48 AM

Ferret Chere

I think this candidate is just ahead of their time. Give it a couple more years until smart watches come with 16Tb RAM as standard and their proposed solution will make a lot more sense.

29 Jan 2014
06:27 AM

Ayende Rahien

Donkeymagic, This is how we pre-screen.

29 Jan 2014
06:27 AM

Ayende Rahien

oguzh4n, Great, you have a 15TB file, see you tomorrow.

29 Jan 2014
11:44 AM

Simone

Oren, although I agree that the solution is certainly not one which wins you a dev position I would argue that your observations are not correct either.

As someone pointed out, it's not true that "if the data is beyond the first 32MB, it won't be found". He's adding to the output only the lines that match the date range and using that collection to check the size of the output, therefore it will eventually find something and not stop eagerly. Of course it still will stop after reading approximately 32MB, which is a wrong interpretation of the question, but still.

Second, as far as I am aware File.ReadLines is no different than pulling lines out of a StreamReader. Not that the proposed fix is good, but no worse than the original solution. Maybe you thought he had used ReadAllLines?

29 Jan 2014
12:40 PM

Konrad Kokosa

This reminds me this question on StackOverflow: http://stackoverflow.com/questions/21099339/read-efficiently-chunks-of-a-very-large-file Sounds familiar? :)

29 Jan 2014
12:45 PM

Ayende Rahien

Konrad, You are correct. The nice thing about the question is that even with asking people out, you have to understand the fundamentals to actually answer it.

29 Jan 2014
23:17 PM

Thorsten

@Sven: I would have gone for the solution you proposed. Get the size of the file, jump into the middle, re-sync to the next (or previous) line ending, then decide whether to go for the upper or lower section.

Assumption would be that the dates are guaranteed to be linear in the file. Out in the field all sort of madness can happen (like servers changing their system time and then logging in the past or in the future). If there is not a 100% guarantee that time is linear in the log file I would introduce some kind of sanity check like "If last line was from date X && travelling back in time && dates from the future appear, then report a problem".

BTW, I love the idea about guessing the starting point of the search. Hadn't thought about that but in practice it could be very worthwhile especially if every single random-access is expensive (like when the file is stored off-site on a tape library or something).

30 Jan 2014
05:18 AM

Robert

@Dalibor, @Simone

You guys are partially correct, yes, this will look beyond the first 32mb of the file, however if the date range requested was very wide (say the start of the file through to the end of the file, assuming a 15tb file) then yes, this method will indeed fail to look beyond the first 32mb. However, it would be more correct to say that you won't be able to find more than 32mb of results...

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

Fail, fail, fail

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication