Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:


+972 52-548-6969

, @ Q c

Posts: 6,128 | Comments: 45,550

filter by tags archive

Fail, fail, fail

time to read 9 min | 1737 words

Sometimes, reading candidates answer is just something that I know is going to piss me off.

We have a question that goes something like this (the actual question is much more detailed):

We have a 15TB csv file that contains web log, the entries are sorted by date (since this is how they were entered). Find all the log entries within a given date range. You may not read more than 32 MB.

A candidate replied with an answered that had the following code:

   1: string line = string.Empty;
   2: StreamReader file;
   4: try
   5: {
   6:     file = new StreamReader(filename);
   7: }
   8: catch (FileNotFoundException ex)
   9: {
  10:     Console.WriteLine("The file is not found.");
  11:     Console.ReadLine();
  12:     return;
  13: }
  15: while ((line = file.ReadLine()) != null)
  16: {
  17:     var values = line.Split(',');
  18:     DateTime date = Convert.ToDateTime(values[0]);
  19:     if (date.Date >= startDate && date.Date <= endDate)
  20:         output.Add(line);
  22:     // Results size in MB
  23:     double size = (GetObjectSize(output) / 1024f) / 1024f;
  24:     if (size >= 32)
  25:     {
  26:         Console.WriteLine("Results size exceeded 32MB, the search will stop.");
  27:         break;
  28:     }
  29: }

My reply was:

The data file is 15TB in size, if the data is beyond the first 32MB, it won't be found.

The candidate then fixed his code. It now includes:

   1: var lines = File.ReadLines(filename);

Yep, this is on a 15TB file.

Now I’m going to have to lie down for a bit, I am not feeling so good.


hilton smith

one of my nightmares is seeing my code on your blog with the title "fail"...

Bartosz Adamczewski

@Ayende, how much time did he had to actually do it? It was stress free environment or you watched his every step of the way ? Was it on site or off ?

Not that I'm defending or judging anyone here but those factors can heavily influence the output (excluding the fix that he did later as it's total nonsense).

Matt Johnson

You'd think he'd at least have put the reader in a using block... or closed the file in some other way... or quit when the search range was exceeded... or yielded or streamed the output... or paid attention to the date format... I could go on all day. Where do you find these folks?

Of course the better solution would be to intelligently find a starting position, but that must be for the "advanced" folks.... lol.


What do you think about a BinarySearch over the file to find the Start Date and from there just iterate to the Enddate?

Ayende Rahien

Bartosz, It was offsite, his machine, full net access. He had several days.

Ayende Rahien

Mike, That is what I expect people to do.

Thomas Levesque

Well, at least he didn't use File.ReadAllLines ;)

Bartosz Adamczewski

@Ayende, In that case this is a bit shocking, having a few days (a weekend maybe) he should provide at least 3-4 different solutions of the problem (binsearch. range index, probabilistic) just for fun and write all of them in literate style :)

This sort of thing isn't the problem in itself but how often such thing happen? If this is frequent then it's scary.

Petar Repac


stress free environment ? So, when production servers start to "burn" I'm allowed to read 15TB in memory ? :))

Joke aside, stress is almost always present in some form sooner or later.

Now, if Ayende is nearby holding an axe, that's a different story.......

Talking about stressful interviews.... http://www.youtube.com/watch?v=rUY8HysBzsE

Ayende Rahien

Bartosz, We tend to get very different results. Either we get stuff like this, where you can see that they don't really understand the problem set, or we get people who solve this in the way I expected. And we sometimes get people who solve this in really crazy innovative ways.


@Ayende, It would be interesting and inspiring to read about this 'crazy innovative ways' too.


I find it amazing that people with this level of skill apply for a job with someone like Ayende! I'm pretty sure I'd have realised that loading a 15Tb file into memory would not be the correct answer, but I still don't think I'm anywhere near good enough to work for Ayende (unfortunately!).

Dunning-Kruger effect in action I guess.


Mostly I feel bad for this person's current employer.

Carsten Hansen

Since you are indexing on a date field. Maybe it is possible to use a Newton–Raphson search instead of a binary search. It might be faster.

Ayende Rahien

Carsten, The data is sorted by date, a binary search would probably be fastest. I don't think that you can do better on sorted data than O(logN)

Pavel Dvorak

The real shocker would be to realize how much prod code is actually written like this these days.

Petar Repac

@Ayende, apart from binary search maybe you can guess and so improve results in average.

The file is a web log, so we can guess that entries are evenly distributed and instead of jumping in the middle (as binary search does) we can jump closer to expected position of our search date.

Thoughts ?

BTW, that Newton–Raphson is interesting ... http://mozartreina.com/binary-search.html

Dalibor Carapic

Perhaps I am missing something here but your first answer: "The data file is 15TB in size, if the data is beyond the first 32MB, it won't be found." is not correct. What he is doing is gathering all lines which fall within the specific date in a list (I guess, definition of 'output' variable is somewhere outside the pasted code). Once he gathers more then 32MB of lines he will abort the search (again I guess that GetObjectSize method returns the number of bytes that an object should consume in memory). This means that if there is less then 32MB of lines in the log which fall within specified range his code will work correctly. Also if we take a look at your prerequisite "You may not read more than 32 MB" (which I assume was originally stated "Do not use more then 32MB of memory to read the file") then his initial code can also be considered valid since once he used up more then 32MB to store the results he should stop (difficult to say without seeing how the question was originally stated).

In any case I guess he misinterpreted your question and he should have really asked more questions if he did not fully understood the assignment.

His attempt to fix the issue however was terrible.


I'm a software tester and even I thought that binary search should be the best choice... Is the number of lines known? The solution he gave you for searching more than 32mb was really funny though :))

Jesús López

I agree with Petar. I think interpolation search is the winner to solve this problem.


Joe King

I thing you're being a little hard on the guy. As Dalibor mentioned he probably just misinterpreted "You may not read more than 32 MB".

Ayende Rahien

Joe, The spec says a 15 TB file. That alone should tell you that you can't read it all.

Peter Morlion

I hope you mean no disrespect to this candidate, because this post seems a bit so. If not, apologies. Yet you can't blame the person for not knowing. Maybe he/she never learned. There's no shame in not knowing. There is in not wanting to learn. I had a similar question in a job interview when I was young. I hadn't learned in school (or didn't remember) and had never needed it. Now I know. I'm also at a point where I'd ask why you have such a large log file in the first place. And I'd recommend Splunk or some other log management tool.

Ayende Rahien

Peter, If you are applying for a job in my company, you are expected to know how to deal with such things. And yes, I consider such a task a very basic measure of competence

Konrad Kokosa

What's wrong with ReadLines? I ask seriously. It reads file line by line, with 1024 kB internal buffer.

Konrad Kokosa

I mean - 1024 bytes. And of course there is a lot of other optimizations, there is no need to read all lines. But at least it is not ReadAllLines :)

Ayende Rahien

Konard, It forces you to read each line. The key problem here is that you are reading a 15 TB file. That is going to take over a day or so.

Steve S

I'd question the genius who decided to put all the web logs into a single file that was 15TB in size.

Ayende Rahien

Steve, Why? It efficient, easy to work with, and fast.

Itamar Syn-Hershko

@Oren, because it's sys-admins 101, not to have huge log files so they can be rotated, expired and moved between servers when needed (or debugged).


You need to chill out or fix your process. If you have no screen before this then you have no right to be pissed off. It whiffs of intellectual snobbishness. Conversely, if this candidate got by a pre-screen, be angry at who pre-screened them. Maybe this guy was just out of college, or maybe never worked in a high performing team. Programming isn't all intellect, its experience and learning. Don't be pissed at a guy who hasn't had the opportunity to learn. And don't confuse skill with potential. Actually, this post pissed me off. Superiority is not a pleasant or laudable personality trait.


In the interest of learning: can somebody explain to me how you do a binary search when the record length is variable (i.e. every line in the log is not necessarily of the same length, is it ?)

Are you supposed to just "seek" to the half way point, step backwards/forwards to the start of the previous/next line and read the date in order to determine whether the current line is in the date range you're looking for ?


Never mind: found the answer here:


Note to self: catch up on Ayende's blog before asking questions :-)


1 - Read single line 2 - Parse line 3 - Check date in range - yes - Write line using StreamWriter 4- Go to step 1

Ferret Chere

I think this candidate is just ahead of their time. Give it a couple more years until smart watches come with 16Tb RAM as standard and their proposed solution will make a lot more sense.

Ayende Rahien

Donkeymagic, This is how we pre-screen.

Ayende Rahien

oguzh4n, Great, you have a 15TB file, see you tomorrow.


Oren, although I agree that the solution is certainly not one which wins you a dev position I would argue that your observations are not correct either.

As someone pointed out, it's not true that "if the data is beyond the first 32MB, it won't be found". He's adding to the output only the lines that match the date range and using that collection to check the size of the output, therefore it will eventually find something and not stop eagerly. Of course it still will stop after reading approximately 32MB, which is a wrong interpretation of the question, but still.

Second, as far as I am aware File.ReadLines is no different than pulling lines out of a StreamReader. Not that the proposed fix is good, but no worse than the original solution. Maybe you thought he had used ReadAllLines?

Konrad Kokosa

This reminds me this question on StackOverflow: http://stackoverflow.com/questions/21099339/read-efficiently-chunks-of-a-very-large-file Sounds familiar? :)

Ayende Rahien

Konrad, You are correct. The nice thing about the question is that even with asking people out, you have to understand the fundamentals to actually answer it.


@Sven: I would have gone for the solution you proposed. Get the size of the file, jump into the middle, re-sync to the next (or previous) line ending, then decide whether to go for the upper or lower section.

Assumption would be that the dates are guaranteed to be linear in the file. Out in the field all sort of madness can happen (like servers changing their system time and then logging in the past or in the future). If there is not a 100% guarantee that time is linear in the log file I would introduce some kind of sanity check like "If last line was from date X && travelling back in time && dates from the future appear, then report a problem".

BTW, I love the idea about guessing the starting point of the search. Hadn't thought about that but in practice it could be very worthwhile especially if every single random-access is expensive (like when the file is stored off-site on a tape library or something).


@Dalibor, @Simone

You guys are partially correct, yes, this will look beyond the first 32mb of the file, however if the date range requested was very wide (say the start of the file through to the end of the file, assuming a 15tb file) then yes, this method will indeed fail to look beyond the first 32mb. However, it would be more correct to say that you won't be able to find more than 32mb of results...

Comment preview

Comments have been closed on this topic.


  1. The worker pattern - about one day from now

There are posts all the way to May 30, 2016


  1. The design of RavenDB 4.0 (14):
    26 May 2016 - The client side
  2. RavenDB 3.5 whirl wind tour (14):
    25 May 2016 - Got anything to declare, ya smuggler?
  3. Tasks for the new comer (2):
    15 Apr 2016 - Quartz.NET with RavenDB
  4. Code through the looking glass (5):
    18 Mar 2016 - And a linear search to rule them
  5. Find the bug (8):
    29 Feb 2016 - When you can't rely on your own identity
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats