Ayende @ Rahien

It's a girl

Fail, fail, fail

Sometimes, reading candidates answer is just something that I know is going to piss me off.

We have a question that goes something like this (the actual question is much more detailed):

We have a 15TB csv file that contains web log, the entries are sorted by date (since this is how they were entered). Find all the log entries within a given date range. You may not read more than 32 MB.

A candidate replied with an answered that had the following code:

   1: string line = string.Empty;
   2: StreamReader file;
   3:  
   4: try
   5: {
   6:     file = new StreamReader(filename);
   7: }
   8: catch (FileNotFoundException ex)
   9: {
  10:     Console.WriteLine("The file is not found.");
  11:     Console.ReadLine();
  12:     return;
  13: }
  14:  
  15: while ((line = file.ReadLine()) != null)
  16: {
  17:     var values = line.Split(',');
  18:     DateTime date = Convert.ToDateTime(values[0]);
  19:     if (date.Date >= startDate && date.Date <= endDate)
  20:         output.Add(line);
  21:  
  22:     // Results size in MB
  23:     double size = (GetObjectSize(output) / 1024f) / 1024f;
  24:     if (size >= 32)
  25:     {
  26:         Console.WriteLine("Results size exceeded 32MB, the search will stop.");
  27:         break;
  28:     }
  29: }

My reply was:

The data file is 15TB in size, if the data is beyond the first 32MB, it won't be found.

The candidate then fixed his code. It now includes:

   1: var lines = File.ReadLines(filename);

Yep, this is on a 15TB file.

Now I’m going to have to lie down for a bit, I am not feeling so good.

Comments

hilton smith
01/28/2014 09:38 AM by
hilton smith

one of my nightmares is seeing my code on your blog with the title "fail"...

Bartosz Adamczewski
01/28/2014 09:44 AM by
Bartosz Adamczewski

@Ayende, how much time did he had to actually do it? It was stress free environment or you watched his every step of the way ? Was it on site or off ?

Not that I'm defending or judging anyone here but those factors can heavily influence the output (excluding the fix that he did later as it's total nonsense).

Matt Johnson
01/28/2014 09:51 AM by
Matt Johnson

You'd think he'd at least have put the reader in a using block... or closed the file in some other way... or quit when the search range was exceeded... or yielded or streamed the output... or paid attention to the date format... I could go on all day. Where do you find these folks?

Of course the better solution would be to intelligently find a starting position, but that must be for the "advanced" folks.... lol.

Mike
01/28/2014 09:52 AM by
Mike

What do you think about a BinarySearch over the file to find the Start Date and from there just iterate to the Enddate?

Ayende Rahien
01/28/2014 09:59 AM by
Ayende Rahien

Bartosz, It was offsite, his machine, full net access. He had several days.

Ayende Rahien
01/28/2014 09:59 AM by
Ayende Rahien

Mike, That is what I expect people to do.

Thomas Levesque
01/28/2014 10:05 AM by
Thomas Levesque

Well, at least he didn't use File.ReadAllLines ;)

Bartosz Adamczewski
01/28/2014 10:10 AM by
Bartosz Adamczewski

@Ayende, In that case this is a bit shocking, having a few days (a weekend maybe) he should provide at least 3-4 different solutions of the problem (binsearch. range index, probabilistic) just for fun and write all of them in literate style :)

This sort of thing isn't the problem in itself but how often such thing happen? If this is frequent then it's scary.

Petar Repac
01/28/2014 10:11 AM by
Petar Repac

@Bartosz,

stress free environment ? So, when production servers start to "burn" I'm allowed to read 15TB in memory ? :))

Joke aside, stress is almost always present in some form sooner or later.

Now, if Ayende is nearby holding an axe, that's a different story.......

Talking about stressful interviews.... http://www.youtube.com/watch?v=rUY8HysBzsE

Ayende Rahien
01/28/2014 10:12 AM by
Ayende Rahien

Bartosz, We tend to get very different results. Either we get stuff like this, where you can see that they don't really understand the problem set, or we get people who solve this in the way I expected. And we sometimes get people who solve this in really crazy innovative ways.

krlm
01/28/2014 10:19 AM by
krlm

@Ayende, It would be interesting and inspiring to read about this 'crazy innovative ways' too.

Simon
01/28/2014 10:37 AM by
Simon

I find it amazing that people with this level of skill apply for a job with someone like Ayende! I'm pretty sure I'd have realised that loading a 15Tb file into memory would not be the correct answer, but I still don't think I'm anywhere near good enough to work for Ayende (unfortunately!).

Dunning-Kruger effect in action I guess.

Tyler
01/28/2014 10:43 AM by
Tyler

Mostly I feel bad for this person's current employer.

Carsten Hansen
01/28/2014 10:51 AM by
Carsten Hansen

Since you are indexing on a date field. Maybe it is possible to use a Newton–Raphson search instead of a binary search. It might be faster.

Ayende Rahien
01/28/2014 10:53 AM by
Ayende Rahien

Carsten, The data is sorted by date, a binary search would probably be fastest. I don't think that you can do better on sorted data than O(logN)

Pavel Dvorak
01/28/2014 11:03 AM by
Pavel Dvorak

The real shocker would be to realize how much prod code is actually written like this these days.

Petar Repac
01/28/2014 11:11 AM by
Petar Repac

@Ayende, apart from binary search maybe you can guess and so improve results in average.

The file is a web log, so we can guess that entries are evenly distributed and instead of jumping in the middle (as binary search does) we can jump closer to expected position of our search date.

Thoughts ?

BTW, that Newton–Raphson is interesting ... http://mozartreina.com/binary-search.html

Dalibor Carapic
01/28/2014 12:12 PM by
Dalibor Carapic

Perhaps I am missing something here but your first answer: "The data file is 15TB in size, if the data is beyond the first 32MB, it won't be found." is not correct. What he is doing is gathering all lines which fall within the specific date in a list (I guess, definition of 'output' variable is somewhere outside the pasted code). Once he gathers more then 32MB of lines he will abort the search (again I guess that GetObjectSize method returns the number of bytes that an object should consume in memory). This means that if there is less then 32MB of lines in the log which fall within specified range his code will work correctly. Also if we take a look at your prerequisite "You may not read more than 32 MB" (which I assume was originally stated "Do not use more then 32MB of memory to read the file") then his initial code can also be considered valid since once he used up more then 32MB to store the results he should stop (difficult to say without seeing how the question was originally stated).

In any case I guess he misinterpreted your question and he should have really asked more questions if he did not fully understood the assignment.

His attempt to fix the issue however was terrible.

Michael
01/28/2014 12:32 PM by
Michael

I'm a software tester and even I thought that binary search should be the best choice... Is the number of lines known? The solution he gave you for searching more than 32mb was really funny though :))

Jesús López
01/28/2014 01:38 PM by
Jesús López

I agree with Petar. I think interpolation search is the winner to solve this problem.

http://en.wikipedia.org/wiki/Interpolation_search

Joe King
01/28/2014 01:59 PM by
Joe King

I thing you're being a little hard on the guy. As Dalibor mentioned he probably just misinterpreted "You may not read more than 32 MB".

Ayende Rahien
01/28/2014 02:06 PM by
Ayende Rahien

Joe, The spec says a 15 TB file. That alone should tell you that you can't read it all.

Peter Morlion
01/28/2014 02:44 PM by
Peter Morlion

I hope you mean no disrespect to this candidate, because this post seems a bit so. If not, apologies. Yet you can't blame the person for not knowing. Maybe he/she never learned. There's no shame in not knowing. There is in not wanting to learn. I had a similar question in a job interview when I was young. I hadn't learned in school (or didn't remember) and had never needed it. Now I know. I'm also at a point where I'd ask why you have such a large log file in the first place. And I'd recommend Splunk or some other log management tool.

Ayende Rahien
01/28/2014 04:25 PM by
Ayende Rahien

Peter, If you are applying for a job in my company, you are expected to know how to deal with such things. And yes, I consider such a task a very basic measure of competence

Konrad Kokosa
01/28/2014 05:07 PM by
Konrad Kokosa

What's wrong with ReadLines? I ask seriously. It reads file line by line, with 1024 kB internal buffer.

Konrad Kokosa
01/28/2014 05:10 PM by
Konrad Kokosa

I mean - 1024 bytes. And of course there is a lot of other optimizations, there is no need to read all lines. But at least it is not ReadAllLines :)

Ayende Rahien
01/28/2014 06:34 PM by
Ayende Rahien

Konard, It forces you to read each line. The key problem here is that you are reading a 15 TB file. That is going to take over a day or so.

Steve S
01/28/2014 06:53 PM by
Steve S

I'd question the genius who decided to put all the web logs into a single file that was 15TB in size.

Ayende Rahien
01/28/2014 06:55 PM by
Ayende Rahien

Steve, Why? It efficient, easy to work with, and fast.

Itamar Syn-Hershko
01/28/2014 07:13 PM by
Itamar Syn-Hershko

@Oren, because it's sys-admins 101, not to have huge log files so they can be rotated, expired and moved between servers when needed (or debugged).

donkeymagic
01/28/2014 07:18 PM by
donkeymagic

You need to chill out or fix your process. If you have no screen before this then you have no right to be pissed off. It whiffs of intellectual snobbishness. Conversely, if this candidate got by a pre-screen, be angry at who pre-screened them. Maybe this guy was just out of college, or maybe never worked in a high performing team. Programming isn't all intellect, its experience and learning. Don't be pissed at a guy who hasn't had the opportunity to learn. And don't confuse skill with potential. Actually, this post pissed me off. Superiority is not a pleasant or laudable personality trait.

Sven
01/28/2014 08:38 PM by
Sven

In the interest of learning: can somebody explain to me how you do a binary search when the record length is variable (i.e. every line in the log is not necessarily of the same length, is it ?)

Are you supposed to just "seek" to the half way point, step backwards/forwards to the start of the previous/next line and read the date in order to determine whether the current line is in the date range you're looking for ?

Sven
01/28/2014 08:47 PM by
Sven

Never mind: found the answer here:

http://ayende.com/blog/165124/big-data-search-binary-search-of-textual-data

Note to self: catch up on Ayende's blog before asking questions :-)

oguzh4n
01/28/2014 10:16 PM by
oguzh4n

1 - Read single line 2 - Parse line 3 - Check date in range - yes - Write line using StreamWriter 4- Go to step 1

Ferret Chere
01/29/2014 04:48 AM by
Ferret Chere

I think this candidate is just ahead of their time. Give it a couple more years until smart watches come with 16Tb RAM as standard and their proposed solution will make a lot more sense.

Ayende Rahien
01/29/2014 06:27 AM by
Ayende Rahien

Donkeymagic, This is how we pre-screen.

Ayende Rahien
01/29/2014 06:27 AM by
Ayende Rahien

oguzh4n, Great, you have a 15TB file, see you tomorrow.

Simone
01/29/2014 11:44 AM by
Simone

Oren, although I agree that the solution is certainly not one which wins you a dev position I would argue that your observations are not correct either.

As someone pointed out, it's not true that "if the data is beyond the first 32MB, it won't be found". He's adding to the output only the lines that match the date range and using that collection to check the size of the output, therefore it will eventually find something and not stop eagerly. Of course it still will stop after reading approximately 32MB, which is a wrong interpretation of the question, but still.

Second, as far as I am aware File.ReadLines is no different than pulling lines out of a StreamReader. Not that the proposed fix is good, but no worse than the original solution. Maybe you thought he had used ReadAllLines?

Konrad Kokosa
01/29/2014 12:40 PM by
Konrad Kokosa

This reminds me this question on StackOverflow: http://stackoverflow.com/questions/21099339/read-efficiently-chunks-of-a-very-large-file Sounds familiar? :)

Ayende Rahien
01/29/2014 12:45 PM by
Ayende Rahien

Konrad, You are correct. The nice thing about the question is that even with asking people out, you have to understand the fundamentals to actually answer it.

Thorsten
01/29/2014 11:17 PM by
Thorsten

@Sven: I would have gone for the solution you proposed. Get the size of the file, jump into the middle, re-sync to the next (or previous) line ending, then decide whether to go for the upper or lower section.

Assumption would be that the dates are guaranteed to be linear in the file. Out in the field all sort of madness can happen (like servers changing their system time and then logging in the past or in the future). If there is not a 100% guarantee that time is linear in the log file I would introduce some kind of sanity check like "If last line was from date X && travelling back in time && dates from the future appear, then report a problem".

BTW, I love the idea about guessing the starting point of the search. Hadn't thought about that but in practice it could be very worthwhile especially if every single random-access is expensive (like when the file is stored off-site on a tape library or something).

Robert
01/30/2014 05:18 AM by
Robert

@Dalibor, @Simone

You guys are partially correct, yes, this will look beyond the first 32mb of the file, however if the date range requested was very wide (say the start of the file through to the end of the file, assuming a 15tb file) then yes, this method will indeed fail to look beyond the first 32mb. However, it would be more correct to say that you won't be able to find more than 32mb of results...

Comments have been closed on this topic.