Fail, fail, fail
Sometimes, reading candidates answer is just something that I know is going to piss me off.
We have a question that goes something like this (the actual question is much more detailed):
We have a 15TB csv file that contains web log, the entries are sorted by date (since this is how they were entered). Find all the log entries within a given date range. You may not read more than 32 MB.
A candidate replied with an answered that had the following code:
1: string line = string.Empty;2: StreamReader file;
3:
4: try5: {
6: file = new StreamReader(filename);7: }
8: catch (FileNotFoundException ex)9: {
10: Console.WriteLine("The file is not found.");11: Console.ReadLine();
12: return;13: }
14:
15: while ((line = file.ReadLine()) != null)16: {
17: var values = line.Split(',');18: DateTime date = Convert.ToDateTime(values[0]);
19: if (date.Date >= startDate && date.Date <= endDate)20: output.Add(line);
21:
22: // Results size in MB23: double size = (GetObjectSize(output) / 1024f) / 1024f;24: if (size >= 32)25: {
26: Console.WriteLine("Results size exceeded 32MB, the search will stop.");27: break;28: }
29: }
My reply was:
The data file is 15TB in size, if the data is beyond the first 32MB, it won't be found.
The candidate then fixed his code. It now includes:
1: var lines = File.ReadLines(filename);
Yep, this is on a 15TB file.
Now I’m going to have to lie down for a bit, I am not feeling so good.
Comments
one of my nightmares is seeing my code on your blog with the title "fail"...
@Ayende, how much time did he had to actually do it? It was stress free environment or you watched his every step of the way ? Was it on site or off ?
Not that I'm defending or judging anyone here but those factors can heavily influence the output (excluding the fix that he did later as it's total nonsense).
You'd think he'd at least have put the reader in a using block... or closed the file in some other way... or quit when the search range was exceeded... or yielded or streamed the output... or paid attention to the date format... I could go on all day. Where do you find these folks?
Of course the better solution would be to intelligently find a starting position, but that must be for the "advanced" folks.... lol.
What do you think about a BinarySearch over the file to find the Start Date and from there just iterate to the Enddate?
Bartosz, It was offsite, his machine, full net access. He had several days.
Mike, That is what I expect people to do.
Well, at least he didn't use File.ReadAllLines ;)
@Ayende, In that case this is a bit shocking, having a few days (a weekend maybe) he should provide at least 3-4 different solutions of the problem (binsearch. range index, probabilistic) just for fun and write all of them in literate style :)
This sort of thing isn't the problem in itself but how often such thing happen? If this is frequent then it's scary.
@Bartosz,
stress free environment ? So, when production servers start to "burn" I'm allowed to read 15TB in memory ? :))
Joke aside, stress is almost always present in some form sooner or later.
Now, if Ayende is nearby holding an axe, that's a different story.......
Talking about stressful interviews.... http://www.youtube.com/watch?v=rUY8HysBzsE
Bartosz, We tend to get very different results. Either we get stuff like this, where you can see that they don't really understand the problem set, or we get people who solve this in the way I expected. And we sometimes get people who solve this in really crazy innovative ways.
@Ayende, It would be interesting and inspiring to read about this 'crazy innovative ways' too.
I find it amazing that people with this level of skill apply for a job with someone like Ayende! I'm pretty sure I'd have realised that loading a 15Tb file into memory would not be the correct answer, but I still don't think I'm anywhere near good enough to work for Ayende (unfortunately!).
Dunning-Kruger effect in action I guess.
Mostly I feel bad for this person's current employer.
Since you are indexing on a date field. Maybe it is possible to use a Newton–Raphson search instead of a binary search. It might be faster.
Carsten, The data is sorted by date, a binary search would probably be fastest. I don't think that you can do better on sorted data than O(logN)
The real shocker would be to realize how much prod code is actually written like this these days.
@Ayende, apart from binary search maybe you can guess and so improve results in average.
The file is a web log, so we can guess that entries are evenly distributed and instead of jumping in the middle (as binary search does) we can jump closer to expected position of our search date.
Thoughts ?
BTW, that Newton–Raphson is interesting ... http://mozartreina.com/binary-search.html
Perhaps I am missing something here but your first answer: "The data file is 15TB in size, if the data is beyond the first 32MB, it won't be found." is not correct. What he is doing is gathering all lines which fall within the specific date in a list (I guess, definition of 'output' variable is somewhere outside the pasted code). Once he gathers more then 32MB of lines he will abort the search (again I guess that GetObjectSize method returns the number of bytes that an object should consume in memory). This means that if there is less then 32MB of lines in the log which fall within specified range his code will work correctly. Also if we take a look at your prerequisite "You may not read more than 32 MB" (which I assume was originally stated "Do not use more then 32MB of memory to read the file") then his initial code can also be considered valid since once he used up more then 32MB to store the results he should stop (difficult to say without seeing how the question was originally stated).
In any case I guess he misinterpreted your question and he should have really asked more questions if he did not fully understood the assignment.
His attempt to fix the issue however was terrible.
I'm a software tester and even I thought that binary search should be the best choice... Is the number of lines known? The solution he gave you for searching more than 32mb was really funny though :))
I agree with Petar. I think interpolation search is the winner to solve this problem.
http://en.wikipedia.org/wiki/Interpolation_search
I thing you're being a little hard on the guy. As Dalibor mentioned he probably just misinterpreted "You may not read more than 32 MB".
Joe, The spec says a 15 TB file. That alone should tell you that you can't read it all.
I hope you mean no disrespect to this candidate, because this post seems a bit so. If not, apologies. Yet you can't blame the person for not knowing. Maybe he/she never learned. There's no shame in not knowing. There is in not wanting to learn. I had a similar question in a job interview when I was young. I hadn't learned in school (or didn't remember) and had never needed it. Now I know. I'm also at a point where I'd ask why you have such a large log file in the first place. And I'd recommend Splunk or some other log management tool.
Peter, If you are applying for a job in my company, you are expected to know how to deal with such things. And yes, I consider such a task a very basic measure of competence
What's wrong with ReadLines? I ask seriously. It reads file line by line, with 1024 kB internal buffer.
I mean - 1024 bytes. And of course there is a lot of other optimizations, there is no need to read all lines. But at least it is not ReadAllLines :)
Konard, It forces you to read each line. The key problem here is that you are reading a 15 TB file. That is going to take over a day or so.
I'd question the genius who decided to put all the web logs into a single file that was 15TB in size.
Steve, Why? It efficient, easy to work with, and fast.
@Oren, because it's sys-admins 101, not to have huge log files so they can be rotated, expired and moved between servers when needed (or debugged).
You need to chill out or fix your process. If you have no screen before this then you have no right to be pissed off. It whiffs of intellectual snobbishness. Conversely, if this candidate got by a pre-screen, be angry at who pre-screened them. Maybe this guy was just out of college, or maybe never worked in a high performing team. Programming isn't all intellect, its experience and learning. Don't be pissed at a guy who hasn't had the opportunity to learn. And don't confuse skill with potential. Actually, this post pissed me off. Superiority is not a pleasant or laudable personality trait.
In the interest of learning: can somebody explain to me how you do a binary search when the record length is variable (i.e. every line in the log is not necessarily of the same length, is it ?)
Are you supposed to just "seek" to the half way point, step backwards/forwards to the start of the previous/next line and read the date in order to determine whether the current line is in the date range you're looking for ?
Never mind: found the answer here:
http://ayende.com/blog/165124/big-data-search-binary-search-of-textual-data
Note to self: catch up on Ayende's blog before asking questions :-)
1 - Read single line 2 - Parse line 3 - Check date in range - yes - Write line using StreamWriter 4- Go to step 1
I think this candidate is just ahead of their time. Give it a couple more years until smart watches come with 16Tb RAM as standard and their proposed solution will make a lot more sense.
Donkeymagic, This is how we pre-screen.
oguzh4n, Great, you have a 15TB file, see you tomorrow.
Oren, although I agree that the solution is certainly not one which wins you a dev position I would argue that your observations are not correct either.
As someone pointed out, it's not true that "if the data is beyond the first 32MB, it won't be found". He's adding to the output only the lines that match the date range and using that collection to check the size of the output, therefore it will eventually find something and not stop eagerly. Of course it still will stop after reading approximately 32MB, which is a wrong interpretation of the question, but still.
Second, as far as I am aware File.ReadLines is no different than pulling lines out of a StreamReader. Not that the proposed fix is good, but no worse than the original solution. Maybe you thought he had used ReadAllLines?
This reminds me this question on StackOverflow: http://stackoverflow.com/questions/21099339/read-efficiently-chunks-of-a-very-large-file Sounds familiar? :)
Konrad, You are correct. The nice thing about the question is that even with asking people out, you have to understand the fundamentals to actually answer it.
@Sven: I would have gone for the solution you proposed. Get the size of the file, jump into the middle, re-sync to the next (or previous) line ending, then decide whether to go for the upper or lower section.
Assumption would be that the dates are guaranteed to be linear in the file. Out in the field all sort of madness can happen (like servers changing their system time and then logging in the past or in the future). If there is not a 100% guarantee that time is linear in the log file I would introduce some kind of sanity check like "If last line was from date X && travelling back in time && dates from the future appear, then report a problem".
BTW, I love the idea about guessing the starting point of the search. Hadn't thought about that but in practice it could be very worthwhile especially if every single random-access is expensive (like when the file is stored off-site on a tape library or something).
@Dalibor, @Simone
You guys are partially correct, yes, this will look beyond the first 32mb of the file, however if the date range requested was very wide (say the start of the file through to the end of the file, assuming a 15tb file) then yes, this method will indeed fail to look beyond the first 32mb. However, it would be more correct to say that you won't be able to find more than 32mb of results...
Comment preview