Rate this interview question, please
I ask candidates to answer the following question. Sometimes at home, sometimes during an interview with a whiteboard.
You need to create an executable that would manage a phone book. The commands you need to support are:
- phone-book.exe /path/to/file add [name] [phone]
- phone-book.exe /path/to/file list [skip], [limit]
The output of the list operation must be the phone book records in lexical order. You may not sort the data during the list operation, however. All such work must be done in the add operation.
You may keep any state you’ll like in the file system, but there are separate invocations of the program for each step. This program need to support adding 10 million records.
Feel free to constrain the problem in any other way that would make it easier for you to implement it. We’ll rate the solution on how much it cost in terms of I/O.
A reminder, we are a database company, this sort of question is incredibly relevant to the things that we do daily.
I give this question to candidates with no experience, fresh graduates, etc. How would you rate its difficulty?
Comments
Some context is needed, the most important piece being: what is the level/position that is being interviewed for (junior software engineer, senior software architect)?
The answer to that would shape my answer about my expectations dramatically.
Apologies, ignore that comment.
Nicholas,
This is for a junior position. No experience at all assumed
I think it is a good question, right level for junior, but IMHO most candidates will re-sort the entire file on each add.
Jenya,
That is why we rate in the terms of file I/O.
It is not said, are candidates allowed to use any database engine?
If are candidates supposed to create his own database, it leads to file with sequential records and some kind of index. Not easy for me witch 25 years of programming. I am using databases, not creating them.
Trivial solution is to keep a fixed length record (for easy skip computation) file in lexical order, read & rewrite on add and insert new entry/line at the correct place; many comparisons, but since only I/O is rated...
Separate index file might be an option and just append to data file, but since index file must be re-read on each separate invocation, this comes down to data size of name & phone number (20 chars/10 byte BCD) vs. size of BTree node, which has to contain key (name) anyway.
Good question to argue about and requires some basic understanding of data structures.
Overall index solution would win when balancing (combined) I/O of add & list op
The index file can contain record sizes and lexicographically sorted record indexes (without actual keys) and be sparse to allow cheap on I/O inserts and be chunked on some page size (64KB) to align with actual storage h/w characteristics.
On insert if overflow (no gap) it can grow to next page boundary size and ensure enough gaps. If the index is missing (a kind of catastrophic event) it can be rebuild from data file.
Simple ADT and no error-prone B-tree implementation to maintain by a junior.
The question is a bit on the edge since it's difficult to quantify what is good enough and I could see an inexperienced developer stressing out because of that. Personally I would probably use a file structure where records are appended at the end. However each record is prefixed with red-black tree node which points to the next/previous record in lexical order (offset or absolute position in the file). Adding new entry would mean going through the existing fields via tree nodes and updating one field tree node to point to the new field. Red-black tree is also self-balanced if necessary so performance would be ok i guess.
It seems this question splits candidates in two groups:
Karel
Sure, if they can write one during the interview, sure. An off the shelf one, nope
Karel,
We are a database company, this is literally what we do. And that kind of task, especially if you are doing that on a while board, isn't really that hard if you can skip a lot of steps in the middle (like durability, concurrency, etc).
Felix,
If you are doing fixed length records and do in lexical order, you are going to be facing O(N^2) rewrite cost when adding records in reversed order.
A separate index file is probably easiest, to be honest. That means that you can very easily append the values to the main file, and the index just stores sorted offsets.
For bonus points, compress the index file on disk and reduce this even further.
Dalibor,
Yes, that is probably the best reasonable scale (in term of code) solution that you can come up with. Not _ideal_, but that would require a b+tree, and that isn't something that you write in an hour.
Garcha,
People who use Sqlite don't answer this question correctly, however.
@Oren Well since I'm a LOB application developer which does not normally work with algorithms and data structures I'll take that as a high praise :)
this question would guarantee you dont hire anyone, especially a junior
I think it's a good question, especially in your side of business. It also can lead to a long, interesting discussion on trade offs, concerns, edge cases, etc. allowing you to assess the candidate's thinking, even if initial solution somewhat misses the point. At that point in my career I would definitely try to think of some simplistic B+ tree implementation with enough fill factor to allow for not too many splits. All-those-years-ago me, fresh out of university? I would probably start towards some heap structure or balanced tree, as they naturally keep the sorting order, while trying not bending myself too much about file. Maybe several files, 100k entries each, given that we know the size limit more or less here. I wouldn't worry too much about wasting space - if the file(s) are slightly bigger than the sum of characters but we get better performance, so be it.
I think it is a very difficult question and very few candidates will present something reasonable if you expect them to implement an efficient on disk data structure like B+ tree or log structured merge tree. On the other hand, it would be more realistic to allow candidates to use some existing disk storage engine library such as Voron, RocksDb, SQLite or even Extensible Store Engine. You used ESE to build RanvenDB at the begining, so I think It would not be reasonable to expect the candidate to create an storage engine from scratch in an interview question.
@Jesús López The problem with this approach is to expect a junior candidate to have experience with some of the mentioned libraries, also correctly using library calls his will not tell you much about the candidate; the purpose is still an interview.
Most junior developers will not be able to answer this question, simply because they were never confronted with this topic. In algorithms or computability class you will learn abstract concepts and proofs. So they may know the complexity of a binary search. What you don't cover in class is how this has any real-life application, especially with regards to real I/O.
It is also highly unlikely that they ever encountered thinking about this in their previous jobs - they probably had enough to do wrapping their heads around all the different aspects and languages of web programming, not how db engines internals work and for sure not how to implement efficient indexing mechanisms. You cannot expect these people to come up with any good idea in this regard in an hour long interview.
Instead, I would suggest you give them some abstract and much more tightly scoped questions with much more background information, s.a. "How do you perform an efficient search in a sorted list?", "In what ways is appending to a file on disk more efficient than random writes within the file?", "Given that appending to a file is much less costly than random writes, could you think of a way to organize a sorted list in a file that is write efficient (the list can be an array or a linked list). " This way you point the candidate in the right direction, but still give him enough opportunity to prove whether or not they have the right way of thinking for the job.
Rafal,
I answered such questions for homework when I was in high school. I seriously doubt that this is "not hire anyone" question.
Tomasz, Someone who can discuss the alternative of various implementations? They are _hired_.
I'm not expecting them to come up with the full details on the spot, but I am expecting that they can apply algorithms they run into in memory to disk structures.
Jesus,
There is a BIG difference from LSM and B+Tree and the kind of work you need to solve this problem. And using an existing storage engine defeats the purpose of this question. I want to see if they can understand what is going on behind the storage engine.
Manu,
"Searching in sorted list" is the sort of thing that I expect any dev to be able to answer without thinking, ever. It tests nothing.The rest are actually traps, because you can get deep enough to show that there isn't actually such a thing as "write to the middle of a file".
As for not running into this, that is natural, but that is the _point_. I want to be able to see that they can take data structures that they run into in one context and apply them in the next
B+ Tree and LSM are different data structures, of course. But both can solve the problem of the question. I can think on a very naive LSM implementation that can pass your interview question, however even the most naive B+ tree is away more difficult to implement. But I think it is more difficult you find a candidate that present something reasonably efficient than you win the lottery.
This isn't an easy question, and in general I think many would struggle unless they had a lot of experience.
BUT, I disagree with those saying it's a "no-hire" question, because context is important here - for candidates applying to a database company, you'd expect them to have done some research and learning beforehand, even if it's at a basic level.
Something else - this question has so many different solutions that it's a great one for generating discussion, and that's where you really find out about a candidate. I mean, even for juniors you could guide them to a particular solution and see how they manage.
Yes, i believe the high-school you would be a great hire for Hibernating Rhinos. And it's great that you happen to find such people.
Just out of curiosity, is this very naive LSM tree good enough to pass this interview question:
https://github.com/jesuslpm/phone-book/blob/main/PhoneBook.NaiveLsmTree/NaiveLsmTreePhoneBookStore.cs
It inserts 10.000.000 entries in 45 seconds and list them all in 5 seconds.
Jesus,
Implementing an AVL tree to solve this would actually be a viable strategy, and be simple enough to do. B+Tree on a whiteboard is easy, too.
Rafal,
To be fair, I don't think that I was a good programmer in high school. For that matter, that was a task that was given to all the students in the class. Considered to be a routine part of homework.
Jesus,
Yes, that would be a great thing to see. For that matter, there is so much to go on from that implementation that I would be overjoyed.
It took me many many years to build that knowledge, confidence in myself that i'm a competent programmer and can handle pretty demanding tasks. Although i always had some talent for understanding technical stuff handling complexity, and started basic programming at pretty young age, i didnt have any real achievements and had couldn't brag about casual late night patching of OS with a hex editor. At high school i realized how little i knew. So i couldn't either call myself a good programmer at that time. But yep, the algorithms and data structures and all theory behind were a major subject thru first 3 years and that's why i find it difficult to understand why there are so many people with CS diploma who seem to have skipped that part. On the other side, there were many areas that were kind of left out by my university - lots of theory but very little practice, for example i had almost no knowledge of databases and didnt even realize how all the theory applies to real software.
Based on my experience with hiring developers, this is a difficult question, especially to solve at the whiteboard for a junior developer, unless he was doing special preparation for this specific position - I don't see a way for a regular developer with 2 years of website development (backend api with sql persistence) to answer it on the whiteboard using b-trees (there are exceptions, but those are rare).
In my opinion hiring junior developers is the hardest - as usually you need to make decision not on what they know today, but what they might be able to know after a year working with you - so if you allow to solve this question at home, it should work just fine - it should prove that candidate can read/understand stuff and it will be easy to prove that he/she just stole the code without actually understanding it. Also doing it at home saves your time, as you need to interview only those, that were able to write something sensible.
Overall I think all depends on amount of candidates you have - for example in my country's market we don't have huge pool to choose from and deciding from number of email's I get with offers to relocate it should be difficult in other countries too.
I would say this is a nice question for a junior programmer, just to see how they will attack it. I would just like to have 2 additions to the problem : - What kind of storage are we talking about (On hdd large sequential writes are different then small writes, just as with network storage) - Is it allowed to have eventual consistency or does it require immediate consistency (and additional, is it required to be ACID compliant (much more I/O) or am I allowed to keep the data in memory for 5 seconds to batch-write it then to lower I/O's)
A nice addition for a senior position would be I think to also have a look up by phone-number requirement for legal reasons
But and this is a pretty big but imho, I think this would only be a junior question for somebody working on the RavenDB core, for somebody working on RavenDB clients for example I would say it is overkill
Giedrius,
In my opinion, someone with 2 years experience will usually have a harder time here than a true junior.And b-tree are not actually required, they are simply one of the most efficient methods. A tree can be implemented fairly simply with some loss of efficiency.
Right now, when hiring junior devs, my issue is literally that I'm drowning in candidates and need to filter them out.
Chirstiaan,
If the candidate asks about details relating to disk performance, they are 95% hired already :-) If they are asking about ACID properties, they are 99% hired.
We also don't really have a distinction of client vs. backend work. And for that matter, a lot of those details are transferable across the barrier.
Comment preview