The low level interview question
The following is likely to end up in the list of questions we’ll ask candidates to answer when they apply to Hibernating Rhinos.
We need to store information about keys and their location on disk. We need to implement a trie. We want the trie to store int64 size values and unbounded UTF8 string keys.
Given that, we have the following interface:
We need to implement that with the following properties:
- The trie will be represented by single 32 kilobytes byte array.
- You cannot store any additional information about the trie outside of the array.
- The costs of searching in the trie in proportional to the size of the key.
- There are no duplicates.
- This is a single thread implementation.
- If the array is full, it is fine to return false from the TryWrite()
- Unsafe code is fine (but not required).
- You cannot use any in memory structure other than the byte array. But it is fine to allocate memory during the processing of the operations (for example, to turn the string key into a byte array).
We will be looking at the following aspect in the implementation:
- Correctness
- Performance
- Space used
The idea is that we can pack as much as possible into as small a space as possible, but at the same time, we can get great performance.
Comments
Do you ever specify test cases/data up front to let the interviewee focus on implementation instead of testing data?
One question,
What is an "unbounded" UTF8 string?
Like a typical UTF8 string: System.Text.Encoding.UTF8.GetString(yourByteArray), how do you make it unbounded?
Oren, I think you should specify what performance you wish to optimize. I think there could be diffrent solution optimizing diffrent methods, I think you care most about the try read, am I right? Would a binary search per char would be considered proportional to the key length? Saving all possible unbound UTF8 edges per char will waste too much storage... Even breaking the keys per byte will mean having 2^8 edges per byte in the key and with the size limit it is unacceptable...
Tal, The idea is to optimize the reads, yes. And binary search per char would something that you can avoid for the most part. Only when you have different values would you need that.
Jeremy, Unbounded means a string that has no size limit. For example, you can have a string whose size is 5KB. Obviously, if you tried to use a string whose length is over 32KB, it would fail because of lack of space
Harry, No, I don't. It is more interesting to see what they come up with
A tree structure with branches consisting of different letters (UTF8 bytes), levels of the tree being the letter position. These nodes can either be an internal node containing a matching letter with a next link to the next letter branch and a link to the next node if match, or simply a leaf node containing the 64-bit value itself.
Assuming each node is 64-bit, or 8 bytes, then 32KB is 4096 slots, with each address being 12 bits. If each node is 32-bit, then each address is 13 bits -- but then I'll have to keep an allocation table to tell which one is double and which one is single. Probably not worth it.
Each internal node is 8 bits of matching letter, plus a link => 8+12 = 20 bits. Packing two into each node ==> 40 bits, plus line "next" link ==> 40 + 12 = 52 bits. Put in a "prev" link ==> 52 + 12 = 64 bits. So each internal node matches two possible letters, and chained in a doubly-linked list.
A free slot can be all zero --> Zero letters are not UTF8. However, I need a bitmap to tell which node is a leaf and which node is internal, if all 64-bit of value must be held.
This is fun.
Stephen, A node can be both internal and leaf. For example, assume the following sequence:
TryWrite("foo", 15); TryWrite("foobar", 13);
But that is great, yeah
A nice challenge is irresistible :) I have come up with an all-C# solution that is 'just' 10 times slower on reads than a standard Dictionary<string, long> (but a lot more space-efficient, of course). Depending on the test dataset, some 1200 filenames or 750 guids will fit (with values). So that's between 27 and 43 bytes per key, of which 8 are needed for the value. A write is quite slow though; ~20 times slower than reading. Being an interview question, I guess it is not appreciated to share a github link?
Comment preview