Proposed solution to the low level interview question
For the actual question, see the original post.
So the first thing that we need to decide is what will be the data format on the tire. Since we have only 32KB to work with, we need to consider the actual storage.
32KB is small enough to fit in a unsigned short, so all the references we’ll used will be shorts. We also need to store a bit of metadata, so we’ll use the first 4 bytes as the header for just that.
- ushort SpaceUsed;
- ushort LastAllocation;
Now that we have this, we need to decide how to store the actual data. To make things easy, we are going to define the following way to allocate memory:
This is about the simplest way that you can go about doing things, note that we use a length prefix value, and we limit allocations to a max of 127 bytes each. We use a negative size to indicate a delete marker.
So basically, now we have a pretty trivial way to allocate memory, and we can implement the trie as we would normally do. There are a few wrinkles, however.
Deleting the memory doesn’t actually make it eligible for reuse, and it is quite likely to get fragmented easily. In order to handle that, we will track the amount of space that is used, and if we got to the end of the space, we’ll check the UsedSpace value. If this is still too little, we can abort, there is no available space here. However, if we go to the end of the buffer, but we have free space available, we can do the following:
- Scan the buffer for available spots (find available locations that have negative size).
- Failing that, we will copy the data to a temporary buffer, then re-add everything to the buffer from scratch. In other words, we defrag it.
Another issue we have is that the maximum size we can allocate is 127. This value is big enough so most actual strings can fit into it nicely, but a trie already has the property that a large string might be broken into pieces, we’ll just cut each node in the trie to a max size of 127. Actually, the max size is likely to be less than that, because there is also some information that we need to keep track per entry.
- byte NumberOfChildren;
- byte Flags; // node type, (internal, leaf or both)
- ushort ChildrenPosition;
So in practice we have about 123 bytes to work with for the length. Note that we don’t store the string value of the node’s length (we can get that from the allocation information), and that we store the actual children in an array that is stored separately. This allows us to easily add items to the trie as child nodes. If the node is a leaf node, we also need to store the actual value (which is 8 bytes), we store that information at the end of the value (giving us 115 bytes for that section of the value).
All in all, there is going to be a bit of pointer arithmetic and bit counting, but is likely to be a pretty simple implementation.
Note that additional optimizations would be to try align everything so it would fit into a cache line, trying to place nodes near their children (which are more likely to be followed), etc.
Comments
"...and that we store the actual children in an array that is stored separately. This allows us to easily add items to the trie as child nodes. ..." Aha! You use more than 32KB? I had the impression everything had to be stored into the 32Kb array, no?
// Ryan
Ryan, Where did you get that impression? We store the children of the node in a separate _allocation_. That way, in the common case where we need to grow the children, we can just allocate a bigger location for them, and then use that, instead of moving the whole node.
Comment preview