Building a phone bookPart III
In the past two posts in the series, I talked about ways to store phone book records in a file. During the candidates review process, I noticed that many candidates failed to make their lives significantly easier by placing limits on themselves.
For example:
- Using variable length records.
- Using a single file.
- Choosing simple algorithm to do the whole task.
If we force fixed length records, either directly or via record splitting (if each record is 64 bytes, a record that is bigger than that would reside in some multiple of the record size), the task become much easier. I’ve mostly ignored that in my code so far because I’m using binary offsets, but it can really make the code a lot simpler.
Using a single file lead to complications, because you have to do internal space management (where do the records live, where is the metadata?). It also make it much harder to recover used space in many cases.
The last one is probably the most interesting limitation, and not something that I would expect a junior developer to figure out. The use of a single option is typically limiting you to whatever a particular algorithm is providing, but you can extend on that significantly.
Let’s see another approach to building a persistent phone book. I’m going to effectively build an LSM here. You can see the code here.
I called it a pretty horrible LSM (Log Structure Merge), but all the relevant pieces are there. It is just horribly inefficient. The key problem, by the way, is around the number of times it will open a file handle. That can be really slow on Windows and end up being a real issue for any significant size.
There are also probably a lot of other bugs there, but also enough to figure out how this is actually built.
And with this post, I can can say that I explicitly scratched this itch.
A fun task to take this further, by the way, is to try to implement a persistent trie for the phone book.
Comments
I get that fixed-length records make moving through the file a bit easier (since you don't need to keep decoding record lengths), at the expense of extra disk space. But wouldn't supporting larger records via record-splitting complicate things quite a bit? Presumably then you need some kind of record type identifier, e.g. "FULL-RECORD, PARTIAL-RECORD", and to handle reconstituting split records?
P.S. I'm really enjoying this phone book series!
cocowalla,
I'm happy you enjoy that. And the key about full / partial records is that you have an easy was to index into the file. You can go to Record #8123, and just multiply that by 64 to know what the file offset is. But yes, you'll need an indicator here, which kind of sucks. Eventually, you're led to page based model and from there, to B+Tree and friends.
@Oren
Ah, I'd love to see another part of this series that introduces page-based storage with a look at the benefits, along with a minimal working code sample!
cocowalla,
For that, you might want to look at Gavran's codebase. It explores this in detail.
Page based storage requires significantly more code, I'm afraid. Not sure that I'll get to that.
Comment preview