Looking back at this series, I have the strong feeling that I’m being unfair to Resin, I’m judging it using the same criteria I would use to judge our own production, highly optimized code. The projects have very different goals, maturity and environments. That said, I think that a lot of the comments I have to the project are at the implementation level. That is, they can be fixed (except maybe the analyzer / tokenizer pipeline) by simply optimizing one method at a a time. Even the architectural change with analyzing the text isn’t very big. What is important is that the code is quite clear, easy to follow and have a well defined structure. That means that it is actually possible to make this changes as the project mature.
And now that this is out of the way, let me cover some of the things that I would have done differently in the codebase. A lot of them are around I/O related. In particular, the usage of all those different files and the way this is done is decidedly non optimal. In particular, opening and closing of the files constantly, reading and seeking all over the place, etc. The actual design seems to be based around LSM, even if this isn’t state explicitly. And that have pretty good semantics already for writes, but reads currently are probably leaning very heavily on the file system cache, but that won’t work as the data grows beyond a certain scope.
When running on a Unix system, you also need to consider the fact that there is a limit to the number of open files you have, so smaller number of files are generally preferred. I would go with merging all those files into a single large one, similar to the compound format that Lucene uses.
Once that is done, I would also memory map the entire file to memory and use directly memory accesses to handle all I/O. This has several very important advantages. First, I’m being a lot more explicit about using the file system cache, and that would allow us to avoid a lot of system calls. Second, the data is already mostly structured as arrays, so it would be very natural to do so. This also avoid the need to manually buffer things in our own memory, which is always nice.
Next, there is the need to consider consistency checks. Resin as it stands now (I’m not sure if this is an explicit design decision) takes the position that it is not its job to ensure file consistency. Lucene make some attempt to ensure consistency, and usually fails at that horribly at the most inconvenient moments. Adding a hash to the file will allow to ensure that the data is okay, but it means having to read the entire file when you open it, which is probably too expensive.
The other aspect that need attention is the data structure used. In particular LcrsTrie is a good idea to save space and might work well for in memory usage, but it isn’t a good choice for persistent data structures. B+Tree or SST are the common choices, and need to be evaluated for the job.
As part of this, and quite important, I would recommend getting a look at the full I/O status. That means:
- How you write to disk?
- How you update data?
- Do you have write amplification (merges)?
- Are you trying for consistency / ACID?
- Can you explain how your data is persisted, using algorithms / approaches that are well known and trusted?
The latter is good both for external users (who can then reason about your system) and for yourself. If you are using LSM, you know that you have a set of problems (compactions, write amplifications) and solutions (auto optimize over time, etc) that is well known and you can make use of that. If you are using B+Trees, then the problems and solutions space are different, but there is even more information about them.
If you are using consistency, are you using WAL or append only? What are your consistency guarantees, etc?
Those are all questions that needs answers, and they have an impact on the design of the project as a whole.
And with this, this series is over. I have to say that I didn’t think that I would have so much to write about. It is a very interesting project.