Reviewing Noise Search EngineSummary
After the last three parts, I think that I have a good high level understanding of what is going on in the Noise codebase, as a reminder, I’m reviewing 48d808375e541eca383826ea928272976bce064d.
I love reading code, especially stuff that I’m not familiar with, such as a new platform or doing something novel. The Noise codebase is quite interesting, and it is a very early stage in the project. That said, when looking at what it is doing and comparing to what I’m doing in my day job, a few things popped up.
I couldn’t help but think that this is a very similar model to the default Azure DocumentDB. All the data is indexed, and the values are shredded as they go in. The model you use was very similar. The nice thing about this kind of model is that it gives you a lot of freedom and flexibility, but the downside is that if you need more, there aren’t any alternatives. Storing large JSON documents inside Noise is likely to cause problems, especially if you are typically working with the full document. If you are just throwing things into it and then pulling back pieces, especially if the documents you write are relatively small, this is likely to be much more suited.
Noise is making some clever usage of the LSM nature of RocksDB to do cleanups, and I really like the elegance of the indexing engine. The query model is also interesting, in that it gives you a lot of freedom to operate. Select the right fields, define the appropriate aggregation, etc. But queries are limited (no computation) and aggregation queries are going to run with cost of O(N) with respect to their results. Given that this is intended to be embedded, and apparently into low end devices, that might very well be the right decision.
A side affect of the shredding is that you need to reconstruct documents all the time, since they were broken into component parts. Things like scanning results by one field while returning another looks like they can generate O(N * log N) number of queries in RocksDB. I’m not sure how much that can scale with regards to queries. On the one hand, there are very clear layers between the pieces of the code, which is great, but that also have some built in issues with regards to how much performance you can squeeze out of the system. I don’t have a feeling for Rust code performance, nor do I think that the project is ready at this early stage for performance analysis, but it is a concern. Then again, I did just spent the past two years head down in a major performance work, so that might very well be the reason. And it is entirely possible that the code will perform significantly better when the project is done.
One last thing to note, when I tried building a replacement to Lucene one of the annoying thing to pop up was that it is really hard to replace Lucene efficiently and still be better in term of time / space / complexity. Noise in particular is storing all the data at least twice (once as terms to be searched, once as the actual document data). This make is very easy to do certain operations (such as update a document), but it also means that the cost of updating a document is now doubled. We first need to index the new document, then we need to re-index old document, compare between the two and then write the changes. I guess that it is possible that reducing the amount work to be written would help matters, but the cost of indexing is likely to dominate.
I would expect a Noise index to be significantly larger than the Lucene equivalent, because Lucene does a lot to reduce file size (in quite evil & nasty ways, to be fair).
Another thing that I don’t think that is good is the reliance on a query language in an embedded solution. A much better idea in my option would be to expose the raw documents out and let the client handle it. If it is running in the same memory space, there is no extra cost. This is because the moment that this open up a huge number of features / issues. I’m talking specifically about the notion of “return” here. The idea is really cool, you can project exactly what you want from the document, and that allows the engine to only pull the relevant data.
At any rate, this is really getting late now, so I’ll close with the note that Noise is a really cool codebase to read. Even if you don’t know Rust, going ahead and reading something non trivial is a really good exercise in learning new things. And I’m really interested in what Noise is going to do next.
I had the same feeling. Still, the Azure whitepaper describes their reasoning and trade offs in a more in-depth manner: