Reviewing ResinPart V
In the previous part, I looked at how indexing and queries are handled in Resin. This post is mostly about the pieces I haven’t talked about so far. We’ll start with the query parser and move to the trie.
Queries in Resin looks like this:
This is sort of looking like the Lucene syntax, but it looks like it keeps the same context until a new field comes along.
Range queries looks better, sort of:
I had a hard time figuring this one out, until I realized that this is not an XML tag in the middle.
The problem is that the Lucene query syntax kinda sucks. Actually, it sucks a lot. It is complex and ambiguous to parse and it is full of all those little things like the ~ over there that is not very obvious but is very important to the query. I would actually suggest something more like SQL. Sure, that wouldn’t be what you’ll put in the search box, but programmers will appreciate you for that.
Looking at the parser code, there aren’t any surprises there. It is using a hard rolled system using regex and split, which can be vastly improved. One thing to note is that because of the simplicity of the parser, it isn’t really able to process things like a search for a token with a colon in it, so it can’t process this query:
url:http://ayende.com
Anyway, the query parser isn’t really the most important thing here. The core of Resin, and what I haven’t looked at so far at all is the trie…
LcrsTrie stands for Left child Right sibling, there is a good discussion on the reasons why you’ll want to use this here. At this point, I’m not really sure why the choice of Lcrs was used. In general, they are used to reduce space and simplify the representation, but I don’t think that this is a good idea for a persistent structure. I’ll look at that later. Right now I’m reading the code, and it is mostly pretty obvious code. But then you get to this:
This pattern of using IEnumerable to return a single value is something that I’ve seen in other places in the codebase, and I don’t really get it.
I like the use of the Levenshtein distance in fuzzy search, mostly because we don’t need to store a lot of data to get fuzzy search working. In particular, it looks like suggestion style queries are pretty easy, and would be much cheaper then it would be in Lucene.
Probably the core operation you always perform on a trie is the search, and the core of that in this case is the TryFindPath method:
There is nothing surprising in this code, but it is a pure in memory implementation, which is a very different environment then a persistent data structure.
The persistent data structure is actually the MappedTrieReader, so let us examine it. Looking at it, there is some reference to the notions of segments within the file, but I’m not seeing where it is used. This is what the “*.six” file is used for, it seems. I think that this might be related to merging, but I don’t really know.
And here is the reason for the IsWord design:
When using a single LcrsTrie, it isn’t needed. But when using a possibly segmented reader, we might have multiple results for the same word.
What is worrying here is that the same access pattern for the trie that is used in memory is also using in the persistent mode. That means that each time we need to traverse the trie, we’ll need to seek. Actually, it looks like that might only be needed when we aren’t on the right path, but that is actually pretty common, so there are going to be a lot of seeks.
That is enough for now, my next post will be more detailed analysis of the Resin I/O structure and what I would probably do instead.
Comments
Comment preview