The Corax Experiment: API
I posted before about design practice for how I would approach building a search engine library. I decided to bite the bullet and actually try to do this. Using Voron, that turned out to be a really simple thing to do. Of course, this doesn’t do a tenth of what Lucene does, but it actually does quite a lot. The code is available here, and I want to emphasize again, this is purely experimental / research project.
The entire thing comes to less than 500 lines of code. And it is pretty functional even at this stage.
Corax is composed of:
- Analysis
- Indexing
- Querying
Analysis of the documents is handled via analyzers:
1: public interface IAnalyzer2: {
3: ITokenSource CreateTokenSource(string field, ITokenSource existing);4: bool Process(string field, ITokenSource source);5:
6: }
An analyzer create a token source, which accept a TextReader and produces token. For each token, the Process method is called, and it is used to do things to the relevant token. For example, here is the default analyzer:
1: public class DefaultAnalyzer : IAnalyzer2: {
3: readonly IFilter[] _filters =4: {
5: new LowerCaseFilter(),6: new RemovePossesiveSuffix(),7: new StopWordsFilter(),8: };
9:
10:
11: public ITokenSource CreateTokenSource(string field, ITokenSource existing)12: {
13: return existing ?? new StringTokenizer();14: }
15:
16: public bool Process(string field, ITokenSource source)17: {
18: for (int i = 0; i < _filters.Length; i++)19: {
20: if (_filters[i].ProcessTerm(source) == false)21: return false;22: }
23: return true;24: }
25: }
The idea here is to match, fairly closely, what Lucene is doing, but hopefully with clearer code. This analyzer will text a stream of text, break it up to discrete tokens, lower case them, remove the possessive ‘s suffix and clear stop words. Note that each of the filters are actually modifying the token in place. And the tokenizer is pretty simple, but it does the job for now.
Now, let us move to indexing. With Lucene, the recommendation is that you’ll reuse your document and field instance, to avoid create garbage for the GC. With Corax, I took it a step further:
1: using (var indexer = fullTextIndex.CreateIndexer())2: {
3: indexer.NewDocument();
4: indexer.AddField("Name", "Oren Eini");5:
6: indexer.NewDocument();
7: indexer.AddField("Name", "Ayende Rahien");8:
9: indexer.Flush();
10: }
There are a couple of things to note here. An index can create indexers, it is intended to have multiple concurrent indexers running at the same time. Note that unlike Lucene, we don’t have Document or Field classes. Instead, we call methods on the indexer to create a new document and then add fields to the current document. When you are done with a document, you start a new one, or flush to complete the entire operation. For long running indexing, the indexer will flush itself automatically for you.
I think that this API gives us the best approach. It guide you toward using a single instance, with internal optimizations to make it memory efficient. Multiple instances can be used concurrently to speed up indexing time. And it knows when to spill flush itself for you, so you don’t have to worry about that. Although you do have to complete the operation by calling Flush() at the end.
How about searching? That turned out to be pretty similar as well. All you have to do is create a searcher:
1: using (var searcher = fti.CreateSearcher())2: {
3: QueryResults queryResults = searcher.QueryTop(new TermQuery("Name", "Arava"), 10);4: Console.WriteLine(queryResults.TotalResults);
5: foreach (var match in queryResults.Results)6: {
7: Console.WriteLine(match.DocumentId + " - " + match.Score);8: }
9: }
We create a searcher, and then we can utilize it to perform queries.
So far, this has been all about the API we have, I’ll talk about the actual implementation in my next post.
Comments
Does this mean you're thinking about making the indexers be a provider model that you could theoretically use lucene, corax, or roll your own?
voron power supreme
Chris, No, if we go that way, and that is a big if, it would be a port, not a provider model thingie.
What do you mean port? As in you rewrite lucene.net itself using lucene as reference?
Chris, No, I mean that if we decide to use Corax, we'll be using that, not Lucene.NET And we'll not be using any sort of a provider model anywhere. It make no sense to try.
I sure hope you'd never replace Lucene.NET corax unless that allowed you to make LINQ exactly match RavenDB and that there would never be an instance you have to quit LINQ and goto Advanced.Corax....
Learning Lucene.NET & RavenDB was alot of effort. You firmly need to understand both to excel. That would be big deal if you throw away all existing knowledge to switch to Corax with a perfect harmony of integration for 1 single api.
to switch to Corax withOUT* a perfect harmony of integration for 1 single api.
Chris, Corax is meant as a research project, please remember that. Now, assuming a hypothetical in which we are replacing Lucene with Corax, from external point of view, not much would change.
My argument is more of you can't merely "replace" Lucene.NET. That would be a terrible experience. However if you created a single unified RavenDB LINQ experience for both querying and creating indexes. That there is zero impedance mismatch anywhere. That could be justified. But if you trade merely integration with Lucene.NET for corax or anything else, that would be terrible to existing users.
Chris, Again, _ hypothetical _ here. You are getting mixed up in this for no reason.
And no, Corax doesn't have, nor will it have, a Linq API. It will have string queries, like Lucene.
Comment preview