The Corax Experiment: API

time to read 17 min | 3232 words

I posted before about design practice for how I would approach building a search engine library. I decided to bite the bullet and actually try to do this. Using Voron, that turned out to be a really simple thing to do. Of course, this doesn’t do a tenth of what Lucene does, but it actually does quite a lot. The code is available here, and I want to emphasize again, this is purely experimental / research project.

The entire thing comes to less than 500 lines of code. And it is pretty functional even at this stage.

Corax is composed of:

  • Analysis
  • Indexing
  • Querying

Analysis of the documents is handled via analyzers:

   1: public interface IAnalyzer
   2: {
   3:     ITokenSource CreateTokenSource(string field, ITokenSource existing);
   4:     bool Process(string field, ITokenSource source);
   6: }

An analyzer create a token source, which accept a TextReader and produces token. For each token, the Process method is called, and it is used to do things to the relevant token. For example, here is the default analyzer:

   1: public class DefaultAnalyzer : IAnalyzer
   2: {
   3:     readonly IFilter[] _filters =
   4:     {
   5:         new LowerCaseFilter(), 
   6:         new RemovePossesiveSuffix(), 
   7:         new StopWordsFilter(), 
   8:     };
  11:     public ITokenSource CreateTokenSource(string field, ITokenSource existing)
  12:     {
  13:         return existing ?? new StringTokenizer();
  14:     }
  16:     public bool Process(string field, ITokenSource source)
  17:     {
  18:         for (int i = 0; i < _filters.Length; i++)
  19:         {
  20:             if (_filters[i].ProcessTerm(source) == false)
  21:                 return false;
  22:         }
  23:         return true;
  24:     }
  25: }

The idea here is to match, fairly closely, what Lucene is doing, but hopefully with clearer code. This analyzer will text a stream of text, break it up to discrete tokens, lower case them, remove the possessive ‘s suffix and clear stop words. Note that each of the filters are actually modifying the token in place.  And the tokenizer is pretty simple, but it does the job for now.

Now, let us move to indexing. With Lucene, the recommendation is that you’ll reuse your document and field instance, to avoid create garbage for the GC. With Corax, I took it a step further:

   1: using (var indexer = fullTextIndex.CreateIndexer())
   2: {
   3:     indexer.NewDocument();
   4:     indexer.AddField("Name", "Oren Eini");
   6:     indexer.NewDocument();
   7:     indexer.AddField("Name", "Ayende Rahien");
   9:     indexer.Flush();
  10: }

There are a couple of things to note here. An index can create indexers, it is intended to have multiple concurrent indexers running at the same time. Note that unlike Lucene, we don’t have Document or Field classes. Instead, we call methods on the indexer to create a new document and then add fields to the current document. When you are done with a document, you start a new one, or flush to complete the entire operation. For long running indexing, the indexer will flush itself automatically for you.

I think that this API gives us the best approach. It guide you toward using a single instance, with internal optimizations to make it memory efficient. Multiple instances can be used concurrently to speed up indexing time. And it knows when to spill flush itself for you, so you don’t have to worry about that.  Although you do have to complete the operation by calling Flush() at the end.

How about searching? That turned out to be pretty similar as well. All you have to do is create a searcher:

   1: using (var searcher = fti.CreateSearcher())
   2: {
   3:     QueryResults queryResults = searcher.QueryTop(new TermQuery("Name", "Arava"), 10);
   4:     Console.WriteLine(queryResults.TotalResults);
   5:     foreach (var match in queryResults.Results)
   6:     {
   7:         Console.WriteLine(match.DocumentId + " - " + match.Score);
   8:     }
   9: }

We create a searcher, and then we can utilize it to perform queries.

So far, this has been all about the API we have, I’ll talk about the actual implementation in my next post.