Design practice: Building a search engine library

time to read 11 min | 2174 words

Note: This is done purely as a design practice. We don’t have any current plans to implement this, but I find that it is a good exercise in general.

How would I go about building a search engine for RavenDB to replace Lucene. Well, we have Voron as the basis for storage, so from the get go, we have several interesting changes. To start with, we inherit the transactional properties of Voron, but more importantly, we don’t have to do merges, so we don’t have any of those issues. In other words, we can actually generate stable document ids.

But I’m probably jumping ahead a bit. We’ll start with the basics. Analysis / indexing is going to be done in very much the same way. Users will provide an analyzer and a set of documents, like so:

   1: var index = new Corax.Index(storageOptions, new StandardAnalyzer());
   2:  
   3: var scope = index.CreateIndexingScope();
   4:  
   5: long docId = scope.Add(new Document
   6: {
   7:     {"Name", "Oren Eini", Analysis.Analyzer},
   8:     {"Name", "Ayende Rahien", Analysis.Analyzer},
   9:     {"Email", "ayende@ayende.com", Analyzed.AsIs, Stored.Yes}
  10: });
  11:  
  12: docId = scope.Add(new Document
  13: {
  14:     {"Name", "Arava Eini", Analysis.Analyzer},
  15:     {"Email", "arava@doghouse.com", Analyzed.AsIs, Stored.Yes}
  16: });
  17:  
  18: index.Sumbit(index);

Some notes about this API. It is modeled quite closely after the Lucene API, and it would probably need additional work. The idea here is that you are going to need to get an indexing scope, which is single threaded. And you can have multiple indexing scopes running a the same time. You can batch multiple writes into a single scope, and it behaves like a transaction.

The idea is to deal with all of the work associated with indexing the document into a single threaded work, so that make it easier for us. Note that we immediately get the generated document id, but that the document will only be available for searching when you have submitted the scope.

Under the hood, this does all of the work at the time of calling Add(). The analyzer will run on the relevant fields, and we will create the appropriate entries. How does that work?

Every document has a long id associated with it. In Voron, we are going to have a ‘Documents’ tree, with the long id as the key, and the value is going to be all the data about the document we need to keep. For example, it would have the stored fields for that documents, or the term positions, if required, etc. We’ll also have a a Fields tree, which will have a mapping of all the field names to a integer value.

Of more interest is how we are going to deal with the terms and the fields. For each field, we are going to have a tree called “@Name”, “@Email”, etc. Those are multi trees, with the keys in that tree being the terms, and the multi values being the document ids that has those threes. In other words, the code above is going to generate the following data in Voron.

  • Documents tree:
    • 0 – { “Fields”: { 1: “ayende@ayende.com” } }
    • 1 – { “Fields”: { 1: “arava@doghouse.com” } }
  • Fields tree
    • 0 – Name
    • 1 – Email
  • @Name tree
    • ayende – [ 0 ]
    • arava – [ 1 ]
    • oren – [ 0 ]
    • eini – [ 0, 1 ]
    • rahien – [ 0 ]
  • @Email tree
    • ayende@ayende.com – [ 0 ]
    • arava@doghouse.com – [ 1 ]

Given this model, we can now turn to the actual querying part of the business. Let us assume that we have the following query:

   1: var queryResults = index.Query("Name: Oren OR Email: ayende@ayende.com");
   2: foreach(var result in queryResult.Matches)
   3: {
   4:     Console.WriteLine("{0}: {1}", result.Id, result["Email"] )
   5: }

How does querying works? The query above would actually be something like:

   1: new BooleanQuery(
   2:     Match.Or,
   3:     new TermQuery("Name", "Oren"),
   4:     new TermQuery("Email", "ayende@ayende.com")
   5:     )

The actual implementation of this query would just access the relevant field tree and load the values in the particular key. The result is a set of ids for both parts of the query, and since we are using an OR here, we will just union them and return the list of results back.

Optimizations here can be to run those queries in parallel, or to just cache the results of particular term query, so we can speed things even more.  This looks simple, and it is, because a lot of the work has already been done in Voron. Searching is pretty much complete, we only need to tell it what to search with.