Corax: The problem of space

time to read 5 min | 904 words

Corax is a research project that we have, to see how we can build a full text search library on top of Voron. Along the way, we take the chance to find out how Lucene does things, and what we can do better. Pretty much from the get go, Corax is likely to use more disk space than Lucene, probably significantly so. I would be happy if we could get a merely 50% increase over Lucene

The reason that this is the case is that Lucene goes to great length to save disk space. From storing all integers in variable length format, to prefix compression to implicitly referencing data in other files. For example, you can see that when you try reading term positions:

TermPositions are ordered by term (the term is implicit, from the .tis file).

Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).

The downside of saving every little bit is that it is a lot more complex to read the data, requiring multiple caches and complex code path to actually get it properly. It make a lot of sense, when Lucene was created, disk space was at a premium. I won’t go as far as to say that disk space doesn’t matter, but given a trade off of using more disk space vs. using more memory / complexity, it is much easier to justify disk space usage today*.

* The caveat here is that you need to be careful, because just accessing the disk can be very slow.

One of the major things that we wanted to deal with Corax is reducing index corruption issues, and seeing if we can simplify things into a transactional system. As a side effect of that, we don’t need to have index segments, and we don’t need to do merges to free disk space. The problem is that in order to handle this, we need to make track additional information that Lucene doesn’t need to.

Let us look at the actual data we keep. Here is a very simple index:

using (var fullTextIndex = new FullTextIndex(StorageEnvironmentOptions.CreateMemoryOnly(), new DefaultAnalyzer()))
{
using (var indexer = fullTextIndex.CreateIndexer())
{
indexer.NewIndexEntry();
indexer.AddField("Name", "Oren Eini");
indexer.AddField("Email","Ayende@ayende.com");

indexer.NewIndexEntry();
indexer.AddField("Name", "Arava Eini");
indexer.AddField("Email","Arava@houseof.dog");

indexer.Flush();
}
}

For each field, we are going to create a multi tree. And for each unique term in the field we have a list of (Index Entry Id, term frequency, boost).

  • @fld_Name
    • arava
      • { 2, 1, 1.0 }  (index id 2, freq 1, boost 1.0)
    • eini
      • { 1,1,1.0 }
      • { 2,1,1.0 }
    • oren
      • { 1,1,1 }
  • @fld_Email
    • arava@houseof.dog
      • { 2,1,1.0 }
    • ayende@ayende.com
      • { 1,1,1.0 }

This is pretty much the equivalent to the way Lucene store things. Possible space optimizations here include not storing default values (term frex or boost of 1), storing index entry ids as variable ints, etc.

The problem is that while this is actually enough for the way Lucene does things, it is not enough for the way Corax does things. Let us consider the case of deleting a document. How would you go about doing this using the information above?

Lucene does this by marking a document id as deleted, and will purge its details on the next segments merge. That works, but only because a segment merge actually read & write all of the relevant segments data. Without a segments merge, deleting a document is actually something that would require us to scan all the data in the entire database. This is not really practical. Therefor, we need to store additional data so we can delete it later on. In this case, we have the Docs tree, which has keys for (index entry id, field id and term num). This looks like this:

  • Docs
    • [1,1,1]: oren
    • [1,1,2]: eini
    • [1,2,1]: ayende@ayende.com
    • [2,1,1]: arava
    • [2,1,2]: eini
    • [2,2,1]: arava@houseof.dog

Using this information, we can now remove all traces of a document when it is deleted. However, the problem here is that we need to also keep the terms per document in the index. That really blow up the index size, obviously.

The reason for this peculiar way of storing the document fields in this manner is that we also want to reuse this information for sorting. When Lucene needs to sort data, it has to read all of the data from the fields, then recreate the values for all relevant documents. Corax can just serve the data already there.

A pretty obvious step to save space would be to track the terms separately, and use an id in the Docs tree, not the full term. That leads to an interesting problem, because we are going to need to be able to go from term –> id and id –> term, which pretty much require storing them twice, unfortunately.

Final note, Corax is a research project.