Ayende @ Rahien

Refunds available at head office

Corax: The problem of space

Corax is a research project that we have, to see how we can build a full text search library on top of Voron. Along the way, we take the chance to find out how Lucene does things, and what we can do better. Pretty much from the get go, Corax is likely to use more disk space than Lucene, probably significantly so. I would be happy if we could get a merely 50% increase over Lucene

The reason that this is the case is that Lucene goes to great length to save disk space. From storing all integers in variable length format, to prefix compression to implicitly referencing data in other files. For example, you can see that when you try reading term positions:

TermPositions are ordered by term (the term is implicit, from the .tis file).

Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).

The downside of saving every little bit is that it is a lot more complex to read the data, requiring multiple caches and complex code path to actually get it properly. It make a lot of sense, when Lucene was created, disk space was at a premium. I won’t go as far as to say that disk space doesn’t matter, but given a trade off of using more disk space vs. using more memory / complexity, it is much easier to justify disk space usage today*.

* The caveat here is that you need to be careful, because just accessing the disk can be very slow.

One of the major things that we wanted to deal with Corax is reducing index corruption issues, and seeing if we can simplify things into a transactional system. As a side effect of that, we don’t need to have index segments, and we don’t need to do merges to free disk space. The problem is that in order to handle this, we need to make track additional information that Lucene doesn’t need to.

Let us look at the actual data we keep. Here is a very simple index:

using (var fullTextIndex = new FullTextIndex(StorageEnvironmentOptions.CreateMemoryOnly(), new DefaultAnalyzer()))
{
using (var indexer = fullTextIndex.CreateIndexer())
{
indexer.NewIndexEntry();
indexer.AddField("Name", "Oren Eini");
indexer.AddField("Email","Ayende@ayende.com");

indexer.NewIndexEntry();
indexer.AddField("Name", "Arava Eini");
indexer.AddField("Email","Arava@houseof.dog");

indexer.Flush();
}
}

For each field, we are going to create a multi tree. And for each unique term in the field we have a list of (Index Entry Id, term frequency, boost).

  • @fld_Name
    • arava
      • { 2, 1, 1.0 }  (index id 2, freq 1, boost 1.0)
    • eini
      • { 1,1,1.0 }
      • { 2,1,1.0 }
    • oren
      • { 1,1,1 }
  • @fld_Email
    • arava@houseof.dog
      • { 2,1,1.0 }
    • ayende@ayende.com
      • { 1,1,1.0 }

This is pretty much the equivalent to the way Lucene store things. Possible space optimizations here include not storing default values (term frex or boost of 1), storing index entry ids as variable ints, etc.

The problem is that while this is actually enough for the way Lucene does things, it is not enough for the way Corax does things. Let us consider the case of deleting a document. How would you go about doing this using the information above?

Lucene does this by marking a document id as deleted, and will purge its details on the next segments merge. That works, but only because a segment merge actually read & write all of the relevant segments data. Without a segments merge, deleting a document is actually something that would require us to scan all the data in the entire database. This is not really practical. Therefor, we need to store additional data so we can delete it later on. In this case, we have the Docs tree, which has keys for (index entry id, field id and term num). This looks like this:

  • Docs
    • [1,1,1]: oren
    • [1,1,2]: eini
    • [1,2,1]: ayende@ayende.com
    • [2,1,1]: arava
    • [2,1,2]: eini
    • [2,2,1]: arava@houseof.dog

Using this information, we can now remove all traces of a document when it is deleted. However, the problem here is that we need to also keep the terms per document in the index. That really blow up the index size, obviously.

The reason for this peculiar way of storing the document fields in this manner is that we also want to reuse this information for sorting. When Lucene needs to sort data, it has to read all of the data from the fields, then recreate the values for all relevant documents. Corax can just serve the data already there.

A pretty obvious step to save space would be to track the terms separately, and use an id in the Docs tree, not the full term. That leads to an interesting problem, because we are going to need to be able to go from term –> id and id –> term, which pretty much require storing them twice, unfortunately.

Final note, Corax is a research project.

Success: From Opening a Champagne Bottle To Hiding Under the Bed with Said Bottle

During the RavenDB courses* in the past few weeks, I was talking with one of the attendees and I came up with what I think is a great analogy.

* Yes, courses, in the past 4 weeks, I’ve given the RavenDB course 3 times. That probably explains why I don’t remember which course it was.

What are your success metrics? From Opening a Champagne Bottle To Hiding Under the Bed with Said Bottle?

The first success  metric is when you have enough users (and, presumably, revenue) to cross the threshold to the Big Boys League. Let us call this the 25,000 users range.  That is the moment when you throw a party, go to the store and grab a whole case of champagne bottles and make fancy speeches. Of course, the problem with success is that you can have too much of it. A system that does just fine (maybe creeks a little ) on a 25,000 users is going to behave pretty differently when you have 100,000 users. That is the moment when you find your engineers under the bed, with a half empty bottle of champagne and muttering things about Out Of Capacity errors and refusing to come out until we fire all the users.

In just about any system, you need to define the success points. Because Twitter was very luck that it managed to grow even though it had so many problems when its user base exploded. It is far more likely that users will figure out that your service is… well, your engineers are drunk and hiding under the bed, so the service looks accordingly.

And yes, you can try to talk to people about SLA, and metrics and capacities. But I have found that an image like that tend to give you a lot more focused answers. Even if a lot of the time the answer is “I don’t know”. That is a place to start, because this make it a lot more acute than just “how many req/sec do we need to support?”.

Tags:

Published at

Originally posted at

Comments (5)

The nice thing about working in a large team…

There are a lot of stuff that are hard to do when you are working on a large team. But the really nice thing is the velocity in which you can move.

I just started the morning with the following commands:

image

And I have another PR pending for a different branch that I’m going to have to look at. Overall, I think that this is a pretty cool thing to have. We can push forward in many direction at once, and it can be pretty awesome to look at all the good thins that are coming our way.

Tags:

Published at

Originally posted at

Optimizing Voron and the cost of Multi Trees

One of the nice features that Voron got from LMDB is the notion of multi trees. If I recall correctly, LMDB calls them duplicate items, or something like that. Basically, it is the ability to store multiple values for a single key.

Update: The behavior described in this post has been in LMDB for the over 3 years. It is just something that Voron didn't have. Please note that this post discuss the Voron's limitation and its solution, not LMDB, which has had the solution for a long time.

Those items are actually stored as a nested tree, which effectively gives us a great way to handle duplicates. From an API standpoint, this looks like this:

// store
tree.MultiAdd(tx, "Eini", "users/1");
tree.MultiAdd(tx, "Eini", "users/3");
tree.MultiAdd(tx, "Eini", "users/5");



//read 
using(var it = tree.MultiRead(tx, "Eini"))
{
    if(it.Seek(Slice.BeforeAllKeys) == false)
        yield break;
    do
    {
        yield return it.CurrentKey; // "users/1", "users/3", "users/5"
    }while(it.MoveNext());
}

Internally, we handle this in the following fashion:

  • If a multi add operation is the very first such operation, we’ll add it as a simple key/value pair in the tree.
  • If a multi add operation is the 2nd such operation, we’ll create a new tree, and add both operations to the new tree. The original tree will have the key/nested tree reference stored.

That lead to a very easy to use API, which is quite useful in many cases. However, it also have space usage implication. In particular, a tree means that we have to allocate at least one page to it. And that means that we have to allocate 4KB to hold any multi tree with more than a single value. Since there are actually a lot of scenarios where we would want to store a small number of small values, that can often lead to a lot of waste on our part. We use 4KB to store data that we could have stored in just 64 bytes, for example. That means that when we want to actually store things that might be duplicated, we actually need to consider the taken space consideration as well.

I think that this is a pretty bad idea, because it leads us to do things with composite keys under some scenarios. That is something that Voron should be able to just handle for me.  So we changed that, in fact, what we did was changed the way we approach multi trees in general. Now, for every multi value, we can store up to 1KB of items in the same page as the key, before we devolve into a full blown multi tree.

The idea is that we can use the same Page mechanism we use elsewhere, and just have a nested page inside the parent page. As long as the nested page is small enough, we can just store it embedded. That result in some nice space saving. Since usually we have items in the following counts: zero, one, few, lots.

We need this to help with Corax, and in general, that would reduce the amount of space we need to use in many cases.

Tags:

Published at

Originally posted at

Comments (2)

The Corax Experiment: API

I posted before about design practice for how I would approach building a search engine library. I decided to bite the bullet and actually try to do this. Using Voron, that turned out to be a really simple thing to do. Of course, this doesn’t do a tenth of what Lucene does, but it actually does quite a lot. The code is available here, and I want to emphasize again, this is purely experimental / research project.

The entire thing comes to less than 500 lines of code. And it is pretty functional even at this stage.

Corax is composed of:

  • Analysis
  • Indexing
  • Querying

Analysis of the documents is handled via analyzers:

   1: public interface IAnalyzer
   2: {
   3:     ITokenSource CreateTokenSource(string field, ITokenSource existing);
   4:     bool Process(string field, ITokenSource source);
   5:  
   6: }

An analyzer create a token source, which accept a TextReader and produces token. For each token, the Process method is called, and it is used to do things to the relevant token. For example, here is the default analyzer:

   1: public class DefaultAnalyzer : IAnalyzer
   2: {
   3:     readonly IFilter[] _filters =
   4:     {
   5:         new LowerCaseFilter(), 
   6:         new RemovePossesiveSuffix(), 
   7:         new StopWordsFilter(), 
   8:     };
   9:  
  10:  
  11:     public ITokenSource CreateTokenSource(string field, ITokenSource existing)
  12:     {
  13:         return existing ?? new StringTokenizer();
  14:     }
  15:  
  16:     public bool Process(string field, ITokenSource source)
  17:     {
  18:         for (int i = 0; i < _filters.Length; i++)
  19:         {
  20:             if (_filters[i].ProcessTerm(source) == false)
  21:                 return false;
  22:         }
  23:         return true;
  24:     }
  25: }

The idea here is to match, fairly closely, what Lucene is doing, but hopefully with clearer code. This analyzer will text a stream of text, break it up to discrete tokens, lower case them, remove the possessive ‘s suffix and clear stop words. Note that each of the filters are actually modifying the token in place.  And the tokenizer is pretty simple, but it does the job for now.

Now, let us move to indexing. With Lucene, the recommendation is that you’ll reuse your document and field instance, to avoid create garbage for the GC. With Corax, I took it a step further:

   1: using (var indexer = fullTextIndex.CreateIndexer())
   2: {
   3:     indexer.NewDocument();
   4:     indexer.AddField("Name", "Oren Eini");
   5:  
   6:     indexer.NewDocument();
   7:     indexer.AddField("Name", "Ayende Rahien");
   8:  
   9:     indexer.Flush();
  10: }

There are a couple of things to note here. An index can create indexers, it is intended to have multiple concurrent indexers running at the same time. Note that unlike Lucene, we don’t have Document or Field classes. Instead, we call methods on the indexer to create a new document and then add fields to the current document. When you are done with a document, you start a new one, or flush to complete the entire operation. For long running indexing, the indexer will flush itself automatically for you.

I think that this API gives us the best approach. It guide you toward using a single instance, with internal optimizations to make it memory efficient. Multiple instances can be used concurrently to speed up indexing time. And it knows when to spill flush itself for you, so you don’t have to worry about that.  Although you do have to complete the operation by calling Flush() at the end.

How about searching? That turned out to be pretty similar as well. All you have to do is create a searcher:

   1: using (var searcher = fti.CreateSearcher())
   2: {
   3:     QueryResults queryResults = searcher.QueryTop(new TermQuery("Name", "Arava"), 10);
   4:     Console.WriteLine(queryResults.TotalResults);
   5:     foreach (var match in queryResults.Results)
   6:     {
   7:         Console.WriteLine(match.DocumentId + " - " + match.Score);
   8:     }
   9: }

We create a searcher, and then we can utilize it to perform queries.

So far, this has been all about the API we have, I’ll talk about the actual implementation in my next post.

Finding a port in a storm: Or how to get a consistent build experience

RavenDB is a server product, as such, we are obvious going to have to talk over the network. That is great, except that it means we need a system wide resource, a TCP port. And that isn’t so great, because it turn out that there are a lot of things that can try to grab a port and use it.

And I don’t speak about things like Skype taking over port 80. I’m talking about the use of dynamic or ephemeral ports. That means that during the build, we can have something (for example, Chrome, or DropBox, or anything that uses the web) make a call and suddenly that port is busy and you can’t bind to it.

The default ports we use for tests are in the 8070 – 8090 range, but we have had consistent build failures because something is taking the ports. Finally I took the time to investigate, and I discovered that:

   1: C:\Work\ravendb-3.0 [master]> netsh int ipv4 show dynamicport tcp
   2:  
   3: Protocol tcp Dynamic Port Range
   4: ---------------------------------
   5: Start Port      : 1025
   6: Number of Ports : 64510

So, on the two machines I tested this on, we had the dynamic port range cover everything from port 1025 and up. This sucks. And probably explains the issue. This is how we fixed it (I hope):

   1: netsh int ipv4 set dynamicportrange tcp startport=10000 numberofports=55535

Basically, everything below port 10,000 is free for us to use. And yes, this post is here mostly so I’ll not have to remember it in the future.

Tags:

Published at

Originally posted at

Comments (5)

Design practice: Building a search engine library

Note: This is done purely as a design practice. We don’t have any current plans to implement this, but I find that it is a good exercise in general.

How would I go about building a search engine for RavenDB to replace Lucene. Well, we have Voron as the basis for storage, so from the get go, we have several interesting changes. To start with, we inherit the transactional properties of Voron, but more importantly, we don’t have to do merges, so we don’t have any of those issues. In other words, we can actually generate stable document ids.

But I’m probably jumping ahead a bit. We’ll start with the basics. Analysis / indexing is going to be done in very much the same way. Users will provide an analyzer and a set of documents, like so:

   1: var index = new Corax.Index(storageOptions, new StandardAnalyzer());
   2:  
   3: var scope = index.CreateIndexingScope();
   4:  
   5: long docId = scope.Add(new Document
   6: {
   7:     {"Name", "Oren Eini", Analysis.Analyzer},
   8:     {"Name", "Ayende Rahien", Analysis.Analyzer},
   9:     {"Email", "ayende@ayende.com", Analyzed.AsIs, Stored.Yes}
  10: });
  11:  
  12: docId = scope.Add(new Document
  13: {
  14:     {"Name", "Arava Eini", Analysis.Analyzer},
  15:     {"Email", "arava@doghouse.com", Analyzed.AsIs, Stored.Yes}
  16: });
  17:  
  18: index.Sumbit(index);

Some notes about this API. It is modeled quite closely after the Lucene API, and it would probably need additional work. The idea here is that you are going to need to get an indexing scope, which is single threaded. And you can have multiple indexing scopes running a the same time. You can batch multiple writes into a single scope, and it behaves like a transaction.

The idea is to deal with all of the work associated with indexing the document into a single threaded work, so that make it easier for us. Note that we immediately get the generated document id, but that the document will only be available for searching when you have submitted the scope.

Under the hood, this does all of the work at the time of calling Add(). The analyzer will run on the relevant fields, and we will create the appropriate entries. How does that work?

Every document has a long id associated with it. In Voron, we are going to have a ‘Documents’ tree, with the long id as the key, and the value is going to be all the data about the document we need to keep. For example, it would have the stored fields for that documents, or the term positions, if required, etc. We’ll also have a a Fields tree, which will have a mapping of all the field names to a integer value.

Of more interest is how we are going to deal with the terms and the fields. For each field, we are going to have a tree called “@Name”, “@Email”, etc. Those are multi trees, with the keys in that tree being the terms, and the multi values being the document ids that has those threes. In other words, the code above is going to generate the following data in Voron.

  • Documents tree:
    • 0 – { “Fields”: { 1: “ayende@ayende.com” } }
    • 1 – { “Fields”: { 1: “arava@doghouse.com” } }
  • Fields tree
    • 0 – Name
    • 1 – Email
  • @Name tree
    • ayende – [ 0 ]
    • arava – [ 1 ]
    • oren – [ 0 ]
    • eini – [ 0, 1 ]
    • rahien – [ 0 ]
  • @Email tree
    • ayende@ayende.com – [ 0 ]
    • arava@doghouse.com – [ 1 ]

Given this model, we can now turn to the actual querying part of the business. Let us assume that we have the following query:

   1: var queryResults = index.Query("Name: Oren OR Email: ayende@ayende.com");
   2: foreach(var result in queryResult.Matches)
   3: {
   4:     Console.WriteLine("{0}: {1}", result.Id, result["Email"] )
   5: }

How does querying works? The query above would actually be something like:

   1: new BooleanQuery(
   2:     Match.Or,
   3:     new TermQuery("Name", "Oren"),
   4:     new TermQuery("Email", "ayende@ayende.com")
   5:     )

The actual implementation of this query would just access the relevant field tree and load the values in the particular key. The result is a set of ids for both parts of the query, and since we are using an OR here, we will just union them and return the list of results back.

Optimizations here can be to run those queries in parallel, or to just cache the results of particular term query, so we can speed things even more.  This looks simple, and it is, because a lot of the work has already been done in Voron. Searching is pretty much complete, we only need to tell it what to search with.

The dark sides of Lucene

I’ve been using Lucene for the past six or seven years, and after my last post, I thought it would be a good idea to talk a bit about the kind of things that it isn’t doing well. We’ve been using it extensively in RavenDB for the past 5 years, and I think that I have a pretty good understanding of it. We used to have one of Lucene.NET committers working at Hibernating Rhinos, so I’ve a high level of confidence that I’m not just stupidly not using it properly, too.

Probably the part that caused us the most pain with Lucene was the fact that it isn’t transactional. That is, it is quite easy to get into situations where the indexes are corrupted. That make it… challenging to use it in a database that needs to ensure consistency. The problem is that it is really not a use case that Lucene is well suited for. In order to ensure that data is saved, we have to commit often, the problem is that in order to ensure good performance, we want to commit less often, but then we will the changes if we crash. For that matter, Lucene doesn’t do any attempt to actually flush the data properly, relying on the OS to do that, a system crash can cause you to lose data even though you “committed” it.

Fun times, I can tell you that.

Next, we have the issue of what Lucene call updates. Updates in Lucene are actually just delete/add, and they don’t maintain the same document id (more on that later). Because of that, you usually have to have an additional field in the index that would be your primary key, and you handle updates by first deleting then adding things. That is quite strange, to be fair, and it means that you can’t “extend” an index entry, you have to build it from scratch every time.

Speaking of this, let us talk a bit about deletes. Ignoring for the moment the absolutely horrendous decision to do deletes through the reader, let us talk about how they are actually done. Deletes are recorded in a separate file, and that means that the moment you have any deletes (or, as I mentioned, updates), all the internal statistics are wrong.  We run into this quite often with RavenDB when we are doing things like facets or suggestions. For example, if you have request a suggestion for a user name, it will happily give you suggestions for deletes users, even though we deleted it in Lucene.

It will go away eventually, when it is ready to optimize the index by merging all the files, but in the meantime, it makes  for interesting bug reports.  Speaking of merging, that is another common issue that you have to deal with. In order to ensure optimal performance, you have to be on top of the merge policy. This results in some interesting issues. For RavenDB’s purposes, we do a writer commit after every indexing batch. That means that if you are writing to RavenDB slowly enough, we do a commit after every document write. That result in a lot of segments, and the merge policy would have to do a lot of merges. The problem here is that merges have two distinct costs associated with them.

First, and obviously, you are going to need to write (again) all of the documents in all of the segments you are merging. That is very similar to doing merges in LevelDB ( indeed, in general Lucene’s file format is remarkably similar to SST ). Next, and arguably more interesting / problematic from our point of view is the fact that it also kills all of the caches. Let me try to explain, Lucene uses a lot of caches to speed things up, in fact, most of the sorting is done by using the caches, for example.  That works really well when we are querying normally, because segments are immutable, which makes for great caching. But on a merge, not only have we just invalidate all of our caches, we now need to read, again, all of the data that we just wrote, so we would be able to use it. That can be… costly. And both things can introduce stalls into the system.

The major problem externally with merges is that the document id changes, and that means that you cannot rely on them. It would be much easier if you could send an id out into the world, and get it back later and do something with it, but that isn’t possible with Lucene.

Next, and not really an operational issue like the rest, Lucene’s multi threaded behavior is… a hammer to an egg, in most cases. By that I mean code like this:

image

I mean, it is certainly functional, but it is pretty ugly.

Now, don’t get me wrong, I think that Lucene is pretty neat. But there are some really dark corners there. For example, the actual searching, go ahead and try to find where that is done in Lucene. It is very easy to get lost between all of the different aspects: weights, sorters, queries and various enumerators. For fun, a lot of that runs at hard to figure out times, making the actual query run time interesting to try to figure out.

As a good example, let us take the simplest possible query, TermQuery. Go ahead, try to find where it is actually doing the query for matching terms in this code: https://github.com/apache/lucene/blob/LUCENE_2_1/src/java/org/apache/lucene/search/TermQuery.java

That actually happens here: https://github.com/apache/lucene/blob/LUCENE_2_1/src/java/org/apache/lucene/search/TermScorer.java#L79, and it is effectively a side effect of calling reader.termDocs(term) that limit the matches only to those with the same term.  Trying to track down where exactly things happen can be… interesting.

Anyway, this post is getting to long, and I want to get back to figuring out how Lucene does its thing without dwelling too much in the dark…

Tags:

Published at

Originally posted at

Comments (4)

What Lucene does, a look under the hood

Lucene is a search engine library, which is great. But as it turns out, there is a lot going on there. After working with it for several years, I can say with confidence that it is a pretty awesome library. But surprisingly, a lot of the effort that went into it doesn’t seem to be talked about / visible to people not trolling through the code. I think that this is a pretty good testament for how successful it is. That, and the fact that it is now the base line against which all other search libraries & engines are compared.

What I wanted to talk about today was the kind of things that Lucene is doing that doesn’t seem to get much publicity. I think that Spolsky said it best:

Back to that two page function. Yes, I know, it's just a simple function to display a window, but it has grown little hairs and stuff on it and nobody knows why. Well, I'll tell you why: those are bug fixes. One of them fixes that bug that Nancy had when she tried to install the thing on a computer that didn't have Internet Explorer. Another one fixes that bug that occurs in low memory conditions. Another one fixes that bug that occurred when the file is on a floppy disk and the user yanks out the disk in the middle. That LoadLibrary call is ugly but it makes the code work on old versions of Windows 95.

I remember just how much impact that article made on me at the time. And Lucene’s codebase bear true for this words. Lucene is a search engine library, which basically means that it does:

  • Indexing
  • Querying
  • Maintenance

One of the major areas of maturity in Lucene is how it optimized indexing. You can see it in the code. For example, Lucene goes to a great deal of trouble to avoid allocating memory willy nilly. Instead, pretty much everything there is done via object pools. This helps reduce the memory pressure when doing a lot of indexing and can save a lot of GC cycles.

Another is the concept of multiple threads for indexing .A lot of Lucene is build around this idea, it has a lot of per thread state that is meant to ensure that you don’t have to deal with concurrency yourself. The idea is that you can take an IndexWriter and write to it concurrently, then call commit. A lot of the work to do with indexing is CPU intensive, so that makes a lot of sense, and Lucene nicely isolates you from all of that work. There is DocumentWriterPerThread, so you can see really nice scaling effects as you throw more threads & hardware at the problem.

Usually, when people start messing with Lucene, they do that by writing analyzers, and you are sort of exposed to the memory constraints by being encourage to use ReusableTokenStream, etc. It has also a nice pipeline architecture for doing the indexing work with filters.

On the querying side, Lucene does a lot of work to ensure that things just works. It has a Boolean Model for searches, and Vector Space Model for ranking. Writing your own Query classes is pretty easy too, once you understand how things work, and again, this is another common place for people to extend Lucene. But there is a lot going on behind the scenes. Lucene does a lot of caching on a segment basis, and it is quite nice, since segments are immutable, it means that you can get pretty good usage out of that.

That give it a lot of its speed, and it means that over time, things are actually going to be faster, because more parts of the segments are in memory and cached.

Finally, we have all of the other work that Lucene does. In practice, it means things like merging segments (hopefully in the background), and keeping the overall system humming along. Unfortunately, that is also one of the places that are usually most common for people to start tinkering with when they run into perf problems. That is anything but trivial, and optimizing it is something that require a lot of expertise and understanding about the specific scenario you have.

And on top of that, you have everything else that already works on top of Lucene. Which is quite a lot.

As I said earlier, that is a very impressive piece of technology. That doesn’t mean that it doesn’t have its own set of problems, but that is something that I’ll discuss in detail in my next post.

Tags:

Published at

Originally posted at

Comments (2)

Sorting with Lucene

I talked about the Lucene formula and how it calculate things using tf-idf for best matches. Now I want to talk about the actual sorting implementation. As it turned out, the default sorting (by relevancy) is really simple. All you need is to get the relevant score for a query, then you shove the results through a heap with a specified size. The heap will take care of maintain the top results.

So far, that is pretty simple to understand. But the question is, how do you do sorting on a field value? The answer is, not easily.

image

GetStringIndex() does something very interesting. I returns  a string index, which gives us:

  • A string array containing all the distinct (sorted) value for this index.
  • A int array with all the documents in the index, with the position of the value of that field in the string value array

Now we can compare fields by their field position on the field, which give us pretty good sorting. Unfortunately, this also require us to load all the values to memory. Let us see another example, which would probably be easier to follow:

Sorting by an integer is done like this:

image

Get an array (whose size match the number of documents),  We can then sort things easily because accessing the relevant field value only require us to have the document id to index into the array.

The reason Lucene does this is that it uses an inverted index, and it has no easy way of going from the field values to the list documents it has. So it is easier to read all the values into memory and work with them there. I don’t like it, but off hand, I can’t think of a better way to handle this.

Tags:

Published at

Originally posted at

RavenDB Conf: Success!

Now that the conference is over ,and I am merely doing 3 back to back RavenDB courses in 3 cities and 2 continents, I can sit back and look at that.

I was, in a word, a blast. We had eight speakers giving 14 talks, about topics that moved from date and time handling in RavenDB to operational concerns to architectural how to. But, to be honest, I think that I found the attendees to be the main attraction. It was really good to finally meet so many of the people we have been talking with over the past few years. And what people are building with RavenDB is flat out amazing, and just a little bit scary. It is a funny thing when I see people take our tool and put it to uses that we have never even imagined.

Another great part was the hackaton. We had, in the space of a few hours, produced a major new feature for RavenDB (distributed counters), with everything wired up: Storage, Wire Protocol and even the UI. We still need to complete the replication bits, but those are coming.

We have recorded the sessions, and we hope to have the edited videos soon. I wanted to take the time and thank the speakers and the attendees for such a great conference .

We’ll be seeing you again soon… Smile.

Tags:

Published at

Originally posted at

Comments (5)

RavenConf, day 1

I started the day at 5:30 AM, because we had to go to the venue and get everything ready. This is now 01:30 AM for the next day, and I am still stoked.

I took this picture at around 8 AM, when people started to queue up to go into the conference:

image

For that matter, you can see a few other photos from the conference here. I’ll do a retrospect about the conference when I have time to breath. But for now, I can report that the RavenDB Hackaton was quite a success, we have a cool new feature mostly implemented. You can see it here: https://github.com/ayende/ravendb/tree/duco

And yes, I’ll talk about that in the future as well. But right now, I’m going to have a db capable of managing a 100 billion records tomorrow, live on stage, so I probably need to get some sleep done…

Published at

Originally posted at

Comments (3)

The Lucene formula: TF & IDF

The basis of how Lucene work is tf–idf, short for term frequency–inverse document frequency. You can read about it here. It is pretty complex, but it is explained in detailed in the Lucene docs. It all fall down to:

image

For now, I want to look at the tf() function. The default implementation is in DefaultSimilarity.Tf, and is defined as:

image

That isn’t really helpful, however. Not without knowing what freq is.  And I’m pretty sure that Sqrt isn’t cheap, which probably explains this:

image

So it caches the score function, and it appears that the tf() is purely based on the count, not on anything else. Before I’ll go and find out what is done with the score cache, let’s look at the weight value. This is calculated here:

image

And idf stands for inverse document frequency.  That is defined by default to be:

image

And the actual implementation as:

image

And the caller for that is:

image

So, we get the document doc frequency of a term, that is, in how many documents this term shows, and the number of all the documents, and that is how we calculate the idf.

So far, so good. But how about that doc frequency? This is one of the things that we store in the terms info file. But how does that get calculated?

I decided to test this with:

image

This should generate 3 terms, two with a frequency of 1 and one with a frequency of 2. Now, to figure out how this is calculated. That part is a real mess.

During indexing, there is a manually maintained hash table that contains information about each unique term, and when we move from one document to another, we write the number of times each term appeared in the document. For fun, this is written to what I think is an in memory buffer for safe keeping, but it is really hard to follow.

Anyway, I now know enough about how that works for simple cases, where everything happens in a single segment. Now let us look what happens when we use multiple segments. It is actually quite trivial. We just need to sum the term frequency each term across all segments. This gets more interesting when we involve deletes. Because of the way Lucene handle deletes, it can’t really handle this scenario, and deleting a document do not remove its frequency counts for the terms that it had. That is the case until the index does a merge, and fix everything that way.

So now I have a pretty good idea about how it handled the overall term frequency. Note that this just gives you the number of times this term has been seen across all documents. What about another important quality, the number of times this term appears in a specific document? Those are stored in the frq file, and they are accessible during queries. This is then used in conjunction with the overall term frequency to generate the boost factor per result.

Tags:

Published at

Originally posted at

Comments (6)

Peeking into Lucene indexing

Continuing my trip into the Lucene codebase, now I’m looking into the process indexing are happening. Interestingly enough, that is something that we never really had to look at before.

It is quite clear that Lucene is heavily meant for utilizing additional threads for improving overall indexing speed. You can see it in the number of per thread state that exist all over the indexing code. That is also meant to reduce memory consumption, as far as I can see.

image

The real stuff happens in the ProcessDocument() where we have a chain of DocFieldConsumerPerThread and TermsHashConsumerPerThread which actually do the work.

Then the real work is happening on a per field level in DocInverterPerField, where the analyzer is actually called. The process in which the analyzers return values for the text is interesting. There are “attributes” that are added to the stream, per token, and they are used to get the relevant values. I assume that this allows to have different levels of analyzers.

image

This way, you can have analyzers that don’t deal with offesets or positions, etc. And the actual processing of this is done in a chain that appears to be:

  • Freq Prox
  • Term Vectors

But that isn’t something that I really care about now. I want to see how Lucene writes the actual terms. This is being written in the TermsInfoWriter, which took some time to find.

image

Terms are stored in a prefix compressed mode (sorted, obviously), and there is the actual terms, and an index into the terms, which allows for faster seeking into the file. This is actually done here:

image

This is a single term written to the file. A lot of the stuff Lucene does (prefixes, VInt, etc) are done in the name of conserving space, and it reminds me greatly of LevelDB’s SST. In fact, the way terms are stored is pretty much an SST, except that this happens to be on multiple files. Pretty much the entire behavior and all the implications are the same.

It also means that searching on this is fast, because the data is sorted, but pretty complex. And it also explains a lot about the actual exposed API that it has. I think that I have a pretty good idea on how things work now. I want to now go back up and look at specific parts of how it works… but that is going to be the next post.

Tags:

Published at

Originally posted at

Comments (4)

RavenDB 3.0 milestone

Well, today we have reached a really big milestone for RavenDB 3.0. It is now the master branch in our development mode, and 2.5 is in maintenance mode.

That means that you can go here: http://github.com/ayende/ravendb and see all the cool stuff that we are doing. Official builds will be up for the conference, so you can play with the new stuff Smile.

Tags:

Published at

Originally posted at

Comments (5)

Last RavenDB Conference tickets

We still have a couple of tickets available for the RavenDB conference. And sale of them ends tomorrow. You can buy them here.

I’m already at the states (writing this from JFK right now), and getting ready to show you some very cool stuff next week.

Tags:

Published at

Originally posted at

My distributed build system

Yes, I know that you are probably getting geared up to hear about some crazy setup, and in some manner, it is crazy. My distributed build system is this:

IMG_20140402_103433

Yep, that is me manually distributing the build to do a cross check on a reasonable time frame.

I’ve mentioned before that our build time is atrocious. With over 3,500 tests, this is no longer a feasible alternative to just run them normally. So we started parallel efforts (pun intended), to reduce the time it takes the system to build and reduce individual test times as well as the ability to parallelize things.

We are actually at the point where we can run concurrent tests, even those we previously had to distribute. And we can even do that with replicated tests, which are the ones usually taking the longest. But what we’ll probably have in the end is just a bunch of test projects (currently we have ~5) that we will run on different machines at the same time.

We are using Team City for build system, and I know that it has capabilities in this regard, but I’ve never actually looked into that. We’re going to be pretty busy in the next couple of weeks, so I thought that this would be a good time to ask.

My current thinking is:

  • One build that would actually do the compilation, etc.
  • Then branch off that to a build per test project, which can run on different agents
  • Then a build on top of that that can actually deploy the build.

Any recommendations on that? We are going to need 5 – 7 agents to run the tests in parallel. Any recommendations on how to get those? Ideally I would like to avoid having to pay for those machines to sit idle 60% – 80% of the time.

Any other alternatives?

Hibernating Rhinos: The Movie

Well, it is finally here, it has been ten years since I started blogging. My first blog was on April’s 1st 2004, and somehow, we are a decade later.

Usually it is time for introspective, and careful thought about what was and what will be, but I’m having no time for that. Instead, I wanted to let you know that we just finished signing with Lucas Film (well, it said Disney on the contract, but I’m a geek, so I’ll go with the more impressive title) for a movie about Hibernating Rhinos. The idea is basically similar to the Social Network, but much better.

I’m trying to see if I can get Jesse Eisenberg to play there.

And I got a budget for 999 ravens to be released in a flock at one of the scenes!

We’ll show a trailer for the movie (purely computer generated, so far) at our Conference in a week.

Tags:

Published at

Originally posted at

Comments (7)

The Lucene disk format

I realized lately that I wanted to know a lot more about exactly how Lucene is storing data on disk. Oh, I know the general stuff about segments and files, etc. But I wanted to know the actual bits & bytes. So I started tracing into Lucene and trying to figure out what it is doing.

And, by the way, the only thing that the Lucene.NET codebase is missing is this sign:

image

At any rate, this is how Lucene writes the segment file. Note that this is done in a CRC32 signed file:

image

And the info write method is:

image

Today, I would probably use a JSON file for something like that (bonus point, you know if it is corrupted and it is human readable), but this code was written in 2001, so that explains it.

This is the format of the format of a segment file, and the segments.gen file is generated using:

image

Moving on to actually writing data, I created ten Lucene documents and wrote them. Then just debugged through the code to see what will happen. It started by creating _0.fdx and _0.fdt files. The .fdt is for fields, the fdx is for field indexes.

Both of those files are used when writing the stored fields. This is the empty operation, writing an unstored field.

image

This is how fields are actually stored:

image

And then it ends up in:

image

Note that this particular data goes in the fdt file, while the fdx appears to be a quick way to go from a known document id to the relevant position in the fdx file.

As I was going through the code, I did some searches, and found a very detailed explanation of the actual file format in the docs. That is really nice and quite informative, however, just seeing how the “let us take the documents and make them searchable” part is quite interesting. Lucene has a lot of chains of responsibilities going through. And it is also quite interesting to see the design choices that were made.

Unfortunately, Lucene is very much wedded to its file format, and making changes to it isn’t going to be possible, which is a shame, since it impacts quite a lot of the way Lucene works in general.

Tags:

Published at

Originally posted at

Comments (2)

Reduce ^ 2 in RavenDB

An interesting question that keeps popping up is how to re-reduce the results of a map/reduce. That is really nice feature on the surface, but it has a lot of implications, for example, when / how you run the 2nd reduce, can you chain only 1 time, or multiple times , what happens when there are a lot of reduce results, etc.

But most of the time, what people want is to be able to do aggregation on the map/reduce results without too much hassle, and they don’t have a lot of aggregated results or they are fine with waiting for them if they are very large. And we have a really nice solution for that scenario.

You start by defining the base map/reduce operation, like so:

image

Note that we need to also output the fields that we care about reducing further. In this case, we start by reducing to postal code, but we keep the city, region and country options as well.

Then, we define a transformer. Note that this is a special transformer, in that it has a group by in it, and it takes some parameters from outside.

image

Using those two together, we can now get the following results…

Raw map/reduce output:

image

With loc = City, we get:

image

With loc = Country, we get:

image

Tada, we have reduced further the result of a map/reduce operation. Now, this is subject to the usual limitations of RavenDB paging, in that it will only go through the only 1024 results. That can be a problem, but that is why RavenDB has the Streaming API.

You can use streaming on a map/reduce index with a transformer (and even apply parameters on top of that). That end up giving you the ability to run a re-reduction on top of a map/reduce index regardless of size.

Of course, on very large result sets, that can take quite a while, but that is expected and usually fine. For that matter, if you need to, you can chain the stream into a bulk insert, and get the re-reduction in that manner.

Tags:

Published at

Originally posted at

Comments (4)

“Incremental” map/reduce in MongoDB isn’t

Rafal  an Ben Foster commented on my previous post with some ideas on how to deal with incremental updates to map/reduce indexes. Rafal said:

Actually, it's quite simple if you can 'reverse' the mapping operation (for given key find all documents matching that key): you just delete aggregate record with specified key and run incremental map-reduce on all matching documents. In today's example, you would delete the aggregate with key='oren' and then run map reduce with a query:

db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: 'oren' } });

And Ben said:

It's worth mentioning that I was able to get the MongoDB map-reduce collections updating automatically (insert/update/delete) by monitoring the MongoDB OpLog …

…and listen for new documents in the OpLog which could then be used to re-execute an incremental Map-Reduce.

And while this looks right, this actually can’t possibly work. I’ll start from Rafal’s suggestion first. He suggest just issuing the following set of commands whenever we delete something from the database:

   1: db.distinct_item_names.remove({name: 'oren' } });
   2: db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: 'oren' } });

And yes, that will actually work, as long as you are careful to never do this concurrently. Because if you do run this concurrently… well, the best you can hope is no data, but the liker scenario is data corruption.

But this actually gets better, deletes are annoying, but they are a relatively simple case to process. You have updates to deal with too. We’ll assume that we are watching the oplog to get notified when this happens. Here is an MongoDB oplog entry

   1: {
   2:   "ts": {
   3:     "t": 1286821984000,
   4:     "i": 1
   5:   },
   6:   "h": "1633487572904743924",
   7:   "op": "u",
   8:   "ns": "items",
   9:   "o2": {
  10:     "_id": "4cb35859007cc1f4f9f7f85d"
  11:   },
  12:   "o": {
  13:     "$set": {
  14:       "Name": "Eini"
  15:     }
  16:   }
  17: }

As you can see, we an update operation (op: u) on a specific document (o2._id) with the specified update (o.$set). That is really great, and it is utterly useless for our purposes. In this case, we updated the name from Oren to Eini, so we would like to be able to run this:

   1: db.distinct_item_names.remove({name: 'oren' } });
   2: db.distinct_item_names.remove({name: eini' } });
   3: db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: 'oren' } });
   4: db.items.mapReduce(map,reduce, { out: {reduce: ‘distinct_item_names’}, query: {name: eini' } });

Except that we don’t have any way to get the old value out from the oplog. And this still isn’t going to work concurrently.

But let us say that we decided to have a watcher process monitor the oplog somehow, and it will ensure no concurrency of those requests. Now you have to deal with fun issues like: “what happens if the watcher process recycle?”  How do you keep your place in the oplog (and remember, the oplog is capped, stuff you haven’t seen might be removed if they are beyond the specified size.

And… to be frank, once we have done all of that, this is still the easy part. One of the reasons that you want to do this work in the first place is to deal with large amount of data. But you cannot assume that you’ll have even distribution of the data.

One bug request that came against the RavenDB map/reduce implementation was a map/reduce index on the US Census data. That is ~300 million documents, and the index the user wanted to build was a map/reduce group by the state. You have states like California, with more than 30 million people in it, and you realize that you don’t want to have to re-do the map/reduce over the entire 30+ million documents that you have there. In RavenDB, under this scenario, you’ll have to issue about 3,073 operations, by the way. Versus the 30 millions you would need for this approach.

So yeah, “incremental” map/reduce can’t handle concurrent work, can’t handle deletes, can’t handle updates, and definitely shouldn’t be used on large data sets. And that is after you went to the trouble of setting up the watcher process, monitoring the oplog, etc.

Or, you can use RavenDB and you get a true incremental map/reduce without having to worry about any of that.

Tags:

Published at

Originally posted at

Comments (9)

Differences in Map/Reduce between RavenDB & MongoDB

Ben Foster has a really cool article showing some of the similarities and differences between MongoDB & RavenDB with regards to their map/reduce implementation.

However, there is a very important distinction that was missed. Map/reduce operations are run online in MongoDB, that means that for large collections, map/reduce is going to be very expensive. MongoDB has the option of taking the result of a map/reduce operation and writing it to a collection, so you don’t need to run map/reduce jobs all the time. However, that is a snapshot view of the data, not a live view. Ben mentioned that you can do something called incremental map/reduce, but that isn’t actually really good idea at all.

Let us look at the following sequence of operations:

   1: db.items.insert({name: 'oren', ts: 1 });
   2: db.items.insert({name: 'ayende', ts: 2});
   3:  
   4: var map = function Map() { emit(this.name,null); };
   5: var reduce = function(key, val) { return key; };
   6:  
   7: db.items.mapReduce(map,reduce, { out: 'distinct_item_names' });

This creates two items, and give me the distinct names in a separate collection. Now, let us see how that works with updates…

   1: db.items.insert({name: 'eini', ts: 3 });
   2:  
   3: db.items.mapReduce(map,reduce, { out: {reduce: 'distinct_item_names'}, query: {ts: {$gt: 2} } });

This is actually nice, mongo is able to merge the previous results with the new results, so you only have to do the work on the new data. But this has several implications:

  • You have to kick something like ‘ts’ property around to check for new stuff. And you have to _udpate_ that ts property on every update.
  • You have to run this on a regular basis yourself, mongo won’t do that for you.
  • It can’t work with deletes.

It is the last part that is really painful:

   1: db.items.remove({name: 'oren'});

Now, there is just no way for you to construct a map/reduce job that would remove the name when it is gone.

This sort of thing works very nicely when what you want is to just append stuff. That is easy. It is PITA when we are talking about actually using it for live data, that can change and be modified.

Contrast that with the map/reduce implementation in RavenDB:

  • No need to manually maintain state, the database does it for you.
  • No need to worry about updates & deletes, the database does it for you.
  • No need to schedule map/reduce job updates, database does it for you.
  • Map/reduce queries are very fast, regardless of data size.

To be frank, the map/reduce implementation in RavenDB is complex, and pretty much all of it comes down to the fact that we don’t do stupid stuff like run a map/reduce operation on a large database on every query, and that we support edge cases scenarios like data that is actually updated or deleted.

Naturally I’m biased, but it seems to me that trying to use map/reduce in Mongo just means that you have to do a lot of hand holding yourself, while with RavenDB, we take care of everything and leaving you to actually do stuff.

Tags:

Published at

Originally posted at

Comments (10)

My poor little blameless Voron

I’m currently working on a project using Voron (although only incidentally), and I was horrified to get the FatalExecutionEngineException that was discussed previously. Naturally, I assumed that this is something that I did wrong in Voron.

image

After a lot of work, I managed to narrow it down to… not Voron. To be rather more exact, it is Voron that is causing it, but it isn’t Voron’s fault.

The actual issue is the fact that I’ve accidently using Json.NET to serialize a Stream that I got from Voron. And Voron is giving us an UnmanagedMemoryStream. I kept thinking that I was doing something wrong with releasing memory, but as it turns out, here is a very small repro:

 unsafe static void Main(string[] args)
 {
     JsonConvert.SerializeObject(new Foo { Ptr = (byte*)0 });

 }

 public unsafe class Foo
 {
     public byte* Ptr { get; set; }
 }

And that is enough to get the above mentioned error.

What actually happens is that Json.NET is using dynamic IL generation to optimize accessing properties. And it just doesn’t know how to handle pointer properties. What it ended up doing is to generate invalid IL, which resulted in a crash when we tried to actually use it.

Nothing to do with Voron at all, just a class that has a pointer property that was attempting serialization.  Nasty bug, and very annoying to try to figure out.

Tags:

Published at

Originally posted at

Today’s work is…

It isn’t enough to have just an ExecutionEngineException, I had to have this:

image

Tags:

Published at

Originally posted at

Comments (3)