Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,279 | Comments: 46,759

filter by tags archive

Fixing the index, solutions

time to read 4 min | 650 words

In my previous post, I showed a pretty trivial index and asked how to efficiently update it. Efficient being time & memory wise.

The easiest approach to do that is by using a reverse lookup option. Of course, that means that we actually need to store about twice as much data as before.

Given the following documents:

  • users/21 – Oren Eini
  • users/42 – Hibernating Rhinos
  • users/13 – Arava Eini

Previously, we had:

Term Documents
Oren users/21,
Eini users/21, users/13
Hibernating users/42,
Rhinos users/42,
Arava users/13

And with the reverse lookup, we have:

Term Documents Document Terms
Oren users/21,   users/21 Oren, Eini
Eini users/21, users/13   users/42 Hibernating, Rhinos
Hibernating users/42   users/13 Arava, Eini
Rhinos users/42      
Arava users/13      

And each update to the index would first do a lookup for the document id, then remove the document id from all the matching terms.

The downside of that is that we take about twice as much room. The upside is that all the work is done during indexing time, and space is pretty cheap.

It isn’t that cheap, though. So we want to try something better.

Another alternative is to introduce a level of indirection, like so:

Term Documents   Num Id
Oren 1,   1 users/21
Eini 1,3   2 users/42
Hibernating 2   3 users/13
Rhinos 2      
Arava 3      

Now, let us say that we want to update users/13 to be Phoebe Eini, we will end up with:

Term Documents   Num Id
Oren 1,   1 users/21
Eini 1,3,4   2 users/42
Hibernating 2   4 users/13
Rhinos 2      
Arava 3      
Phoebe 4      

We removed the 3rd document, and didn’t touch the terms except to add to them.

That gives us a very fast way to add to the system, and if someone will search for Arava, we will see that  the number no longer exists, so we’ll return no results for the query.

Of course, this means that we have to deal with garbage in the index, and have some way to clean it up periodically. It also means that we don’t have a way to really support Update, instead we have just Add and Delete operations.

Interview questionfix the index

time to read 4 min | 630 words

This is something that goes into the “what to ask a candidate”.

Given the following class:

public class Indexer
{
    private Dictionary<string, List<string>> terms = 
        new Dictionary<string, List<string>>(StringComparer.OrdinalIgnoreCase);

    public void Index(string docId, string text)
    {
        var words = text.Split();
        foreach (var term in words)
        {
            List<string> val;
            if (terms.TryGetValue(term, out val) == false)
            {
                val = new List<string>();
                terms[term] = val;
            }
            val.Add(docId);
        }
    }

    public List<string> Query(string term)
    {
        List<string> val;
        terms.TryGetValue(term, out val);
        return val ?? new List<string>();
    }
}

This class have the following tests:

public class IndexTests
{
    [Fact]
    public void CanIndexAndQuery()
    {
        var index = new Indexer();
        index.Index("users/1", "Oren Eini");
        index.Index("users/2", "Hibernating Rhinos");

        Assert.Contains("users/1", index.Query("eini"));
        Assert.Contains("users/2", index.Query("rhinos"));
    }

    [Fact]
    public void CanUpdate()
    {
        var index = new Indexer();
        index.Index("users/1", "Oren Eini");
        //updating
        index.Index("users/1", "Ayende Rahien");

        Assert.Contains("users/1", index.Query("Rahien"));
        Assert.Empty(index.Query("eini"));
    }
}

The first test passes, but the second fails.

The task is to get the CanUpdate test to pass, while keeping memory utilization and CPU costs as small as possible. You can change the internal implementation of the Indexer as you see fit.

After CanUpdate is passing, implement a Delete(string docId) method.

Timeouts, TCP and streaming operations

time to read 3 min | 499 words

We got a bug report in the RavenDB mailing list that was interesting to figure out.  The code in question was:

foreach(var product in GetAllProducts(session)) // GetAllProducts is implemented using streaming
{
  ++i;
  if (i > 1000)
  {
    i = 0;
    Thread.Sleep(1000);
  }
}

This code would cause a timeout error to occur after a while. The question is why? We can assume that this code is running in a console application, and it can take as long as it wants to process things.

And the server is not impacted from what the client is doing, so why do we have a timeout error here? The quick answer is that we are filling in the buffers.

GetAllProducts is using the RavenDB streaming API, which push the results of the query to the client as soon as we have anything. That lets us parallelize work on both server and client, and avoid having to hold everything in memory.

However, if the client isn’t processing things fast enough, we run into an interesting problem. The server is sending the data to the client over TCP. The client machine will get the results, buffer them and send them to the client. The client will read them from the TCP buffers, then do some work (in this case, just sleeping). Because the rate in which the client is processing items is much smaller than the rate in which we are sending them, the TCP buffers become full.

At this point, the client machine is going to start dropping TCP packets. It doesn’t have any more room to put the data in, and the server will send it again, anyway. And that is what the server is doing, assuming that we have a packet loss over the network. However, that will only hold up for a while, because if the client isn’t going to recover quickly, the server will decide that it is down, and close the TCP connection.

At this point, there isn’t any more data from the server, so the client will catch up with the buffered data, and then wait for the server to send more data. That isn’t going to happen, because the server already consider the connection lost. And eventually the client will time out with an error.

A streaming operation require us to process the results quickly enough to not jam the network.

RavenDB also have the notion of subscriptions. With those, we require explicit client confirmation from the client to send the next batch, so a a slow client isn’t going to cause issues.

Merge related entities using Multi Map/Reduce

time to read 5 min | 995 words

A question came up in the mailing list regarding searching across related entities. In particular, the scenario is the notion of a player and characters in MMPROG game.

Here is what a Player document looks like:

{
  "Id": "players/bella@dona.self",
  "Name": "Bella Dona",
  "Billing": [ { ... }, { ... }],
  "Adult": false,
  "LastLogin": "2015-03-11"
}

And a player have multiple character documents:

{
  "Id": "characters/1234",
  "Name": "Black Dona",
  "Player": "players/bella@dona.self",
  "Race": "DarkElf",
  "Level": 24,
  "XP": 283831,
  "HP": 438,
  "Skills": [ { ... } , { ... } ]
}
{
  "Id": "characters/1321",
  "Name": "Blue Bell",
  "Player": "players/bella@dona.self",
  "Race": "Halfling",
  "Level": 2,
  "XP": 2831,
  "HP": 18,
  "Skills": [ { ... } , { ... } ]
}
{
  "Id": "characters/1143",
  "Name": "Brown Barber",
  "Player": "players/bella@dona.self",
  "Race": "WoodElf",
  "Level": 44,
  "XP": 983831,
  "HP": 718,
  "Skills": [ { ... } , { ... } ]
}

And what we want is an output like this:

{
    "Id" : "players/bella@dona.self",
    "Adult": false,
    "Characters" : [
        { "Id": "characters/1234",  "Name": "Black Dona" },
        { "Id": "characters/1321",  "Name": "Blue Bell" },
        { "Id": "characters/1143",  "Name": "Brown Barberl" },
    ]
}

Now, a really easy way to do that would be to issue two queries. One to find the player, and another to find its characters. That is actually the much preferred method to do this. But let us say that we need to do something that uses both documents types.

Give me all the players who aren’t adults that have a character over 40, for example. In order to do that, we are going to use a multi map reduce index to merge the two together. Here is how it is going to look like:

// map - Players

from player in docs.Players
select new 
{
  Player = player.Id,
  Adult = player.Adult,
  Characters = new object[0]
}

// map - Characters

from character in docs.Characters
select new
{
   character.Player,
   Adult = false,
   Characters = new [] 
   { 
     new { character.Id, character.Name }
   }
}

// reduce

from result in results
group result by result.Player into g
select new
{
   Player = g.Key,
   Adult = g.Any(x=>x.Adult),
   Characters = g.SelectMany(x=>x.Characters)
}

This gives you all the details, in a single place. And you can start working on queries from there.

Taking full dumps for big IIS apps

time to read 2 min | 313 words

If your application is running on IIS, you are getting quite a lot for free. To start with, monitoring and management tools are right there out of the box. You are also getting some… other effects.

In particular, we had RavenDB running inside IIS that exhibit a set of performance problems in a couple of nodes (and just on those nodes). We suspected that this might be related to memory usage, and we wanted to take a full process dump so we can analyze this offline.

Whenever we tried doing that, however, the process would just restart. The problem was that to reproduce this we had to wait for a particular load pattern to happen after the database was live for about 24 hours. So taking the dump at the right time was crucial. Initially we thought we used the wrong command, or something like that. The maddening this was, when we tried it on the same machine, using the same command, without the performance issue present, it just worked (and told us nothing).

Eventually we figured out that the problem was in IIS. Or, to be rather more exact, IIS was doing its job.

When the performance problem happened, the memory usage was high. We then needed to take a full process dump, which meant that we had to write a lot. IIS didn’t hear from the worker process during that time (since it was currently being dumped), and it killed it, creating a new one.

The solution was to ask IIS to not do that, the configuration is available in the advanced settings for application pool. Note that just changing that would force IIS to restart the process, which was another major annoyance.

image

QCon London and In The Brain talk – Performance Optimizations in the wild

time to read 1 min | 135 words

The RavenDB Core Team is going to be in the QCon London conference this week, so if you are there, stop by our booth, we got a lot of cool swag to give out and some really cool demos.

In addition to that, on Thursday I’m going to be giving an In The Brain talk about Performance Optimizations in the Wild, talking about the kind of performance work we have been doing recently.

The results of this work can be shown on the following graph:

image

Come to the talk to hear all about the details and what we did to get things working.

That ain’t going to take you anywhere

time to read 4 min | 659 words

As part of our usual work routine, we field customer questions and inquiries. A pretty common one is to take a look at their system to make sure that they are making a good use of RavenDB.

Usually, this involves going over the code with the team, and making some specific recommendations. Merge those indexes, re-model this bit to allow for this widget to operate more cleanly, etc.

Recently we had such a review in which what I ended up saying is: “Buy a bigger server, hope this work, and rewrite this from scratch as fast as possible”.

The really annoying thing is that someone who was quite talented has obviously spent a lot of time doing a lot of really complex things to end up where they are now. It strongly reminded me of this image:

image

At this point, you can have the best horse in the world, but the only thing that will happen if it runs is that you are going to be messed up.

What was so bad? Well, to start with, the application was designed to work with a dynamic data model. That is probably also why RavenDB was selected, since that is a great choice for dynamic data.

Then the designers sat down and created the following system of classes:

public class Table
{
	public Guid TableId {get;set;}
	public List<FieldInformation> Fields {get;set;}
	public List<Reference> References {get;set;}
	public List<Constraint> Constraints {get;set;}
}

public class FieldInformation
{
	public Guid FieldId {get;set;}
	public string Name {get;set;}
	public string Type {get;set;}
	public bool Required {get;set;}
}

public class Reference
{
	public Guid ReferenceId {get;set;}
	public string Field {get;set;}
	public Guid ReferencedTableId {get;set;}
}

public class Instance
{
	public Guid InstanceId {get;set;}
	public Guid TableId {get;set;}
	public List<Guid> References {get;set;}
	public List<FieldValue> Values {get;set;}
}

public class FieldValue
{
	public Guid FieldId {get;set;}
	public string Value {get;set;}
}

I’ll let you draw your own conclusions about how the documents looked like, or just how many calls you needed to load a single entity instance.

For that matter, it wasn’t possible to query such a system directly, obviously, so they created a set of multi-map/reduce indexes that took this data and translated that into something resembling a real entity, then queried that.

But the number of documents, indexes and the sheer travesty going on meant that actually:

  • Saving something to RavenDB took a long time.
  • Querying was really complex.
  • The number of indexes was high
  • Just figuring out what is going on in the system was nigh impossible without a map, a guide and a lot of good luck.

Just to cap things off, this is a .NET project, and in order to connect to RavenDB they used direct REST calls using HttpClient. Blithely ignoring all the man-decades that were spent in creating a good client side experience and integration. For example, they made no use of Etags or Not-Modified-Since, so a lot of the things that RavenDB can do (even under such… hardship) to make things better weren’t supported, because the client code won’t cooperate.

I don’t generally say things like “throw this all away”, but there is no mid or long term approach that could possibly work here.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Implementing low level trie (2):
    14 Dec 2016 - Part II
  2. The performance regression in the optimization (2):
    01 Dec 2016 - Part II
  3. Digging into the CoreCLR (4):
    25 Nov 2016 - Some bashing on the cost of hashing
  4. Making code faster (10):
    24 Nov 2016 - Micro optimizations and parallel work
  5. Optimizing read transaction startup time (7):
    31 Oct 2016 - Racy data structures
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats