RavenDBMulti Maps / Reduce indexes

time to read 18 min | 3538 words

badge1If you thought that map/reduce was complex, wait until we introduce the newest feature in RavenDB:

Multi Maps / Reduce Indexes

Okay, to be frank, they aren’t complex at all, they are actually quite simple, when you sit down to think about them. Again, I have to credit to Frank Schwieterman, who came up with the idea.

Wait! Let us back track a bit and try to explain what the actual problem is that we are trying to solve. The problem with Map/Reduce is that you can only gather information from a single set of documents. Let us look at the following documents as an example:

{// users/ayende 
   "Name": "Ayende Rahien" 
} 

{ // posts/1234 
  "Title": "Why RavenDB?", 
  "Author": "users/ayende" 
} 
{ // posts/1235 
  "Title": "It is awesome!", 
  "Author": "users/ayende" 
} 

We want to get an list of users with the count of posts that they made. That is trivially easy, as shown in the following map/reduce index:

from post in docs.Posts
select new { post.Author, Count = 1 }

from result in results
group result by result.Author into g
select new
{
   Author = g.Key,
   Count = g.Sum(x=>x.Count)
}

The output of this index would be something like this:

{ Author: "users/ayende", Count: 2 }

And we can load it efficiently using Includes:

session.Query<AuthorPostStats>("Posts/ByUser/Count")
     .Include(x=>x.Author)
     .ToList();

This will load all the users statistics, and also load all of the associated users, in a single query to the database. So far, fairly simple and routine.

badge5The problem begins when we want to be able to query this index using the user’s name. As you can deduce from the documents shown previously, the user name isn’t available on the post document, which means that we can’t index it. That, in turn, means that we can’t search it.

We are left with several bad options:

  • De-normalize the User’s Name property to the Post document, solely for indexing purposes.
  • Don’t implement this feature.
  • Write the following scary query:
from doc in docs.WhereEntityIs("Users","Posts") 
let user = doc.IfEntityIs("Users") 
let post = doc.IfEntityIs("Posts") 
select new 
{ 
  Count = user == null ? 1 : 0, 
  Author = user.Name, 
  UserId = user.Id ?? post.Author 
} 

from result in results 
group result by result.UserId into g 
select new 
{ 
   Count = g.Sum(x=>x.Count), 
   Author = g.FirstNotNull(x=>x.Author), 
   UserId = g.Key 
} 

This is actually pretty straightforward, when you sit down and think about it. But there is a whole lot of ceremony involved, and it is actually more than a bit hard to figure out what is going on in more complex scenarios.

This is where Frank’s suggestion came in:

…if I were try to support linq-based indexes that can map multiple types, it might look like:

public class OverallOpinion : AbstractIndexCreationTask<?>
{
   public OverallOpinion()
   {
       Map<Foo>(docs => from doc in docs select new { Id = doc.Id, LastUpdated = doc.LastUpdated }
       Map<OpinionOfFoo>(docs => from doc in docs select new { Id = Doc.DocId, Rating = doc.Rating, Count = 1}
       Reduce = docs => from doc in docs
                        group doc by doc.Id into g
                        select new {
                           Id = g.Key,
                           LastUpdated = g.Values.Where(f => f.LastUpdated != null).FirstOrDefault(),
                           Rating = g.Values.Rating.Sum(),
                           Count = g.Values.Count.Sum()
                        }
   }
}

It seems like some clever code could combine the different map expressions into one.

badge7This is part of a longer discussion, but basically, it got me thinking about how we can implement multi maps, and I came up with the following:

// Map from posts
from post in docs.Posts
select new { UserId = post.Author, Author = (string)null, Count = 1 }

// Map from users
from user in docs.Users
select new { UserId = user.Id, Author = user.Name, Count = 0 }

// Reduce takes results from both maps
from result in results
group result by result.UserId into g
select new
{
   Count = g.Sum(x=>x.Count),
   Author = g.Where(x=>x!=null).First(),
   UserId = g.Key
}

The only thing to understand now is that we have multiple map functions, getting data from multiple sources. We can then take those sources and reduce them together. The only requirements that we have is that the output of all of the map functions would be identical (and obviously, match the output of the reduce function). Then we can just treat this information as normal map/reduce index, which means that all of the usual RavenDB features kick in. Let us see what this actually means, using code. We have the following classes:

public class User
{
    public string Id { get; set; }
    public string Name { get; set; }
}

public class Post
{
    public string Id { get; set; }
    public string Title { get; set; }
    public string AuthorId { get; set; }
}

public class UserPostingStats
{
    public string UserName { get; set; }
    public string UserId { get; set; }
    public int PostCount { get; set; }
}

And we have the following index:

public class PostCountsByUser_WithName : AbstractMultiMapIndexCreationTask<UserPostingStats>
{
    public PostCountsByUser_WithName()
    {
        AddMap<User>(users => from user in users
                              select new
                              {
                                  UserId = user.Id,
                                  UserName = user.Name,
                                  PostCount = 0
                              });

        AddMap<Post>(posts => from post in posts
                              select new
                              {
                                  UserId = post.AuthorId,
                                  UserName = (string)null,
                                  PostCount = 1
                              });

        Reduce = results => from result in results
                            group result by result.UserId
                            into g
                            select new
                            {
                                UserId = g.Key,
                                UserName = g.Select(x => x.UserName).Where(x => x != null).First(),
                                PostCount = g.Sum(x => x.PostCount)
                            };

        Index(x=>x.UserName, FieldIndexing.Analyzed);
    }
}

badge8As you can see, we are getting the values from two different collections. We need to make sure that they are actually using the same output, which is what caused us the null casting for posts and the filtering that we need to do on the reduce.

But that is it! It is ridiculously easy compared to the previous alternative. Moreover, it follows quite naturally from both the exposed API and the internal implementation inside RavenDB. It took me roughly half a day to make it work, and some of that was dedicated to lunch Smile. In truth, most of that time is actually just handling the error conditions nicely, but… anyway, you get the point.

Even more interesting than the rest is the fact that for all intents and purposes, what we have done here is a join between two different collections. We were never able to really resolve the problems associated with joins before, update notifications were always too complex to figure out, but going the route of multi map makes things so easy.

Just for fun, you might have noticed that we marked the UserName property as analyzed, which means that we can issue full text queries against it. Let us assume that we want to provide users with the following UI:

image

It is now just a matter of writing the following code:

using (var session = store.OpenSession())
{
    var ups= session.Query<UserPostingStats, PostCountsByUser_WithName>()
        .Where(x => x.UserName.StartsWith("rah"))
        .ToList();

    Assert.Equal(1, ups.Count);

    Assert.Equal(5, ups[0].PostCount);
    Assert.Equal("Ayende Rahien", ups[0].UserName);
}

So you can do a cheap full text search over joins quite easily. For that matter, joins are cheap now, because they are computed on the background and queried directly from the pre-computed index.

Okay, enough blogging for now, going to implement all the proper error handling and then push an awesome new build.

Oh, and a final thought, Multi Map was shown in this blog only in the context of Multi Maps/Reduce, but we also support just the ability to use multi map on its own. This is quite useful if you want to enable search over a large number of entities that reside in different collections. I’ll just drop a bit of code here to show how it works:

public class CatsAndDogs : AbstractMultiMapIndexCreationTask
{
    public CatsAndDogs()
    {
        AddMap<Cat>(cats => from cat in cats
                         select new {cat.Name});

        AddMap<Dog>(dogs => from dog in dogs
                         select new { dog.Name });
    }
}

[Fact]
public void CanQueryUsingMutliMap()
{
    using (var store = NewDocumentStore())
    {
        new CatsAndDogs().Execute(store);

        using(var documentSession = store.OpenSession())
        {
            documentSession.Store(new Cat{Name = "Tom"});
            documentSession.Store(new Dog{Name = "Oscar"});
            documentSession.SaveChanges();
        }

        using(var session = store.OpenSession())
        {
            var haveNames = session.Query<IHaveName, CatsAndDogs>()
                .Customize(x => x.WaitForNonStaleResults(TimeSpan.FromMinutes(5)))
                .OrderBy(x => x.Name)
                .ToList();

            Assert.Equal(2, haveNames.Count);
            Assert.IsType<Dog>(haveNames[0]);
            Assert.IsType<Cat>(haveNames[1]);
        }
    }
}

All together, a great day’s work.

More posts in "RavenDB" series:

  1. (25 May 2016) Got anything to declare, ya smuggler?
  2. (23 May 2016) I'm no longer conflicted about this
  3. (19 May 2016) What did you subscribe to again?
  4. (17 May 2016) See here, I got a contract, I say!
  5. (13 May 2016) Deeper insights to indexing
  6. (11 May 2016) Digging deep into the internals
  7. (09 May 2016) I'll have the 3+1 goodies to go, please
  8. (04 May 2016) I’ll find who is taking my I/O bandwidth and they SHALL pay
  9. (02 May 2016) You want all the data, you can’t handle all the data
  10. (29 Apr 2016) A large cluster goes into a bar and order N^2 drinks
  11. (27 Apr 2016) I’m the admin, and I got the POWER
  12. (25 Apr 2016) Can you spare me a server?
  13. (21 Apr 2016) Configuring once is best done after testing twice
  14. (19 Apr 2016) Is this a cluster in your pocket AND you are happy to see me?