Ayende @ Rahien

Refunds available at head office

RavenDB: Multi Maps / Reduce indexes

badge1If you thought that map/reduce was complex, wait until we introduce the newest feature in RavenDB:

Multi Maps / Reduce Indexes

Okay, to be frank, they aren’t complex at all, they are actually quite simple, when you sit down to think about them. Again, I have to credit to Frank Schwieterman, who came up with the idea.

Wait! Let us back track a bit and try to explain what the actual problem is that we are trying to solve. The problem with Map/Reduce is that you can only gather information from a single set of documents. Let us look at the following documents as an example:

{// users/ayende 
   "Name": "Ayende Rahien" 
} 

{ // posts/1234 
  "Title": "Why RavenDB?", 
  "Author": "users/ayende" 
} 
{ // posts/1235 
  "Title": "It is awesome!", 
  "Author": "users/ayende" 
} 

We want to get an list of users with the count of posts that they made. That is trivially easy, as shown in the following map/reduce index:

from post in docs.Posts
select new { post.Author, Count = 1 }

from result in results
group result by result.Author into g
select new
{
   Author = g.Key,
   Count = g.Sum(x=>x.Count)
}

The output of this index would be something like this:

{ Author: "users/ayende", Count: 2 }

And we can load it efficiently using Includes:

session.Query<AuthorPostStats>("Posts/ByUser/Count")
     .Include(x=>x.Author)
     .ToList();

This will load all the users statistics, and also load all of the associated users, in a single query to the database. So far, fairly simple and routine.

badge5The problem begins when we want to be able to query this index using the user’s name. As you can deduct from the documents shown previously, the user name isn’t available on the post document, which means that we can’t index it. That, in turn, means that we can’t search it.

We are left with several bad options:

  • De-normalize the User’s Name property to the Post document, solely for indexing purposes.
  • Don’t implement this feature.
  • Write the following scary query:
from doc in docs.WhereEntityIs("Users","Posts") 
let user = doc.IfEntityIs("Users") 
let post = doc.IfEntityIs("Posts") 
select new 
{ 
  Count = user == null ? 1 : 0, 
  Author = user.Name, 
  UserId = user.Id ?? post.Author 
} 

from result in results 
group result by result.UserId into g 
select new 
{ 
   Count = g.Sum(x=>x.Count), 
   Author = g.FirstNotNull(x=>x.Author), 
   UserId = g.Key 
} 

This is actually pretty straightforward, when you sit down and think about it. But there is a whole lot of ceremony involved, and it is actually more than a bit hard to figure out what is going on in more complex scenarios.

This is where Frank’s suggestion came in:

…if I were try to support linq-based indexes that can map multiple types, it might look like:

public class OverallOpinion : AbstractIndexCreationTask<?>
{
   public OverallOpinion()
   {
       Map<Foo>(docs => from doc in docs select new { Id = doc.Id, LastUpdated = doc.LastUpdated }
       Map<OpinionOfFoo>(docs => from doc in docs select new { Id = Doc.DocId, Rating = doc.Rating, Count = 1}
       Reduce = docs => from doc in docs
                        group doc by doc.Id into g
                        select new {
                           Id = g.Key,
                           LastUpdated = g.Values.Where(f => f.LastUpdated != null).FirstOrDefault(),
                           Rating = g.Values.Rating.Sum(),
                           Count = g.Values.Count.Sum()
                        }
   }
}

It seems like some clever code could combine the different map expressions into one.

badge7This is part of a longer discussion, but basically, it got me thinking about how we can implement multi maps, and I came up with the following:

// Map from posts
from post in docs.Posts
select new { UserId = post.Author, Author = (string)null, Count = 1 }

// Map from users
from user in docs.Users
select new { UserId = user.Id, Author = user.Name, Count = 0 }

// Reduce takes results from both maps
from result in results
group result by result.UserId into g
select new
{
   Count = g.Sum(x=>x.Count),
   Author = g.Where(x=>x!=null).First(),
   UserId = g.Key
}

The only thing to understand now is that we have multiple map functions, getting data from multiple sources. We can then take those sources and reduce them together. The only requirements that we have is that the output of all of the map functions would be identical (and obviously, match the output of the reduce function). Then we can just treat this information as normal map/reduce index, which means that all of the usual RavenDB features kick in. Let us see what this actually means, using code. We have the following classes:

public class User
{
    public string Id { get; set; }
    public string Name { get; set; }
}

public class Post
{
    public string Id { get; set; }
    public string Title { get; set; }
    public string AuthorId { get; set; }
}

public class UserPostingStats
{
    public string UserName { get; set; }
    public string UserId { get; set; }
    public int PostCount { get; set; }
}

And we have the following index:

public class PostCountsByUser_WithName : AbstractMultiMapIndexCreationTask<UserPostingStats>
{
    public PostCountsByUser_WithName()
    {
        AddMap<User>(users => from user in users
                              select new
                              {
                                  UserId = user.Id,
                                  UserName = user.Name,
                                  PostCount = 0
                              });

        AddMap<Post>(posts => from post in posts
                              select new
                              {
                                  UserId = post.AuthorId,
                                  UserName = (string)null,
                                  PostCount = 1
                              });

        Reduce = results => from result in results
                            group result by result.UserId
                            into g
                            select new
                            {
                                UserId = g.Key,
                                UserName = g.Select(x => x.UserName).Where(x => x != null).First(),
                                PostCount = g.Sum(x => x.PostCount)
                            };

        Index(x=>x.UserName, FieldIndexing.Analyzed);
    }
}

badge8As you can see, we are getting the values from two different collections. We need to make sure that they are actually using the same output, which is what caused us the null casting for posts and the filtering that we need to do on the reduce.

But that is it! It is ridiculously easy compared to the previous alternative. Moreover, it follows quite naturally from both the exposed API and the internal implementation inside RavenDB. It took me roughly half a day to make it work, and some of that was dedicated to lunch Smile. In truth, most of that time is actually just handling the error conditions nicely, but… anyway, you get the point.

Even more interesting than the rest is the fact that for all intents and purposes, what we have done here is a join between two different collections. We were never able to really resolve the problems associated with joins before, update notifications were always too complex to figure out, but going the route of multi map makes things so easy.

Just for fun, you might have noticed that we marked the UserName property as analyzed, which means that we can issue full text queries against it. Let us assume that we want to provide users with the following UI:

image

It is now just a matter of writing the following code:

using (var session = store.OpenSession())
{
    var ups= session.Query<UserPostingStats, PostCountsByUser_WithName>()
        .Where(x => x.UserName.StartsWith("rah"))
        .ToList();

    Assert.Equal(1, ups.Count);

    Assert.Equal(5, ups[0].PostCount);
    Assert.Equal("Ayende Rahien", ups[0].UserName);
}

So you can do a cheap full text search over joins quite easily. For that matter, joins are cheap now, because they are computed on the background and queried directly from the pre-computed index.

Okay, enough blogging for now, going to implement all the proper error handling and then push an awesome new build.

Oh, and a final thought, Multi Map was shown in this blog only in the context of Multi Maps/Reduce, but we also support just the ability to use multi map on its own. This is quite useful if you want to enable search over a large number of entities that reside in different collections. I’ll just drop a bit of code here to show how it works:

public class CatsAndDogs : AbstractMultiMapIndexCreationTask
{
    public CatsAndDogs()
    {
        AddMap<Cat>(cats => from cat in cats
                         select new {cat.Name});

        AddMap<Dog>(dogs => from dog in dogs
                         select new { dog.Name });
    }
}

[Fact]
public void CanQueryUsingMutliMap()
{
    using (var store = NewDocumentStore())
    {
        new CatsAndDogs().Execute(store);

        using(var documentSession = store.OpenSession())
        {
            documentSession.Store(new Cat{Name = "Tom"});
            documentSession.Store(new Dog{Name = "Oscar"});
            documentSession.SaveChanges();
        }

        using(var session = store.OpenSession())
        {
            var haveNames = session.Query<IHaveName, CatsAndDogs>()
                .Customize(x => x.WaitForNonStaleResults(TimeSpan.FromMinutes(5)))
                .OrderBy(x => x.Name)
                .ToList();

            Assert.Equal(2, haveNames.Count);
            Assert.IsType<Dog>(haveNames[0]);
            Assert.IsType<Cat>(haveNames[1]);
        }
    }
}

All together, a great day’s work.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

John Landheer
09/12/2011 09:25 AM by
John Landheer

I think the line in your name/posts example should read:

select new { UserId = user.Id, Author = user.Name, Count = 0 }

Otherwise your count will always be one too many.

tobi
09/12/2011 11:07 AM by
tobi

This is a pretty specialized and nasty way to do materialized views over join queries.

I don't know why you resist joins so much. You just implemented materialized joins, but in a totally involved way because it had to be forced into the map/reduce paradigm. There is nothing in joins that is inherently unscalable or unperformant. Look at how SQL Server does incremental and efficient maintainance of indexed views with joins in them. Querying them has zero runtime overhead yet you get all the denormalization benefits and efficient updates.

I totally don't understand why all map/reduce implementations in the entire industry just cannot admit that joins are useful and should be supported. DryadLINQ is the welcome exception.

Louis
09/12/2011 01:21 PM by
Louis

@John: I think Count = 1 is just a way to initialize the property Count of the anonymous type to an int32.

Ayende Rahien
09/12/2011 01:32 PM by
Ayende Rahien

Tobi, Try to implement that, and tell me where it leads you. We have spent a LOT of time on that, and this is by far the simplest solution

tobi
09/12/2011 02:05 PM by
tobi

Ayende, the implementation problem will probably be a) the change tracking and b) the need for a more efficient join than loop-joins for view initialization. Difficulty is in this order.

Implementing an external merge sort + merge-join is easy, I have done it to process "big data" (more than RAM would hold). The change tracking will be difficult to work out but easy to test. A reference implementation exists in SQL Server and can be analyzed by looking at the execution plans. They are detailed enough to reverse-engineer a solution.

tobi
09/12/2011 02:28 PM by
tobi

Needless to say that simple standard joins would remove the need to denormalize in RavenDB and be a great convenience. I expect the NoSql fad to end when somebody finally imlements a relational database with relaxed guarantees. RavenDB could be that database right now. Basically, only convenient joins and a reasonably query planner are missing.

Chanan Braunstein
09/12/2011 02:48 PM by
Chanan Braunstein

Somewhat related question... Do map/reduce always return a key and a count?

Or in other words, for example, I want to return the first letter of the last name of the authors and group of authors:

B: Braunstein, Chanan Bova, Ben R: Rahien, Ayende

fschwiet
09/12/2011 07:59 PM by
fschwiet

Chanan, I gave it a try. I only had a hour or so to try it out, it seems possible but I couldn't quite get it to work: https://github.com/fschwiet/ravendb/commit/597c8630360e6a28331e9f738ac5e753807f4a46

One observation, to group by a field it needs to be included in all types that are mapped, (for Ayende's test, I had to add UserName field to Post in order to group by first letter of username)

Tobi, you just have to write a distributed map/reduce function in linq. Thats much better than say, Erlang. :) What would the joins look like? Trying multimaps, there is room to make it more DRY when coalescing, but being able to merge different types like this does open some possibilities for composite views, very typical on the web.

tobi
09/13/2011 07:51 AM by
tobi

fschwiet, the ability to use multiple sources is a union-all. It is clearly useful especially for search. It has nothing to do with joins inherently, it is just that you can assemble/hack a join from it by doing a reduce stage afterwards. If you want to know how joins would look like, google DryadLINQ. It is a data-warehouse query solution, not OLTP. Yet it is very applicable and instructive to this discussion.

Ayende Rahien
09/13/2011 08:11 AM by
Ayende Rahien

Tobi, No, the actual problem is how do you find what you join ON. Remember, we don't have a column for that join condition, this can be anywhere you want it to. And just getting the data to do the join would be incredibly expensive.

Ayende Rahien
09/13/2011 08:13 AM by
Ayende Rahien

Tobi, You don't relaly need to denormalize often in RavenDB

Ayende Rahien
09/13/2011 08:13 AM by
Ayende Rahien

Chanan, You can do that, certainly. It is a little different, but it is entirely feasible

Richard Poole
09/14/2011 03:00 PM by
Richard Poole

Includes and live projections are useful for retrieving related data, but they don't help if you need to query or sort on the related data. Denormalised references are an option, but they're not always appropriate, e.g. if the related data has a high rate of change. That's why I'm really looking forward to this feature. It perfectly fills the gap left by denormalised references, includes and live projections. Well done to Frank for coming up with the idea!

Comments have been closed on this topic.