RavenDB: Multi Maps / Reduce indexes

filter by tags archive

architecture (616) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (642) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1088) rss
raven (1457) rss
ravendb.net (541) rss
reviews (184) rss

2025
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Sep 12 2011

RavenDBMulti Maps / Reduce indexes

time to read 18 min | 3538 words

If you thought that map/reduce was complex, wait until we introduce the newest feature in RavenDB:

Multi Maps / Reduce Indexes

Okay, to be frank, they aren’t complex at all, they are actually quite simple, when you sit down to think about them. Again, I have to credit to Frank Schwieterman, who came up with the idea.

Wait! Let us back track a bit and try to explain what the actual problem is that we are trying to solve. The problem with Map/Reduce is that you can only gather information from a single set of documents. Let us look at the following documents as an example:

{// users/ayende 
   "Name": "Ayende Rahien" 
} 

{ // posts/1234 
  "Title": "Why RavenDB?", 
  "Author": "users/ayende" 
} 
{ // posts/1235 
  "Title": "It is awesome!", 
  "Author": "users/ayende" 
} 

We want to get an list of users with the count of posts that they made. That is trivially easy, as shown in the following map/reduce index:

from post in docs.Posts
select new { post.Author, Count = 1 }

from result in results
group result by result.Author into g
select new
{
   Author = g.Key,
   Count = g.Sum(x=>x.Count)
}

The output of this index would be something like this:

{ Author: "users/ayende", Count: 2 }

And we can load it efficiently using Includes:

session.Query<AuthorPostStats>("Posts/ByUser/Count")
     .Include(x=>x.Author)
     .ToList();

This will load all the users statistics, and also load all of the associated users, in a single query to the database. So far, fairly simple and routine.

The problem begins when we want to be able to query this index using the user’s name. As you can deduce from the documents shown previously, the user name isn’t available on the post document, which means that we can’t index it. That, in turn, means that we can’t search it.

We are left with several bad options:

De-normalize the User’s Name property to the Post document, solely for indexing purposes.
Don’t implement this feature.
Write the following scary query:

from doc in docs.WhereEntityIs("Users","Posts") 
let user = doc.IfEntityIs("Users") 
let post = doc.IfEntityIs("Posts") 
select new 
{ 
  Count = user == null ? 1 : 0, 
  Author = user.Name, 
  UserId = user.Id ?? post.Author 
} 

from result in results 
group result by result.UserId into g 
select new 
{ 
   Count = g.Sum(x=>x.Count), 
   Author = g.FirstNotNull(x=>x.Author), 
   UserId = g.Key 
}

This is actually pretty straightforward, when you sit down and think about it. But there is a whole lot of ceremony involved, and it is actually more than a bit hard to figure out what is going on in more complex scenarios.

This is where Frank’s suggestion came in:

…if I were try to support linq-based indexes that can map multiple types, it might look like:
public class OverallOpinion : AbstractIndexCreationTask<?>
{
   public OverallOpinion()
   {
       Map<Foo>(docs => from doc in docs select new { Id = doc.Id, LastUpdated = doc.LastUpdated }
       Map<OpinionOfFoo>(docs => from doc in docs select new { Id = Doc.DocId, Rating = doc.Rating, Count = 1}
       Reduce = docs => from doc in docs
                        group doc by doc.Id into g
                        select new {
                           Id = g.Key,
                           LastUpdated = g.Values.Where(f => f.LastUpdated != null).FirstOrDefault(),
                           Rating = g.Values.Rating.Sum(),
                           Count = g.Values.Count.Sum()
                        }
   }
}
It seems like some clever code could combine the different map expressions into one.

This is part of a longer discussion, but basically, it got me thinking about how we can implement multi maps, and I came up with the following:

// Map from posts
from post in docs.Posts
select new { UserId = post.Author, Author = (string)null, Count = 1 }

// Map from users
from user in docs.Users
select new { UserId = user.Id, Author = user.Name, Count = 0 }

// Reduce takes results from both maps
from result in results
group result by result.UserId into g
select new
{
   Count = g.Sum(x=>x.Count),
   Author = g.Where(x=>x!=null).First(),
   UserId = g.Key
}

The only thing to understand now is that we have multiple map functions, getting data from multiple sources. We can then take those sources and reduce them together. The only requirements that we have is that the output of all of the map functions would be identical (and obviously, match the output of the reduce function). Then we can just treat this information as normal map/reduce index, which means that all of the usual RavenDB features kick in. Let us see what this actually means, using code. We have the following classes:

public class User
{
    public string Id { get; set; }
    public string Name { get; set; }
}

public class Post
{
    public string Id { get; set; }
    public string Title { get; set; }
    public string AuthorId { get; set; }
}

public class UserPostingStats
{
    public string UserName { get; set; }
    public string UserId { get; set; }
    public int PostCount { get; set; }
}

And we have the following index:

public class PostCountsByUser_WithName : AbstractMultiMapIndexCreationTask<UserPostingStats>
{
    public PostCountsByUser_WithName()
    {
        AddMap<User>(users => from user in users
                              select new
                              {
                                  UserId = user.Id,
                                  UserName = user.Name,
                                  PostCount = 0
                              });

        AddMap<Post>(posts => from post in posts
                              select new
                              {
                                  UserId = post.AuthorId,
                                  UserName = (string)null,
                                  PostCount = 1
                              });

        Reduce = results => from result in results
                            group result by result.UserId
                            into g
                            select new
                            {
                                UserId = g.Key,
                                UserName = g.Select(x => x.UserName).Where(x => x != null).First(),
                                PostCount = g.Sum(x => x.PostCount)
                            };

        Index(x=>x.UserName, FieldIndexing.Analyzed);
    }
}

As you can see, we are getting the values from two different collections. We need to make sure that they are actually using the same output, which is what caused us the null casting for posts and the filtering that we need to do on the reduce.

But that is it! It is ridiculously easy compared to the previous alternative. Moreover, it follows quite naturally from both the exposed API and the internal implementation inside RavenDB. It took me roughly half a day to make it work, and some of that was dedicated to lunch Smile . In truth, most of that time is actually just handling the error conditions nicely, but… anyway, you get the point.

Even more interesting than the rest is the fact that for all intents and purposes, what we have done here is a join between two different collections. We were never able to really resolve the problems associated with joins before, update notifications were always too complex to figure out, but going the route of multi map makes things so easy.

Just for fun, you might have noticed that we marked the UserName property as analyzed, which means that we can issue full text queries against it. Let us assume that we want to provide users with the following UI:

It is now just a matter of writing the following code:

using (var session = store.OpenSession())
{
    var ups= session.Query<UserPostingStats, PostCountsByUser_WithName>()
        .Where(x => x.UserName.StartsWith("rah"))
        .ToList();

    Assert.Equal(1, ups.Count);

    Assert.Equal(5, ups[0].PostCount);
    Assert.Equal("Ayende Rahien", ups[0].UserName);
}

So you can do a cheap full text search over joins quite easily. For that matter, joins are cheap now, because they are computed on the background and queried directly from the pre-computed index.

Okay, enough blogging for now, going to implement all the proper error handling and then push an awesome new build.

Oh, and a final thought, Multi Map was shown in this blog only in the context of Multi Maps/Reduce, but we also support just the ability to use multi map on its own. This is quite useful if you want to enable search over a large number of entities that reside in different collections. I’ll just drop a bit of code here to show how it works:

public class CatsAndDogs : AbstractMultiMapIndexCreationTask
{
    public CatsAndDogs()
    {
        AddMap<Cat>(cats => from cat in cats
                         select new {cat.Name});

        AddMap<Dog>(dogs => from dog in dogs
                         select new { dog.Name });
    }
}

[Fact]
public void CanQueryUsingMutliMap()
{
    using (var store = NewDocumentStore())
    {
        new CatsAndDogs().Execute(store);

        using(var documentSession = store.OpenSession())
        {
            documentSession.Store(new Cat{Name = "Tom"});
            documentSession.Store(new Dog{Name = "Oscar"});
            documentSession.SaveChanges();
        }

        using(var session = store.OpenSession())
        {
            var haveNames = session.Query<IHaveName, CatsAndDogs>()
                .Customize(x => x.WaitForNonStaleResults(TimeSpan.FromMinutes(5)))
                .OrderBy(x => x.Name)
                .ToList();

            Assert.Equal(2, haveNames.Count);
            Assert.IsType<Dog>(haveNames[0]);
            Assert.IsType<Cat>(haveNames[1]);
        }
    }
}

All together, a great day’s work.

Tweet Share Share 14 comments

Tags:

Raven

Comments

12 Sep 2011
09:25 AM

John Landheer

I think the line in your name/posts example should read:

select new { UserId = user.Id, Author = user.Name, Count = 0 }

Otherwise your count will always be one too many.

12 Sep 2011
11:07 AM

tobi

This is a pretty specialized and nasty way to do materialized views over join queries.

I don't know why you resist joins so much. You just implemented materialized joins, but in a totally involved way because it had to be forced into the map/reduce paradigm. There is nothing in joins that is inherently unscalable or unperformant. Look at how SQL Server does incremental and efficient maintainance of indexed views with joins in them. Querying them has zero runtime overhead yet you get all the denormalization benefits and efficient updates.

I totally don't understand why all map/reduce implementations in the entire industry just cannot admit that joins are useful and should be supported. DryadLINQ is the welcome exception.

12 Sep 2011
13:21 PM

Louis

@John: I think Count = 1 is just a way to initialize the property Count of the anonymous type to an int32.

12 Sep 2011
13:30 PM

Ayende Rahien

John, thanks, fixed

12 Sep 2011
13:32 PM

Ayende Rahien

Tobi, Try to implement that, and tell me where it leads you. We have spent a LOT of time on that, and this is by far the simplest solution

12 Sep 2011
14:05 PM

tobi

Ayende, the implementation problem will probably be a) the change tracking and b) the need for a more efficient join than loop-joins for view initialization. Difficulty is in this order.

Implementing an external merge sort + merge-join is easy, I have done it to process "big data" (more than RAM would hold). The change tracking will be difficult to work out but easy to test. A reference implementation exists in SQL Server and can be analyzed by looking at the execution plans. They are detailed enough to reverse-engineer a solution.

12 Sep 2011
14:28 PM

tobi

Needless to say that simple standard joins would remove the need to denormalize in RavenDB and be a great convenience. I expect the NoSql fad to end when somebody finally imlements a relational database with relaxed guarantees. RavenDB could be that database right now. Basically, only convenient joins and a reasonably query planner are missing.

12 Sep 2011
14:48 PM

Chanan Braunstein

Somewhat related question... Do map/reduce always return a key and a count?

Or in other words, for example, I want to return the first letter of the last name of the authors and group of authors:

B: Braunstein, Chanan Bova, Ben R: Rahien, Ayende

12 Sep 2011
19:59 PM

fschwiet

Chanan, I gave it a try. I only had a hour or so to try it out, it seems possible but I couldn't quite get it to work: https://github.com/fschwiet/ravendb/commit/597c8630360e6a28331e9f738ac5e753807f4a46

One observation, to group by a field it needs to be included in all types that are mapped, (for Ayende's test, I had to add UserName field to Post in order to group by first letter of username)

Tobi, you just have to write a distributed map/reduce function in linq. Thats much better than say, Erlang. :) What would the joins look like? Trying multimaps, there is room to make it more DRY when coalescing, but being able to merge different types like this does open some possibilities for composite views, very typical on the web.

13 Sep 2011
07:51 AM

tobi

fschwiet, the ability to use multiple sources is a union-all. It is clearly useful especially for search. It has nothing to do with joins inherently, it is just that you can assemble/hack a join from it by doing a reduce stage afterwards. If you want to know how joins would look like, google DryadLINQ. It is a data-warehouse query solution, not OLTP. Yet it is very applicable and instructive to this discussion.

13 Sep 2011
08:11 AM

Ayende Rahien

Tobi, No, the actual problem is how do you find what you join ON. Remember, we don't have a column for that join condition, this can be anywhere you want it to. And just getting the data to do the join would be incredibly expensive.

13 Sep 2011
08:13 AM

Ayende Rahien

Tobi, You don't relaly need to denormalize often in RavenDB

13 Sep 2011
08:13 AM

Ayende Rahien

Chanan, You can do that, certainly. It is a little different, but it is entirely feasible

14 Sep 2011
15:00 PM

Richard Poole

Includes and live projections are useful for retrieving related data, but they don't help if you need to query or sort on the related data. Denormalised references are an option, but they're not always appropriate, e.g. if the related data has a high rate of change. That's why I'm really looking forward to this feature. It perfectly fills the gap left by denormalised references, includes and live projections. Well done to Frank for coming up with the idea!

Comment preview

Comments have been closed on this topic.

Markdown turns plain text formatting into fancy HTML formatting.

Phrase Emphasis

*italic*   **bold**
_italic_   __bold__

Links

Inline:

An [example](http://url.com/ "Title")

Reference-style labels (titles are optional):

An [example][id]. Then, anywhere
else in the doc, define the link:
  [id]: http://example.com/  "Title"

Images

Inline (titles are optional):

![alt text](/path/img.jpg "Title")

Reference-style:

![alt text][id]
[id]: /url/to/img.jpg "Title"

Headers

Setext-style:

Header 1
========
Header 2
--------

atx-style (closing #'s are optional):

# Header 1 #
## Header 2 ##
###### Header 6

Lists

Ordered, without paragraphs:

1.  Foo
2.  Bar

Unordered, with paragraphs:

*   A list item.
    With multiple paragraphs.
*   Bar

You can nest them:

*   Abacus
    * answer
*   Bubbles
    1.  bunk
    2.  bupkis
        * BELITTLER
    3. burper
*   Cunning

Blockquotes

> Email-style angle brackets
> are used for blockquotes.
> > And, they can be nested.
> #### Headers in blockquotes
> 
> * You can quote a list.
> * Etc.

Horizontal Rules

Three or more dashes or asterisks:

---
* * *
- - - -

Manual Line Breaks

End a line with two or more spaces:

Roses are red,   
Violets are blue.

Fenced Code Blocks

Code blocks delimited by 3 or more backticks or tildas:

```
This is a preformatted
code block
```

Header IDs

Set the id of headings with {#<id>} at end of heading line:

## My Heading {#myheading}

Tables

Fruit    |Color
---------|----------
Apples   |Red
Pears	 |Green
Bananas  |Yellow

Definition Lists

Term 1
: Definition 1
Term 2
: Definition 2

Footnotes

Body text with a footnote [^1]
[^1]: Footnote text here

Abbreviations

MDD <- will have title
*[MDD]: MarkdownDeep

Oren Eini

Oren Eini

CEO of RavenDB

RavenDBMulti Maps / Reduce indexes

More posts in "RavenDB" series:

Comments

Comment preview

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

Related posts that you may find interesting:

More posts in "RavenDB" series:

Comments

Comment preview

Markdown formatting

Phrase Emphasis

Links

Images

Headers

Lists

Blockquotes

Horizontal Rules

Manual Line Breaks

Fenced Code Blocks

Header IDs

Tables

Definition Lists

Footnotes

Abbreviations

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication