Ayende @ Rahien

It's a girl

Self “joins” in RavenDB

A customer had an interesting problem in the mailing list. He had the following model:

public class Case
{
    public string Id { get; set; }
    public string[] CasesCitedByThisCase { get; set; }
    public string Name { get; set; }
    // other interesting properties
}

And he needed to be able to query and sort by the case’s properties, but also by its importance. A case important is determined by how many other cases are citing it.

This seems hard to do, because you cannot access other documents during standard indexing, so you have to use map reduce to make this work. The problem is that it is really hard to just map/reduce to make this work, you need two disparate sets of information from the documents.

This is why we have multi map, and we can create the appropriate index like this:

public class Cases_Search : AbstractMultiMapIndexCreationTask<Cases_Search.Result>
{
    public class Result
    {
        public string Id { get; set; }
        public int TimesCited { get; set; }
        public string Name { get; set; }
    }

    public Cases_Search()
    {
        AddMap<Case>(cases =>
                        from c in cases
                        select new
                        {
                        c.Id,
                        c.Name,
                        TimesCited = 0
                        }
            );

        AddMap<Case>(cases =>
                        from c in cases
                        from cited in c.CasesCitedByThisCase
                        select new
                        {
                            Id = cited,
                            Name = (string)null,
                            TimesCited = 1
                        }
            );

        Reduce = results =>
                    from result in results
                    group result by result.Id
                    into g
                    select new
                    {
                    Id = g.Key,
                    Name = g.Select(x => x.Name).FirstOrDefault(x => x != null),
                    TimesCited = g.Sum(x => x.TimesCited)
                    };
    }
}

We are basically doing two passes on the cases, one to get the actual case information, the next to get the information about the cases it cite. We then take all of that information together and reduce it together, resulting in the output that we wanted.

The really nice thing about this? This being a RavenDB index, all of the work is done once, and queries on this are going to be really cheap.

Tags:

Posted By: Ayende Rahien

Published at

Originally posted at

Comments

Andreas Kroll
08/06/2012 10:23 AM by
Andreas Kroll

Hi Ayende,

cool stuff. And published totally at the right time. I just have this "problem" ahead of me.

One interesting question would be if it is possible to count not only the immediate childs, but all childs that have the current item as parent. There are a few problems within that approach, because this only works if you have a directed and loop free graph. But if you take an org-chart as example, it is not possible that one person is above and below another person in the hierarchy.

Do you also have a solution to count all descendants?

Thanks Andreas

Ayende Rahien
08/06/2012 10:47 AM by
Ayende Rahien

Andreas, If you have all of the parents in each docs, that is easy. Otherwise, it gets complex and probably not worth it.

Andreas Kroll
08/06/2012 11:21 AM by
Andreas Kroll

Ayende, I was not trying to have a node have multiple parents, but rather leave the structure as you showed it, but instead of counting only immediate childs I'd like to count all childs below a node.

A --- B | |---C----D |-----E

So B, D and E would have a child count of 0 C has a child count of 2 A has a child count of 4

Would that be possible to achieve with map/reduce?

Thanks Andreas

Andreas Kroll
08/06/2012 11:22 AM by
Andreas Kroll

Sorry, the comment engine reformatted the picture of the nodes

A has childs B and C C has childs D and E B, D and E have no childs

configurator
08/06/2012 03:08 PM by
configurator

Here's another interesting case. Say I have some kind of hierarchical data, and I want to show its tree. So when accessing a page /a/b/c/d we'd show a tree that shows all the parentage (a, b, and c), and d and its children.

Suppose we model our documents so that they only have the Children property and not a Parent (or Ancestry), could you create a Map/Reduce index that returns the ancestry for an object?

Ayende Rahien
08/06/2012 04:01 PM by
Ayende Rahien

Andreas, You would have to have all of the parents available in the document. So:

e-> a, b,c,d d->a,b,c c->a,b b-> a a

If you don't have this, then no, you would have to do multiple passes, and we don't allow it.

Ayende Rahien
08/06/2012 04:02 PM by
Ayende Rahien

Configurator, No need for map/reduce here, you could do that, but you would require multiple queries (O(H), where H is the height).

configurator
08/06/2012 04:09 PM by
configurator

The question is, could you prepare a map/reduce index so that it can be done in a single query, e.g. the results would be { document: "docs/17", parents: [ "docs/4", "docs/9", "docs/13" ] }

Ayende Rahien
08/06/2012 04:10 PM by
Ayende Rahien

Configurator, If the Children property on the parents was recursive (So it contains ALL children, not just direct ones), yes, easily. Otherwise, no.

Ajai
08/06/2012 08:08 PM by
Ajai

Just curious on how re indexing works in RavenDB.

Say I delete a Case, or update a case (add/remove cases CitedByThisCase) - do you re-run map/reduce on the whole case document collection?

Ajai

Andreas Kroll
08/06/2012 10:29 PM by
Andreas Kroll

Hi Ayende,

thanks for the clarification. You are - as most of the time - right that this requirement would indicate a needed change in the model itself. A change in a single node could otherwise trigger a lot of changes in the index and that really would not be performant at all.

As Configurator was also interested in some sort of hierarchical information.... how would you tackle such a task so it could be implemented with RavenDB?

The classical example in SQL is having an org-chart (person and his/her manager) and the usage of a hierarchical function. You can query for a single node and have all its children or all its parents in the "flat" output.

How would that translate to RavenDB? A recursive function in SQL is no "index" of any sort, but rather a function running on the server to "iterate" over the hierarchical layers and aggregate the data.

How would you perform that sort of task in RavenDB?

Thanks Andreas

Ayende Rahien
08/07/2012 04:06 AM by
Ayende Rahien

Ajai, We would re-run things for the specific case that was updated/deleted.

Ayende Rahien
08/07/2012 04:08 AM by
Ayende Rahien

Andreas, RavenDB contains builtin support for recursion. Here is an example on one such index:

from ex in examples from ov in ex.Overrides let ancestry = Recurse(ov, x => x.Parent) select new { ex.Id, ov.OwnerId, Name = ov.OverriddenValues["Name"] ?? (ancestry.Any(x => x.OverriddenValues["Name"] != null) ? ancestry.First(x => x.OverriddenValues["Name"] != null).OverriddenValues["Name"] : ex.Name), Description = ov.OverriddenValues["Description"] ?? (ancestry.Any(x => x.OverriddenValues["Description"] != null) ? ancestry.First(x => x.OverriddenValues["Description"] != null).OverriddenValues["Description"] : ex.Description) });

You could do this in the transform results and load the entire hierarchy along with the child object.

Ajai
08/07/2012 04:43 AM by
Ajai

Hi Ayende

Not understanding this correctly.

Say:

C1 cites C2 & C3 C4 cites C2

Run map/reduce:

C2 -> TimesCited = 2 C3 -> TimesCited = 1

When C1 is deleted how can indexing work just by reprocessing just the deleted record?

Do we not need to rerun map/reduce on whole collection to recompute TimesCited for all then cases? Or are somehow intermediate results also stored?

Any pointers on the implementation or link to relevant source is much appreciated...

Thanks

Ajai

Ayende Rahien
08/07/2012 04:46 AM by
Ayende Rahien

Ajai, We store some of the intermediate results, exactly for this purpose. It basically works by saying:

Delete C1, figure out what part of the map/reduce would be affected, re-run those parts.

Ajai
08/07/2012 04:56 AM by
Ajai

Thanks Ayende

That is super cool! Will try dig into the implementation...

Ajai

Andreas Kroll
08/07/2012 01:54 PM by
Andreas Kroll

Hi Ayende,

man.... I must admit you never cease to amaze with RavenDB. I never thought that recursive actions would be part of the concept.

Awesome!!!

Is there any source I could read more about those things?

Thanks Andreas

Ayende Rahien
08/07/2012 01:55 PM by
Ayende Rahien

Andreas, The mailing list is the best place for that. We have a BIG documentation push coming up.

Jack Jones
08/08/2012 03:21 AM by
Jack Jones

This sort of stuff is trivially simple with graph databases. If the relationships between records are equally as important as the record contents, I would recommend you check out graph db's and some of the algorithms that they make simple.

Oren: I know you were looking at implementing a graph db on top of Raven at some point, but I understand that you dropped it due to market size? I wish you had continued with that.

Ayende Rahien
08/08/2012 04:01 AM by
Ayende Rahien

Jack, I could have create a bad impl of a graph db, or created a really awesome doc db.

Comments have been closed on this topic.