Self “joins” in RavenDB
A customer had an interesting problem in the mailing list. He had the following model:
public class Case { public string Id { get; set; } public string[] CasesCitedByThisCase { get; set; } public string Name { get; set; } // other interesting properties }
And he needed to be able to query and sort by the case’s properties, but also by its importance. A case important is determined by how many other cases are citing it.
This seems hard to do, because you cannot access other documents during standard indexing, so you have to use map reduce to make this work. The problem is that it is really hard to just map/reduce to make this work, you need two disparate sets of information from the documents.
This is why we have multi map, and we can create the appropriate index like this:
public class Cases_Search : AbstractMultiMapIndexCreationTask<Cases_Search.Result> { public class Result { public string Id { get; set; } public int TimesCited { get; set; } public string Name { get; set; } } public Cases_Search() { AddMap<Case>(cases => from c in cases select new { c.Id, c.Name, TimesCited = 0 } ); AddMap<Case>(cases => from c in cases from cited in c.CasesCitedByThisCase select new { Id = cited, Name = (string)null, TimesCited = 1 } ); Reduce = results => from result in results group result by result.Id into g select new { Id = g.Key, Name = g.Select(x => x.Name).FirstOrDefault(x => x != null), TimesCited = g.Sum(x => x.TimesCited) }; } }
We are basically doing two passes on the cases, one to get the actual case information, the next to get the information about the cases it cite. We then take all of that information together and reduce it together, resulting in the output that we wanted.
The really nice thing about this? This being a RavenDB index, all of the work is done once, and queries on this are going to be really cheap.
Comments
Hi Ayende,
cool stuff. And published totally at the right time. I just have this "problem" ahead of me.
One interesting question would be if it is possible to count not only the immediate childs, but all childs that have the current item as parent. There are a few problems within that approach, because this only works if you have a directed and loop free graph. But if you take an org-chart as example, it is not possible that one person is above and below another person in the hierarchy.
Do you also have a solution to count all descendants?
Thanks Andreas
Andreas, If you have all of the parents in each docs, that is easy. Otherwise, it gets complex and probably not worth it.
Ayende, I was not trying to have a node have multiple parents, but rather leave the structure as you showed it, but instead of counting only immediate childs I'd like to count all childs below a node.
A --- B | |---C----D |-----E
So B, D and E would have a child count of 0 C has a child count of 2 A has a child count of 4
Would that be possible to achieve with map/reduce?
Thanks Andreas
Sorry, the comment engine reformatted the picture of the nodes
A has childs B and C C has childs D and E B, D and E have no childs
Here's another interesting case. Say I have some kind of hierarchical data, and I want to show its tree. So when accessing a page /a/b/c/d we'd show a tree that shows all the parentage (a, b, and c), and d and its children.
Suppose we model our documents so that they only have the Children property and not a Parent (or Ancestry), could you create a Map/Reduce index that returns the ancestry for an object?
Andreas, You would have to have all of the parents available in the document. So:
e-> a, b,c,d d->a,b,c c->a,b b-> a a
If you don't have this, then no, you would have to do multiple passes, and we don't allow it.
Configurator, No need for map/reduce here, you could do that, but you would require multiple queries (O(H), where H is the height).
The question is, could you prepare a map/reduce index so that it can be done in a single query, e.g. the results would be { document: "docs/17", parents: [ "docs/4", "docs/9", "docs/13" ] }
Configurator, If the Children property on the parents was recursive (So it contains ALL children, not just direct ones), yes, easily. Otherwise, no.
Just curious on how re indexing works in RavenDB.
Say I delete a Case, or update a case (add/remove cases CitedByThisCase) - do you re-run map/reduce on the whole case document collection?
Ajai
Hi Ayende,
thanks for the clarification. You are - as most of the time - right that this requirement would indicate a needed change in the model itself. A change in a single node could otherwise trigger a lot of changes in the index and that really would not be performant at all.
As Configurator was also interested in some sort of hierarchical information.... how would you tackle such a task so it could be implemented with RavenDB?
The classical example in SQL is having an org-chart (person and his/her manager) and the usage of a hierarchical function. You can query for a single node and have all its children or all its parents in the "flat" output.
How would that translate to RavenDB? A recursive function in SQL is no "index" of any sort, but rather a function running on the server to "iterate" over the hierarchical layers and aggregate the data.
How would you perform that sort of task in RavenDB?
Thanks Andreas
Ajai, We would re-run things for the specific case that was updated/deleted.
Andreas, RavenDB contains builtin support for recursion. Here is an example on one such index:
from ex in examples from ov in ex.Overrides let ancestry = Recurse(ov, x => x.Parent) select new { ex.Id, ov.OwnerId, Name = ov.OverriddenValues["Name"] ?? (ancestry.Any(x => x.OverriddenValues["Name"] != null) ? ancestry.First(x => x.OverriddenValues["Name"] != null).OverriddenValues["Name"] : ex.Name), Description = ov.OverriddenValues["Description"] ?? (ancestry.Any(x => x.OverriddenValues["Description"] != null) ? ancestry.First(x => x.OverriddenValues["Description"] != null).OverriddenValues["Description"] : ex.Description) });
You could do this in the transform results and load the entire hierarchy along with the child object.
Hi Ayende
Not understanding this correctly.
Say:
C1 cites C2 & C3 C4 cites C2
Run map/reduce:
C2 -> TimesCited = 2 C3 -> TimesCited = 1
When C1 is deleted how can indexing work just by reprocessing just the deleted record?
Do we not need to rerun map/reduce on whole collection to recompute TimesCited for all then cases? Or are somehow intermediate results also stored?
Any pointers on the implementation or link to relevant source is much appreciated...
Thanks
Ajai
Ajai, We store some of the intermediate results, exactly for this purpose. It basically works by saying:
Delete C1, figure out what part of the map/reduce would be affected, re-run those parts.
Thanks Ayende
That is super cool! Will try dig into the implementation...
Ajai
Hi Ayende,
man.... I must admit you never cease to amaze with RavenDB. I never thought that recursive actions would be part of the concept.
Awesome!!!
Is there any source I could read more about those things?
Thanks Andreas
Andreas, The mailing list is the best place for that. We have a BIG documentation push coming up.
This sort of stuff is trivially simple with graph databases. If the relationships between records are equally as important as the record contents, I would recommend you check out graph db's and some of the algorithms that they make simple.
Oren: I know you were looking at implementing a graph db on top of Raven at some point, but I understand that you dropped it due to market size? I wish you had continued with that.
Jack, I could have create a bad impl of a graph db, or created a really awesome doc db.
Comment preview