Graphs in RavenDB: The overall design
Note: These series of posts are about a planned feature, exploring how we go about building it. This is meant to solicit feedback and get more eyes on the idea, things aren’t set in stone and we don’t have a firm release date on this.
We have been wanting to add graph queries to RavenDB for several years now, but we always had more important things get in the way. That didn’t prevent us from discussing this internally and sketch up a few options. We are now looking at this more seriously and I thought that sharing the details of our deliberations would be interesting and likely to garner us some valuable feedback. I’m going to assume that the reader is at least somewhat familiar with the notion of graph data and graph queries.
Probably the most well known graph database is Neo4J, which provides the notion of nodes and edges, both of which have a type and a set of (flat) properties. This allow you to define a model of any arbitrary complexity. This works if you model is purely graph based, but it doesn’t work for RavenDB, whose users are used to the document model. On the surface, this looks like a minor detail. RavenDB has documents, which can have any shape, including containing embedded values and collections inside them. Neo4J, on the other hand, model things differently. The simplest example that I can think of is Orders and Order Lines, where you’ll have the following models:
Both models have the same information, but each element in the Neo4J graph is an independent node that is linked to the others. On the other hand, with RavenDB, we have a single document that embeds a lot of the information directly. Note that what we haven’t shown in the image is that in RavenDB as well, you have other documents as well. The products, for example, are separate documents.
Graph databases are often used to handle the basis of recommendation engines, fraud detection, etc. But they are usually used to augment the capabilities of the system, rather than as the primary data store of applications. RavenDB, on the other hand, is most frequently deployed as the primary data store. We want to give our users the ability to perform graph operations, but we don’t want to lose anything that make RavenDB useful and easy to use.
We initially thought about having the following definition:
- Each document is (implicitly) a node in the graph.
- You can call Link(src,dest,type, attributes) to create an edge between any two documents.
- Provide the usual graph queries on top of that.
We started exploring this implementation, but it quickly led to mounting complexity. From the point of view of the user, it led to having to do additional work, you’ll have to maintain your document model and the edges at the same time. This allow you to do some interesting things, but it also likely to cause complications down the line and very likely to cause issues when the document model and graph model disagree with one another. Other issues relates to how do you handle graphs in a distributed manner. How do you deal with the creation on an edge between two documents on one node when one of them was deleted on another?
We pushed in that direction for a while, because that was the obvious thing to do, but it really turned up to be a bad idea which didn’t play well with the rest of RavenDB. The worst part was the fact that you might modify the document properties but not define the edge, which lead to inconsistency. This was very easy to do.
The next thing we played with was to remove the Link() call and allow the user to define a background operation that would go and create the links between documents automatically whenever they were updated. This would allow us to avoid having any inconsistencies between the data in the documents and the links between then. After thinking about this for a while, we went ahead with this approach, but removed the requirement for a background operations.
RavenDB will be able to use your existing document model as the graph model as well. In other words, in the model above, you have the orders/2 document, which has two links, for each of the products. This give us both the ability to have a well define document model, with its well known Domain Driven architecture and the ability to hop off all the pre-existing links that we have in the model.
I’ll discuss the querying model and how it all plays together in a future post. For now, I want to show you how this looks like when we want to do a typical graph operation, friends of friends:
More details will come in the next post…