Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,726 | Comments: 48,713

filter by tags archive

Graphs in RavenDBGraph modeling vs. document modeling

time to read 3 min | 445 words


One of the most important design decisions we made with RavenDB is not forcing users to explicitly create edges between documents. Instead, the edges are actually just normal properties on the documents and can be used as-is. This means that pretty much any existing RavenDB database can immediately start using graph operations, you don’t need to do anything.

The image below shows an order, using the RavenDB’s sample Northwind dataset. The highlighted portions mark the edges from this document. You can use these to traverse the graph by hopping from document to document.

image

For example, using:

image

This is easy to do, migrates well (zero cost to do so, yeah!) and usually matches nicely with what you already have. It does lead to an interesting observation. Typically, in a graph DB, you’ll model everything as a graph. But with RavenDB, you don’t need to do that.

In particular, let’s take a look at the Neo4J’s rendition of Northwind:

image

As you can see, everything is modeled as a node / edge. This is the only thing you could model it as. With RavenDB, you would typically use a domain driven model. In this case, it means that a value object, like an OrderLIne, will not have its own concrete existence. Either as a node or an edge. Instead, it will be embedded inside its root aggregate (the order).

Note that this is actually quite interesting, because it means that we need to be able to provide the ability to query on complex edges, such as the order lines. Here is how works:

image

This will give us all the discount products sold in London as well as their discount rate.

Note that in here, unlike previous queries, we use an named alias for the edge. In this case, it gives us the ability to access it properties and project the line’s Discount property to the user. This means that you can have a domain model with strong cohesion and locality, following the domain driven design principles while still being able to run arbitrary graph queries on it.  Combining this with the ability to pull data from indexes (including map/reduce) ones, you have a lot of things that you can do that used to be very hard but now are easy.

Graphs in RavenDBPre-processing the queries

time to read 4 min | 744 words

A query has two audiences: the users and the query engine. Ideally, you need to come up with a query language that would serve both. One of the early decisions that we made with the query language is that we want to be:

  • Very flexible for the user, giving them several ways to express themselves.
  • Be very rigid in the query engine, with only one way to do something.

These two requirements are directly contradicting one another, which is indeed somewhat of a problem. The key here is that we don’t want to produce multiple ways to do the same thing in the query engine. That is a great way to introduce:

  • Different actual execution plans.
  • Features that only work with a specific syntax.
  • More complexity overall.

Anyone who ever worked with the internals of Linq can attest to the complexity that is involved here.

Let’s take the simple query that we have been inspecting so far:

image

Now, let’s ask RavenDB to spit it back out for us, shall we? Here is how RavenDB thinks about this query:

image

In other words, the way RavenDB sees the query and the way the user sees the query are very different. You can see that we have the with edges clauses here, defining the edges on the query.

In other words, all of the query definitions are happening in the with and with edges clauses. When we need to actually perform the matches, the match clause only defines the graph pattern that we need to match on.  It is the responsibility of the query parser to arrange the query from the multiple ways that the user may want to define it to the single representation that is actually going to be executed by the query runner.

This may seem like a lot of ceremony, but that is only because we have a very simple query. Let’s change the “friends of friends who aren’t my friends” to something a bit more interesting: “Close friends of my close friends who aren’t my friends”. We are also going to want to limit the friends that we follow only to Users (so, for example, we’ll not follow a FriendOf link to a Pet).

Here is what the query looks like, when we use more concise syntax, and how RavenDB translates it:

image

You’ll note that even for the query above, I still used a separate with clause to make things easier, the following query is exactly the same:

image

The basic idea is that for trivial filtering, you’ll probably want to do that inline, inside the match clause. But anything more complex should go to the with clause where you can more easily express your logic.  Also note that aliases matter. The f1 and f2 here are not duplicated for no reason, part of processing the query is to bind a value to each of the aliases, and you cannot bind a single result to multiple aliases.

Another key aspect of this mode is that while this is pretty easy to follow, a with clause can contain any query. That means that you can use indexes as well, including Map/Reduce indexes. Here is one such example:

image

In this case, I”m not sure how good a graph query this is, I’ll admit, but it does a good job of demonstrating what you can do. We are taking a few queries, mixing them together and then mashing the results to find London companies who didn’t order as much as they used to.

This means that the source information for graph queries can be things like spatial queries, full text search, map/reduce, etc. A lot of the complexity in graphs queries is just getting to do the start of the graph pattern matching. With RavenDB, you have a very strong query language and facilities to help you get past that and directly into the graph operations.

This is enough about the pre-processing the query, in my next post, I’m going to go into depth into how graph queries work with document models.

Graphs in RavenDBThe query language

time to read 4 min | 781 words

Pretty much all our early discussions about graphs in RavenDB focused on how to build the actual graph implementation. How to allow fast traversal, etc. When we started looking at the actual implementation, we realized that we seriously neglected a very important piece of the puzzle, the query interface for the graphs.

This is important for several reasons. First, ergonomics matter, if we end up with a query language that is awkward, it won’t see much use and complicate the users’ lives (and our own). Second, the query language effectively dictate how the user think about the model, so making low level decisions that would have impact on how the user is actually using this feature is probably not a good idea yet. We need to start from the top, what do we give to the user, and then see how we can make that a reality.

The most common use case of graph queries is the friends of friends query. Let’s see how this query is handled in various existing implementation, shall we?

Neo4J, using Cypher:

image

OrientDB doesn’t seem to have an easy way to do this. The following shows how you can find the 2nd degree friends, but it doesn’t exclude friends of friends who are already your friends. StackOverflow questions on that show scary amount of code, so I’m going to skip them.

image

Gremlin, which is used in a wide variety of databases:

image

We looked at other options, but it seems that graph query languages fall into the following broad categories:

  • ASCII art to express the relationship between the nodes.
  • SQL extensions that express the relationships as nested queries.
  • Method calls to express the traversal.

Of the three options, we found the first option, using ASCII Art / Cypher as the easier one to work with. This is true both in terms of writing the query and actually executing it.

Let’s look at how friends of friends query will look like in RavenDB:

image

Graph queries are composed of two portions:

  • With clauses, which determine source point for the graph traversal.
  • Match clause (singular) that contain the graph pattern that we need to match on.

In the case, above, we are starting the graph traversal from start, this is defined as a with clause. A query can have multiple with clauses, each defining an alias that can be used in the match clause. The match clause, on the other hand, uses these aliases to decide how to process the query.

You can see that we have two clauses in the above query, and the actual processing is done by pattern matching (to me, it make sense to compare it to regular expressions or Prolog). It would probably be easier to show this with an example. Here is the relationship graphs among a few people:

image

We’ll set the starting point of the graph as Arava and see how this will be processed in the query.

For the first clause, we’ll have:

  • start (Arava) –> f1 (Oscar) –> f2 (Phoebe)
  • start (Arava) –> f1 (Oscar) –> f2 (Sunny)
  • start (Arava) –> f1 (Sunny) –> f2 (Phoebe)
  • start (Arava) –> f1 (Sunny) –> f2 (Oscar)

For the second clause, of the other hand, have:

  • start (Arava) –> f2 (Oscar)
  • start (Arava) –> f2 (Sunny)

These clauses are joined using and not operator. What this means is that we need to exclude from the first clause anything that matches on the second cluase. Match, in this case, means the same alias and value for any existing alias.

Here is what we need up with:

  • start (Arava) –> f1 (Oscar) –> f2 (Phoebe)
  • start (Arava) –> f1 (Oscar) –> f2 (Sunny) 
  • start (Arava) –> f1 (Sunny) –> f2 (Phoebe)
  • start (Arava) –> f1 (Sunny) –> f2 (Oscar)

We removed two entries, because they matched the entries from the second clause. The end result being just friends of my friends who aren’t my friends.

The idea with behind the query language is that we want to be high level and allow you to express what you want, and we’ll be in charge of actually making this work properly.

In the next post, I’ll talk a bit more about the query language, what scenarios it enables and how we are going to go about processing queries.

Graphs in RavenDBThe overall design

time to read 5 min | 863 words

Note: These series of posts are about a planned feature, exploring how we go about building it. This is meant to solicit feedback and get more eyes on the idea, things aren’t set in stone and we don’t have a firm release date on this.

We have been wanting to add graph queries to RavenDB for several years now, but we always had more important things get in the way. That didn’t prevent us from discussing this internally and sketch up a few options. We are now looking at this more seriously and I thought that sharing the details of our deliberations would be interesting and likely to garner us some valuable feedback. I’m going to assume that the reader is at least somewhat familiar with the notion of graph data and graph queries.

Probably the most well known graph database is Neo4J, which provides the notion of nodes and edges, both of which have a type and a set of (flat) properties. This allow you to define a model of any arbitrary complexity. This works if you model is purely graph based, but it doesn’t work for RavenDB, whose users are used to the document model. On the surface, this looks like a minor detail. RavenDB has documents, which can have any shape, including containing embedded values and collections inside them. Neo4J, on the other hand, model things differently. The simplest example that I can think of is Orders and Order Lines, where you’ll have the following models:

Neo4J

RavenDB

image image

Both models have the same information, but each element in the Neo4J graph is an independent node that is linked to the others. On the other hand, with RavenDB, we have a single document that embeds a lot of the information directly.  Note that what we haven’t shown in the image is that in RavenDB as well, you have other documents as well. The products, for example, are separate documents. 

Graph databases are often used to handle the basis of recommendation engines, fraud detection, etc. But they are usually used to augment the capabilities of the system, rather than as the primary data store of applications. RavenDB, on the other hand, is most frequently deployed as the primary data store. We want to give our users the ability to perform graph operations, but we don’t want to lose anything that make RavenDB useful and easy to use.

We initially thought about having the following definition:

  • Each document is (implicitly) a node in the graph.
  • You can call Link(src,dest,type, attributes) to create an edge between any two documents.
  • Provide the usual graph queries on top of that.

We started exploring this implementation, but it quickly led to mounting complexity. From the point of view of the user, it led to having to do additional work, you’ll have to maintain your document model and the edges at the same time. This allow you to do some interesting things, but it also likely to cause complications down the line and very likely to cause issues when the document model and graph model disagree with one another. Other issues relates to how do you handle graphs in a distributed manner. How do you deal with the creation on an edge between two documents on one node when one of them was deleted on another?

We pushed in that direction for a while, because that was the obvious thing to do, but it really turned up to be a bad idea which didn’t play well with the rest of RavenDB. The worst part was the fact that you might modify the document properties but not define the edge, which lead to inconsistency. This was very easy to do.

The next thing we played with was to remove the Link() call and allow the user to define a background operation that would go and create the links between documents automatically whenever they were updated. This would allow us to avoid having any inconsistencies between the data in the documents and the links between then. After thinking about this for a while, we went ahead with this approach, but removed the requirement for a background operations.

RavenDB will be able to use your existing document model as the graph model as well. In other words, in the model above, you have the orders/2 document, which has two links, for each of the products. This give us both the ability to have a well define document model, with its well known Domain Driven architecture and the ability to hop off all the pre-existing links that we have in the model.

I’ll discuss the querying model and how it all plays together in a future post. For now, I want to show you how this looks like when we want to do a typical graph operation, friends of friends:

image

More details will come in the next post…

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Challenge (54):
    28 Sep 2018 - The loop that leaks–Answer
  2. Graphs in RavenDB (4):
    21 Sep 2018 - Graph modeling vs. document modeling
  3. Reviewing FASTER (9):
    06 Sep 2018 - Summary
  4. RavenDB 4.1 features (12):
    22 Aug 2018 - MongoDB & CosmosDB Migration Wizards
  5. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats