Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 6 min | 1032 words

imageI talked a lot about how graph queries in RavenDB will work, but one missing piece of the puzzle is how they are going to be used. I’m going to use this post to discuss some of the options that the new features enables. We’ll take the following simple model, issue tracking. Let’s imagine that a big (secret) project is going on and we need to track down the tasks for it. On the right, you have an image of the permissions graph that we are interested in.

The permission model is pretty standard I think. We have the notion of users and groups. A user can be associated to one or more groups. Groups memberships are hierarchical. An issue’s access is controlled either by giving access to a specific users or to a group. The act of assigning a group will also allow access to all the group’s parents and any user that is associated to any of them.

Here is the issue in question:

image

Given the graph on the right, let’s see, for each user, how we can find out what issues they have access to.

We’ll start with Sunny, who is named directly in the issue as allowed access. Here is the query to find the issues on which we are directly named.

image

This was easy, I must admit. We could also write this without any need for graph queries, because it is so simple:

image

It gets a bit trickier when we have to look at Max. Unlike Sunny, Max isn’t directly named. Instead, he is a member in a group that is directly named in the issue. Let’s see how we can query on issues that Max has access to:

image

This is where things gets start to get interesting. If you’ll look into the match clause, you’ll see that we have arrows that go both left and right. We are telling RavenDB that we want to find issues that have the same group (denoted as g in the query) as the user. This kind of query you already can’t express with RavenDB right now without the graph query syntax. Another way to write this query is to use explicit clauses, like so:

image

In this case, all the arrows go in a single direction, but we have two clauses that are being and together. These two queries are exactly the same, in fact, RavenDB will translate the one with arrows going both ways to the one with the and.

The key here is that between any two clauses with an and, RavenDB will match the same alias to the same value.

So we now can find issues for Max and Sunny, but what about all the rest of the people? We can, of course, do this manually. Here is how we can find issues for people a group removed from the issue.

image

This query gives us all the issues for Nati that are one group removed from him. That works, but it isn’t how we want to do things. It is a good exercise in setting out the structure, though. Because instead of hard coding the pattern, we are now going to use recursion.

image

The key for this query is the recursive element in the third line. We moved from the issue to its groups, and then we recurse from the issues’ groups to their parents. We allow empty recursion and we follow all the paths in the graph.

On the other side, we go from the user and try to find a match from the user’s group to any of the groups that end the query. In Nati’s case, we went from project-x group to team-nati, which is a match, so we can return this issue.

Here is the final query that we can use, it is a bit much, even if it is short, so I will deconstruct it below:

image

We use a parameterized query here, to make things easier. We start from a user and find an issue if:

  • The issue directly name the user as allowed access. This is the case for Sunny.
  • The issue’s groups (or their parents) match the groups that the user belong to (non recursively, mind).

In the case of Max, for example, we have a match because the recursive element allows a zero length path, so the project-x group is allowed access and Max is a member of it.

In the case of Nati, on the other hand, we have to go from project-x to team-nati to get a match.

If we’ll set $uid to users/23, which is Pheobe, all the way to the left in the graph, we’ll also have a match. We’ll go from project-x to execs to board and then find a match.

Snoopy, on the other hand, doesn’t have access. Look carefully at the direction of the arrows in the graph. Snoopy belongs to the r-n-d group, and that groups is a child of exces, but the query we specified only go up to the parents, not the children, so Snoopy is not included.

I hope that this post gave you some ideas about what kind of use cases are enabled by graph queries in RavenDB. I would love to hear about any scenarios you have for graph queries so we can play with them and see how they are answered by RavenDB.

time to read 3 min | 565 words

imageGraph queries as I discussed them so far gives you the ability to search for patterns. On the right, you can see the family tree of the royal family of Great Britain going back a few hundred years. That make for an interesting subject for practicing graph queries.

A good example we might want to ask is who is the royal grand parent of Elizabeth II. We can do that using:

image

This is great, and nicely demonstrate how we can scan for specific patterns in the graph. However, it is limited by its rigidity. For example, let’s say that I want to find someone in the family tree and I’m not sure about the exact nature of the relationship?

“We are not amused” comes to mind, but off the top of my head and without consulting the chart, I don’t think that I would be able to figure it out. Luckily, I don’t have to, I can ask RavenDB to be so kind and tell me.

image

Note the use of the recursive element here. We are asking RavenDB to start in a particular document and go up the parents, trying to find an unamused royal. The recursion portion of the query can be zero to six steps in size and should abort as soon as we have any match. Following the zero to six parents, there should be a parent that is both a royal an unamused.

The Cypher syntax for what they call variable length queries is reminiscent of regular expressions, and I don’t mean that in a complimentary manner. Looking at the query above, you might have noticed that there is a distinct difference between it and the first one. The recursive query will go up the Parents link, regardless of whatever that parent is royal or not. RavenDB Graph Queries has what I believe to be a unique feature. The recursive pattern isn’t limited to a single step and can be as complex as you like.

For example, let’s ensure that we are only going to go up the chain of the royal parents.

image

The recursive element has a few knows that you can tweak. The minimum and maximum distance, for example, are obvious examples, but the results criteria for the recursion is also interesting. In this query, we use the shortest, instead of the lazy. This will make RavenDB work a bit harder and find the shortest recursive path that matches the query, where as lazy stops on the first one that matches. The following options are available:

  • Lazy – stop on the first pattern that matches. Good for: “Am I related to Victoria?”
  • Shortest – find the shortest path that match the pattern. Good for: “How am I related to Victoria?”
  • Longest – find the longest path that match the pattern. Good for: “For how many generations has Victoria’s family been royals?”
  • All – find all the paths that match the pattern. Good for if you have multiple paths in your ancestry to Victoria.
time to read 3 min | 462 words

In my previous post, I discussed some options for changing the syntax of graph queries in RavenDB from Cypher to be more in line with the rest of the RavenDB Query Language. We have now completed that part and can see the real impact it has on the overall design.

In one of the design review, one of the devs (who have built non trivial applications using Neo4J) complained that the syntax is now much longer. Here are the before and after queries to compare:

image

The key, from my perspective, is that the new form is more explicit and easier to read after the fact. Queries tend to grow more complex over time, and they are being read a lot more often than written). As such, I absolutely want to lean toward being readable over being terse.

The example above just show the extra characters that you need to write. Let’s talk about something that is a bit more complex:

image

Now we have a lot more text, but it is a lot easier to understand what is going on. Focus especially on the Lines edge, where we can very clearly separate what constitute the selection on the edge, the filter on the edge and what is the property that contains the actual linked document id.

The end result is that we now have a syntax that is a lot more consistent and approachable. There are other benefits, but I’ll show them off in the next post.

A major source of annoyance for me with this syntax was how to allow anonymous aliases. In the Cypher syntax we used, you could do something like:

image

There is a problem with how to express this kind of syntax of anonymous aliases with the Collection as alias mode. I initially tried to make it work by saying that we’ll look at the rest of the query and figure it out. But that just felt wrong. I didn’t like this inconsistency. I want a parse tree that I can look at in isolation and know what is going on. Simplifying the language is something that pays dividends over time, so I eventually decided that the query above will look this with the next syntax:

image

There is a lot of precedence of using underscore as the “I don’t care” marker, so that works nice and resolves any ambiguities in the syntax.

time to read 5 min | 981 words

When we started building support for graph queries inside RavenDB, we looked at what is the state of the market in this regard. There seems to be two major options: Cypher and Gremlins. Gremlins is basically a fluent interface that represent a specific graph pattern while Cypher is a more abstract manner to represent the graph query. I don’t like Gremlins, and it doesn’t fit into the model we have for RQL, so we went for the Cypher syntax. Note the distinction between went for Cypher and went for Cypher syntax.

One of the major requirements that we have is fitting in into the pre-existing Raven Query Language, but the first concern we had was just getting started and getting some idea about our actual scenarios. We are now at the point where we have written a bunch of graph queries and got a lot more experience in how it mesh into the overall environment. And at this point, I can really feel that there is an issue in meshing Cypher syntax into RQL. They don’t feel the same at all. There are a lot of good ideas there, make no mistake, but we want to create something that would flow as a cohesive whole.

Let’s look at some of our queries and how we can better express them. The one I talked to about the most is this:

image

Let see what we have here:

  • match is the overall clause that apply a graph pattern query to the dataset.
  • () – is an indication of a node in the graph.
  • [] – is an indication of an edge.
  • a:Dogs, l:Likes and b:Dogs – this is an alias and a path specification.
  • -[]-> – is an indication of an edge between two nodes
  • (expression) – is a filter on a node or an edge

I’m ignoring the select statement here because it is just the usual RQL select statement.

The first thing that keeps biting us is the filter in (a:Dogs (id() = 'dogs/arava')), I keep being tripped by missing the closing ), so that has got to go. Luckily, is it very obvious what to do here:

image

We use an explicit where clause, instead of the () to express the inline filter. This fits a lot more closely with how the rest of RQL works.

Now, let’s look at the aliases: (b:Dogs). The alias:Collection syntax is pretty foreign to RQL, we tend to use the Collection as alias syntax. Let’s see how that would look like, shall we?

image

This looks a lot more natural to me, and it is a good fit into RQL in general. This syntax does bring a few things to the table. In particular, look a the edge. In Cypher, an anonymous edge would be: [:Likes], and using this method, we will have just [Likes].

However, as nice as this syntax is, we still run into a problem. The query above is actually just a shorthand way to write the full query, which looks like so:

image

In fact, we have two queries here, to show off the actual problem we have in parsing. In the first case, we have a match clause the only refers to explicit with statement. On the second case, we have a couple of explicit with statements, but also an implicit with edges expression (the Likes).

From the point of view of the parser, we can’t distinguish those two. Now, we can absolutely say that if the edge expression contains a single name, we’ll simply look for an edge with that name and otherwise assume that this is the path that will be used.

But this seems to be error prone, because you might have a small typo or remove a edge statement and get a completely different (and unexpected) meaning. I thought about adding some sort of prefix to help tell an alias from an implicit definition, but that looks very ugly, see:

image 

And on the other hand, I really like the –[Likes]-> syntax in general. It is a lot cleaner and easier to read.

At this point, I don’t have a solution for this. I think we’ll go with the mode in which we can’t tell what the query is meant to say just from the parser, and look at the explicit with statements to figure it out (with the potential for mistakes that I pointed out earlier) until we can figure out something better.

One thing that I’m thinking about is that the () and [] which help distinguish between nodes and edges, aren’t actually required for us if we have an explicit statement. So we can write it like so:

image

In this manner, we can tell, quite easily, if you meant to define an implicit edge / node or refers to an explicitly defined alias. I’m not sure whatever this would be a good idea, though.

Another issue we have to deal with is:

image

Note that in this case, we have a filter expression on the edge as well. Applying the same process we have done so far, we get:

image

The advantages here is that this is very clear and obvious about what is going on. The disadvantage is that this takes quite a bit longer to express.

time to read 2 min | 279 words

imageAn interesting challenge with implementing graph queries is that you sometimes get into situations where the correct behavior is counter intuitive.

Consider the case of the graph on the right and the following query:

image

This will return:

  • Source: Arava, Destination: Oscar

But what would be the value of the Edge property? The answer to that is… complicated.  What we actually return is the edge itself. Let’s see what I mean by that.

image

And, indeed, the value of Edge in this query is going to be dogs/oscar.

image

This isn’t very helpful if we are talking about a simple edge like this. After all, we can deduce this from the Src –> Destination pair. This gets more interesting when the edge is more complex. Consider the following query:

image

What do you this should be the output here? In this case, the edge isn’t the Product property, it is the specific line that match the filter on the edge. Here is what the result looks like:

image

As you can imagine, knowing exactly what edge led you from one document to another can be very useful when you look at the query results.

time to read 2 min | 258 words

I was busy working on the implementation on filtering in graph queries, as discussed in my previous post. What I ended up implementing is a way for the user to tell us exactly how to handle the results. The actual query we ended up with is this:

image

And the key part here is the where clause, were we state that a and c cannot be the same dog. This also matches the behavior of SQL, and for that reason allow (predictably), that’s a good idea.

However, I didn’t just implement inequity, I implement full filtering capabilities, and you can access anything in the result. Which means that this query is now also possible:

image

I’ll let you a moment to analyze this query in peace. Try to de-chyper it (pun intended).

What this  query is doing is to compare the actual sale price and the regular price of product on a particular order, for products that match a particular set of categories.

This is a significant query because, for the first time in RavenDB, you have the ability to perform such a query (previous, you would have had to define a specific index for this query).

In other words, what graph query filtering brings to the table is joins. And I did not set out to build this feature and I’m feeling very strange about it.

time to read 2 min | 381 words

imageWe run into an interesting design issue when building graph queries for RavenDB. The problem statement is fairly easy. Should a document be allowed to be bound to multiple aliases in the query results, or just one? However, without context, the problem statement in not meaningful, so let’s talk about what the actual problem is. Consider the graph on the right. We have three documents, Arava, Oscar and Phoebe and the following edges:

  • Arava Likes Oscar
  • Phoebe Likes Oscar

We now run the following query:

image

This query asks for a a dog that likes another dog that is liked by a dog. Another way to express the same sentiment (indeed, how RavenDB actually considers this type of query) is to write it as follows:

image

When processing the and expression, we require that documents that match to the same alias will be the same. Given the graph that we execute this on, what would you consider the right result?

Right now, we have the first option, in which a document can be match to multiple different alias in the same result, which would lead to the following results:

image

Note that in this case, the first and last entries match A and C to the same document.

The second option is to ensure that a document can only be bound to a single alias in the result, which would remove the duplicate results above and give us only:

image

Note that in either case, position matters, and the minimum number of results this query will generate is two, because we need to consider different starting points for the pattern match on the graph.

What do you think should we do in such a case? Are there reasons to want this behavior or that and should it be something that the user select?

time to read 3 min | 445 words


One of the most important design decisions we made with RavenDB is not forcing users to explicitly create edges between documents. Instead, the edges are actually just normal properties on the documents and can be used as-is. This means that pretty much any existing RavenDB database can immediately start using graph operations, you don’t need to do anything.

The image below shows an order, using the RavenDB’s sample Northwind dataset. The highlighted portions mark the edges from this document. You can use these to traverse the graph by hopping from document to document.

image

For example, using:

image

This is easy to do, migrates well (zero cost to do so, yeah!) and usually matches nicely with what you already have. It does lead to an interesting observation. Typically, in a graph DB, you’ll model everything as a graph. But with RavenDB, you don’t need to do that.

In particular, let’s take a look at the Neo4J’s rendition of Northwind:

image

As you can see, everything is modeled as a node / edge. This is the only thing you could model it as. With RavenDB, you would typically use a domain driven model. In this case, it means that a value object, like an OrderLIne, will not have its own concrete existence. Either as a node or an edge. Instead, it will be embedded inside its root aggregate (the order).

Note that this is actually quite interesting, because it means that we need to be able to provide the ability to query on complex edges, such as the order lines. Here is how works:

image

This will give us all the discount products sold in London as well as their discount rate.

Note that in here, unlike previous queries, we use an named alias for the edge. In this case, it gives us the ability to access it properties and project the line’s Discount property to the user. This means that you can have a domain model with strong cohesion and locality, following the domain driven design principles while still being able to run arbitrary graph queries on it.  Combining this with the ability to pull data from indexes (including map/reduce) ones, you have a lot of things that you can do that used to be very hard but now are easy.

time to read 4 min | 744 words

A query has two audiences: the users and the query engine. Ideally, you need to come up with a query language that would serve both. One of the early decisions that we made with the query language is that we want to be:

  • Very flexible for the user, giving them several ways to express themselves.
  • Be very rigid in the query engine, with only one way to do something.

These two requirements are directly contradicting one another, which is indeed somewhat of a problem. The key here is that we don’t want to produce multiple ways to do the same thing in the query engine. That is a great way to introduce:

  • Different actual execution plans.
  • Features that only work with a specific syntax.
  • More complexity overall.

Anyone who ever worked with the internals of Linq can attest to the complexity that is involved here.

Let’s take the simple query that we have been inspecting so far:

image

Now, let’s ask RavenDB to spit it back out for us, shall we? Here is how RavenDB thinks about this query:

image

In other words, the way RavenDB sees the query and the way the user sees the query are very different. You can see that we have the with edges clauses here, defining the edges on the query.

In other words, all of the query definitions are happening in the with and with edges clauses. When we need to actually perform the matches, the match clause only defines the graph pattern that we need to match on.  It is the responsibility of the query parser to arrange the query from the multiple ways that the user may want to define it to the single representation that is actually going to be executed by the query runner.

This may seem like a lot of ceremony, but that is only because we have a very simple query. Let’s change the “friends of friends who aren’t my friends” to something a bit more interesting: “Close friends of my close friends who aren’t my friends”. We are also going to want to limit the friends that we follow only to Users (so, for example, we’ll not follow a FriendOf link to a Pet).

Here is what the query looks like, when we use more concise syntax, and how RavenDB translates it:

image

You’ll note that even for the query above, I still used a separate with clause to make things easier, the following query is exactly the same:

image

The basic idea is that for trivial filtering, you’ll probably want to do that inline, inside the match clause. But anything more complex should go to the with clause where you can more easily express your logic.  Also note that aliases matter. The f1 and f2 here are not duplicated for no reason, part of processing the query is to bind a value to each of the aliases, and you cannot bind a single result to multiple aliases.

Another key aspect of this mode is that while this is pretty easy to follow, a with clause can contain any query. That means that you can use indexes as well, including Map/Reduce indexes. Here is one such example:

image

In this case, I”m not sure how good a graph query this is, I’ll admit, but it does a good job of demonstrating what you can do. We are taking a few queries, mixing them together and then mashing the results to find London companies who didn’t order as much as they used to.

This means that the source information for graph queries can be things like spatial queries, full text search, map/reduce, etc. A lot of the complexity in graphs queries is just getting to do the start of the graph pattern matching. With RavenDB, you have a very strong query language and facilities to help you get past that and directly into the graph operations.

This is enough about the pre-processing the query, in my next post, I’m going to go into depth into how graph queries work with document models.

time to read 4 min | 781 words

Pretty much all our early discussions about graphs in RavenDB focused on how to build the actual graph implementation. How to allow fast traversal, etc. When we started looking at the actual implementation, we realized that we seriously neglected a very important piece of the puzzle, the query interface for the graphs.

This is important for several reasons. First, ergonomics matter, if we end up with a query language that is awkward, it won’t see much use and complicate the users’ lives (and our own). Second, the query language effectively dictate how the user think about the model, so making low level decisions that would have impact on how the user is actually using this feature is probably not a good idea yet. We need to start from the top, what do we give to the user, and then see how we can make that a reality.

The most common use case of graph queries is the friends of friends query. Let’s see how this query is handled in various existing implementation, shall we?

Neo4J, using Cypher:

image

OrientDB doesn’t seem to have an easy way to do this. The following shows how you can find the 2nd degree friends, but it doesn’t exclude friends of friends who are already your friends. StackOverflow questions on that show scary amount of code, so I’m going to skip them.

image

Gremlin, which is used in a wide variety of databases:

image

We looked at other options, but it seems that graph query languages fall into the following broad categories:

  • ASCII art to express the relationship between the nodes.
  • SQL extensions that express the relationships as nested queries.
  • Method calls to express the traversal.

Of the three options, we found the first option, using ASCII Art / Cypher as the easier one to work with. This is true both in terms of writing the query and actually executing it.

Let’s look at how friends of friends query will look like in RavenDB:

image

Graph queries are composed of two portions:

  • With clauses, which determine source point for the graph traversal.
  • Match clause (singular) that contain the graph pattern that we need to match on.

In the case, above, we are starting the graph traversal from start, this is defined as a with clause. A query can have multiple with clauses, each defining an alias that can be used in the match clause. The match clause, on the other hand, uses these aliases to decide how to process the query.

You can see that we have two clauses in the above query, and the actual processing is done by pattern matching (to me, it make sense to compare it to regular expressions or Prolog). It would probably be easier to show this with an example. Here is the relationship graphs among a few people:

image

We’ll set the starting point of the graph as Arava and see how this will be processed in the query.

For the first clause, we’ll have:

  • start (Arava) –> f1 (Oscar) –> f2 (Phoebe)
  • start (Arava) –> f1 (Oscar) –> f2 (Sunny)
  • start (Arava) –> f1 (Sunny) –> f2 (Phoebe)
  • start (Arava) –> f1 (Sunny) –> f2 (Oscar)

For the second clause, of the other hand, have:

  • start (Arava) –> f2 (Oscar)
  • start (Arava) –> f2 (Sunny)

These clauses are joined using and not operator. What this means is that we need to exclude from the first clause anything that matches on the second cluase. Match, in this case, means the same alias and value for any existing alias.

Here is what we need up with:

  • start (Arava) –> f1 (Oscar) –> f2 (Phoebe)
  • start (Arava) –> f1 (Oscar) –> f2 (Sunny) 
  • start (Arava) –> f1 (Sunny) –> f2 (Phoebe)
  • start (Arava) –> f1 (Sunny) –> f2 (Oscar)

We removed two entries, because they matched the entries from the second clause. The end result being just friends of my friends who aren’t my friends.

The idea with behind the query language is that we want to be high level and allow you to express what you want, and we’ll be in charge of actually making this work properly.

In the next post, I’ll talk a bit more about the query language, what scenarios it enables and how we are going to go about processing queries.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}