Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,738 | Comments: 48,777

filter by tags archive

Graphs in RavenDBReal world use cases

time to read 6 min | 1032 words

imageI talked a lot about how graph queries in RavenDB will work, but one missing piece of the puzzle is how they are going to be used. I’m going to use this post to discuss some of the options that the new features enables. We’ll take the following simple model, issue tracking. Let’s imagine that a big (secret) project is going on and we need to track down the tasks for it. On the right, you have an image of the permissions graph that we are interested in.

The permission model is pretty standard I think. We have the notion of users and groups. A user can be associated to one or more groups. Groups memberships are hierarchical. An issue’s access is controlled either by giving access to a specific users or to a group. The act of assigning a group will also allow access to all the group’s parents and any user that is associated to any of them.

Here is the issue in question:

image

Given the graph on the right, let’s see, for each user, how we can find out what issues they have access to.

We’ll start with Sunny, who is named directly in the issue as allowed access. Here is the query to find the issues on which we are directly named.

image

This was easy, I must admit. We could also write this without any need for graph queries, because it is so simple:

image

It gets a bit trickier when we have to look at Max. Unlike Sunny, Max isn’t directly named. Instead, he is a member in a group that is directly named in the issue. Let’s see how we can query on issues that Max has access to:

image

This is where things gets start to get interesting. If you’ll look into the match clause, you’ll see that we have arrows that go both left and right. We are telling RavenDB that we want to find issues that have the same group (denoted as g in the query) as the user. This kind of query you already can’t express with RavenDB right now without the graph query syntax. Another way to write this query is to use explicit clauses, like so:

image

In this case, all the arrows go in a single direction, but we have two clauses that are being and together. These two queries are exactly the same, in fact, RavenDB will translate the one with arrows going both ways to the one with the and.

The key here is that between any two clauses with an and, RavenDB will match the same alias to the same value.

So we now can find issues for Max and Sunny, but what about all the rest of the people? We can, of course, do this manually. Here is how we can find issues for people a group removed from the issue.

image

This query gives us all the issues for Nati that are one group removed from him. That works, but it isn’t how we want to do things. It is a good exercise in setting out the structure, though. Because instead of hard coding the pattern, we are now going to use recursion.

image

The key for this query is the recursive element in the third line. We moved from the issue to its groups, and then we recurse from the issues’ groups to their parents. We allow empty recursion and we follow all the paths in the graph.

On the other side, we go from the user and try to find a match from the user’s group to any of the groups that end the query. In Nati’s case, we went from project-x group to team-nati, which is a match, so we can return this issue.

Here is the final query that we can use, it is a bit much, even if it is short, so I will deconstruct it below:

image

We use a parameterized query here, to make things easier. We start from a user and find an issue if:

  • The issue directly name the user as allowed access. This is the case for Sunny.
  • The issue’s groups (or their parents) match the groups that the user belong to (non recursively, mind).

In the case of Max, for example, we have a match because the recursive element allows a zero length path, so the project-x group is allowed access and Max is a member of it.

In the case of Nati, on the other hand, we have to go from project-x to team-nati to get a match.

If we’ll set $uid to users/23, which is Pheobe, all the way to the left in the graph, we’ll also have a match. We’ll go from project-x to execs to board and then find a match.

Snoopy, on the other hand, doesn’t have access. Look carefully at the direction of the arrows in the graph. Snoopy belongs to the r-n-d group, and that groups is a child of exces, but the query we specified only go up to the parents, not the children, so Snoopy is not included.

I hope that this post gave you some ideas about what kind of use cases are enabled by graph queries in RavenDB. I would love to hear about any scenarios you have for graph queries so we can play with them and see how they are answered by RavenDB.

The fear of an empty source file

time to read 4 min | 606 words

imageI have been writing software at this point for over twenty years, and I want to believe that I have learned a few things during that timeframe.

And yet, probably the hardest thing for me is to start writing from scratch. If there is no code already there, it is all too easy to get lost in the details and not actually be able to get anywhere.

An empty source file is full of so many options, and any decision that I’ll make is going to have very long lasting impact. Sometimes I look at the keyboard and just freeze, unable to proceed because I know, with a 100% certainty, that whatever I’ll produce isn’t going to be up to my own standards. In fact, it is going to suck, for sure.

I think that about 90% of the things I have written so far are stuff that I couldn’t write today. Not because I lack the knowledge, but because I have far greater understanding of the problem space and I know that trying to solve it all is such a big task that it is not possible for me to do so. What I need reminding, sometimes, is that I have written those things, and eventually, those things were able to accomplish all that was required of them.

A painter doesn’t just start by throwing paint on canvas, and a building doesn’t grow up by people putting bricks where they feel like. In pretty much any profession, you need to iterate several times to get things actually done. With painters, you’ll typically do a drawing before actually putting paint on canvas. With architects will build a small scale model, etc.

For me, the hardest thing to do when I’m building something new is to actually allow myself to write it out as is. That means, lay out the general structure of the code, and ignore all the other stuff that you must have in order to get to real production worthy code. This means flat our ignoring:

  • Error handling
  • Control of allocations and memory used
  • Select the underlying data structures and algorithms
  • Yes, that means that O(N^2) is just fine for now
  • Logging, monitoring and visibility
  • Commenting and refactoring the code for maintainability over time

All of these are important, but I literally can’t pay these taxes and build something new in the same time.

I like to think about the way I work as old style rendering passes. When I’m done with the overall structure, I’ll go back and add these details. Sometimes that can be a lot of work, but at that point, I actually have something to help me. At a minimum, I have tests that verify that things still work and now I have a good understanding of the problem (and my solution) so I can approach things without having so many unknown to deal with.

A large part of that is that the fact that I didn’t pay any of the taxes for development. This usually means that the new thing is basically a ball of mud, but it is a small ball of mud, which means that if I need to change things around, I have to touch fewer moving parts. A lot fewer, actually. That allow me to explore, figure out what works and doesn’t.

It is also going directly against all of my instincts and can be really annoying. I really want to do a certain piece of code properly, but focusing on perfecting a single door knob means that the whole structure will never see the light of day.

Graphs in RavenDBRecursive queries

time to read 3 min | 565 words

imageGraph queries as I discussed them so far gives you the ability to search for patterns. On the right, you can see the family tree of the royal family of Great Britain going back a few hundred years. That make for an interesting subject for practicing graph queries.

A good example we might want to ask is who is the royal grand parent of Elizabeth II. We can do that using:

image

This is great, and nicely demonstrate how we can scan for specific patterns in the graph. However, it is limited by its rigidity. For example, let’s say that I want to find someone in the family tree and I’m not sure about the exact nature of the relationship?

“We are not amused” comes to mind, but off the top of my head and without consulting the chart, I don’t think that I would be able to figure it out. Luckily, I don’t have to, I can ask RavenDB to be so kind and tell me.

image

Note the use of the recursive element here. We are asking RavenDB to start in a particular document and go up the parents, trying to find an unamused royal. The recursion portion of the query can be zero to six steps in size and should abort as soon as we have any match. Following the zero to six parents, there should be a parent that is both a royal an unamused.

The Cypher syntax for what they call variable length queries is reminiscent of regular expressions, and I don’t mean that in a complimentary manner. Looking at the query above, you might have noticed that there is a distinct difference between it and the first one. The recursive query will go up the Parents link, regardless of whatever that parent is royal or not. RavenDB Graph Queries has what I believe to be a unique feature. The recursive pattern isn’t limited to a single step and can be as complex as you like.

For example, let’s ensure that we are only going to go up the chain of the royal parents.

image

The recursive element has a few knows that you can tweak. The minimum and maximum distance, for example, are obvious examples, but the results criteria for the recursion is also interesting. In this query, we use the shortest, instead of the lazy. This will make RavenDB work a bit harder and find the shortest recursive path that matches the query, where as lazy stops on the first one that matches. The following options are available:

  • Lazy – stop on the first pattern that matches. Good for: “Am I related to Victoria?”
  • Shortest – find the shortest path that match the pattern. Good for: “How am I related to Victoria?”
  • Longest – find the longest path that match the pattern. Good for: “For how many generations has Victoria’s family been royals?”
  • All – find all the paths that match the pattern. Good for if you have multiple paths in your ancestry to Victoria.

Graphs in RavenDBInconsistency abhorrence

time to read 3 min | 462 words

In my previous post, I discussed some options for changing the syntax of graph queries in RavenDB from Cypher to be more in line with the rest of the RavenDB Query Language. We have now completed that part and can see the real impact it has on the overall design.

In one of the design review, one of the devs (who have built non trivial applications using Neo4J) complained that the syntax is now much longer. Here are the before and after queries to compare:

image

The key, from my perspective, is that the new form is more explicit and easier to read after the fact. Queries tend to grow more complex over time, and they are being read a lot more often than written). As such, I absolutely want to lean toward being readable over being terse.

The example above just show the extra characters that you need to write. Let’s talk about something that is a bit more complex:

image

Now we have a lot more text, but it is a lot easier to understand what is going on. Focus especially on the Lines edge, where we can very clearly separate what constitute the selection on the edge, the filter on the edge and what is the property that contains the actual linked document id.

The end result is that we now have a syntax that is a lot more consistent and approachable. There are other benefits, but I’ll show them off in the next post.

A major source of annoyance for me with this syntax was how to allow anonymous aliases. In the Cypher syntax we used, you could do something like:

image

There is a problem with how to express this kind of syntax of anonymous aliases with the Collection as alias mode. I initially tried to make it work by saying that we’ll look at the rest of the query and figure it out. But that just felt wrong. I didn’t like this inconsistency. I want a parse tree that I can look at in isolation and know what is going on. Simplifying the language is something that pays dividends over time, so I eventually decided that the query above will look this with the next syntax:

image

There is a lot of precedence of using underscore as the “I don’t care” marker, so that works nice and resolves any ambiguities in the syntax.

Graphs in RavenDBI didn’t mean to build this feature!

time to read 2 min | 258 words

I was busy working on the implementation on filtering in graph queries, as discussed in my previous post. What I ended up implementing is a way for the user to tell us exactly how to handle the results. The actual query we ended up with is this:

image

And the key part here is the where clause, were we state that a and c cannot be the same dog. This also matches the behavior of SQL, and for that reason allow (predictably), that’s a good idea.

However, I didn’t just implement inequity, I implement full filtering capabilities, and you can access anything in the result. Which means that this query is now also possible:

image

I’ll let you a moment to analyze this query in peace. Try to de-chyper it (pun intended).

What this  query is doing is to compare the actual sale price and the regular price of product on a particular order, for products that match a particular set of categories.

This is a significant query because, for the first time in RavenDB, you have the ability to perform such a query (previous, you would have had to define a specific index for this query).

In other words, what graph query filtering brings to the table is joins. And I did not set out to build this feature and I’m feeling very strange about it.

Graphs in RavenDBQuery results

time to read 2 min | 381 words

imageWe run into an interesting design issue when building graph queries for RavenDB. The problem statement is fairly easy. Should a document be allowed to be bound to multiple aliases in the query results, or just one? However, without context, the problem statement in not meaningful, so let’s talk about what the actual problem is. Consider the graph on the right. We have three documents, Arava, Oscar and Phoebe and the following edges:

  • Arava Likes Oscar
  • Phoebe Likes Oscar

We now run the following query:

image

This query asks for a a dog that likes another dog that is liked by a dog. Another way to express the same sentiment (indeed, how RavenDB actually considers this type of query) is to write it as follows:

image

When processing the and expression, we require that documents that match to the same alias will be the same. Given the graph that we execute this on, what would you consider the right result?

Right now, we have the first option, in which a document can be match to multiple different alias in the same result, which would lead to the following results:

image

Note that in this case, the first and last entries match A and C to the same document.

The second option is to ensure that a document can only be bound to a single alias in the result, which would remove the duplicate results above and give us only:

image

Note that in either case, position matters, and the minimum number of results this query will generate is two, because we need to consider different starting points for the pattern match on the graph.

What do you think should we do in such a case? Are there reasons to want this behavior or that and should it be something that the user select?

RavenDB C++ client: Laying the ground work

time to read 2 min | 371 words

The core concept underlying the RavenDB client API is the notion of Unit of Work. This provide core features such as change tracking and identity map. In all our previous clients, that was pretty easy to deal with, because the GC solved memory ownership and reflection gave us a lot of stuff basically for free.

Right now, what I want to achieve is the following:

It seems pretty simple, right? Because both the session and the caller code are going to share ownership on the passed User. Notice that we modify the user after we call store() but before save_changes(). We expect to see the modification in the document that is being generated.

The memory ownership is handled here by using shared_ptr as the underlying mode in which we accept and return data to the session. Now, let’s see how we actually deal with serialization, shall we? I have chosen to use nlohmann’s json for the project, which means that the provided API is quite nice. As a consumer, you’ll need to write your JSON serialization code, but it is fairly obvious how to do so, check this out:

Given that C++ doesn’t have reflection, I think that this represent a really nice handling of the issue. Now, how does this play with everything else? Here is what the skeleton of the session looks like:

There is a whole bunch of stuff that is going on here that we need to deal with.

First, we have the IEntityDetails interface, which is non generic. It is implemented by the generic class EntityDetails, which has the actual type that we are using and can then use the json serialization we defined to convert the entity to JSON. The rest are just details, we need to use a vector of shared_ptr, instead of the abstract class, because the abstract class has no defined size.

The generic store() method just capture the generic type and store it, and the rest of the code can work with the non generic interface.

I’m not sure how idiomatic this code is, or how performant, but at least as a proof of concept, it works to show that we can get a really good interface to our users in C++.

The perils of full system resource utilization

time to read 3 min | 471 words

The following quotes (or something very similar) came from our interactions with customers:

“We paid a lot of money for this hardware, why isn’t your database making full use of it?”

“The machine is peaking at 100% CPU, the sky is falling, help, NOW!”

This is a problem, because I can empathize with both sides. On the one hand, having just put a five or six figure sum into new hardware, it can be depressing to see is “going to waste”. On the other hand, seeing the system under high load gives you that sinking feeling that the boat is going to overturn at any moment and production will go down.

Balancing resource consumption is a really hard problem, mostly because we don’t have any control over our work intake. We can’t control how many requests we accept nor do we control what kind of work is being asked of us. Actually, that isn’t true. We could control that, but in most cases, that is a false distinction.

At some point, RavenDB had a max number of concurrent request limit, and users have hit that in the past. This resulted in angry calls from customers about RavenDB refusing requests. The fact that we did that to maintain the overall health of the system was immaterial. Refusing requests meant that the system (or some portion of it) was down. In those cases, it was actually better, from the customer’s perspective, for the whole thing to slow down a bit, as long as there were no errors.

Inside RavenDB, we attempt to manage our CPU consumption using separation of concerns. First, we have the processing of requests. The assumption is that such requests end up being waited on by an actual human, directly or indirectly, so we process them first, prioritizing them above almost everything else. The only thing that has higher priority is the cluster health and monitoring system, which ensure that all nodes are up, running and in the same state.

As it turns out, RavenDB have a lot of additional processes internally that can be given a lower priority under load. For example, indexing, which are something that RavenDB runs in the background, are something that we can increase the latency of to give more resources for request processing.

We have a lot of experience in balancing the overall needs, and I’m still not sure that I have a good answer here. The reason for this post is that I just analyzed a dump file where it looked like requests were waiting for indexing to complete, but they were actually starving the indexes from the CPU time that they needed to actually run.  The system progressed, but not fast enough for the user to not notice things.

Actually, that is the primary criteria that we use. If the system is slow, but no one notices, the system ain’t slow.

The design & challenges of a RavenDB C++ client

time to read 3 min | 468 words

When I wrote the first version of RavenDB, I was coming off about six years of intensive work on NHibernate. I wanted the same level of convenience that I had with a world class OR/M with non of the relational constraints (pun intended).

Given that I was working in a managed language, features such as change tracking, unit of work, etc. Since then, we created clients for: C#, Java, Python, Node.JS, Ruby and Go. A common feature of all these languages is that they all have automatic memory management. Go, in particular, has been interesting, because while it deals with explicit pointers, there is no need to deal with manually freeing memory.

We are now looking at what it would take to bring the same level of experience to a C++ client. For example, here is about the simplest CRUD scenario that I can think of:

This code isn’t showing something special, until you realize that when you want to translate it to C++, you’ll need to take into account the explicit memory ownership. Another issue to deal with is how we can implement seamless integration between business objects and JSON documents.

I looked at how this is handled in other similar databases, and the results seems to be, pretty badly.

At least, when I compare it to how much higher the level of the code is in C++. Now, it is possible that C++ developers like working at this level. And certainly, the RavenDB client APIs actually have user exposed layers that are similar to this, but this is something that you’ll usually not need. Ideally, I want to be able to give the same level of experience to the C++ client as well.

The issue of JSON serialization actually seems to be already well taken for already.  A user will need to define to_json and from_json functions to make this work, but given that C++ has no reflection, that seems reasonable to request. It also gives the user complete control over the serialization / deserialization process and avoid the process of “customizing” the JSON serialization, which you sometimes have to do.

The issue of memory ownership, though, it a bit more complex. I was thinking about exposing this via the following interface:

The idea is that the RavenDB C++ client will only deal with shared_ptr, with the idea that we can accept that the entities we manage may live longer than the lifetime of the session.

I’m no longer able to consider myself a C++ developer, and the dev we have started working on the C++ stuff is currently busy learning RavenDB itself, so I thought this would be a good time to ask for feedback.

Both on the kind of interface that you’ll like to see for C++ client and whatever this approach is going to work.

Graphs in RavenDBGraph modeling vs. document modeling

time to read 3 min | 445 words


One of the most important design decisions we made with RavenDB is not forcing users to explicitly create edges between documents. Instead, the edges are actually just normal properties on the documents and can be used as-is. This means that pretty much any existing RavenDB database can immediately start using graph operations, you don’t need to do anything.

The image below shows an order, using the RavenDB’s sample Northwind dataset. The highlighted portions mark the edges from this document. You can use these to traverse the graph by hopping from document to document.

image

For example, using:

image

This is easy to do, migrates well (zero cost to do so, yeah!) and usually matches nicely with what you already have. It does lead to an interesting observation. Typically, in a graph DB, you’ll model everything as a graph. But with RavenDB, you don’t need to do that.

In particular, let’s take a look at the Neo4J’s rendition of Northwind:

image

As you can see, everything is modeled as a node / edge. This is the only thing you could model it as. With RavenDB, you would typically use a domain driven model. In this case, it means that a value object, like an OrderLIne, will not have its own concrete existence. Either as a node or an edge. Instead, it will be embedded inside its root aggregate (the order).

Note that this is actually quite interesting, because it means that we need to be able to provide the ability to query on complex edges, such as the order lines. Here is how works:

image

This will give us all the discount products sold in London as well as their discount rate.

Note that in here, unlike previous queries, we use an named alias for the edge. In this case, it gives us the ability to access it properties and project the line’s Discount property to the user. This means that you can have a domain model with strong cohesion and locality, following the domain driven design principles while still being able to run arbitrary graph queries on it.  Combining this with the ability to pull data from indexes (including map/reduce) ones, you have a lot of things that you can do that used to be very hard but now are easy.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Graphs in RavenDB (11):
    08 Nov 2018 - Real world use cases
  2. Challenge (54):
    28 Sep 2018 - The loop that leaks–Answer
  3. Reviewing FASTER (9):
    06 Sep 2018 - Summary
  4. RavenDB 4.1 features (12):
    22 Aug 2018 - MongoDB & CosmosDB Migration Wizards
  5. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats