Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,783 | Comments: 48,880

filter by tags archive
time to read 4 min | 613 words

imageIn my last post on the topic, I showed how we can define a simple computation during the indexing process. That was easy enough, for sure, but it turns out that there are quite a few use cases for this feature that go quite far from what you would expect. For example, we can use this feature as part of defining and working with business rules in our domain.

For example, let’s say that we have some logic that determine whatever a product is offered with a warranty (and for how long that warranty is valid). This is an important piece of information, obviously, but it is the kind of thing that changes on a fairly regular basis. For example, consider the following feature description:

As a user, I want to be able to see the offered warranty on the products, as well as to filter searches based on the warranty status.

Warranty rules are:

  • For new products made in house, full warranty for 24 months.
  • For new products from 3rd parties, parts only warranty for 6 months.
  • Refurbished products by us, full warranty, for half of new warranty duration.
  • Refurbished 3rd parties products, parts only warranty, 3 months.
  • Used products, parts only, 1 month.

Just from reading the description, you can see that this is a business rule, which means that it is subject to many changes over time. We can obviously create a couple of fields on the document to hold the warranty information, but that means that whenever the warranty rules change, we’ll have to go through all of them again. We’ll also need to ensure that any business logic that touches the document will re-run the logic to apply the warranty computation (to be fair, these sort of things are usually done as a subscription in RavenDB, which alleviate that need).

Without further ado, here is the index to implement the logic above:

You can now query over the warranty types and it’s duration, project them from the index, etc. Whenever a document is updates, we’ll re-compute the warranty status and update the index.

This saves you from having additional fields in your model and greatly diminish the cost of queries that need to filter on warranty or its duration (since you don’t need to do this computation during the query, only once, during indexing).

If the business rule definition changes, you can update the index definition and RavenDB will effectively roll out your change to the entire dataset. That is nice, but even though I’m writing about cool RavenDB features, there are some words of cautions that I want to mention.

Putting queryable business rules in the database can greatly ease your life, but be wary of putting too much business logic in there. In general, you want your business logic to reside right next to the rest of your application code, not running in a different server in a mode that is much harder to debug, version and diagnose. And if the level of complexity involved in the business rule exceed some level (hard to define, but easy to know when you hit it), you should probably move from defining the business rules in an index to a subscription.

A RavenDB subscription allow you to get all changes to documents and apply your own logic in response. This is a reliable way to process data in RavenDB, this runs in your own code, under your own terms, so it can enjoy all the usual benefits of… well, being your code, and not mine. You can read more about them in this post and of course, the documentation.

time to read 3 min | 403 words

image

Consider the graph on the right. I already talked about this graph when I wrote about permission based graph queries.

In this post, I want to show off another way to deal with the same problem, but without using graph queries, and using only the capabilities that we have in RavenDB 4.1.

The idea is that, given a user, I want to be able to issue a query for all the issues that this user has access to, either directly (like Sunny in the graph), via a group (like Max, via project-x group) or via a recursive group, like (Nati, via project-x –> team-nati groups).

As you can image from the name of this post, this requires recursion. You can read the documentation about this, but I thought to spice things up and use several features all at once.

Let’s look at the following index (Issues/Permissions):

This is an JS index, which has a map() function over the Issues collection. For each of the issues, we index the Users for the issue and the groups (recursively) that are allowed to access it.

Here is what the output of this index will be for our issue in the graph:

image

Now, let’s look at how we can query this, shall we?

image

This query has two clauses, either we are assigned directly or via a group. The key here is in the recurse_groups() and inside that, the load() call in the index. It scans upward through the defined groups and their parents until we have a simple structure in the index that is easily searchable.

RavenDB will ensure that whenever a document that is referenced by a load() in the index is updated, all the documents that are referencing it will be re-indexed. In the case we have here, whenever a group is updated, we’ll re-index all the relevant issues to match the new permissions structure.

One of the core principles of RavenDB is that you can push more work to the indexing and keep your queries fast and simple. This is a good example of how we can arrange the data in such a way that we can push work to background indexing in a pretty elegant manner.

time to read 5 min | 944 words

imageWe test RavenDB as thoroughly as we can. Beyond just tests, longevity runs, load test and in general beating it with a stick and seeing if it bleats, we also routinely push early build of RavenDB to our own production systems to see how it behaves on real load.

This has been invaluable in catching several hard to detect bugs. A few days after we pushed a new build to production, we started getting errors about running out of disk space. The admin looked, and it seemed to have fixed itself. Then it repeated the issue.

It took a bit of time to figure out what was going on. A recent change caused RavenDB to hang on to what are supposed to be transient files until an idle moment. Where idle moment is defined as 5 seconds without any writes. Our production systems don’t have an idle moment, here is a typical load on our production system:

image

We’ll range between lows of high 30s requests per seconds to peeks of 200+ requests per second.

So we never have an idle moment to actually clean up those files, so they gather up. Eventually, we run out of disk space. At this point, several interesting things happen. Here is what our topology looks like:

image

We have three nodes in the cluster, and currently, node B is the leader of the cluster as a whole. Most of our operations relate to a single database, and for that database, we have the following topology:

image

Note that in RavenDB, the cluster as a whole and a particular database may have different topologies. In particular, the primary node for the database and the leader of the cluster can and are often different. In this case, Node A is the primary for the database and node B is the leader of the cluster.

Under this scenario, we have Node A accepting all the writes for this database, and then replicating the data to the other nodes. Here is what this looks like for node C.

image

In other words, Node A is very busy handling requests and writes. Node C is just handling replicated writes. Just for reference, here is what Node A’s CPU looks like, the one that is busy:

image

In other words, it is busy handling requests. But it isn’t actually busy. We got a lot of spare capacity at hand.

This make sense, our benchmark scenarios is starting with tens of thousands of requests per seconds. A measly few hundreds per second aren’t actually meaningful load. The servers in questions, by the way, are t2.large with 2 cores and 8GB of RAM.

So we have a bug in releasing files when we aren’t “idle”. The files just pile up and eventually we run out of disk space. Here is what this looks like in the studio:

image

And clicking on details gives us:

image

So this looks bad, but let’s us see how this actually turns out to work in practice.

We have a failure on one node, which causes the database to be unloaded. RavenDB is meant to run in an environment where failure is likely and is built to handle that. Both clients and the cluster as a whole know that when such things happen, either to the whole node or to one particular database on it, we should respond accordingly.

The clients automatically fail over to a secondary node. In the case of the database in question, we can see that if Node A is failure, the next in like would be Node C. The cluster will also detect that are re-arrange the responsibilities of the nodes to indicate that one of them failed.

The node on which the database has failed will attempt to recover from the error. At this point, the database is idle, in the sense that it doesn’t process any requests, and will be able to cleanup all the files and delete them. In other words, the database goes down, restarts and recover.

Because of the different write patterns on the different nodes, we’ll run into the out of disk space error at different times. We have been running in this mode for a few days now. The actual problem, by the way, has been identified and resolved. We just aren’t any kind of pressure to push the fix to production.

Even under constant load, the only way we can detect this problem is through the secondary affects, the disk space on the machines that is being monitored. None of our production systems has reported any failures and monitoring is great across the board. This is a great testament to the manner in which expecting failure and preparing for it at all levels of the stack really pays off.

My production database is routinely running out of space? No worries, I’ll fix that on the regular schedule, nothing is actually externally visible.

time to read 9 min | 1610 words

image“If a tree falls in a forest and no one is around to hear it, does it make a sound?” is a well known philosophical statement. The technological equivalent of this is this story. We got a report that RavenDB was failing in the field. But the details around the failure were critical.

The failure happened on the field, literally. This is a system that is running an industrial robot using a custom ARM board. The failure would only happen on the robot on the field, would not reproduce on the user’s test environment or on our own systems. Initially, that was all the information that we had: “This particular robot works fine for a while, but as soon as there is a break, RavenDB dies and needs to be restarted”. That was the first time I run into a system that would crash when it went idle, instead of dying under load, I have to say.

My recommendation that they would just keep the robot busy at all time was shot down, but for a while, we were in the dark. It didn’t help that this was literally a custom ARM machine that we had no access to. We finally managed to figure out that the crash was some variant of SIGSEGV or SIGABRT. That was concerning. The ARM machine in question is running on 32 bits, and the worry was that our 32 bits code was somehow doing a read out of bound. This is a crash in production, so we allocated a couple of people to investigate and try to figure out what was going on.

We started by doing a review of all our 32 bits memory management code and in parallel attempted to reproduce this issue on a Raspberry Pi (the nearest machine we had to what was actually going on). We got a lucky break when we someone did manage to kill the RavenDB process on our own lab somehow. The exit code was 139 (Segmentation fault), but we weren’t sure what was actually going on. We were trying all sort of stuff on the machine, seeing what would cause this. We basically fed it all sorts of data that we had laying around and saw if it would choke on that. One particular data export would sometimes cause a crash. Sometimes. I really really hate this word. That meant that we were stuck with trying to figure out something by repeatedly trying and relying on the law of averages.

It took several more days, but we figured out that a certain sequence of operations would reliably cause a crash within 5 – 30 minutes. As you can imagine, this made debugging pretty hard. The same sequence of operations on Intel machines, either 32 bits or 64 bits worked without issue, regardless of many times we have repeated them.

We followed several false trails with our investigation into RavenDB’s memory management’s code in 32 bits. We had a few cases where we thought that we had something, but nothing popped up. We have instrumented the code and verified that everything seemed kosher, and it certainly did, but the system still crashed on occasion.

RavenDB usually relies on mmap() to access the data on disk, but on 32 bits, we couldn’t do that. With an addressable memory of just 2 GB, we cannot map the whole file to memory if it is too large. Because of that, we map portions of the file to memory as needed for each transaction. That led us to suspect that we were somehow unmapping memory while it was still in use or something like that. But we have gone through the code with a fine tooth comb and got nothing. We used strace to try to help point out what is going on and we could see that there were no surprise calls to unmap() that shouldn’t be there.

What was really nasty was the fact that when we failed with SIGSEGV, the error was always on an address just past the area of memory that we mapped. This lead us to suspect that we had an out of boundary write and led to a chase for that rouge pointer operation. We instrumented our code ever more heavily, but weren’t able to find any such operation. All our reads and writes were in bound, and that was incredibly frustrating. RavenDB is a CoreCLR application. As such, debugging it on an ARM device is… challenging. We tried lldb and gdb. Both allow unmanaged debugging, but even with lldb, we couldn’t debug managed code or even just pull the managed stack properly from ARM. Eventually we found this extension which allow to do SSH debugging on the Raspberry PI from a Windows machine.

That helped, and we finally figured out where in our managed code the error happened. This always happened during a copy of memory from a document write to a scratch buffer in a memory mapped file. The entire thing was wrapped in boundary checks and everything was good.

We went back to the drawing board and attempted to set it on fire, because it was no good for us. Once we put the fire out, we looked at what remained and had an Eureka! moment. Once of the differences between ARM and x86/x64 machines is in how they treat alignment. In x64/x86 alignment is pretty much a non issue for most operations. On ARM, however, an unaligned operation will cause a CPU fault. That led us to suspect that the SIGABRT error we got was indeed an alignment issue. Most of our code is already aligned on memory because while it isn’t mandatory on x64/x86, it can still get better perf in certain cases, but it is certainly possible that we missed it.

We discovered a horrifying problem:

image

We were using the CopyBlock method, and obviously that was the issue, right? We wrote a small test program that simulated what we were doing and used unaligned CopyBlock and it just worked. But maybe our situation is different?

Using CopyBlockUnaligned on x86 led to a 40% performance drop (we call this method a lot) and initially it looked like it fixed the problem on ARM. Except that on the third or forth attempts to reproduce the problem, we run into our good old SIGSEGV again, so that wasn’t it. This time we went to the drawing board and broke it.

During this time, we have managed to capture the error inside the debugger several times, here is what this looked like:

image

Reading ARM assembly is not something that I’m used to do, so I looked at the manual, and it looks like this instruction is to store multiple registers in descending order and… no clue beyond that. It didn’t make any sort of sense to us.

At this point, we were several weeks and four or five people into this investigation (we consider such issues serious). We have instrumented our code to the point where it barely run, we could manage to reproduce the error in a relatively short time and we were fairly convinced that we were doing things properly. Going over the kernel code for memory mapping and unmapping several times, stracing, debugging, everything. We were stumped. But we also had enough data at this point to be able to point a fairly clear picture of what was going on. So we opened an issue for the CoreCLR about this, suspecting that the issue is in the implementation of this CopyBlockUnaligned.

We got a strange response, though: “This assembly code doesn’t make any sense”. I did mention that I have no idea about ARM assembly, right?  We tried reproducing the same thing in gdb, instead of lldb and got the following assembly code:

image

This looked a lot more readable, to be sure. And it was extremely suspicious. Let me explain why:

image

The faulting instruction is: ldr r3, [r0, #0]

What this says is basically, read a word from the address pointed to by r0 (with 0 offset) into r3.

Now, r0 in this case has this value: 0x523b3ffd. Note the last three characters, ffd.

We are running this on a 32 bits machine, so a word is 4 bytes in side. FFD+4 = 1001

In other words, we had a read beyond the current page boundary. In most cases, the next page is mapped, so everything goes smoothly. In some cases, the next page is not mapped, so you are going to get an access violation trying to read a byte from the next page.

The fix for this is here:

image

This is literally a single character change. And probably the worst lines of codes / time invested that I have ever seen for any bug. Actually, there wasn’t even any code change in RavenDB’s codebase, so that is 0 lines of code / 4 people x 4 weeks.

The good thing is that at least we have proven that the 32 bits memory code is rock solid, and we have a lot better understanding on how to resolve the next issue.

time to read 4 min | 603 words

image

The flagship feature for RavenDB 4.2 is the graph queries, but there are a lot of other features that also deserve attention. One of the more prominent second string features is pull replication.

A very common deployment pattern for RavenDB is to have it deployed to the edge. A great example is shown in this webinar, which talk about deploying RavenDB to 36,000 locations and over half a million instances. To my knowledge, this is one of the largest single deployments of RavenDB, this deployment model is frequent in our users.

In the past few months I talked with users would use RavenDB on the edge for the following purposes:

  • Ships at sea, where RavenDB is used to track cargo and ongoing manifests updates. The ships do not have any internet connection while at sea, but connect to the head quarters when they dock.
  • Clinics in health care providers, where each clinic has a RavenDB instance and can operate completely independently if the network is down, but communicates to the central data center during normal operations.
  • Industrial robots, where each robot holds their own data and communicate occasionally with a central location.
  • Using RavenDB as the backing end for an application running on tablets to be used out in the field, which will only have connection to the central database when back in the office.

We call such deployments the hub & spoke model and distinguish between the types of nodes that we have.  We have edge nodes and the central node.

Now, to be clear, both the edge and the central can be either a single node or a full cluster, it doesn’t matter to our discussion.

Pull replication in RavenDB allows you to define a replication definition on the central once. On each of the edges nodes, you define the pull replication definition and that is pretty much it. Each edge node will connect to the central location and start pulling all the data from that database. On the face of it, it seems like a pretty simple process and not much different from external replication, which we already have in RavenDB.

The difference is that external replication is defined on the central node, for each of the nodes on the edges. Pull replication is defined once on the central node and then defined on each of the edges. The idea here is that deploying a new edge node shouldn’t have any impact on the central database. It is pretty common for users to deploy a new location, and you don’t want to have to go and update the central server whenever that happens.

There are a few other aspects of this feature that matters greatly. The most important of them is that it is the edge that initiates the connection to the central node, not the central to the edge. This means that the edge can be behind NAT and you don’t have to worry about tunneling, etc.

The second is about security. Pull replication it its own security measure. When you define a pull replication on the central node, you also setup the certificates that are allowed to utilize that. Those certificates are completely separate from the certificates that are used to access the database in general. So your edge nodes don’t have any access to the database at all, all they can do is just setup the channel for the central node to send them the data.

This is going to make edge deployments and topologies a lot easier to manage and work with in the future.

time to read 1 min | 85 words

The node.js RavenDB client had a milestone release last month, which I only now got to talking about. The most important factor about this release is that now node.js users can use subscriptions to process data from RavenDB.

Here is how you can do this:

This is one of the most powerful abilities RavenDB have, you will continuously get new orders in your script and be able to process them. This make it very easy to build features such as data pipelines, background processing, etc.

time to read 6 min | 1032 words

imageI talked a lot about how graph queries in RavenDB will work, but one missing piece of the puzzle is how they are going to be used. I’m going to use this post to discuss some of the options that the new features enables. We’ll take the following simple model, issue tracking. Let’s imagine that a big (secret) project is going on and we need to track down the tasks for it. On the right, you have an image of the permissions graph that we are interested in.

The permission model is pretty standard I think. We have the notion of users and groups. A user can be associated to one or more groups. Groups memberships are hierarchical. An issue’s access is controlled either by giving access to a specific users or to a group. The act of assigning a group will also allow access to all the group’s parents and any user that is associated to any of them.

Here is the issue in question:

image

Given the graph on the right, let’s see, for each user, how we can find out what issues they have access to.

We’ll start with Sunny, who is named directly in the issue as allowed access. Here is the query to find the issues on which we are directly named.

image

This was easy, I must admit. We could also write this without any need for graph queries, because it is so simple:

image

It gets a bit trickier when we have to look at Max. Unlike Sunny, Max isn’t directly named. Instead, he is a member in a group that is directly named in the issue. Let’s see how we can query on issues that Max has access to:

image

This is where things gets start to get interesting. If you’ll look into the match clause, you’ll see that we have arrows that go both left and right. We are telling RavenDB that we want to find issues that have the same group (denoted as g in the query) as the user. This kind of query you already can’t express with RavenDB right now without the graph query syntax. Another way to write this query is to use explicit clauses, like so:

image

In this case, all the arrows go in a single direction, but we have two clauses that are being and together. These two queries are exactly the same, in fact, RavenDB will translate the one with arrows going both ways to the one with the and.

The key here is that between any two clauses with an and, RavenDB will match the same alias to the same value.

So we now can find issues for Max and Sunny, but what about all the rest of the people? We can, of course, do this manually. Here is how we can find issues for people a group removed from the issue.

image

This query gives us all the issues for Nati that are one group removed from him. That works, but it isn’t how we want to do things. It is a good exercise in setting out the structure, though. Because instead of hard coding the pattern, we are now going to use recursion.

image

The key for this query is the recursive element in the third line. We moved from the issue to its groups, and then we recurse from the issues’ groups to their parents. We allow empty recursion and we follow all the paths in the graph.

On the other side, we go from the user and try to find a match from the user’s group to any of the groups that end the query. In Nati’s case, we went from project-x group to team-nati, which is a match, so we can return this issue.

Here is the final query that we can use, it is a bit much, even if it is short, so I will deconstruct it below:

image

We use a parameterized query here, to make things easier. We start from a user and find an issue if:

  • The issue directly name the user as allowed access. This is the case for Sunny.
  • The issue’s groups (or their parents) match the groups that the user belong to (non recursively, mind).

In the case of Max, for example, we have a match because the recursive element allows a zero length path, so the project-x group is allowed access and Max is a member of it.

In the case of Nati, on the other hand, we have to go from project-x to team-nati to get a match.

If we’ll set $uid to users/23, which is Pheobe, all the way to the left in the graph, we’ll also have a match. We’ll go from project-x to execs to board and then find a match.

Snoopy, on the other hand, doesn’t have access. Look carefully at the direction of the arrows in the graph. Snoopy belongs to the r-n-d group, and that groups is a child of exces, but the query we specified only go up to the parents, not the children, so Snoopy is not included.

I hope that this post gave you some ideas about what kind of use cases are enabled by graph queries in RavenDB. I would love to hear about any scenarios you have for graph queries so we can play with them and see how they are answered by RavenDB.

time to read 3 min | 565 words

imageGraph queries as I discussed them so far gives you the ability to search for patterns. On the right, you can see the family tree of the royal family of Great Britain going back a few hundred years. That make for an interesting subject for practicing graph queries.

A good example we might want to ask is who is the royal grand parent of Elizabeth II. We can do that using:

image

This is great, and nicely demonstrate how we can scan for specific patterns in the graph. However, it is limited by its rigidity. For example, let’s say that I want to find someone in the family tree and I’m not sure about the exact nature of the relationship?

“We are not amused” comes to mind, but off the top of my head and without consulting the chart, I don’t think that I would be able to figure it out. Luckily, I don’t have to, I can ask RavenDB to be so kind and tell me.

image

Note the use of the recursive element here. We are asking RavenDB to start in a particular document and go up the parents, trying to find an unamused royal. The recursion portion of the query can be zero to six steps in size and should abort as soon as we have any match. Following the zero to six parents, there should be a parent that is both a royal an unamused.

The Cypher syntax for what they call variable length queries is reminiscent of regular expressions, and I don’t mean that in a complimentary manner. Looking at the query above, you might have noticed that there is a distinct difference between it and the first one. The recursive query will go up the Parents link, regardless of whatever that parent is royal or not. RavenDB Graph Queries has what I believe to be a unique feature. The recursive pattern isn’t limited to a single step and can be as complex as you like.

For example, let’s ensure that we are only going to go up the chain of the royal parents.

image

The recursive element has a few knows that you can tweak. The minimum and maximum distance, for example, are obvious examples, but the results criteria for the recursion is also interesting. In this query, we use the shortest, instead of the lazy. This will make RavenDB work a bit harder and find the shortest recursive path that matches the query, where as lazy stops on the first one that matches. The following options are available:

  • Lazy – stop on the first pattern that matches. Good for: “Am I related to Victoria?”
  • Shortest – find the shortest path that match the pattern. Good for: “How am I related to Victoria?”
  • Longest – find the longest path that match the pattern. Good for: “For how many generations has Victoria’s family been royals?”
  • All – find all the paths that match the pattern. Good for if you have multiple paths in your ancestry to Victoria.
time to read 3 min | 462 words

In my previous post, I discussed some options for changing the syntax of graph queries in RavenDB from Cypher to be more in line with the rest of the RavenDB Query Language. We have now completed that part and can see the real impact it has on the overall design.

In one of the design review, one of the devs (who have built non trivial applications using Neo4J) complained that the syntax is now much longer. Here are the before and after queries to compare:

image

The key, from my perspective, is that the new form is more explicit and easier to read after the fact. Queries tend to grow more complex over time, and they are being read a lot more often than written). As such, I absolutely want to lean toward being readable over being terse.

The example above just show the extra characters that you need to write. Let’s talk about something that is a bit more complex:

image

Now we have a lot more text, but it is a lot easier to understand what is going on. Focus especially on the Lines edge, where we can very clearly separate what constitute the selection on the edge, the filter on the edge and what is the property that contains the actual linked document id.

The end result is that we now have a syntax that is a lot more consistent and approachable. There are other benefits, but I’ll show them off in the next post.

A major source of annoyance for me with this syntax was how to allow anonymous aliases. In the Cypher syntax we used, you could do something like:

image

There is a problem with how to express this kind of syntax of anonymous aliases with the Collection as alias mode. I initially tried to make it work by saying that we’ll look at the rest of the query and figure it out. But that just felt wrong. I didn’t like this inconsistency. I want a parse tree that I can look at in isolation and know what is going on. Simplifying the language is something that pays dividends over time, so I eventually decided that the query above will look this with the next syntax:

image

There is a lot of precedence of using underscore as the “I don’t care” marker, so that works nice and resolves any ambiguities in the syntax.

FUTURE POSTS

  1. Using TLS with Rust: Authentication - 45 minutes from now
  2. The role of domain model with CQRS / Event Sourcing - about one day from now
  3. Using TLS in Rust: Going to async I/O with Tokio - 4 days from now
  4. Investigating self inflicted wounds: The SSL failure on the Linux build server - 5 days from now
  5. Using TLS in Rust: tokio ain’t mere mortals - 6 days from now

And 4 more posts are pending...

There are posts all the way to Feb 04, 2019

RECENT SERIES

  1. Using TLS with Rust (4):
    11 Jan 2019 - Part III–Will native tls do the trick?
  2. Data modeling with indexes (3):
    14 Jan 2019 - Predicting the future
  3. Reminder (9):
    03 Jan 2019 - I’ll be in CodeMash is next week
  4. Production postmortem (24):
    25 Dec 2018 - Handled errors and the curse of recursive error handling
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats