Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,738 | Comments: 48,777

filter by tags archive

Graphs in RavenDBSelecting the syntax

time to read 5 min | 981 words

When we started building support for graph queries inside RavenDB, we looked at what is the state of the market in this regard. There seems to be two major options: Cypher and Gremlins. Gremlins is basically a fluent interface that represent a specific graph pattern while Cypher is a more abstract manner to represent the graph query. I don’t like Gremlins, and it doesn’t fit into the model we have for RQL, so we went for the Cypher syntax. Note the distinction between went for Cypher and went for Cypher syntax.

One of the major requirements that we have is fitting in into the pre-existing Raven Query Language, but the first concern we had was just getting started and getting some idea about our actual scenarios. We are now at the point where we have written a bunch of graph queries and got a lot more experience in how it mesh into the overall environment. And at this point, I can really feel that there is an issue in meshing Cypher syntax into RQL. They don’t feel the same at all. There are a lot of good ideas there, make no mistake, but we want to create something that would flow as a cohesive whole.

Let’s look at some of our queries and how we can better express them. The one I talked to about the most is this:

image

Let see what we have here:

  • match is the overall clause that apply a graph pattern query to the dataset.
  • () – is an indication of a node in the graph.
  • [] – is an indication of an edge.
  • a:Dogs, l:Likes and b:Dogs – this is an alias and a path specification.
  • -[]-> – is an indication of an edge between two nodes
  • (expression) – is a filter on a node or an edge

I’m ignoring the select statement here because it is just the usual RQL select statement.

The first thing that keeps biting us is the filter in (a:Dogs (id() = 'dogs/arava')), I keep being tripped by missing the closing ), so that has got to go. Luckily, is it very obvious what to do here:

image

We use an explicit where clause, instead of the () to express the inline filter. This fits a lot more closely with how the rest of RQL works.

Now, let’s look at the aliases: (b:Dogs). The alias:Collection syntax is pretty foreign to RQL, we tend to use the Collection as alias syntax. Let’s see how that would look like, shall we?

image

This looks a lot more natural to me, and it is a good fit into RQL in general. This syntax does bring a few things to the table. In particular, look a the edge. In Cypher, an anonymous edge would be: [:Likes], and using this method, we will have just [Likes].

However, as nice as this syntax is, we still run into a problem. The query above is actually just a shorthand way to write the full query, which looks like so:

image

In fact, we have two queries here, to show off the actual problem we have in parsing. In the first case, we have a match clause the only refers to explicit with statement. On the second case, we have a couple of explicit with statements, but also an implicit with edges expression (the Likes).

From the point of view of the parser, we can’t distinguish those two. Now, we can absolutely say that if the edge expression contains a single name, we’ll simply look for an edge with that name and otherwise assume that this is the path that will be used.

But this seems to be error prone, because you might have a small typo or remove a edge statement and get a completely different (and unexpected) meaning. I thought about adding some sort of prefix to help tell an alias from an implicit definition, but that looks very ugly, see:

image 

And on the other hand, I really like the –[Likes]-> syntax in general. It is a lot cleaner and easier to read.

At this point, I don’t have a solution for this. I think we’ll go with the mode in which we can’t tell what the query is meant to say just from the parser, and look at the explicit with statements to figure it out (with the potential for mistakes that I pointed out earlier) until we can figure out something better.

One thing that I’m thinking about is that the () and [] which help distinguish between nodes and edges, aren’t actually required for us if we have an explicit statement. So we can write it like so:

image

In this manner, we can tell, quite easily, if you meant to define an implicit edge / node or refers to an explicitly defined alias. I’m not sure whatever this would be a good idea, though.

Another issue we have to deal with is:

image

Note that in this case, we have a filter expression on the edge as well. Applying the same process we have done so far, we get:

image

The advantages here is that this is very clear and obvious about what is going on. The disadvantage is that this takes quite a bit longer to express.

Graphs in RavenDBI didn’t mean to build this feature!

time to read 2 min | 258 words

I was busy working on the implementation on filtering in graph queries, as discussed in my previous post. What I ended up implementing is a way for the user to tell us exactly how to handle the results. The actual query we ended up with is this:

image

And the key part here is the where clause, were we state that a and c cannot be the same dog. This also matches the behavior of SQL, and for that reason allow (predictably), that’s a good idea.

However, I didn’t just implement inequity, I implement full filtering capabilities, and you can access anything in the result. Which means that this query is now also possible:

image

I’ll let you a moment to analyze this query in peace. Try to de-chyper it (pun intended).

What this  query is doing is to compare the actual sale price and the regular price of product on a particular order, for products that match a particular set of categories.

This is a significant query because, for the first time in RavenDB, you have the ability to perform such a query (previous, you would have had to define a specific index for this query).

In other words, what graph query filtering brings to the table is joins. And I did not set out to build this feature and I’m feeling very strange about it.

Stabilization stories: Slow TCP under Linux

time to read 3 min | 578 words

imageRavenDB is pretty big, it is over 600,000 lines of C# code and over 220,000 lines of TypeScript code. In a codebase that large, there are unexpected interactions between different components (written by us, third parties and even the operating systems we use).

Given how important the stability of RavenDB is, we spend quite a bit of time (as in, the majority of it) not writing new features, but ensuring that the system is stable, predictable and observable. One part of that is a large suite of tests, which are being run on a variety of machines and conditions.

Some of these tests fail, in which case we fix them. A failing test is wonderful, because it tell us that something is wrong. A predictably failing test is a pleasure, because it states, in an unambiguous terms, what is going on and what the trouble is. I love getting a failing test, there is usually a pretty straightforward way to figure out what went wrong and then to actually fixing it.

Then there are tests that fail occasionally, and I really hate them. They almost always relate to some sort of race condition. Sometimes, the race is in the test itself, but sometimes the problem is in the actual code. The problem is that tracking down such an issue is pretty hard and annoying. The more frequently can we induce the failure, the faster we can actually get to resolving it.

We recently had a test that failed, very rarely, and only on Linux.

The debugging landscape* on Linux is dramatically poorer compare to Windows, so that adds another hurdle.

* Yes, we have JetBrains’ Rider, and it is great. But it is still quite far from the debugging capabilities of Visual Studio, especially for non trivial debugging.

The test failed because of a timeout waiting for a cluster to fully disseminate changes between all the members in the cluster. That means that we had a test that would spin up three to five independent nodes, combine them into a cluster, create a database that is shared among all these nodes, write documents to one of the nodes and then validate that the document is indeed on all the nodes.  A failure there, and a timeout failure in that aspect, means that we have to inspect pretty much the whole system.

Luckily, we had some good people on this issue, and they manage to come up with a minimal reproduction of the issue. All it took was to spin up a TcpListener and TcpClient and have them talk to one another, then do the same using SSL. We got some really interesting results because of that.

Windows Linux Diff
Single Threaded – Plain

192.8

200.8

104%

Single Threaded – SSL

5,762.3

667,549.8

11,584%

Concurrent (200) – Plain

11,377.5

932,487.9

8,195%

Concurrent (200) – SSL

145,494.8

35,283,175.3

24,250%

As you can see, there is a minor discrepancy in the performance of TCP connection times. All the tests were run on the same machine, testing over localhost.

We opened an issue for this problem, and for now we deal with it by accepting that the connection time can be very long and adjusted the timeout for the test. 

Don’t shove that (cool) feature down my throat, please

time to read 2 min | 220 words

image“I just found out that you can do Dancing Rhinos in 4D if you use FancyDoodad 2.43.3” started a conversation at the office. That is pretty cool, I’ll admit, getting Rhinos to dance at all is nice, and in 4D is probably nicer. I wasn’t aware that FancyDoodad had this feature at all. Great tidbit and something to discuss over lunch or coffee.

The problem is that the follow up was something in the order: “Now I wonder how we can use FancyDoodad’s cool feature for us. Do you think it can solve the balance issue for this problem?”

Well, this problem has nothing to do with Rhinos, wildlife, dancing or (hopefully) dimensional math. So while I can see that if you had a burning enough desire and only a hammer, you would be able to use FancyDoodad to try to solve this thing, I don’t see the point.

The fact that something is cool doesn’t meant that it is :

  • Useful.
  • Ought to go into our codebase.
  • Solve our actual problem.

So broaden your horizons as much as possible, learn as much as you can ingest. But remember that every thing starts at negative hundred points, and coolness on its own doesn’t affect that math.

Paranoid decisions and OMG customers

time to read 3 min | 427 words

imageI used to be a consultant for a long while, and that meant that I worked on a lot of customer projects. That led to me seeing and acting is some really strange ways.

Sometimes you go into a codebase and you can’t really believe what you see there. I think that this is similar to how an archeologist feels, seeing just remnants of something and having to deduct what were the forces that drove the people who built it. In some cases, what looks like bad code is actually a reaction to a bad policy that people are trying to workaround.

I think that the strangest of these cases was when I was working for a customer that refused to let external consultants to use their internal source control system. Apparently, they had sensitive stuff there that they couldn’t isolate or something like that. They were using either Team Foundation Server or Visual Source Safe and I didn’t really want to use that source control anyway, so I didn’t push. I did worry about source control, so we had a shared directory being used as a Subversion repository, this was over a decade ago, mind.

So far, so good, and nothing really interesting to talk about. What killed me was that their operations team flat out refused to back up the Subversion folder. That folder was hosted on shared server that belong to the consulting company (but resided at the customer site), and they were unwilling to either back up a “foreign” computer or to provide us with a shared space to host Subversion that they would back up.

For a while, I would backup the Subversion repository every few days to my iPod, then take a copy of the entire source code history with me home. That wasn’t sustainable, and I was deeply concerned about the future of the project over time, so I also added a twist. As part of the build process, we packed the entire source directory of the codebase as an embedded resource into the binary. In this way, if the code was ever lost, which I considered to be a real possibility, I would have a way to recover it all back.

After we handed off the project, I believe they moved the source to their own repository, so we never actually needed that, but I slept a lot better knowing that I had a second string in my bow.

What is your craziest story?

Managing a multi version project

time to read 4 min | 707 words

image

As I’m writing this, we have the following branches in the main repository of RavenDB. Looking at their history, we have:

Branch

Last Commit

Number of commits this year

v1.0

Feb 3, 2013

0

v2.0

Oct 14, 2016

0

v2.5

Oct 18, 2018

14

v3.0

Aug 14, 2018

10

v3.5

Oct 11, 2018

45

v4.0

Oct 18, 2018

2,270

v4.1

Oct 18, 2018

3,214

v4.2

Oct 18, 2018

95

The numbers are actually really interesting. Branches v1.0 and v2.0 are legacy and not longer supported. Branch v2.5 is also legacy, but we have a few customers with support contracts that are still using it so there are still minor bug fixes going on there occasionally. Most of the people on the 3.x line are using 3.5, which is now in maintenance mode, so you can see that there are very little work on the v3.0 branch and a bit of ongoing bug fixes for customers.

The bulk of the work is on the 4.x line. We released v4.0 in Feb of this year, and then switch to working on v4.1, which was released a couple of months ago. We actively started working on v4.2 this month. We are going to close down the v4.0 branch for new features at the end of this month and move it too to maintenance mode.

In practical terms, we very rarely need to do cross major version work but we do have a lot of prev, current, next parallel work. In other words, the situation right now is that a bug fix has to go to at least v4.1 and v4.2 and usually to v4.0 as well. We have been dealing with several different ways to handle this task.

For v4.0 and v4.1 work, which went on in parallel for most of this year, we had the developers submit two pull requests for their changes, one for v4.0 and one for v4.1. This increased the amount of work each change took, but the cost was usually just a few minutes at PR submission time, since we could usually get cherry pick the relevant changes and be done with it. The reason we did it this way is to avoid big merges as we move work between actively worked on branches. That would require having someone dedicated just to handle that, and it was easier to do it in line, rather than in a big bang fashion.

For the v4.2 branch, we are experimenting with something else. Most of the work is going on in the v4.1 branch at this point, mostly minor features and updates, while the v4.2 branch is experimenting with much larger scope of changes. It doesn’t make sense to ask the team to send three PRs, and we are going to close down v4.0 this month anyway. What we are currently doing is designating a person that is in charge of merging the v4.1 changes to v4.2 on a regular basis. So far, we are still pretty close and there hasn’t been a big amount of changes. Depending on how it goes, we’ll keep doing the dual PR once v4.0 is retired from active status or see if the merges can keep going on.

For feature branches, the situation is more straightforward. We typically ask the owner of the feature to rebase on a regular basis on top of whatever the baseline is, and the responsibility to do that is on them.

A long feature branch for us can last up to a month or so, but we had a few that took 3 months when it was a big change. I tend to really dislike those and we are trying to get them to a much shorter timeframes. Most of the work doesn’t happen in a feature branch, we’ll accept partial solutions (if they don’t impact anything else) and we tend to collaborate a lot more closely on code that is already merged rather than in independent branches.

The mental weight of open pull requests

time to read 1 min | 108 words

I just merged two PRs into RavenDB, and for the first time in a while, I got this beautiful number:

image

For the past few months, we have been working on several long running features, graph queries being the most obvious example. We are now at a stage where we are ready to pull all this work together, which mean that all the long running feature branches (and the discussion about them in the PR) is merged to the next release branch.

And for a while, I can luxuriate in that wonderful feeling.

The perils of full system resource utilization

time to read 3 min | 471 words

The following quotes (or something very similar) came from our interactions with customers:

“We paid a lot of money for this hardware, why isn’t your database making full use of it?”

“The machine is peaking at 100% CPU, the sky is falling, help, NOW!”

This is a problem, because I can empathize with both sides. On the one hand, having just put a five or six figure sum into new hardware, it can be depressing to see is “going to waste”. On the other hand, seeing the system under high load gives you that sinking feeling that the boat is going to overturn at any moment and production will go down.

Balancing resource consumption is a really hard problem, mostly because we don’t have any control over our work intake. We can’t control how many requests we accept nor do we control what kind of work is being asked of us. Actually, that isn’t true. We could control that, but in most cases, that is a false distinction.

At some point, RavenDB had a max number of concurrent request limit, and users have hit that in the past. This resulted in angry calls from customers about RavenDB refusing requests. The fact that we did that to maintain the overall health of the system was immaterial. Refusing requests meant that the system (or some portion of it) was down. In those cases, it was actually better, from the customer’s perspective, for the whole thing to slow down a bit, as long as there were no errors.

Inside RavenDB, we attempt to manage our CPU consumption using separation of concerns. First, we have the processing of requests. The assumption is that such requests end up being waited on by an actual human, directly or indirectly, so we process them first, prioritizing them above almost everything else. The only thing that has higher priority is the cluster health and monitoring system, which ensure that all nodes are up, running and in the same state.

As it turns out, RavenDB have a lot of additional processes internally that can be given a lower priority under load. For example, indexing, which are something that RavenDB runs in the background, are something that we can increase the latency of to give more resources for request processing.

We have a lot of experience in balancing the overall needs, and I’m still not sure that I have a good answer here. The reason for this post is that I just analyzed a dump file where it looked like requests were waiting for indexing to complete, but they were actually starving the indexes from the CPU time that they needed to actually run.  The system progressed, but not fast enough for the user to not notice things.

Actually, that is the primary criteria that we use. If the system is slow, but no one notices, the system ain’t slow.

The design & challenges of a RavenDB C++ client

time to read 3 min | 468 words

When I wrote the first version of RavenDB, I was coming off about six years of intensive work on NHibernate. I wanted the same level of convenience that I had with a world class OR/M with non of the relational constraints (pun intended).

Given that I was working in a managed language, features such as change tracking, unit of work, etc. Since then, we created clients for: C#, Java, Python, Node.JS, Ruby and Go. A common feature of all these languages is that they all have automatic memory management. Go, in particular, has been interesting, because while it deals with explicit pointers, there is no need to deal with manually freeing memory.

We are now looking at what it would take to bring the same level of experience to a C++ client. For example, here is about the simplest CRUD scenario that I can think of:

This code isn’t showing something special, until you realize that when you want to translate it to C++, you’ll need to take into account the explicit memory ownership. Another issue to deal with is how we can implement seamless integration between business objects and JSON documents.

I looked at how this is handled in other similar databases, and the results seems to be, pretty badly.

At least, when I compare it to how much higher the level of the code is in C++. Now, it is possible that C++ developers like working at this level. And certainly, the RavenDB client APIs actually have user exposed layers that are similar to this, but this is something that you’ll usually not need. Ideally, I want to be able to give the same level of experience to the C++ client as well.

The issue of JSON serialization actually seems to be already well taken for already.  A user will need to define to_json and from_json functions to make this work, but given that C++ has no reflection, that seems reasonable to request. It also gives the user complete control over the serialization / deserialization process and avoid the process of “customizing” the JSON serialization, which you sometimes have to do.

The issue of memory ownership, though, it a bit more complex. I was thinking about exposing this via the following interface:

The idea is that the RavenDB C++ client will only deal with shared_ptr, with the idea that we can accept that the entities we manage may live longer than the lifetime of the session.

I’m no longer able to consider myself a C++ developer, and the dev we have started working on the C++ stuff is currently busy learning RavenDB itself, so I thought this would be a good time to ask for feedback.

Both on the kind of interface that you’ll like to see for C++ client and whatever this approach is going to work.

Answering the web developer task

time to read 1 min | 101 words

In my previous post, I talked about a task we give candidates that interview for the web developer position. They need to implement the following:

Given that I don’t like handing our tasks that I haven’t done, I took a few minutes to answer my own question. Here is how this can be implemented:

I believe that I mentioned that my JavaScript skills are from the last decade, if that, so I’m probably committing quite a few sins against JavaScript (if that is even possible), but this code run the first time I tried it and gave the proper result.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Graphs in RavenDB (11):
    08 Nov 2018 - Real world use cases
  2. Challenge (54):
    28 Sep 2018 - The loop that leaks–Answer
  3. Reviewing FASTER (9):
    06 Sep 2018 - Summary
  4. RavenDB 4.1 features (12):
    22 Aug 2018 - MongoDB & CosmosDB Migration Wizards
  5. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats