Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,691 | Comments: 48,557

filter by tags archive

RavenDB 4.1 Release Candidate is out

time to read 1 min | 128 words

Last Friday, without much of a fanfare, we released the Release Candidate of RavenDB 4.1. This release concludes about six months of work to bring some new and awesome features to the table:

  1. Cluster wide transactions
  2. JavaScript indexes
  3. SQL Migration Wizard
  4. Distributed Counters
  5. RavenDB Embedded (for .Net, Python, etc)
  6. MongoDB & CosmosDB migration
  7. Results highlighting
  8. Subscription includes
  9. Explain query plan
  10. Explain query execution
  11. Query timing
  12. Better setup wizard and reducing deployment friction

And these are just the first dozen big features. We made a lot of improvements, supporting wild card certificate generation in the setup process, better spatial units support and a plenty of other stuff that isn’t sexy or attractive on its own but adds up to a major jump in what you can do with RavenDB.

Transactional Patterns: Conversation vs. Batch

time to read 6 min | 1136 words

When I designed RavenDB, I had a very particular use case at the forefront of my mind. That scenario was a business application talking to a database, usually as a web application.

These kind of applications have a particular style of communication with the user. As you can see below, there are two very distinct operations. Show the user the data, followed by some “think time” (seconds at minimum, but can be much longer) and then followed by an action.

image

This shouldn’t really be a surprised for anyone who developed any kind of application for the last decade or two, so why do I mention this explicitly?  I mention this because of the nature of communication between the application and the database.

Some databases have a the conversation pattern with the application. In terms of API, this will look something like this:

  • BeginTransaction()
  • Update()
  • Insert()
  • Commit()

This is a very natural model and should be quite familiar for most developers. The other alternative to this method is to use batches:

  • SaveChanges( [Update, Insert] )

I want to use this post to talk about the difference between the two styles and how that impacts your work. Relational databases uses the conversation style while RavenDB uses batch style. On the surface, it looks like it would be a more complex to use RavenDB to achieve the same task, but there is very little difference in the API as far as the user is concerned. In both cases, the code looks very much the same:

Behind the scenes, however, the RavenDB code will send just a single request to the server, while a relational database will need four separate commands to execute the transaction. In many cases, you can send all of these commands to the server in a single roundtrips, but that is an optimization that doesn’t always work and often isn’t applied even when it is possible.

Sidebar: Reducing server roundtrips

Why is the reduction in server roundtrips so important? Because it has a lot of implications on the overall performance of the system. In many cases the cost of making a remote query from the application to the database far outstrips the costs of actually executing the query. This ties closely to the Fallacies of Distributed Computing. Latency isn’t zero, even though when you develop locally it certainly seems like this is the case.

The primary goal of this design in RavenDB was to reduce the number of network roundtrips that your application must endure. Because in the vast majority of the cases, your application is going to follow the “show data” / “modify data” as two separate operations (often separated by a long idle time) there is a lot of value in having the database interaction model match what you will actually be doing.

As it turned out, there are some additional advantages (and disadvantages, which I’ll cover a bit later) to this approach, beyond just the obvious reduction in the number of server roundtrips.

When the server gets all the operations that needs to be done in a single request, it can apply all of them at once. For that matter, it can chose how to apply them in the most optimal order. This gives the database server a lot more chances for optimization. It is similar to going to the supermarket with a list of items to purchase vs. a treasure hunt. When you have the full list, you can decide to pick things up based on how close they are on the shelves. If you only get the next instruction after you complete the previous one, you have no option for optimization.

When using the conversation style, durability and state management become more complex as well. Relational databases typically use some variation of ARIES for their journals. This is because they need to record information about ongoing transactions that haven’t yet been committed. This add significant complexity to the amount of work that is required from the database engine. Furthermore, when running in a distributed system, you need to share this transaction state (which hasn’t yet been committed!) across the nodes to allow failover of the transaction if the server fails. With the conversation style, you need to support concurrent transactions all operating at the same time and potentially reading and modifying the same data. This lead to a great deal of code that is required to properly manage locking and latching inside the database engine.

On the other hand, batch mode give the server all the operations in the transaction in a single go. This means that failover can simply be sending the batch of operations to another node, without the need to share complex state between them. It means that the database server has all the required information and can make decisions based on it. For example, if there are no data dependencies, it can execute the operations in the transaction in whatever order it desires, leading to more optimal execution time. The database can also mix & match operations from different transactions into a single batch (as long as it keeps the externally visible behavior consistent, of course) to optimize things even further.

There are two major disadvantages for batch mode. The first of which is that there is usually a strict separation of reads from writes. That means that you usually can’t get a single consistent read/modify operation that stay in the same transaction. The second issue is similar, because you need to generate all the operations ahead of time, you can’t make decisions about what operations to execute based on the data you read, at least not in the same transaction. The typical solution for that is to send a script in the batch. This script can then read / modify data in the same context, apply logic, etc. The important thing here is that this script runs inside the server, already inside the transaction. This means that you don’t pay network round trips time to make such operations.

On the other hand, it means that you need to write potentially complex logic in the database’s scripting language, rather than your own platform, which you’ll likely prefer.

Luckily, for most scenarios, especially with web applications, you don’t need to execute complex logics on the server side. You can usually just send the commands you need in a single batch and be done with it. Often, just have optimistic concurrency is enough to get you the consistency you want, with scripting reserved for more exceptional cases.

RavenDB’s usage scenario was meant to make the common operations easy and the hard stuff possible. I think that we got it right and ended up with an API that is functional, highly performant and one that has withstood the test of time very well.

The iterative design process: Query parameters example

time to read 4 min | 660 words

When we start building a feature, we often have a pretty good idea of what we want to have and how to get there. And then we actually start building it and we often end up with something that is quite different (and usually much better). It has gotten to the point where we aren’t even trying to do hard specs and detailed design at anything beyond the exploratory levels. For example, in the design of RavenDB 4.0, there was not even a mention of RQL. That ended up being a very late addition to the codebase, but it improved RavenDB significantly. On the other hand, the low level mechanisms of zero copy documents from Voron all the way to the network were designed up front, but only at a fairly high level.

In this post, I want to talk about query parameters in RavenDB. Actually, let me be more specific, we have query parameters, but what we don’t have (or rather, didn’t have, because that will be merged in by the time you read this post) is the ability to run parameterized queries from the studio. We always meant to have that capability, but we run out of time with the 4.0 release. As we are gearing up to the 4.1 release, we are cleaning the table from the major-minor issues. (Major in term of impact, minor in term of amount of work required). The query parameters in the studio is one such example. Here is what this looks like:

image

My first thought was to just build something like this:

image

Give the user the ability to define arguments and be done with it. The task was assigned to one of our developers and I expected to get a PR in a short while.

This particular developer has a tendency to consider not just the task at hand but also other aspects of the problem. He didn’t want the user to have to manually specify each argument, since that has poor ergonomics. Instead, he wanted the studio to figure it out its own and help the user. So the first thing he did was detect the arguments (regex: “\$\w+”) and present them in the grid. Then there was the issue of how to deal with edits, etc. Then he run into another problem, types. Query parameters can be more than just strings, they can be any JSON data type.

Here is what he came up with:

image

Instead of having to define the query parameters in a separate location, just put them right in. Having the parameters grid involves pointing and clicking with the mouse, entering possibly complex values (such as long arrays) and in general much more work than just having them right above the query.

Note that this is a studio only feature, queries from the client API already have ways to specify arguments properly. So the next question is how we are going to handle passing the arguments to the server. Remember, this is only on the studio, so we can take quite a few shortcuts. In this case, we’ll simply snip the entire first section of the query text (which contains the query parameters). We can do that by going from the start of the query to the first from or declare keywords. We do a basic pre-processing to turn “$name = …“ into “results.$name = …“ and then just execute this code in the browser, giving us a JS object with all the parameters that we can then send to the servers.

The next stage is to make this discoverable, by detecting parameters whose value is not provided and giving the user a quick fix to add them.

Subtle corruption and the debugger

time to read 1 min | 172 words

We had a bug. If a certain method was called, we would do something very bad. Here is the fix for this issue:

image

Basically, we assumed that the passed pointer is a char pointer and not a UTF8 byte pointer. That led to horrible mess down the line, including the fact that the length was pass to the constructor is twice the size of the actual allocated memory.

In rare cases, that would be enough to push us to the next page of memory. If that page of memory wasn’t mapped, we would die with an access violation.

There is just one problem with this scenario. We never called this method in our codebase. This method was implicitly called by the debugger to show nice strings. Which meant that during debugging, sometimes, we would corrupt our own state and end up killing ourselves. For fun, this is the kind of error that literally cannot be debugged.

The candidate’s portfolio

time to read 1 min | 136 words

This is a screen shot from a CV I just read:

image

The CV itself is kind of boring. Just graduated, did so and so course and was excellent in foo and bar.

We see dozens of CVs like that on a regular basis. But the portfolio link was very nice. It linked to a Google Drive folder with a bunch of games that the candidate made, in various languages.

I didn’t actually went and read all the code, but I really skimmed through a bunch of projects there. I actually like the portfolio a lot better than a github link. A portfolio is explicitly about showing a potential employer what you can do. A github link can be used for many things.

The perf optimization that cost us

time to read 4 min | 609 words

There is a lock deep inside RavenDB that has been bugging me for a while. In essence, this is a lock that prevents a read transaction from opening during a particular section of the commit process of a write transaction. The section under lock does no I/O and only manipulate in memory data structure. Existing read transactions aren’t impacted, it is only the opening of a read transaction that is blocked. It isn’t usually visible in production systems. The only scenario you might expect to see it is if you have a very high degree of writes at the same time as you have a lot of read requests. In that case, RavenDB’s lock contention would rise and you would be able to process only about 5,000 – 10,000 read requests per second at the same time as you can process 30,000 write requests per second.  In practice, because of the way we process requests, you’ll very rarely be able to hit that level of concurrency in anything but a benchmark. But it bugged me. It was one of those scenarios that stood up like a sore thumb when we looked at the performance metrics. In any other case, we are orders of magnitudes faster, but here we hit a speed bump.

So I fixed it. I changed the code so there would be no sharing of data between the write and read transactions, which eliminated the need for the lock. The details are fairly simple, I moved the structured being modified to a single class and then atomically replace the pointer to this class. Easy, simple and much much cheaper.

This also had the side affect of slowing us down by a factor of 4. Here are the profiler results after the change:

image

That… is not a desirable property for a performance optimization.

Luckily, I was able figure out from the profiler what was causing the slowdown:

image

As it turned out, there was another lock there. This one was used to avoid starving a particular portion of the code under high write scenario. Until we removed the first lock, this worked beautifully. But when the first lock was removed, that actually sped us the rest of the system which meant that we’ll acquire the second lock a lot more often. That, in turn, results in a rare occurrence becoming very common and the contention on the lock cause a major slowdown.

We were able to figure out that the lock is now no longer needed, due to some other changes. We just removed this lock entirely, giving us over 10% perf over the original code, without even mentioning the actual scenario we are trying to test.

image

Now, you might have noticed that this benchmark I’m running is for pure writes, right? Let’s see what happens when we have more complex workloads, and are testing things without the profiler:

ReadsWrites2/3 Reads & 1/3 Writes
Original 385,510 85,688 91,984
Optimized416,46890,38092,323

This table shows the number of scenarios and the requests per second on each. We can see just under 10% improvement for reads, over 5% improvement for writes (which are excellent numbers). However, the mixed workload, where we expected to see the most performance boost has a difference that is pretty much a sampling error.

I’m not sure how I feel about that.

Reading the NSA’s codebase: LemonGraph review–Part VII–Summary

time to read 2 min | 344 words

As the final post in this series, I decided to see how I can create a complex query. Given the NSA’s statement, I decided to see if I can use LemonGraph to find a dog of interest. In particular, given our graph, I wanted to find start with a particular dog and find another dog that likes this dog that also like a dog that dislike the original.

As a reminder, here is the graph:

image

And starting from Arava, I want to find a dog that likes Arava that also likes a dog that isn’t liked by Arava.

The LemonGraph query language isn’t very expressive, or at least I’m not familiar enough with it to make it work properly. I decided on the following query:

n(type="dog", value="arava")->
               @e(type="likes", value="yes")->
n(type="dog")->
               @e(type="likes", value="yes")->
n(type="dog")->
               @e(type="likes", value="no")->
@N(type="dog", value="arava")

This is a bit of a brute force method to do this. It encodes the path directly. There are a few minor things that might not be obvious here. The @ prefix means don’t return this to the user and the N() indicates that we shouldn’t filter already seen values. I can certainly see how this can be useful for certain things Smile. I wonder if LemonGraph has a better way to express such a query.

This is the first time I actually reviewed this kind of codebase, where some things are done in C and some in Python. It was quite interesting to see the interaction between them. The codebase itself is really interesting, but I found it really hard to follow at times. The love affair with tuples and dynamic behavior made the code non trivial and likely is going to cause maintenance issues down the line. It is also quite obvious that this is intended for internal consumption, with very little time or effort spent on “productization”. By that I meant things like better query errors and more obvious thing to do.

It has an interesting approach to solving the problem of graph queries and I’ve learned quite a few things from it.

Reading the NSA’s codebase: LemonGraph review–Part VI–Executing queries

time to read 7 min | 1300 words

After going over many layers inside LemonGraph, I think we are ready to get to the real deal. We know how LemonGraph starts to execute a query. It find the clause that is likely to have the least number of items and starts from there. These are the seeds of the query, but how is it going to actually process that?

Here is my graph:

image

And here is the query that I want to run on it: n()->e(type="likes", value="yes")->n()

LemonGraph detects that the cheapest source of seeds for this query is the edge and provides these to the MatchCTX class which does the actual processing. Let’s explore how it is working on a single seed.

The actual behavior starts from the matches() method, which starts here:

image

As usual for this codebase, the use of tuples for controlling behavior is prevalent and annoying. In this case, the idx argument controls the position of the target in the query, whatever this is the start, end or somewhere in the middle. The deltas define what direction the matches should go and the stops where you must stop searching, I think.

Now that we setup the ground rules, the execution is as follows:

image

In this case, because the seed of our query is the edge (which is in the middle) it determines that deltas are (1, –1) and stops are (2, 0). We’ll also go to the first clause in the if statement. The link variable there controls whatever we should follow links going in or out.

There is a lot of logic actually packed into the link initialization. The self.link is an array that has ‘next’ and ‘prev’ as its values, so the idea is that it find what property to look at, then use the dictionary syntax to get the relevant property and decide what kind of direction it should go.

We’ll focus on the _recurse() method for now. In this case, we are being passed:

  • idx = 1
  • delta = 1
  • stop = 2

Here is the actual method:

image

As you can see, we first validate that the match is valid (this is where we mostly check other properties of the value that we are currently checking, filtering them in place). Usually the do_seen will be set to True, which will ensure that we only traverse each node once. The main iteration logic is done in combination with the _next() method, shown below:

image

The first line there shows usage of delta, which control in what direction to move. I’m not quite sure why the filter is set in the if statements, since it is also being set immediately after. This looks like redundant code that snuck in somehow.

The bulk of the work is shelled out to iterlinks, so let’s check what is going on there… I know that we are running on an edge here, so we need to look at the Edge.iterlinks() method:

image

There isn’t really much to tell here, it checks the filters and return the incoming or outgoing edges, nothing more. On the other hand, the Node.iterlinks() implementation is a bit simpler:image

We’ll explore exactly how the edges are loaded for a node in a bit, however, I do want to note right now that the _next() method isn’t passing the types of the edge, even though it looks to me that it has this information. In that case, that would give a performance boost because it could filter a lot of edges in busy graphs.

Actually, I already looked at how iterators working in a previous post, so I won’t go over all of that again. This is basically calling this method

image

The graph_node_edges_in() and graph_node_edges_out() methods are pretty simple, basically just scanning the DB_TGTNODE_IDX or DB_SRCNODE_IDX indexes, respectively. The graph_node_edges() however, is really strange.

Here it is:

image

It took me a few reads to figure out that this is probably a case of someone unpacking a statement for debugging but forgetting to remove the packed statement. The second return statement is ignored and likely stripped out as unreachable, but it is pretty confusing to read.

The graph_iter_concat() answers a question I had about how graph_iter_t is used, here is how it looks like:

image

This is using C’s support for passing an unknown number of parameters to a method. This basically builds a simple linked list, which also answers the usage of _blarf() function and its behavior a few posts ago.

So we are back were we started, we understand how the data flows into the Python code, now let’s go back and look at _recurse() and _next() calls.

Now I know a lot more about the behavior of the code, so this make sense. The stop argument to the _recurse() control the depth of the number of links that would be traversed in a particular query. Now that I know how this particular clause work, I understand how LemonGraph goes from the edge to the node in e()->n(), but I don’t know how it goes to back to find the full path.

The key for that are the calls to push()  and pop() in the the MatchCtx._recurse() methods. These update the chain, like so:

image

In processing the query, we first append the edge to the chain:

image

The next step goes from the edge to it’s destination, giving us Oscar:

image

Remember that in the matches(), we got into this if clause:

image

This is when we start scanning an edge, which first add things to the right, then it scan things to the left in the _next() method. Look at the push() method there. By the time we get to the result(), we have iterated both sides of the connection, giving us:

image

That is a very clever way of doing things.

Let’s try doing things a little bit differently. I’m going to check a slightly different query: n(type=”dog”, value=”arava”)->e(type="likes", value="yes")->n()

If I understand things correctly, this is going to select the node as the starting point, because that is a lot more efficient. That will allow me to figure out the other side of this operation. I just run this through the debugger, and this is indeed how it actually works.

The usage of yield to pass control midway is not trivial to follow, but it ends up being quite a nice implementation. This is enough for this post. In my next one, I want to explore a few of the possible operations that exposed by LemonGraph.

Reading the NSA’s codebase: LemonGraph review–Part V–Query parsing

time to read 6 min | 1089 words

I said before that I don’t want to get into the details of how LemonGraph is dealing with parsing the queries. Unfortunately, I can’t avoid that. There seems to be a lot of logic, magic and mystery in the MatchLGQL() class, which is critical to understanding how queries work.

The problem is that either my Python-fu is lacking or it is just really hard to figure out a non trivial codebase behavior in a dynamic language like python. I find it hard to figure out what data is stored where and how it is manipulated. Therefor, I decided to break with my usual custom and actually run the code in the debugger to try to follow what is going on there. I tried to run this on WSL, but it crashed horribly, so I had to spin up a VM and setup PyCharm on it. First time that I’m actually using that and the experience is pretty nice so far. Being able to inspect things directly means that it is much easier to figure out the behavior of the code.

In order to explore how queries work in LemonGraph, I created the following graph, which represents the relationships between my dogs:

image

Here is how this looks like in code:

This tells us to find all the dogs that like each other. And it finds:

  • Arava –> Oscar
  • Oscar –> Arava
  • Oscar –> Pheobe

Now that we have a query that we can sink our teeth into, let’s figure out how this work, shall we? Inside the dreaded MatchLGQL() class, there are all sorts of regular expressions running on the parse this thing, but eventually we get to the partially processed parsed query:

image

This screen shot might explain why I wasn’t happy with the code structure for figuring out what is going on without debugging. The number of tuples here is quite amazing, and they are used everywhere. This make static analysis (as in, just reading the code) too hard for me. But with the debugger, that is much easier. If you are familiar with ASTs, this should be pretty easy to figure out.

Here is a piece of code that we already looked at (and criticized), this is in munge_obj() method, where it is deciding how to optimize the query:

image

This piece of code is critical for the performance of the system. And it is really hard to understand. Here is what is going on.

The accel array tell a later piece of code how to accelerate the query, using the type or type and value to start from a particular source. The info is used to carry state about particular clause in the query. Before this code run there is some code that builds the dictionary d which is used to figure out the filters on the particular clause. This is fun, because it is using missing a key lookup in the dictionary for control flow.

Let’s follow the logic?

  • Line 2 - If the clause operates on a node, rank it as 6. If it is an edge, rank it as 7.
  • Line 6 – If the clause has a type specified, rank is as 4 if it is a node, 5 if it is an edge. Otherwise, abort the optimization.
    • You might not see the “abort this optimization” in line 6, because it relies on the dictionary to throw if the key isn’t found. This is a common pattern in this code and something that I personally greatly dislike.
  • Line 8 – it uses the length of the type as a metric for secondary ranking. I’m not quite sure why this is the case. I guess the code needed a tie breaker, but I can’t imagine why the length of a type would have any impact on performance.
    • Unless, of course, the code assumes that shorter types are more common, and therefor will prefer to use the rare longer types?
  • Line 10 – If there is a type and a value defined, that is even better. Note that again the is the ranking of node (2) and edge (3) which I find non obvious.

Here are the results of the matches after they have been munged, I marked the ranking:

image

Looking at this, this seems very strange, the rank2 value is 1 in the second element, but I expected it to be the length of the string. As it turns out, this is not working directly on the string, it is working on the tuple of possible values, so the secondary ranking here is not based on the length of the type or the value but on the number of possible types and values that were specified for each clause.

The code judges that the best place to start this query is with the second entry, since it is the most specific option. This in turn takes us the the seeds() method that we previously covered. In this case, the code is going to hit this branch:

image

This means that it is going to be iterating over all the edges of a particular type and filtering them in Python code. This is strange, because the on disk indexes actually support doing a direct query on the (type, value) directly and would probably be much cheaper in the case you have many values for a particular type of an edge.

In fact, just that is implemented for querying nodes by (type, value):

image

I’m guessing that they are either don’t have a lot of queries on (type, value) on edges or not a lot of different values for edge types that they can optimize in this manner.

That is enough for now, I have a pretty good grasp of how queries are parsed and how they fetch data from the underlying storage. The next post will talk about how LemonGraph takes the seeds of the query and execute the actual graph operations on them. The code that does this is tight and will require a full post to explore properly.

Reading the NSA’s codebase: LemonGraph review–Part IV–Compressed, sortable integers

time to read 2 min | 244 words

Before going over the actual query implementation, I wanted to talk about something that I just realized. I said previously that I don’t understand why LemonGraph is using its integer encoding method, because it is less efficient than using variant sized integer. What I didn’t take into account is that the method LemonGraph is using gives short, but sortable, integers.

Here is the encoding method:

image

Now, let’s see what kind of binary output this will generate when given a few numbers:

image

The key here is that the number of bytes is stored as the first item. This means that when we compare two numbers using memcmp(), the number with more bytes is considered greater. This is indeed the case, because if you need more bytes to store a number, it is obviously larger.

What about two numbers that have the same number of bytes? This is handled by simple binary comparison of the values. If they are the same size, the fact that the encode() output them in big endian format means that we can compare them using memcmp() and be done with it.

This is a really nice way to both keep the values sorted and to avoid storing the full 8 bytes for numbers that are usually much smaller.

FUTURE POSTS

  1. RavenDB 4.1 Features: MongoDB & CosmosDB Migration Wizards - 17 hours from now
  2. If you stay in the office any longer, I’ll start charge you rent - about one day from now
  3. You know too much - 3 days from now

There are posts all the way to Aug 24, 2018

RECENT SERIES

  1. RavenDB 4.1 features (12):
    04 Jul 2018 - This document is included in your subscription
  2. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
  3. Codex KV (2):
    06 Jun 2018 - Properly generating the file
  4. I WILL have order (3):
    30 May 2018 - How Bleve sorts query results
  5. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats