Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,436 | Comments: 47,611

filter by tags archive

Solving a problem in a way that doesn’t generate dozens more

time to read 2 min | 393 words

Yesterday I showed off some of the new features in RQL, and in particular, two very cool features, the ability to declare functions and the ability to project an object literal directly from the select.

Both of these features are shown here, including usage demonstration:

image

Take a minute to read the query, then consider what we are doing here.

One of the things we want to enable is to have a powerful query capabilities. We started with what SQL syntax and very quickly hit the wall. Leaving aside the issue of working with just rectangular format of SQL, there is a huge problem here.

Something like N1QL or the Noise syntax have both seems that you can deal with non rectangular formats with enough effort. But the problem is that there is really no really good way to express complex imperative stuff easily. For example, N1QL has about 30 functions just to deal with arrays, and 40 just for dates. And those are just the first I looked up that were easy to count.

Going with something with a SQL syntax, including subqueries, complex nesting of expressions, etc would have led to a really complex language. Both from the point of view of implementation and from the point of view of the users.

On the other hand, with RQL, you get the familiar query language, and if you want to just get some data to display, you can do that. When you need to do more, you move to using the object literal syntax and if you need to apply more logic, you write a small function for that and call it.

Imagine write the code above in SQL (just the fact that the Nicks array is on a separate table means that you’ll have a correlated subquery there), and this is a pretty simple query. Choosing this approach means that we actually have very little expected support burden.

That is important, because with the other option I could absolutely see a lot of requests for “can you add a function to do X”, “how do I add my own function” popping up all the time in the support channels. This way, we give it all to you already and the solution is self contained and extensible as needed very easily.

Breaking the language barrier

time to read 5 min | 916 words

In RavenDB 4.0, we decided to finally bite the bullet and write our own query language. That led to a lot of really complex decisions that we had to make. I already posted about the language and you saw some first drafts. RQL is meant to be instantly familiar to anyone who used SQL before.

At the same time, this led to issues. In particular, the major issue is that SQL as a language is meant to handle rectangular tables, and RavenDB isn’t storing data in tables.  For that matter, we also want to give our users the ability to express themselves fully in the query language, that means support for complex queries and complex projections.

For a while, we explored the option of supporting nested selects as the way to express these semantics, but that was pretty horrible, both in terms of the complexity of the implementation and in terms of the complexity of the language. Instead, we decided that take only the good parts out of SQL Smile.

What do I mean by that? Well, here is a pretty simple query:

image

And here is how we can ask for specific fields:

image

You’ll note that we moved the select clause to the end of the query. The idea being that the syntax will hint to the user about the order of operations in the pipeline.

Next we add some filtering, giving us:

image

This isn’t really interesting except for the part where implementing this was a lot of fun. Things become both more interesting and more obvious when we look at a full query, with almost everything there:

image

And again, this is pretty boring, because except for the clauses location, you might as well write SQL. So let us see what happens when we start mixing some of RavenDB’s own operations.

Here is how we can traverse the object graph on the database side:

image

This is an interesting example, because you can see that we have traversed the path of the relation and are able to fetch not just the data from the documents we are looking for, but also the related documents.

It is important to note that this happens after the where, so you can’t filter on that (but you can plug this in the index, but that is a story for another time). I like this syntax, because it is very clear about what is going on, but at the same time, it is also very limited. If I want anything that isn’t rectangular, I need to jump through hops. Instead, let us try something else…

image

The idea is that we can use a JavaScript object literal in place of the select. Instead of fighting with the language and have to fight it, we have a very natural way to express projections from RavenDB.

The cool part is that you can apply logic as well, so things like string concatenation or logic in the select clause are absolutely permitted. However, take a look at the example, we have code duplication here, in the formatting for customer and contact names. We can do something about it, though:

image


The idea is that we also have a way to define functions to hold common logic and handle the more complex details if we need to. In this manner, instead of having to define and deploy transformers, you can define that directly in your query.

Naturally, if you want to calculate the Mandelbrot set inside the query, that is… suspect, but the idea is that having JavaScript applied on the results of the query give us a lot more freedom and easy of use.

The actual client side in C# is going to remain unchanged and was already updated to talk to the new backend, but in our dynamic language clients, I think we’ll work to expose this more, since it is a much better fit in such environments.

Finally, here is the full query, with everything that we can do:

image

Don’t focus too much the actual content of the queries, they are mostly there to show off the syntax. The last query now has the notion of include directly in the query. As an aside, you can now use load and include to handle tertiary includes. I didn’t actually set out to build them, but they came naturally from the syntax, and they are probably going to stay.

I have a lot more to say about this, but I’m so excited about this feature that I just got to get it out there. And yes, as you can see, you can declare several functions, and you can also call them from one another so there is a lot of power at the table right now.

Optimizing JavaScript and solving the halting problemPart II

time to read 3 min | 544 words

In the previous post I talked about the issues we had with wanting to run untrusted code and wanting to use Jurassic to do so. The problem is that when the IL code is generated, it is then JITed and run on the machine directly, we have no control over what it is doing. And that is a pretty scary thing to have. A simple mistake causing an infinite loop is going to take down a thread, which requires us to Abort it, which means that we are now in a funny state. I don’t like funny states in production systems.

So we were stuck, and then I realized that Jurrasic is actually generating IL for the scripts. If it generates the IL, maybe we can put something in the IL? Indeed we can, here is the relevant PR for Jurrasic code. Here is the most important piece of code here:

The key here is that we can provide Jurrasic with a delegate that will be called at specific locations in the code. Basically, we do that whenever you evaluate a loop condition or just before you jump using goto. That means that our code can be called, which means that we can then check whatever limits such as time / number of iterations / memory used have been exceeded and abort.

Now, the code is written in a pretty funny way. Instead of invoking the delegate we gave to the engine, we are actually embedding the invocation directly in each loop prolog. Why do we do that? Calling a delegate is something that has a fair amount of cost when you are doing that a lot. There is a bit of indirection and pointer chasing that you must do.

The code above, however, is actually statically embedding the call to our methods (providing the instance pointer if needed). If we are careful, we can take advantage of that. For example, if we mark our method with aggressive inlining, that will be taken into account. That means that the JavaScript code will turn into IL and then into machine code and our watchdog routine will be there as well. Here is an example of such a method:

We register the watchdog method with the engine. You’ll notice that the code smells funny. Why do we have a watchdog and then a stage 2 watchdog? The reason for that is that we want to inline the watchdog method, so we want it to be small. In the case above, here are the assembly instructions that will be added to a loop prolog.

These are 6 machine instructions that are going to have great branch prediction and in general be extremely lightweight. Once in a while we’ll do a more expensive check. For example, we can check the time or use GC.GetAllocatedBytesForCurrentThread() to see that we didn’t allocate too much. The stage 2 check is still expected to be fast, but by making it only rarely, it means that we can actually afford to do this in each and every place and we don’t have to worry about the performance of that.

This means that our road is opened to now move to Jurassic fully, since the major limiting factor, trusting the scripts, is now handled.

Optimizing JavaScript and solving the halting problemPart I

time to read 3 min | 537 words

RavenDB is a JSON document database, and the natural way to process such documents is with JavaScript. Indeed, there is quite a lot of usage of JS scripts inside RavenDB. They are used for ETL, Subscription filtering, patching, resolving conflicts, managing administration tasks and probably other stuff that I’m forgetting.

The key problem that we have here is that some of the places where we need to call out to JavaScript are called a lot. A good example would be a request to patch request on a query. The result can be a single document modified of 50 millions. If this is the later case, given our current performance profile, it turns out that the cost of evaluating JavaScript is killing our performance.

We are using Jint as the environment that runs our scripts. It works, it is easy to understand and debug and it is an interpreter. That means that it is more focused on correctness then performance. Over the years, we were able to extend it a bit to do all sort of evil things to our scripts ,but the most important thing is that it isn’t actually executing machine code directly, it is always Jint code running and handling everything.

Why is that important? Well, these scripts that we are talking about can be quite evil. I have seen anything from 300 KB of script (yes, that is 300 KB to modify a document that was considerably smaller) to just plain O(N^6) algorithms (document has 6 collections iterate on each of them while iterating on each of the others). These are just the complex ones. The evil ones do things like this:

We have extended Jint to count the number of operations are abort after a certain limit has passed as well as prevent stack overflow attacks. This means that it is much easier to deal with such things. They just gonna run for a while and then abort.

Of course , there is the perf issue of running an interpreter. We turned out eyes to Jurrasic, which took a very different approach, it generate IL on the fly and then execute it. That means that as far as we are concerned, most operations are going to end up running as fast as the rest of our code. Indeed, benchmarks show that Jurrasic is significantly faster (as in, order of magnitude or more). In our own scenario, we saw anything from simply doubling the speed to order of magnitude performance improvement.

However, that doesn’t help very much. The way Jurrasic works, code like the one above is going to hang or die. In fact, Jurassic documentation calls this out explicitly as an issue and recommend dedicating a thread or a thread pool for this and calling Thread.Abort if need. That is not acceptable for us. Unfortunately, trying to fix this in a smart way take us to the halting problem, and I think we already solved that mess.

This limiting issue was the reason why we kept using Jint for a long time. Today we finally had a breakthrough an were able to fix this issue. Here is the PR, but it tells only part of the story, the rest I’ll leave for tomorrow’s post.

Where do I put the select?

time to read 1 min | 137 words

We have a design issue with the RavenDB Query Language. Consider the following queries:

There are two different ways to express the same concept. The first version is what we have now, and it is modeled after SQL. The problem with that is that it makes it very hard to build good intellisense for this.

The second option is much easier, because the format of the query also follow the flow of writing, and by the time you are in the select, you already know what you are querying on. This is the reason why C# & VB.Net has their Linq syntax in this manner.

The problem is that it is pretty common to want to define aliases for fields in the select, and then refer to them afterward. That now become awkward in the second option.

Any thoughts?

RavenDB 4.0Maintaining transaction boundary integrity in a distributed cluster

time to read 3 min | 463 words

Transactions are important for a database, even if it feels strange to talk about it. Sometimes it feels like taking out this ad:

image

We pretty much treat RavenDB’s transactional nature as a baseline, same as the safe assumption that any employee we hire will have a pulse. (Sorry, we discriminate against Zombies and Vampires because they create hostile work environment, see here for details).

Back to transactions, and why I’m brining up a basic requirement like that. Consider the case when you need to pay someone. That operation is compose of two distinct operations. First the bank debit your account and then the bank credit the other account. You generally want these to happen as a transactional unit, either both of them happened, or none of them did. In practice, that isn’t how banks work at all, but that is the simplest way to explain transactions, so we’ll go with that.

With RavenDB, that is how it always worked. You can save multiple documents in a single transaction, and we guarantee that we are either able to save all of them, or none. There is a problem, however, when we start talking about distributed clusters. When you save data to RavenDB, that goes into a single node, and then it is replicated from there. In RavenDB 4.0 we are now also guaranteeing that replicating the data between nodes will also respect the transaction boundary, so overall in the cluster you can now rely that we’ll never break apart transactions.

Let us consider the following transfer:

  • Deduct 5$ from accounts/1234
  • Credit  5$ to accounts/4321

These changes will also be replicated as a single unit, so the total amount of money in the system remains the same.

But a more complex set of operation can be:

  • Deduct 5$ from accounts/1234
  • Credit  5$ to accounts/4321
  • Deduct 5$ from accounts/1234
  • Credit  5$ to accounts/5678

In this case, we have a document that is involved in multiple transactions. When we need to replicate to another node, we’ll replicate the current state and we do that in batches. So it is possible that we’ll replicate just:

  • Credit  5$ to accounts/4321

We don’t replicate accounts/1234 yet, because it has changed in a later transaction. That means that in one server, we suddenly have magically more money. While in general that would be a good thing, I’m told that certain parties aren’t in favor of such things, so we have another feature that interact with this. You can enable document revisions, in which case even if you documents are modified multiple times with different transactions, we’ll always send the transactions across the wire as they were saved.

This gives you transaction boundaries on a distributed database, and allow you to reason about your data in a saner fashion.


Public Service AnnouncementConcurrentDictionary.Count is locking

time to read 2 min | 233 words

During a performance problem investigation, we discovered that the following innocent looking code was killing our performance.

This is part of a code that allow users to subscribe to changes in the database using a WebSocket. This is pretty rare, so we check that there aren’t any and skip all the work.

We had a bunch of code that run on many threads and ended up calling this method. Since there are not subscribers, this should be very cheap, but it wasn’t. The problem was that the call to Count was locking, and that created a convoy that killed our performance.

We did some trawling through our code base and through the framework code and came up with the following conclusions.

ConcurrentQueue:

  • Clear locks
  • CopyTo, GetEnumerator, ToArray, Count creates a snapshot (consume more memory)
  • TryPeek, IsEmpty are cheap

Here is a piece of problematic code, we are trying to keep the last ~25 items that we looked at:

The problem is that this kind of code ensures that there will be a lot of snapshots and increases memory utilization.

ConcurrentDictionary:

  • Count, Clear, IsEmpty, ToArray  - lock the entire thing
  • TryAdd, TryUpdate, TryRemove – lock the bucket for this entry
  • GetEnumerator does not lock
  • Keys, Values both lock the table and forces a temporary collection

If you need to iterate over the keys of a concurrent dictionary, there are two options:

Iterating over the entire dictionary is much better then iterating over just the keys.

The innocuous code that tripped me

time to read 1 min | 155 words

When building a cache, I need a way to generate a hash code from a query. A query is a complex object that has many properties. My first attempt to do so looked like this:

And it failed, disastrously. All queries always had the same hash code, and I couldn’t figure out why until I actually debugged through the code.

Here is a hint, this only happened if WaitForNonStaleResultsTimeout is null. It might be better to also look at the way the compiler looks at this:

Basically, if WaitForNonStaleResultsTimeout is null, the entire expression evaluated to null, so we return zero, losing all the previous hash values that we put in.

The solution was to parenthesis the expression, but this is something that passed a couple rounds of scrutiny and didn’t show up at all, because it looked fine, and I don’t keep the precedence rules for operators in C# in my head.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Optimizing JavaScript and solving the halting problem (2):
    18 Aug 2017 - Part II
  2. RavenDB 4.0 (12):
    14 Aug 2017 - Maintaining transaction boundary integrity in a distributed cluster
  3. Public Service Announcement (2):
    11 Aug 2017 - ConcurrentDictionary.Count is locking
  4. PR Review (4):
    10 Aug 2017 - Errors, errors and more errors
  5. Production postmortem (19):
    07 Aug 2017 - 30% boost with a single line change
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats