Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,210 | Comments: 46,173

filter by tags archive

RavenDB Retrospective: Unbounded result sets

time to read 4 min | 750 words

Image result for team retrospectiveWe spent some time recently looking into a lot of our old design decisions. Some of them make very little sense today (json vs. blittalbe as a good example), but made perfect sense at the time, and were essential to actually getting the product out.

Some of those design decisions, however, are still something that I very firmly believe in.  This series of posts is going to explore those decisions, their background and how they played out in the real world. So, without further ado, let us talk about unbounded result sets.

The design of RavenDB was heavily influenced by my experience as That NHibernate Guy (I got started with NHibernate over a decade ago, if you can believe that), where I saw the same types of error, repeated over and over again. I then read Release It!, and I suddenly discovered that I wasn’t alone fighting those kind of demons. When I designed RavenDB, I set out explicitly to prevent as many of those as I possibly could.

One of the major issues that I wanted to address was Unbounded Result Sets, simply put, this is when you have:

SELECT * FROM OrderLines WHERE OrderID = 1555

And you don’t realize that this order has three million line items (or, which is worst, that most of your orders have a few thousands line items, so you are generating a lot of load on the database, only to throw most of them away).

In order to prevent this type of issue, RavenDB has the notion of mandatory page sizes.

  • On the client side, if you don’t specify a limit, we’ll implicitly add one (by default set to 128).
  • On the server side, there is a database wide maximum page size (by default set to 1024). The server will trim all page sizes to the max if they are larger.

I think that this is one of the more controversial decisions in RavenDB design, and one that got a lot of heated discussion. But I still think that this is a good idea,because I have seen what happens when you don’t do that.   And the arguments are mostly about “RavenDB should trust developers to know what they are doing” and a particular irate guy called me while I was out shopping to complain how I broke the sacred contract of Linq with regards to “queries should return all by default, even if this is ten billion results”. I pointed out that this is actually configurable, and if he wanted to set the default to any size he wanted, he could do that, but apparently it is supposed to be “shoot my own foot first, then think” kind of deal.

Even though that I still think that this is a really good idea, we have added some features over the years to make it easy for people to access the entire dataset when they need it. Streaming has been around since 2.5 or so, giving you a dedicated API to stream unbounded results. Streams were built to make it efficient to process large sets of data, and they allow both client & server to process the data in parallel, instead of batching huge responses on the server, then consuming ridiculous amounts of memory on the client before giving you the full result set. Instead, you can get each result as soon as it arrive from server, and you can process it and send it further.

In 4.0, we are going to change the behavior of the paging limits so:

  • If you don’t specify a limit, we’ll supply a limit clause of 25 items. If there are more than 25 items, we’ll throw an exception (unless you asked otherwise in the conventions).
  • If you supply a limit explicitly, it will work as expected and page through the data.

The idea is that we want to reduce the surprise for users, and that can give them the experience to draw upon early on. Another thing that we’ll do is make sure that the operations guys can also change that, likely with an environment variable or something like that. If you need to modify the conventions on the fly, you usually have hard time deploying a new version, and an immediate action is needed.

In this manner, we can help users avoid expensive requests to the server, and they can be explicit with what they need to do.

The tale of the intermittently failing test

time to read 2 min | 350 words

We recently started seeing a failing test in our RavenDB 4.0 test suite. This test was a relatively simple multi-map/reduce test.  Here it is:

image

I checked the history, and this test has been part of our test suite (and never failed us) since 2012. So I was a bit concerned when it started failing. Of course, it would only fail sometimes, which is the worst kind of failures.

After taking a deep breath and diving directly into the map/reduce implementation and figuring out all the parts that were touched by this test, I was stumped. Then I actually sat down and read through the test and tried to figure out what it is doing. This particular test is one that was sent by a user, so there was business logic to penetrate too.

The strange thing is that this test can never pass, it is inherently flawed, on several levels. To start with, it isn’t waiting for non stale results, which was the obvious racy issue. But once we fixed that, the test always failed. The problem is probably a copy/paste error. There supposed to be two lines for clients/1 and two lines for clients/2. But there are three lines for clients/1 and only one for clients/2. So this test should always fail.

But, because we didn’t have WaitForNonStaleResults, it will always return no results (it didn’t have time to finish indexing from the SaveChanges to the index) and the test would pass with empty result set.

This has been the case since 2012(!), mind you.

I fixed the copy/paste issue and the WaitForNonStaleResults, and the test consistently pass now.

The most interesting observation that I have here is that RavenDB is now able to run a full map/reduce cycle in the time it takes the test to move from the SaveChanges line to the query itself. And that is a damn impressive way to find bugs in your tests.

Voron InternalsThe diff is the way

time to read 3 min | 486 words

I talked about the goals of using diffs for the journals in a previous post, but in this one, I want to talk about what it actually did. To start with, it turn out that using diffs in the journal invalidates quite a few optimizations that we had. We had a whole bunch of stuff that we could do to avoid writing data to the data file if there are additional modifications to it. When using diffs, we can’t do that, because the next diff is building on the previous version. That actually ended up improving the code quality, because a bunch of pretty tricky code had to go away.

It was tricky code because we tried to reduce the data file I/O, but only in the cases that it was allowed, which wasn’t always. Now we always write the latest version of all the modified pages, which might add some additional I/O in some cases, but in practice, this didn’t show up in our benchmarks. And the reduction of complexity might have been worth it even if it was.

We actually have two diff implementation, one that tests two different versions of a page and find the differences, and one that test a page against zeros. That, in addition to collapsing runs of zeros, is the entire implementation. We are actually not as small as we could get, because we insert some metadata into the diff stream to allow easier debugging and error detection. But the nice thing about the diff output is that we are still compressing it, so working extra hard to reduce the uncompressed data isn’t that important if compression will cover it anyway.

We tested before & after of using diffs for journal writes, and we found the following:

Journal size (inserts) 37% reduction in size
Journal size (updates) 80% reduction in size
Bulk insert speed 5% – 10% speed improvement *
Writes / sec (inserts) 12% improvement
Writes / sec (updates) 18% improvement

* If the sizes went down so significantly, why haven’t the insert & update speed improve by a comparable amount?

The answer to that is that for the most part, reducing the amount of data we are writing is awesome, a major limiting factor is the number of times we write to the journal, rather than the number of bytes. And we have to write once per transaction, so that is a limiting factor.

However, the current numbers we are looking at is a benchmark showing roughly 30,000 write requests / second and are unable to get anything higher, because we have saturated the local network. We are currently setting up a dedicated testing environment to see how far we can push this.

RavenDB 3.5 RC2 is released

time to read 2 min | 224 words

You can now get RavenDB 3.5 RC2 in the download page. This is going to be the last RC before the RTM release, which we expect in the next week or two.

Please take it for a spin, we appreciate any feedback that you have. While the RC build is out there, we are running internal tests, stressing RavenDB under load in various configuration, to see if we can break it.

The changes since the last RC are:

  • Reduce the number of threads being used in the system.
  • Support for dynamic document in More Like This queries
  • Display number of alerts in the studio
  • Reduced studio load times
  • Warn about configuration values that suspect
  • Reduced allocations when reading large number of documents
  • Better optimistic concurrency control on the client, can now specify that on a per document basis.
  • Better handling for clusters, resolved a race condition on the client deciding who the leader is and added better logging and operational support.
  • Fixed index & side by side index replication in master master scenario.
  • Alert if the paging file is on HDD and there is an SSD on the system.
  • Clusters now only select an up to date node as the leader.
  • Fixed perf issue when starting out and loading a large number of dbs concurrently.
  • Better CSV import / export semantics.

Production postmortemThe insidious cost of managed memory

time to read 3 min | 542 words

A customer reported that under memory constrained system, a certain operation is taking all the memory and swapping hard. On a machine with just a bit more memory, the operation completed very quickly. It didn’t take long to figure out what was going on, we were reading too much, and we started swapping, and everything went to hell after that.

The problem is that we have code that is there specifically to prevent that, it is there to check that the size that we load from the disk isn’t too big, and that we aren’t doing something foolish. But something broke here.

Here is a sample document, it is simple JSON (without indentation), and it isn’t terribly large:

image

The problem happens when we convert it to a .NET object:

image

Yep, when we de-serialized it, it takes close to 13 times more space than the text format.

For fun, let us take the following JSON:

image

This generates a string whose size is less than 1KB.

But when parsing it:

image

The reason, by the way? It is the structure of the document.

The reason, by the way:

image

So each two bytes for object creation in JSON ( the {} ) are holding, we are allocating 116 bytes. No wonder this blows up so quickly.

This behavior is utterly dependent on the structure of the document, by the way, and is very hard to protect against, because you don’t really have a way of seeing how much you allocated.

We resolved it by not only watching the size of the documents that we are reading, but the amount of free memory available on the machine (aborting if it gets too low), but that is a really awkward way of doing that.  I’m pretty sure that this is also something that you can use to attack a server, forcing it to allocate a lot of memory by sending very little data to it.

I opened an issue on the CoreCLR about this, and we’ll see if there is something that can be done.

In RavenDB 4.0, we resolved that entirely by using the blittable format, and we have one-to-one mapping between the size of the document on disk and the allocated size (actually, since we map, there is not even allocation of the data, we just access it directly Smile).

RavenDB 3.5 Release Candidate

time to read 2 min | 365 words

I’m very happy to announce that the Release Candidate of RavenDB is now available. We have been working hard for the past few months to incorporate feedback that we got from customers who started using the beta, and we are finally ready.

There have been quite a lot of fixes, improvements and enhancements, but in this post I want to focus on the most exciting three.

Waiting for write assurance

Take a look at the following code:

It asks RavenDB to save a new user, and then wait (on the server) for the newly saved changes to be replicated to at least another 2 servers, up to a maximum of 30 seconds.

This give you the option of choosing how safe you want to be on saves. For high value documents (a million dollar order, for example), we will accept the additional delay.

Waiting for indexes

In the same vein, we have this:

This feature is meant specific for PRG (Post, Redirect, Get) scenarios, where you add a new document and immediately query for it. In this case, you can ask the server to wait until the indexes are caught up with this particular write.

You can also set a timeout and whatever to throw or not. But more interestingly, you can specify specific indexes that you want to wait for. If you don’t specify anything, RavenDB will automatically select just the indexes that are impacted by this write.

Dynamic leader selection

In the RavenDB Conference, we showed how you can define a cluster of RavenDB servers and have them vote on a leader dynamically, then route all writes to the leader. In the presence of failure, we will select a new leader and stick to it as long as it is able.

Some of the feedback we got was related to how we select the leader. We now take into account the state of the databases on the server, and will not select a leader whose databases aren’t up to date with regards to replication. We also reduced the cost of replication in general, by significantly reducing the amount of metadata that we need to track.

On why RavenDB is written in C#

time to read 4 min | 664 words

Greg Young has an interesting reply to my post here. I’m going to reply to it here.

RavenDB nor EventStore should be written in C#.

That may be true for the EventStore, but it isn’t true for RavenDB. Being able to work with the .NET framework makes for a lot of things a lot simpler. Let us take a simple example, every day, I have to clear the unused auto indexes, but only if there has been queries to the collection in question. In C#, here is the code to give me the last query time for all collections in the database:

image

No, if we wanted to do this in C, as the go to example for system level programming, I would need to select which hash function to use. RavenDB being a non trivial project, I’ll probably have one, for simplicity sake, I selected this one, which gives us the following:

image

I’ve elided all error handling code, and I’ve introduced a silent potential “use after free” memory issue.

This is something that runs one a day, but the amount of code, effort and potential things to manage is quite a lot, compare to the simplicity we get in C#.

Now, this is somewhat of a straw man’s argument. Because the code would be much simpler in C++ or in Go (no idea about Rust, I’m afraid), but it demonstrate the disparity between the two options.

But Greg isn’t actually talking about code.

A different environment such as C/C++/Go/Rust would be far superior to haven been written in C#. Cross compiling C# is a pain in the ass to say the least whether we talk about supporting linux via mono or .netcore, both are very immature solutions compared to the alternatives. The amount of time saved by writing C# code is lost on the debugging of issues and having unusual viewpoints of them. The amount of time saved by writing in C# is completely lost from an operational perspective of having something that is not-so-standard. We have not yet talked about the amount of time spent in dealing with things like packaging for 5+ package managers and making things idiomatic for varying distributions.

I have one statement in response to this. Mono sucks, a lot.

I suspect that a lot of Greg’s pain has been involved directly with Mono. And while I have some frustrations with .NET Core, I don’t agree with lumping it together with Mono. And Greg’s very market was primarily Linux admins. While I would love to that be a major market for us, we see a lot more usage in Windows and Enterprise shops. The rules seems to be simpler there. I can deploy the .NET Core along with my application, no separate install necessary, so that makes it easier. What we have found customers want is docker instances so they can  just throw RavenDB up there and be happy.

A lot of the pain Greg has experienced isn’t relevant any longer.

What is more, he keep considering just the core product. There are quite a lot of additional things that needs to happen to take a project to a product. Some of this is external (documentation, for example) and some of this is internal (high level of management studio, good monitoring and debugging, etc). I have seen what “system level guys” typically consider sufficient into itself in those regards. Thank you, but no.

A lot of our time is actually invested in making sure the whole thing is better, spreading oil over the machinery, and the more lightweight and easy we can make it, the better. 

RavenDB Conference 2016–Slides

time to read 1 min | 145 words

The slides from the RavenDB Conference are now available, we’ll post videos in a few weeks, once we are done post processing them.

Day 1:

Day 2:

The design of RavenDB 4.0The client side

time to read 5 min | 926 words

We didn’t plan to do a lot of changes on the client side for RavenDB 4.0. We want to do most changes on the server side, and just adapt the client when we are doing something differently.

However, we run into an interesting problem. A while ago we asked Idan, a new guy at the company, to write a Python client for RavenDB as part of the onboarding process. Unlike with the JVM client, we didn’t want to do an exact duplication, with just the changes to match the new platform. We basically told him, go ahead and do this. And he went, and he did. But along the way he got to choose his own approach for the implementation, and he didn’t copy the same internal architecture. The end result is that the Python client is significantly simpler than the C# one.

That has a few reasons. To start with, the Python client doesn’t need to implement features such as async or Linq. But for the most part, it is because Idan was able to look at the entire system, grasp it, and then just implement the whole thing in one go.

The RavenDB network layer on the client side has gone through many changes over the years. In particular, we have the following concerns handled in this layer:

  • Actual sending of requests to the server.
  • High availability, Failover, load balancing and SLA matching.
  • Authentication of the requests
  • Caching server responses.

I’m probably ignoring a few that snuck in there, but those should be the main ones. The primary responsibility of the network layer in RavenDB is to send requests to the server (with all the concerns mentioned above) and give us the JSON object back.

The problem is that right now we have those responsibilities scattered all over the place. Each was added at a different time, and using several different ways, and the handling is done in very different locations in the code.  This leads to complexity, and seeing everything in one place in the Python client is a great motivation to simplify things. So we are going to do that.

image

Having a single location where all of those concerned will make things simpler for us. But we can do better. Instead of just returning a JSON object, we can solve a few additional issues. Performance, stability and ease of deployment.

We can do that by removing the the internal dependency on JSON.Net. A while ago we got tired from conflicting JSON.Net versions, and we decided to just internalize it in our code. That led to much simpler deployment life, but does add some level of complexity if you want to customize your entities over the wire and in RavenDB (because you have to use our own copy of JSON.Net for that). And it complicates getting new JSON.Net updates, which we want.

So we are going to do something quite interesting here. We are going to drop the dependency on JSON.Net entirely. We already have a JSON parser, and one that is extremely efficient, using the blittable format. More importantly, it can writes directly to native code. That give us a very important property, it make it very easy to store the data, already parsed, with a well known size. Let me try to explain my reasoning.

A very important feature of RavenDB is the idea that it can cache requests for you. When you make a request to the server, the server returns the reply as well as an etag. Next time you make this request, you’ll use the same etag, and if the response hasn’t changed, the server can just tell you to use the value in the cache. Right now we are storing the full string of the request in the cache. That lead to a few issues, in particular, while we saved on the I/O of sending the request, we still need to parse the JSON, and we need to keep relatively large (sometimes very large) .NET strings around for a long time.

But if we use blittable data in the client cache, then we have no need to do additional parsing, and we don’t need to involve the GC on the client side to collect things, we can manage that explicitly and directly. So the 2nd time you make a request, we’ll hit the server, learn that the value is still relevant, and then just use it.

That is great for us. Except for one thing. JSON.Net is doing really great job in translating between JSON and .NET objects, and that isn’t really something that we want to do. Instead, we are going to handle blittable objects throughout the RavenDB codebase. We don’t actually need to deal with the user .NET’s types until the very top layers of the system. And we can handle that by having a convention, that will call into JSON.Net (the only place that will happen) and translate .NET objects to JSON and back. The nice thing about it is that since this is just a single location where this happens, we can do that dynamically, without having a hard dependency on JSON.Net version.

That, in turn, also expose the ability to do JSON <—> .Net objects using other parsers, such as Jil, for example, or whatever the end user decides.

FUTURE POSTS

  1. RavenDB Restorspective - 8 days from now
  2. Playing with the StackOverflow datasets : All the wrong ways - 11 days from now

There are posts all the way to Oct 10, 2016

RECENT SERIES

  1. Interview question (3):
    29 Sep 2016 - Stackoverflow THAT
  2. Voron internals (5):
    13 Sep 2016 - The diff is the way
  3. Database Building 101 (8):
    25 Aug 2016 - Graph querying over large datasets
  4. Production postmortem (16):
    23 Aug 2016 - The insidious cost of managed memory
  5. Digging into the CoreCLR (3):
    12 Aug 2016 - Exceptional costs, Part II
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats