Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,633 | Comments: 48,370

filter by tags archive

RavenDB 4.1 featuresSQL Migration Wizard

time to read 2 min | 234 words

One of the new features coming up in 4.1 is the SQL Migration Wizard. It’s purpose is very simple, to get you started faster and with less work. In many cases, when you start using RavenDB for the first time, you’ll need to first put some data in to play with. We have the sample data which is great to start with, but you’ll want to use you own data and work with that. This is what the SQL Migration Wizard is for.

You start it by pointing it at your existing SQL database, like so:

image

The wizard will analyze your schema and suggest a document model based on that. You can see how this looks like here:

image

In this case, you can see that we are taking a linked table (employee_privileges) and turning that into an embedded collection.  You also have additional options and you’ll be able to customize it all.

The point of the migration wizard is not so much to actually do the real production migration but to make it easier for you to start playing around with RavenDB with your own data. This way, the first step of “what do I want to use it for” is much easier.

Looking for Go / Node.JS / Ruby developers for a short stint

time to read 1 min | 107 words

We are looking to finish work on our client API for RavenDB in Go, Node.JS and Ruby. The current state is that the code is mostly there, and we need help to give it the final push (and spit & polish) and drive it to release status.

For each of these, I estimate that there is about 6 – 8 weeks of work, after which we’ll be managing the rest of this internally. You can see the current state of each client API here:

If you are interested, please send an email to jobs@ravendb.net, this is applicable to both local (Hadera, Israel) or remote work.

Production postmortemThe unavailable Linux server

time to read 6 min | 1019 words

imageAs part of our 4.0 deployment, we have a cluster made of mixed machines, some running Windows, some running Linux. In one particular configuration, we have 2 Windows services and a single Linux machine. We setup the cluster and started directing production traffic to it, and all was good. We left for the weekend and came back the following week, to see what the results were.  Our Linux machine was effectively down. It was not responding to queries and it seemed like it was non responsive.

That was good and bad. It was good because the rest of the cluster just worked, and there was no interruption in service. This is as designed, but it is always nice to see this in real life. It is bad because it isn’t supposed to be happening. What was really frustrating was that we were able to log into the machine and there was nothing wrong there.

No high CPU or memory, no outstanding I/O or anything of the kind that we are used to.

The first clue for us was trying to diagnose the problem from inside the machine, where we able to reproduce it by trying to access the server from the local machine, resulting in the exact same issue as observed externally. Namely, the server would accept the connection and just leave it hanging, eventually timing out.

That was a interesting discovery, since it meant that we can rule out anything in the middle. This is a problem in this machine. But the problem remained quite hard to figure out. We deployed to production in a manner similar to what we expect our users will do, so we used Let’s Encrypt as the certificate authority with auto generated certificates.

So we started by seeing where the problem is, whatever this is on the TCP side or the SSL side, we issued the following command:

openssl s_client -connect b.cluster-name.dbs.local.ravendb.net:443

This command showed immediate connection to the server and the client sending the ClientHello properly, but then just hanging there. What was really interesting is that if we waited about 2 minutes, that SSL connection would complete successfully. But we couldn’t figure out any reason why this would be the case. It occurred to me that it might be related to the system handling of reverse DNS lookup. The two minutes timeout was very suspicious, and I assumed that it might be trying to lookup the client certificate and somehow resolve that. That isn’t how it works in general, although the fact that some SSH (and not SSL/TLS) configuration directly relate to this has led us in a merry chase.

Eventually we pulled strace and looked into what is actually going on. We focused on the following calls:

sudo strace -fp 1017 -s 128 -Tfe open,socket,connect,sendto,recvfrom,write

The interesting bits from there are shown here:

As you can see, we are looking at some DNS resolution, as we can tell from the /etc/resolv.conf and /etc/hosts open() calls. Then we have a connect() to 172.31.0.2 which is an AWS name server. Note that this is done over UDP, as you can see from the SOCK_DGRAM option in the preceding socket() call.

We are getting some data back, and we can see identrust there. And then we see something really suspicious. We have a TCP socket call that goes to 192.35.177.64 on port 80. In other words, this is an HTTP call. What does an HTTP call is doing in the middle of an SSL handshake?

As it turned out, our firewall configuration was blocking outbound connections to port 80. We tested removing that rule and everything came back online and the server was running just fine.

Further inspection revealed that we were calling to: http://apps.identrust.com/roots/dstrootcax3.p7c

And this is where things started to jell together. We are using Let’s Encrypt certificates, and in order to ensure trust, we need to send the full chain to the user. SSL Certificates has the notion of Authority Information Access, which is basically a URL that is registered in the certificate that points to where you can find the certificate that signed this one.

Why is this using HTTP? Because the data that will be fetched is already signed, and it is not a secret. And trying to use HTTPS to fetch it might get us into a loop.

So whenever we had a new SSL connection, we’ll try to connect to IdenTrust to get the full chain to send to the client. The killer here is that if we fail to do so, we’ll send the certificate chain we have (without the missing root), but it will work, since the other side already have this root installed (usually). On Windows, this certificate is installed, so we didn’t see it. On Linux, we didn’t have that certificate installed, so we had to look it up every single time.

The gory details, including dives into the source code are in the GitHub issue. And I do think they are gory. In this case, once we realized what was going on we were able to take steps to handle this. We needed to pre-register the entire chain on the local machine, so any lookup will be able to find it locally, and not do a network call per each SSL connection.

But beyond mutating the global certificate store, there is no real way to prevent that remote call.

Note that this is also true for Windows, although that seems to be implemented much deeper in the stack, and not in managed code, so I wasn’t able to trace where this is actually happening. The really bad thing here is that from the outside, there is no way for us to control or disable this, so this is just something that you have to remember to do when you use certificates, make sure that the entire chain is registered on each machine, otherwise you might have a remote call per connection, or a very long (and synchronous!) hang until the request times out if you are blocking outgoing access.

Getting fatal out of memory errors because we are managing memory too well

time to read 4 min | 621 words

imageWe got a serious situation on one of our test cases. We put the system through a lot, pushing it to the breaking point and beyond. And it worked, in fact, it worked beautifully. Up until the point that we started to use too many resources and crashed. While normally that would be expected, it really bugged us, we had provisions in place to protect us against that. Bulkheads were supposed to be blocked, operations rolled back, etc. We were supposed to react properly, reduce costs of operations, prefer being up to being fast, the works.

That did not happen. From the outside, what happened is that we go to the point where we would trigger the “sky about the fall, let’s conserve everything we can”, but we didn’t see the reaction that we expected from the system. Oh, we were started to use a lot less resources, but the resources that we weren’t using? They weren’t going back to the OS, they were still held.

It’s easiest to talk about memory in this regard. We hold buffers in place to handle requests, and in order to avoid fragmentation, we typically make them large buffers, that are resident on the large object heap.

When RavenDB detects that there is a low memory situation, it starts to scale back. It releases any held buffers, completes ongoing works and starts working on much smaller batches, etc. We saw that behavior, and we certainly saw the slow down as RavenDB was willing to take less upon itself. But what we didn’t see is the actual release of resources as a result of this behavior.

And as it turned out, that was because we were too good about managing ourselves. A large part of the design of RavenDB 4.0 was around reducing the cost of garbage collections by reducing allocations as much as possible. This means that we are running very few GCs. In fact, GC Gen 2 collections are rare on our environment. However, we need these Gen 2 collections to be able to clean up stuff that is in the finalizer queue. In fact, we typically need two such runs before the GC can be certain that the memory is not in use and actually collect it.

In this particular situation, we were careful to code so we will get very few GC collections running, and that led us to crash because we would run out of resources  before the GC could realize that we are actually not really using them at this point.

The solution, by the way, was to change the way we respond to low memory conditions. We’ll be less good about keeping all the memory around and if it isn’t being used, we’ll start discarding it a lot sooner, so the GC has better chance to actually realize that is isn’t being used and recover the memory. An instead of throwing the buffers away all at once when we have low memory and hope that the GC will be fast enough in collecting them, we’ll keep them around and reuse them, avoiding the additional allocations that processing more requests would have required.

Since the GC isn’t likely to be able to actually free them in time, we aren’t affecting the total memory consumed in this scenario but are able to reduce allocations by serving the buffers that are already allocated. This two actions, being less rigorous about policing our memory and not freeing things when we get low memory are confusingly enough to get both reduce the chance of getting into low memory and reduce the chance of actually using too much memory in such a case.

Rejection, dejection and resurrection, oh my!

time to read 4 min | 607 words

imageRegardless of how good your software is, there is always a point where we can put more load on the system than it is capable of handling.

One such case is when you are firing about a hundred requests a second, per second, regardless of whatever the previous requests have completed and at the same time throttling the I/O so we can’t complete the requests fast enough.

What happens then is known as a convoy. Requests start piling up, as more and more work is waiting to be done, we are falling further and further behind. The typical way this ends is when you run out of resources completely. If you are using thread per requests, you end up with all your threads blocked on some lock. If you are using async operations, you start consuming more and more memory as you hold the async state of the request until it is completed.

We put a lot of pressure on the system, and we want to know that it responds well. And the way to do that is to recognize that there is a convoy in progress and handle it. But how can you do that?

The problem is that you are currently in the middle of processing a set of operations in a transaction. We can obviously abort it, and roll back everything, but the problem is that we are now in the second stage. We have a transaction that we wrote to the disk, and we are waiting for the disk to come back and confirm that the write is successful while already speculatively executing the current transaction. And we can’t abort the transaction that we are currently writing to disk, because there is no way to know at what stage the write is. 

So we now need to decide what to do. And we choose the following set of behaviors. When running a speculative transaction (a transaction that is run while the previous transaction is being committed to disk) we observe the amount of memory that is used by this transaction. If the amount of memory being used it too high, we stop processing incoming operations and wait for the previous transaction to come back from the disk.

At the same time, we might still be getting new operations to complete, but we can’t process them. At this point, after we waited for enough time to be worrying, we start proactively rejecting requests, telling the client immediately that we are in a timeout situation and that they should failover to another node.

The key problem is that I/O is, by its nature, highly unpredictable, and may be impacted by many things. On the cloud, you might hit your IOPS limits and see a drastic drop in performance all of a sudden. We considered a lot of ways to actually manage it ourselves, by limiting what kind of I/O operations we’ll send at each time, queuing and optimizing things, but we can only control the things that we do. So we decided to just measure what is going on and react accordingly.

Beyond being proactive to incoming requests, we are also making sure that we’ll surface these kind of details to the user:

image

Knowing that the I/O system may be giving us this kind of response can be invaluable when you are trying to figure out what is going on. And we made sure that this is very clearly displayed to the admin.

The curse of memory fragmentation

time to read 2 min | 393 words

We took a memory dump of a production server that was exhibiting high memory usage. Here are the relevant parts:

You can already see that there is a lot of fragmentation going on. In this case, there are a few things that we want to pay special attention to. First, there are about 3GB of free space and we are seeing a lot of fragmented blocks.

image

Depending on your actual GC settings, you might be expecting some of it. We typically run with Server mode and RetainVM, which means that the GC will delay releasing memory to the operating system, so in some cases, a high amount of memory in the process isn’t an issue, but you need to see its order. If you are looking at the WinDBG output and seeing hundreds of thousands of fragments, it means that the GC will need to work that much harder when allocating. And it means that it can’t really compact memory and optimize things for higher locality, prevent their promotion to a higher GC gen, etc.

This is also usually the result of pinned memory, typically for I/O or interop. This can cause small buffers that are pinned all over the heap, but most I/O systems are well aware of that and use various tricks to avoid this. Typically by allocating large enough buffers so they would reside in the Large Object Heap, which doesn’t gets compacted very often (if ever). If you are seeing something like this in your application, the first thing to check is the number of pinned buffers and instances you are seeing.

In our case, we intentionally made a change to the system that had the side affect to pin small buffers in memory for a long time, mostly to see how bad that would be. This was to see if we could simplify buffer management somewhat. The answer was that this is quite bad, so we had to manage the buffers more proactively. We allocate a large buffer on the large object heap, then slice it into multiple segments and pool these segments. This way we get small buffers that aren’t wasting a lot of memory, but avoid high memory fragmentation because they have to be pinned for longish periods.

Memory management as the operating system sees it

time to read 7 min | 1370 words

imageAbout 15 years ago I got a few Operating Systems books and started reading them cover to cover. They were quite interesting to someone who was just starting to learn that there is something under the covers. I remember thinking that this was a pretty complex topics and that the operating system had to do a lot to make everything seem to go.

The discussion of memory was especially enlightening, since the details of what was going on behind the scene of the flat memory model we usually take for granted are fascinating. In this post, I’m going to lay out a few terms and try to explain how the operating system sees it and how it impacts your application.

The operating system needs to manage the RAM, and typically there is also some swap space that is available as well to spill things to. There is also mmap files, which come with their own backing store, but I’m jumping ahead a bit.

Physical memory – The amount of RAM on the device. This is probably the simplest to grasp here.

Virtual memory – The amount of virtual memory each process can access. This is different between each process and quite different from how much memory is actually in used.

  • Reserved virtual memory – a section of virtual memory that was reserved by the process. The only thing that the operating system needs to do is not allocate anything within this range of memory. It comes with no other costs. Trying to access this memory without first committing it will cause a fault.
  • Committed virtual memory – a section of virtual memory that the process has told the operating system that they intend to use. The operating system commits to having this memory when the process actually uses it. The system can also refuse to commit memory if it choses to do so (for example, because it doesn’t have enough memory for that).
  • Used virtual memory –  a memory section that was previously committed from the operating system and is actually in use. When you commit memory, that isn’t actually doing anything. Only when you access the memory will the OS actually assign a physical memory page for that memory you just accessed. The distinction between the last two is quite important. It is very common to commit far more memory than is actually in use. By not actually taking any space until it is used, the OS can save a lot of work.

Memory mapped files – a section of the virtual address space that uses a particular file as its backing store.

Shared memory – a named piece of memory that may be mapped into more than a single process.

All of these interact with one another in non trivial manners, so it can sometimes be hard to figure out what is going on.

The interesting case happens when the amount of memory we want to access is higher than the amount of physical RAM on the machine. At this point, the operating system needs to start juggle things around and actually make decisions.

Reserving virtual memory is mostly a very cheap operation. This can be used if you will want some contiguous memory space but don’t need all of it right now. On 32 bits, the address space is quite constrained, so that can fail, but on 64 bits, you typically have enough memory address space that you don’t have to worry about it.

Committing virtual memory is where we start getting into interesting issues. We ask the operating system to ensure that we can access this memory, and it will typically say yes. But in order to make that commitment, the OS needs to look at its global state. How many other commitment did it make? In general, the amount of memory commitments that the OS can safely do is limited to the size of the RAM plus the size of the swap. Windows will simply refuse to allocate more (but it can dynamically increase the size of the swap as load grows) but Linux will happily ignore the limit and rely on the fact that applications will rarely actually make use of all the memory they are committing.

So committed memory is counted against the limit, but it isn’t memory that is actually in use. When a process is accessing memory, only then will the OS allocate that memory for it, until then, it is just a ledger entry.

But the memory on your machine is not just stuff that processes allocated. There are a bunch of other stuff that may make use of the physical memory. There are I/O bound devices, which we’ll ignore because they don’t matter for us at this point.

But of much more interest to us at this point is the notion of memory mapped files. These are most certainly memory resident, but they aren’t counted against the commit size of the system. Why is that? Because when we use a memory mapped file, by definition, we are also supplying a file that will be the backing store for this memory. That, in turn, means that we don’t need to worry about where we’ll put this memory if we need to evict some for other purposes, we have the actual file.

All of this, of course, revolves around the issue of what will actually reside in physical memory. And that leads us to another very important term:

Active working set – The portion of the process memory that resides on the physical RAM. Some portions of the process memory may have been paged to disk (if the system is overloaded or if it has just mapped a file and haven’t access the contents yet). The actual term refer to the amount of memory that the process has recently been using, and under load, the working set may be higher then the amount of memory actually in use, leading to thrashing. The OS will keep evicting pages to the page file and then loading them again, in a vicious cycle that typically kills the machines.

Now that we know all these terms, let’s take a a look at what RavenDB reports in some case:

image

The total system memory is 8 GB (about 200MB are reserved for the hardware). RavenDB is using 5.96GB and the machine entire memory usage is 1.95GB.  How can a single process in the machine use more memory than the entire machine?

The reason for that is that we aren’t always talking about the same thing. Here is the output of pertinent memory information from this machine (cat /proc/meminfo).

image

You can see that we have a total memory of 8GB, but only 140MB are free. In active use we have 2.2GB and a lot of stuff in inactive.

There is also the MemAvailable field, which says that we have 6.2GB available. But what does this means? It is a guesstimate of how much memory we can start using without starting to swap. Taking the values from top, it might be easier to understand:

image

There are about 6GB of cached data, but what is it caching? The answer is that RavenDB is making use of memory mapped files, so we gave the system extra swap space, so to speak. Here is what this looks like when looking at the RavenDB process:

image

In other words, large parts of our working set is composed of memory mapped files, and we don’t want to try to account that against the actual memory being in use in the system. Because it is very common for us to operate with almost no free memory, because that memory is being used (by the memory mapped files) and the OS knows that it can just make use of this memory if new demands comes in.

Handling resource disposal on constrained machines

time to read 3 min | 427 words

imageA user reported that when running a set of unit tests against a RavenDB service running on a 1 CPU, 512MB Docker machine instance they were able to reliably reproduce an out of memory exception that would kill the server.

They were able to do that by simply running a long series of pretty simple unit tests. In some cases, this crashed the server.

It took a while to figure out what the root cause was. RavenDB unit tests are run on isolated databases, with each test having their own db. Running a lot of these tests, especially if they were short, will effectively create and delete a lot of databases on the server.

So we were able to reproduce this independently of anything by just running the create/delete database in a loop and the problem became obvious.

Spinning and tearing down a database are pretty big things. Unit tests asides, this is not something that you’ll typically do very often. But with unit tests, you may be creating thousands of them very rapidly. And it looked like that caused a problem, but why?

Well, we had hanging references to this databases. There were two major areas that caused this:

  • Timers of various kinds that might hang around for a while, waiting to actually fire. We had a few cases where we weren’t actually stopping the timer on db teardown, just checking that if the db was disposed when the timer fired.
  • Queues for background operations, which might be delayed because we want to optimize I/O globally. In particular, flushing data to disk is expensive, so we defer it as late as possible. But we didn’t remove the db entry from the flush queue on shutdown, relying on the flusher to check if the db was disposed or not.

None of these are actually really leaks. In both cases, we will clean up everything eventually. But while that happens, this keeps a reference to the database instance, and prevent us from fully releasing all the resources associated with it.

Being more explicitly about freeing all these resource now, rather than waiting a few seconds and have them released automatically made a major difference in this scenario. This was only the case in such small machines because the amount of memory they had was small enough that they run out before they could start clearing the backlog. On machines with a bit more resources, everything will hum along nicely, because we had enough spare capacity.

Production crash root cause analysis step by step

time to read 3 min | 470 words

One of our production server crashed. They do that a lot, because we make them do evil things in harsh environment so they would crash on our end and not when running on your systems. This one is an interesting crash.

We are running the system with auto dump, so a crash will trigger a mini dump that we can run. Here is the root cause of the error, and IOException here:

    SP               IP               Function
    00000024382FF0D0 00007FFD8313108A System_Private_CoreLib!System.IO.FileStream.WriteCore(Byte[], Int32, Int32)+0x3cf7ea
    00000024382FF140 00007FFD82D611A2 System_Private_CoreLib!System.IO.FileStream.FlushWriteBuffer(Boolean)+0x62
    00000024382FF180 00007FFD82D60FA1 System_Private_CoreLib!System.IO.FileStream.Dispose(Boolean)+0x41
    00000024382FF1D0 00007FFD82D60631 System_Private_CoreLib!System.IO.FileStream.Finalize()+0x11

And with this, we are almost done analyzing the problem, right? This is a crash because a finalizer has thrown, so we now know that we are leaking a file stream and that it throws on the finalizer.  We are almost, but not quite, done. We are pretty careful about such things, and most to all of our usages are wrapped in a using statement.

The reason for the error was quite apparent when we checked, we run out of disk space. But that is something that we extensively tested and we are supposed to work fine with that. What is going on?

No disk space is an interesting IO error, because it is:

  • transient, in the sense that the admin can clear some space, and does not expect you to crash.
  • persistent, in the sense that the application can’t really do something to recover from it.

As it turns out, we have been doing everything properly. Here is the sample code that will reproduce this error:

For the moment, ignore things like file metadata, etc. We are writing 12 bytes, and we have only place for 10. Note that calling WriteByte does not go to the disk, extend the file, etc. Instead, it will just write to the buffer and the buffer will be flushed when you call Flush or Dispose.

Pretty simple, right?

Now let us analyze what will happen in this case, shall we?

We call Dispose on the FileStream, and that ends up throwing, as expected. But let us dig a little further, shall we?

Here is the actual Dispose code:

This is in Stream.Dispose. Now, consider the case of a class that:

  • Implements a finalizer, such as FileStream.
  • Had a failure during the Dispose(true); call.
  • Will have the same failure the next time you call Dispose(), even if with Dispose(false);

In this case, the Dispose fails, but the SuppressFinalize is never called, so this will be called from the finalizer, which will also fail, bringing our entire system down.

The issue was reported here and I hope that they will be fixed by the time you read it.

The cost of finalizers

time to read 2 min | 231 words

A common pattern we use is to wrap objects that require disposing of resources with a dispose / finalizer. Here is a simple example:

This is a pretty simple case, and obviously we try to avoid these allocations, so we try to reuse the SecretItem class.

However, when peeking into high memory utilization issue, we run into a really strange situation. We had a lot of buffers in memory, and most of them were held by items that were held in turn by the Finalization Queue.

I started to write up what I figured out about finalizer in .NET, but this post does a great job of explaining everything. The issue here is that we retain a reference to the byte array after the dispose. Because the object is finalizable, the GC isn’t going to be able to remove it on the spot. That means that it is reachable, and anything that it holds is also reachable.

Now, we are trying very hard to be nice to the GC and not allocate too much, which means that the finalizer thread will only wake up occasionally, but that means that the buffer will be held up in memory for a much longer duration.

The fix was the set the Buffer to null during the Dispose call, which means that the GC can pick up that it isn’t actually being used much faster.

FUTURE POSTS

  1. I WILL have order: How Noise sorts query results - about one day from now
  2. Reviewing the Bleve search library - 2 days from now
  3. I WILL have order: How Bleve sorts query results - 3 days from now
  4. I won’t have order: Looking at search libraries without ordering - 4 days from now

There are posts all the way to May 31, 2018

RECENT SERIES

  1. I WILL have order (3):
    25 May 2018 - How Lucene sorts query results
  2. RavenDB 4.1 features (4):
    22 May 2018 - Highlighting
  3. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
  4. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
  5. Challenge (52):
    03 Apr 2018 - The invisible concurrency bug–Answer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats