Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,408 | Comments: 50,842

Privacy Policy Terms
filter by tags archive
time to read 1 min | 183 words

RavenDB has the public live test instance, and we have recently upgraded that to version 6.0.  That means that you can start playing around with RavenDB 6.0 directly, including giving us feedback on any issues that you find.

Of particular interest, of course, is the sharding feature, it is right here:

image

And once enabled, you can see things in more details:

image

If we did things properly, the only thing you’ll notice that indicates that you are running in sharded mode is:

image

Take a look, and let us know what you think.

As a reminder, at the top right of the page, there is the submit feedback option:

image

Use it, we are waiting for your insights.

time to read 2 min | 262 words

Trevor Hunter from Kobo Rakuten is going to be speaking about Kobo’s usage of RavenDB in a webinar next Wednesday.

When I started at Kobo, we needed to look beyond the relational and into document databases. Our initial technology choice didn't work out for us in terms of reliability, performance, or flexibility, so we looked for something new and set on a journey of discovery, exploratory testing, and having fun pushing contender technologies to their limits (and breaking them!). In this talk, you'll hear about our challenges, how we evaluated the options, and our experience since widely adopting RavenDB. You'll learn about how we use it, areas we're still a bit wary of, and features we're eager to make more use of. I'll also dive into the key aspects of development - from how it affects our unit testing to the need for a "modern DBA" role on a development team.

About the speaker: Trevor Hunter: "I am a leader and coach with a knack for technology. I’m a Chief Technology Officer, a mountain biker, a husband, and a Dad. My curiosity to understand how things work and my desire to use that understanding to help others are the things I hope my kids inherit from me. I am currently the Chief Technology Officer of Rakuten Kobo. Here I lead the Research & Development organization where our mission is to deliver the best devices and the best services for our readers. We innovate, create partnerships, and deliver software, hardware, and services to millions of users worldwide."

You can register to the webinar here.

time to read 2 min | 246 words

RavenDB Sharding is now running as a production replication in our backend systems and we are stepping up our testing in a real-world environment.

We are now also publishing nightly builds of RavenDB 6.0, including Sharding support.

image

There are some known (minor) issues in the Studio, which we are busy fixing, but it is already possible to create and manage sharded clusters. As usual, we would love your feedback.

Here are some of the new features that I’m excited about, you can see that we have an obese bucket here, far larger than all other items:

image

And we can drill down and find why:

image

We have quite a few revisions of a pretty big document, and they are all obviously under a single bucket.

You can now also get query timings information across the entire sharded cluster, like so:

image

And as you can see, this will give you a detailed view of exactly where the costs are.

You can get the nightly build and start testing it right now. We would love to hear from you, especially as you test the newest features.

time to read 2 min | 399 words

Let’s assume that you want to make a remote call to another server. Your code looks something like this:

var response = await httpClient.GetAsync("https://api.myservice.app/v1/create-snap", cancellationTokenSource.Token);

This is simple, and it works, until you realize that you have a problem. By default, this request will time out in 100 seconds. You can set it to a shorter timeout using HttpClient.Timeout property, but that will lead to other problems.

The problem is that internally, inside HttpClient, if you are using a Timeout, it will call CancellationTokenSource.CancelAfter(). That is... what we want to do, no?

Well, in theory, but there is a problem with this approach. Let's sa look at how this actually works, shall we?

It ends up setting up a Timer instance, as you can see in the code. The problem is that this will modify a global value (well, one of them, there are by default N timers in the process, where N is the number of CPUs that you have on the machine.

What that means is that in order to register a timeout, you need to take a look. If you have a high concurrency situation, setting up the timeouts may be incredibly expensive.

Given that the timeout is usually a fixed value, within RavenDB we solved that using a different manner. We set up a set of timers that will go off periodically and then use this instead. We can request a task that will be completed on the next timeout duration. This way, we'll not be contending on the global locks, and we'll have a single value to set when the timeout happens.

The code we use ends up being somewhat more complex:

var sendTask = httpClient.GetAsync("https://api.myservice.app/v1/create-snap", cancellationTokenSource.Token);
var waitTask = TimeoutManager.WaitFor(TimeSpan.FromSeconds(15), cancellationTokenSource.Token);

if (Task.WaitAny(sendTask, waitTask) == 1)
{
        throw new TimeoutException("The request to the service timed out.");
}

Because we aren't spending a lot of time doing setup for a (rare) event, we can complete things a lot faster.

I don't like this approach, to be honest. I would rather have a better system in place, but it is a good workaround for a serious problem when you are dealing with high-performance systems.

You can see how we implemented the TimeoutManager inside RavenDB, the goal was to get roughly the same time frame, but we are absolutely fine with doing roughly the right thing, rather than pay the full cost of doing this exactly as needed. For our scenario, roughly is more than accurate enough.

time to read 2 min | 393 words

A customer reported a scenario where RavenDB was using stupendous amounts of memory. In the orders of tens of GB on a system that didn’t have that much load.

Our first suspicion was that this is an issue with reading the metrics, since RavenDB will try to keep as much of the data in memory, which sometimes leads users to worry. I spoke about this at length in the past.

In this case, that wasn’t the case. We were able to drill down into the exact cause of the memory usage and we found out that RavenDB was using an abnormally high amount of memory. The question was why that was, exactly.

We looked into the common operations on the server, and we found a suspicious query, it looked something like this:

from index 'Sales/Actions'
where endsWith(WorkflowStage, '/Final')

The endsWith query was suspicious, so we looked into that further. In general, endsWith requires us to scan all the unique terms for a particular field, but in most cases, there aren’t that many unique values for a field. In this case, however, that wasn’t the case, here are some of the values for WorkflowStage:

  • Workflows/3a1af12a-b5d2-4c96-9348-177ebaacab6c/Step-2
  • Workflows/6aacc86c-2f28-4b8b-8dee-1024314d5add/Final

In total, there were about 250 million sales in the database, each one of them with a unique WorflowStage value.

What does this mean, in terms of RavenDB query execution? Well, the fields are indexed, but we need to effectively do:

This isn’t the actual code, but it will show you what is going on.

In other words, in order to process this query, we have to scan (and materialize) all 250 million unique terms for this field. Obviously that is going to consume a lot of memory.

But what is the solution to that? Instead of doing an expensive endsWith query, we can move the computation from the query time to the index time.

In other words, instead of indexing the WorkflowStage field  as is, we’ll extract the information we want from it. The index would have one of those:

IsFinalWorkFlowStage = doc.WorkflowStage.EndsWith(“/Final”),

WorkflowStagePostfix = doc.WorkflowStage.Split(‘/’).Last()

The first one will check whether the value is final or not, while the second just gets the (one of hopefully a few) postfixes for the field. We can then query using equality instead of endsWith, leading to far better performance and greatly reduced memory usage, since we don’t need to materialize any values during the query.

time to read 2 min | 277 words

image A user of ours called us, quite frantic. They are running a lot of systems on RavenDB, and have been for quite some time.

However, very recently they started to run into severe issues. RavenDB would complain that there isn’t sufficient memory to run.

The system metrics, however, said that there are still gobs of GBs available (I believe that this is the appropriate technical term).

After verifying the situation, the on-call engineer escalated the issue. The problem was weird. There was enough memory, for sure, but for some reason RavenDB would be unable to run properly.

An important aspect is that this user is running a multi-tenant system, with each tenant being served by its own database. Each database has a few indexes as well.

Once we figured that out, it was actually easy to understand what is going on.

There are actually quite a few limits that you have to take into account. I talked about them here. In that post, the issue was the maximum number of tasks defined by the system. After which, you can no longer create new threads.

In this case, the suspect was: vm.max_map_count.

Beyond just total memory, Linux has a limit on the number of memory mappings that a process may have. And RavenDB uses Voron, which is based on mmap(), and each database and each index typically have multiple maps going on.

Given the number of databases involved…

The solution was to increase the max_map_count and add a task for us, to give a warning to the user ahead of time when they are approaching the system's limits.

time to read 11 min | 2005 words

imageA user reported that they observed nodes in the cluster “going dark”. Basically, they would stop communicating with the rest of the cluster, but would otherwise appear functional. Both the internal and external metrics were all fine, the server would just stop responding to anything over the network. The solution for the problem was to restart the service (note, the service, not the whole machine), but the problem would happen every few days.

As you can imagine, we are taking this sort of thing very seriously, so we looked into the problem. And we came up short. The problem made absolutely no sense. The problem occurred on a (minor) version migration, but there was absolutely nothing related to this that could cause it. What was really weird was that the service itself continue to work. We could see log entries being written and it was able to execute scheduled backups, for example. It would just refuse to talk to us over the network.

That was super strange, since the network itself was fine. All the monitoring systems were green, after all. For that matter, the user was able to SSH into the system to restart the service. This didn’t match with any other issue we could think of. Since the user worked around the problem by restarting the server, we didn’t have a lead.

Then we noticed the exact same problem in one of our cloud instances, and there we have much better diagnostic capabilities. Once we had noticed a problematic server, we were able to SSH into that and try to figure out what was going on.

Here is what we found out:

  • The server will not respond to HTTP(s) communication either from outside the machine or by trying to connect from inside the machine.
  • The server will respond to SNMP queries both from inside the machine and outside of it (which is how we typically monitor the system).

When we designed RavenDB, we implemented a “maintenance hatch” for such scenarios, in addition to using HTTP(s) for communication, RavenDB also exposes a named pipe that allows you to connect to the server without going through the network at all. This ensures that if you have administrator privileges on the server, you are able to connect even if there are network issues, certificate problems, etc.

Here is the kicker. Under this particular situation, we could not activate this escape hatch. That is not supposed to be possible. Named pipes on Linux, where we run into the problem, are basically Unix Sockets. A network issue such as a firewall problem or something similar isn’t going to affect them.

At the same time, we were able to communicate with the process using SNMP. What is the problem?

Lacking any other options, we dumped the process, restarted the service, and tried to do the analysis offline. We couldn’t find any problem. All the details we looked at said that everything was fine, the server was properly listening to new connections and it should work. That was… weird.

And then it happened again, and we did the same analysis, and it came back the same. We were clueless. One of the things that we updated between versions was the .NET runtime that we were using, so we opened an issue to see if anyone ran into the same problem.

And then it happened again. This time, we knew that just looking at the dump wouldn’t help us, so we tried other avenues. Linux has a pretty rich set of knobs and dials that you can look at to see what was going on. We suspected that this may be an issue with running out of file descriptors, running out of memory, etc.

We tried looking into what is going on inside the process using strace, and everything was fine. The trace clearly showed that the server was processing requests and was able to send and receive data properly.

Wait, go through that statement again please!

It is fine? But the reason we are using strace is that there is a problem. It looks like the problem fixed itself. That was annoying, because we were hoping to use the trace to figure out what is going on. We added more monitoring along the way, which would let us know if the server found itself isolated. And we waited.

The next time we ran into the problem, the first thing we did was run strace, we needed to get the root cause as soon as possible, and we were afraid that it would fix itself before we had a chance to get to the root cause. The moment we used strace, the server got back online, continuing as if there was never any issue.

Over the next few instances of this issue, we were able to confirm the following observations:

  1. The service would stop responding to TCP and Unix Sockets entirely.
  2. There were no firewall or network issues.
  3. The service was up and functional, tailing the log showed activity.
  4. We could query the server state using SNMP.
  5. Running strace on the service process would fix the problem.

There are a few more things, the actual trigger for the fix wasn’t strace itself. It was the ptrace() call, which it uses. That would cause the service to start responding again. The ptrace() call is basically the beginning and the end of debugging under Linux. Everything uses it.

If you want to dump a memory process, you start with ptrace(). You want to trace the calls, ptrace(). You want to debug the process? GDB will start by calling ptrace(), etc.

And doing that would alleviate the problem.

That was… quite annoying.

We still had absolutely no indication of what the root cause even was.

We suspected it may be something inside Kestrel that was causing a problem. But that wouldn’t affect the named pipes / Unix sockets that we also saw.

Networking worked, because SNMP did. We thought that this may be because SNMP uses UDP instead of TCP, and looked into that, but we couldn’t figure out how that would be any different.

Looking at this further, we found that we have this in the code dumps:

      ~~~~ 5072
         1 Interop+Sys.WaitForSocketEvents(IntPtr, SocketEvent*, Int32*)
         1 System.Net.Sockets.SocketAsyncEngine.EventLoop()
         1 System.Net.Sockets.SocketAsyncEngine+<>c.ctor>b__14_0(Object)

As you can see, we are waiting for this in the .NET Sockets thread. The SNMP, on the other hand, looked like:

Thread (0x559):
   [Native Frames]
   System.Net.Sockets!System.Net.Sockets.SocketPal.SysReceive()
   System.Net.Sockets!System.Net.Sockets.SocketPal.TryCompleteReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.SocketAsyncContext.ReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.SocketPal.ReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.Socket.ReceiveFrom()
   SharpSnmpLib.Engine!Lextm.SharpSnmpLib.Pipeline.ListenerBinding.AsyncReceive()

That was really interesting, since it meant that for sockets (both HTTP and Unix), we were always using async calls, but for SNMP, we were using the synchronous API. We initially suspected that this may be something related to the thread pool. Maybe we had something that blocked it, but it turns out to be a lot more interesting. Here is the code that is actually handling the SNMP:

var count = _socket.ReceiveFrom(buffer, ref remote);

Task.Factory.StartNew(() => HandleMessage(buffer, count, (IPEndPoint)remote));

In other words, we are actually reading from the socket in a blocking manner, but then processing the actual message using the thread pool. So being able to get results via SNMP meant the thread pool was well.

At this point we resulted to hair pulling, rubber ducking and in some instances, shaking our fists to heaven.

I reminded myself that I’m an adult with a bit of experience solving problems, and dug deeper. We started looking into how .NET is actually handling sockets in async mode. This end up here, doing a system call:

while ((numEvents = epoll_wait(port, events, *count, -1)) < 0 && errno == EINTR);

Reading through the man page for epoll_wait() I learned how epoll() works, that it is complex and that we need to be aware of level-triggered and edge-triggered options. Since .NET uses edge-triggered events (EPOLLET, which I keep reading as electronic chicken), we focused on that.

There are a lot of edge cases and things to cover, but everything we checked was handled properly. We finally had a good smoking gun. For some reason, we weren’t getting notifications from epoll(), even though we should. Using strace() or friends somehow fixes that.

We actually found the exact scenario we saw in StackOverflow, but without any idea what the issue was. Truly, there is an XKCD for everything.

Our current understanding of the issue:

  • All async sockets in .NET are going through the same socket engine, and are using epoll() under the covers.
  • SNMP is using synchronous calls, so it wasn’t using epoll().

That covers both of the weird things that we are seeing. So what is the issue?

It is not in .NET. Given the size & scope of .NET, we wouldn’t be the only ones seeing that. Below .NET, there is the kernel, so we looked into that. The machines we were running that on were using kernel 5.4.0-azure-1095, so we looked into that.

And it looked like it is a kernel bug, which was fixed in the next updated kernel. A race condition inside the kernel would cause us to miss wakeups, and then we would basically just stall without anything to wake us up.

We dug deeper to understand a bit more about this situation, and we got this:

       Some system calls return with EINTR if a signal was sent to a
       tracee, but delivery was suppressed by the tracer.  (This is very
       typical operation: it is usually done by debuggers on every
       attach, in order to not introduce a bogus SIGSTOP).  As of Linux
       3.2.9, the following system calls are affected (this list is
       likely incomplete): epoll_wait(2), and read(2) from an inotify(7)
       file descriptor.  The usual symptom of this bug is that when you
       attach to a quiescent process with the command

           strace -p <process-ID>

       then, instead of the usual and expected one-line output such as

           restart_syscall(<... resuming interrupted call ...>_

       or

           select(6, [5], NULL, [5], NULL_

       ('_' denotes the cursor position), you observe more than one
       line.  For example:

               clock_gettime(CLOCK_MONOTONIC, {15370, 690928118}) = 0
               epoll_wait(4,_

       What is not visible here is that the process was blocked in
       epoll_wait(2) before strace(1) has attached to it.  Attaching
       caused epoll_wait(2) to return to user space with the error
       EINTR.  In this particular case, the program reacted to EINTR by
       checking the current time, and then executing epoll_wait(2)
       again.  (Programs which do not expect such "stray" EINTR errors
       may behave in an unintended way upon an strace(1) attach.)

And.. that is exactly what is happening. On attaching, the epoll_wait() will return with EINTR, which will cause .NET to retry the command, and that “fixes” the issue.

It makes total sense now, and concludes the discovery process of a pretty nasty bug.

Now, if you’ll excuse me, I need to go and apologize to a rubber duck.

image

time to read 1 min | 108 words

The recording of my webinar showing off the new Sharding feature in RavenDB 6.0 is now live. I’m showcasing the new technology preview of RavenDB 6.0 and we have a nightly build already available for it. I think the webinar was really good, and I’m super excited to discuss all the ins & out of how this works.

Please take a look, and take the software for a spin. We have managed to get sharding down to a simple, obvious and clear process. And we are very much looking for your feedback.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Recording (8):
    17 Feb 2023 - RavenDB Usage Patterns
  2. Production postmortem (48):
    27 Jan 2023 - The server ate all my memory
  3. Answer (12):
    05 Jan 2023 - what does this code print?
  4. Challenge (71):
    04 Jan 2023 - what does this code print?
  5. RavenDB Indexing (2):
    20 Oct 2022 - exact()
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats