Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,408 | Comments: 50,840

Privacy Policy Terms
filter by tags archive
time to read 2 min | 277 words

image A user of ours called us, quite frantic. They are running a lot of systems on RavenDB, and have been for quite some time.

However, very recently they started to run into severe issues. RavenDB would complain that there isn’t sufficient memory to run.

The system metrics, however, said that there are still gobs of GBs available (I believe that this is the appropriate technical term).

After verifying the situation, the on-call engineer escalated the issue. The problem was weird. There was enough memory, for sure, but for some reason RavenDB would be unable to run properly.

An important aspect is that this user is running a multi-tenant system, with each tenant being served by its own database. Each database has a few indexes as well.

Once we figured that out, it was actually easy to understand what is going on.

There are actually quite a few limits that you have to take into account. I talked about them here. In that post, the issue was the maximum number of tasks defined by the system. After which, you can no longer create new threads.

In this case, the suspect was: vm.max_map_count.

Beyond just total memory, Linux has a limit on the number of memory mappings that a process may have. And RavenDB uses Voron, which is based on mmap(), and each database and each index typically have multiple maps going on.

Given the number of databases involved…

The solution was to increase the max_map_count and add a task for us, to give a warning to the user ahead of time when they are approaching the system's limits.

time to read 11 min | 2005 words

imageA user reported that they observed nodes in the cluster “going dark”. Basically, they would stop communicating with the rest of the cluster, but would otherwise appear functional. Both the internal and external metrics were all fine, the server would just stop responding to anything over the network. The solution for the problem was to restart the service (note, the service, not the whole machine), but the problem would happen every few days.

As you can imagine, we are taking this sort of thing very seriously, so we looked into the problem. And we came up short. The problem made absolutely no sense. The problem occurred on a (minor) version migration, but there was absolutely nothing related to this that could cause it. What was really weird was that the service itself continue to work. We could see log entries being written and it was able to execute scheduled backups, for example. It would just refuse to talk to us over the network.

That was super strange, since the network itself was fine. All the monitoring systems were green, after all. For that matter, the user was able to SSH into the system to restart the service. This didn’t match with any other issue we could think of. Since the user worked around the problem by restarting the server, we didn’t have a lead.

Then we noticed the exact same problem in one of our cloud instances, and there we have much better diagnostic capabilities. Once we had noticed a problematic server, we were able to SSH into that and try to figure out what was going on.

Here is what we found out:

  • The server will not respond to HTTP(s) communication either from outside the machine or by trying to connect from inside the machine.
  • The server will respond to SNMP queries both from inside the machine and outside of it (which is how we typically monitor the system).

When we designed RavenDB, we implemented a “maintenance hatch” for such scenarios, in addition to using HTTP(s) for communication, RavenDB also exposes a named pipe that allows you to connect to the server without going through the network at all. This ensures that if you have administrator privileges on the server, you are able to connect even if there are network issues, certificate problems, etc.

Here is the kicker. Under this particular situation, we could not activate this escape hatch. That is not supposed to be possible. Named pipes on Linux, where we run into the problem, are basically Unix Sockets. A network issue such as a firewall problem or something similar isn’t going to affect them.

At the same time, we were able to communicate with the process using SNMP. What is the problem?

Lacking any other options, we dumped the process, restarted the service, and tried to do the analysis offline. We couldn’t find any problem. All the details we looked at said that everything was fine, the server was properly listening to new connections and it should work. That was… weird.

And then it happened again, and we did the same analysis, and it came back the same. We were clueless. One of the things that we updated between versions was the .NET runtime that we were using, so we opened an issue to see if anyone ran into the same problem.

And then it happened again. This time, we knew that just looking at the dump wouldn’t help us, so we tried other avenues. Linux has a pretty rich set of knobs and dials that you can look at to see what was going on. We suspected that this may be an issue with running out of file descriptors, running out of memory, etc.

We tried looking into what is going on inside the process using strace, and everything was fine. The trace clearly showed that the server was processing requests and was able to send and receive data properly.

Wait, go through that statement again please!

It is fine? But the reason we are using strace is that there is a problem. It looks like the problem fixed itself. That was annoying, because we were hoping to use the trace to figure out what is going on. We added more monitoring along the way, which would let us know if the server found itself isolated. And we waited.

The next time we ran into the problem, the first thing we did was run strace, we needed to get the root cause as soon as possible, and we were afraid that it would fix itself before we had a chance to get to the root cause. The moment we used strace, the server got back online, continuing as if there was never any issue.

Over the next few instances of this issue, we were able to confirm the following observations:

  1. The service would stop responding to TCP and Unix Sockets entirely.
  2. There were no firewall or network issues.
  3. The service was up and functional, tailing the log showed activity.
  4. We could query the server state using SNMP.
  5. Running strace on the service process would fix the problem.

There are a few more things, the actual trigger for the fix wasn’t strace itself. It was the ptrace() call, which it uses. That would cause the service to start responding again. The ptrace() call is basically the beginning and the end of debugging under Linux. Everything uses it.

If you want to dump a memory process, you start with ptrace(). You want to trace the calls, ptrace(). You want to debug the process? GDB will start by calling ptrace(), etc.

And doing that would alleviate the problem.

That was… quite annoying.

We still had absolutely no indication of what the root cause even was.

We suspected it may be something inside Kestrel that was causing a problem. But that wouldn’t affect the named pipes / Unix sockets that we also saw.

Networking worked, because SNMP did. We thought that this may be because SNMP uses UDP instead of TCP, and looked into that, but we couldn’t figure out how that would be any different.

Looking at this further, we found that we have this in the code dumps:

      ~~~~ 5072
         1 Interop+Sys.WaitForSocketEvents(IntPtr, SocketEvent*, Int32*)
         1 System.Net.Sockets.SocketAsyncEngine.EventLoop()
         1 System.Net.Sockets.SocketAsyncEngine+<>c.ctor>b__14_0(Object)

As you can see, we are waiting for this in the .NET Sockets thread. The SNMP, on the other hand, looked like:

Thread (0x559):
   [Native Frames]
   System.Net.Sockets!System.Net.Sockets.SocketPal.SysReceive()
   System.Net.Sockets!System.Net.Sockets.SocketPal.TryCompleteReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.SocketAsyncContext.ReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.SocketPal.ReceiveFrom()
   System.Net.Sockets!System.Net.Sockets.Socket.ReceiveFrom()
   SharpSnmpLib.Engine!Lextm.SharpSnmpLib.Pipeline.ListenerBinding.AsyncReceive()

That was really interesting, since it meant that for sockets (both HTTP and Unix), we were always using async calls, but for SNMP, we were using the synchronous API. We initially suspected that this may be something related to the thread pool. Maybe we had something that blocked it, but it turns out to be a lot more interesting. Here is the code that is actually handling the SNMP:

var count = _socket.ReceiveFrom(buffer, ref remote);

Task.Factory.StartNew(() => HandleMessage(buffer, count, (IPEndPoint)remote));

In other words, we are actually reading from the socket in a blocking manner, but then processing the actual message using the thread pool. So being able to get results via SNMP meant the thread pool was well.

At this point we resulted to hair pulling, rubber ducking and in some instances, shaking our fists to heaven.

I reminded myself that I’m an adult with a bit of experience solving problems, and dug deeper. We started looking into how .NET is actually handling sockets in async mode. This end up here, doing a system call:

while ((numEvents = epoll_wait(port, events, *count, -1)) < 0 && errno == EINTR);

Reading through the man page for epoll_wait() I learned how epoll() works, that it is complex and that we need to be aware of level-triggered and edge-triggered options. Since .NET uses edge-triggered events (EPOLLET, which I keep reading as electronic chicken), we focused on that.

There are a lot of edge cases and things to cover, but everything we checked was handled properly. We finally had a good smoking gun. For some reason, we weren’t getting notifications from epoll(), even though we should. Using strace() or friends somehow fixes that.

We actually found the exact scenario we saw in StackOverflow, but without any idea what the issue was. Truly, there is an XKCD for everything.

Our current understanding of the issue:

  • All async sockets in .NET are going through the same socket engine, and are using epoll() under the covers.
  • SNMP is using synchronous calls, so it wasn’t using epoll().

That covers both of the weird things that we are seeing. So what is the issue?

It is not in .NET. Given the size & scope of .NET, we wouldn’t be the only ones seeing that. Below .NET, there is the kernel, so we looked into that. The machines we were running that on were using kernel 5.4.0-azure-1095, so we looked into that.

And it looked like it is a kernel bug, which was fixed in the next updated kernel. A race condition inside the kernel would cause us to miss wakeups, and then we would basically just stall without anything to wake us up.

We dug deeper to understand a bit more about this situation, and we got this:

       Some system calls return with EINTR if a signal was sent to a
       tracee, but delivery was suppressed by the tracer.  (This is very
       typical operation: it is usually done by debuggers on every
       attach, in order to not introduce a bogus SIGSTOP).  As of Linux
       3.2.9, the following system calls are affected (this list is
       likely incomplete): epoll_wait(2), and read(2) from an inotify(7)
       file descriptor.  The usual symptom of this bug is that when you
       attach to a quiescent process with the command

           strace -p <process-ID>

       then, instead of the usual and expected one-line output such as

           restart_syscall(<... resuming interrupted call ...>_

       or

           select(6, [5], NULL, [5], NULL_

       ('_' denotes the cursor position), you observe more than one
       line.  For example:

               clock_gettime(CLOCK_MONOTONIC, {15370, 690928118}) = 0
               epoll_wait(4,_

       What is not visible here is that the process was blocked in
       epoll_wait(2) before strace(1) has attached to it.  Attaching
       caused epoll_wait(2) to return to user space with the error
       EINTR.  In this particular case, the program reacted to EINTR by
       checking the current time, and then executing epoll_wait(2)
       again.  (Programs which do not expect such "stray" EINTR errors
       may behave in an unintended way upon an strace(1) attach.)

And.. that is exactly what is happening. On attaching, the epoll_wait() will return with EINTR, which will cause .NET to retry the command, and that “fixes” the issue.

It makes total sense now, and concludes the discovery process of a pretty nasty bug.

Now, if you’ll excuse me, I need to go and apologize to a rubber duck.

image

time to read 1 min | 128 words

I posted this code previously:

And asked what it prints. This is actually an infinite loop that will print an endless amount of zeros to the console. The question is why.

The answer is that we are running into two separate features of C# that interact with each other in a surprising way.

The issue is that we are using a nullable iterator here, and accessing the struct using the Value property. The problem is that this is a struct, and using a property will cause it to be copied.

So the way it works, the code actually runs:

And now you can more easily see the temporary copies that are created and how because we are using a value type here, we are using a different instance each time.

time to read 2 min | 366 words

I asked the following question, about code that uses AsyncLocal as well as async calls. Here is the code again:

This code prints False twice, the question is why. I would expect that the AsyncLocal value to remain the same after the call to Start(), since that is obviously the point of AsyncLocal. It turns out that this isn’t the case.

AsyncLocal is good if you are trying to pass a value down to child tasks, but it won’t be applicable to other tasks that are called in the same level. In other words, it works for children, not siblings tasks. This is actually even more surprising in the code above, since we don’t do any awaits in the Start() method.

The question is why? Looking at the documentation, I couldn’t see any reason for that. Digging deeper into the source code, I figured out what is going on.

We can use SharpLab.io to lower the high level C# code to see what is actually going on here, which gives us the following code for the Start() method:

Note that we call to AsyncTaskMethodBuilder.Start() method, which ends up in AsyncMethodBuilderCore.Start(). There we have a bunch of interesting code, in particular, we remember the current thread execution context before we execute user code, here. After the code is done running, we restore it if this is needed, as you can see here.

That looks fine, but why would the execution context change here? It turns out that one of the few places that interact with it is the AsyncValue itself, which ends up in the ExecutionContext.SetLocalValue. The way it works, each time you set an async local, it creates a new layer in the async stack. And when you exit an async call, it will reset the async stack to the place it was before the async call started.

In other words, the local in the name AsyncLocal isn’t a match to ThreadLocal, but is more similar to a local variable, which goes out of scope on function exit.

This isn’t a new thing, and there are workarounds, but it was interesting enough that I decided to dig deep and understand what is actually going on.

time to read 2 min | 229 words

A user contacted us to tell us that RavenDB does not work in his environment. As you can imagine, we didn’t really like to hear that, so we looked deeper into the issue. The issue in question included the actual problem, which looked something like this:

{
    "Url": "/auth/",
    "Type": "Raven.Client.Exceptions.Routing.RouteNotFoundException",
    "Message": "There is no handler for path: GET /auth/",
    "Error": "Raven.Client.Exceptions.Routing.RouteNotFoundException: There is no handler for path: GET /auth/\n"
}

My reaction to that was… huh?!

That is a really strange error, since RavenDB does not have an “/auth/” endpoint. The problem isn’t with RavenDB, it is with something else.

In this case, the user ran RavenDB on port 8080 (which is the normal thing to do) and then tried to access RavenDB in the browser.

The problem was that they previously ran some other software, and that software had the following interaction:

* Connected to 127.0.0.1 port 8080
> GET / HTTP/1.1
> Host: 127.0.0.1
> User-Agent: curl/7.85.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: http://127.0.0.1:8080/auth/
< Content-Type: text/html; charset=UTF-8

In other words, it redirected the browser to the “/auth/” endpoint. It’s critical to understand that 301 response means: Moved Permanently. That means that they are actually cached by the browser. In this case, the scenario was reusing the same endpoint for a different software product, and the browser cache meant that we got strange results.

time to read 5 min | 969 words

One of the big changes in RavenDB is the new search engine, Corax. We want to replace Lucene with a purpose built search engine, capable of doing everything that we can do with Lucene, but far faster.

In this post, I want to discuss a particular detail of the implementation. Managing the posting list. In information retrieval, the idea of a posting list is that we have a list of document ids that match a particular term. I’m ignoring the details of how we create or manage the list. The basic idea is that we have a need to store a potentially large number of document ids, update them on the fly as well query them. Lucene manages that by creating immutable segments, but that design decision doesn’t match the way we want to do things in Corax. The problem with immutable segments is that you’ll eventually need to run compaction, which can be really expensive.

As a database, we already have a really good data structure that matches most of those requirements, it turns out. A B+Tree can do a pretty good approximation of a posting list, but it does have a problem. It’s not efficient in terms of space usage. A document id is a 64 bits integer, and we can make a number of assumptions about it. Therefore, I create a dedicated B+Tree like structure to hold the posting list. This B+Tree is called a Set inside of Voron, and it can hold any number of uint64 values. Internally, this is managed as two separate types of pages. The Branch pages (which are fairly standard B+Tree branches) and the Leaf pages.

Here is the first version of the Set Leaf Page implementation:

image

Let’s figure out what is going on here. The page has a notion of a base. That means all the documents IDs that have the same upper 33 bits. Basically, inside a single page, all the IDs are in the same 2 billion range. That means that even though the document ids are 64 bits, in the context of a single page, we can treat them as 32 bits integers. That turns out to be important, since most integer compression routines work on 32 bits integers.

How does that work? We have a section in the page that is dedicated to raw values, and we insert values into that section until it is full. Whenever that happens, we compress the raw values section using PForDelta compression. The page will then contain multiple compressed segments and a single raw values segment. Each compressed segment is non-overlapping with the other compressed segments (but may overlap with the raw values). PForDelta is really good in compressing integers, especially if it is able to figure out patterns in your data. And documents IDs in Corax are explicitly designed to have common patterns so it will be able to take advantage of this behavior. When we read the entries from the page, we merge the compressed segments with the raw values and return the full details.

The code isn’t particularly simple, but has a fairly straightforward approach to the problem once you understand the design.

One thing that I haven’t touched is the notion of removals. That is an important concept, and we handle that in an interesting way. Remember that I said that the baseline for the page is the upper 33 bits? That is because the numbers inside the page are 31 bits in length, we reserve the top bit to mark a value as a tombstone marker.  In other words, when we write to the raw values, we can have either a new value or a removed value, distinguished using the uppermost bit.

When we fill the raw values segment, we compress it alongside the relevant compressed segments. At that point, we’ll filter out the removed values. This is very similar to the segments that Lucene is using, but we do that on a page boundary (8KB), not across all values.

We are able to push a lot of values into a single page. We see typically thousands to tens of thousands of documents IDs fitting into a single 8KB page. That is pretty amazing, since even if you have a posting list that has millions of entries, the actual disk reads are minimal.

The design was with us throughout most of the development of Corax, but we ran into a serious issue with the way it works when we started to build the query optimizer for Corax.

That is an interesting statement, I think you’ll agree. What is the relation between a (very) low-level design of the on-disk data format and the result of the query optimizer?

Well, it turns out that one of the things that we need to know for the query optimizer is: “How big is this posting list?”

This question is actually really important to be able to generate optimal queries. And given the structure we have, we can provide a good answer to that, most of the time, but not always.

Why is that?

The problem is what happens when we want to remove a value from the set, or add an existing value. If the value already exists inside a compressed segment, we don’t open the compressed segement (which will require re-writing it from scratch), so we record an addition that is actually spurious. Conversely, if we try to remove a value that isn’t in the set, we’ll wrongly decrement the number of entries in the posting list, leading to issues with a mismatch between the record number of entries and the actual number we have in the posting list.

That was very confusing to figure out, I’ll admit. It was also really hard to fix, but I’ll address that in the next post in the series.

time to read 5 min | 998 words

A long while ago, I was involved in a project that dealt with elder home care. The company I was working for was contracted by the government to send care workers to the homes of elderly people to assist in routine tasks. Depending on the condition of the patient in question, they would spend anything from 2 – 8 hours a day a few days a week with the patient. Some caretakers would have multiple patients that they would be in charge of.

There were a lot of regulations and details that we had to get right. A patient may be allowed a certain amount of hours a week by the government, but may also pay for additional hours from their own pocket, etc. Quite interesting to work on, but not the topic of this post.

A key aspect of the system was actually paying the workers, of course. The problem was the way it was handled. At the end of each month, they would designate “office hours” for the care workers to come and submit their time sheets. Those had to be signed by the patients or their family members and were submitted in person at the office. Based on the number of hours worked, the workers would be paid.

The time sheets would look something like this:

image

A single worker would typically have 4 – 8 of those, and the office would spend a considerable amount of time deciphering those, computing the total hours and handing over payment.

The idea of the system I was working on was to automate all of that. Here is more or less what I needed to do:

image

For each one of the timesheet entries, the office would need to input the start & end times, who the care worker took care of, whether the time was approved (actually, whether this was paid by the government, privately, by the company, etc), and whether it was counted as overtime, etc.

The rules for all of those were… interesting, but we got it working and tested. And then we got to talk with the actual users.

They took one look at the user interface they had to work with and absolutely rebelled. The underlying issue was that during the end of the month period, each branch would need to handle hundreds of care workers, typically within four to six hours. They didn’t have the time to do that (pun intended). Their current process was to review the submitted time sheet with the care worker, stamp it with approved, and put just the total hours worked into the payroll system.

imageHaving to spend so much time on data entry was horrendous for them, but the company really wanted to have that level of granularity, to be able to properly track how many hours were worked and handle its own billing more easily.

Many of the care workers were also… non-technical, and the timesheet had to be approved by the patient or family worker. Having a signed piece of paper was easy to handle, trying to get them to go on a website and enter hours was a non-starter. That was also before everyone had a smartphone (and even today I would assume that would be difficult in that line of business).

As an additional twist, it turns out that the manual process allowed the office employees to better manage the care workers. For example, they may give them a 0.5 / hour adjustment for a particular month to manage a difficult client or deal with some specific issue, or approve (at their discretion) overtime pay when it wasn’t quite "proper” to do so.

One of the reasons that the company wanted to move to a modern system was to avoid this sort of “initiatives”, but they turned out to be actually quite  important for the proper management of the care workers and actually getting things done. For an additional twist, they had multiple branches, and each of those had a different methodology of how that was handled, and all of them were different from what HQ thought it should be.

The process turned out to be even more complex than we initially got, because there was a lot of flexibility in the system that was  actually crucial for the proper running of the business.

If I recall properly, we ended up having a two-stage approach. During the end of the month rush, they would fill in the gross hours and payment details. That part of the system was intentionally simplified to the point where we did almost no data validation and trusted them to put the right values.

After payroll was processed, they had to actually put in all those hours and we would run a check on the expected amount / validation vs. what was actually paid. That gave the company the insight into what was going on that they needed, the office people were able to keep on handling things the way they were used to, and discrepancies could be raised and handled more easily.

Being there in the “I’m just a techie” role, so I got to sit on the sidelines and see tug-of-war between the different groups. It was quite interesting to see, I described the process above in a bit of a chaotic manner, but there were proper procedures and processes in the offices. They were just something that the company HQ never even realized was there.

It also taught me quite a lot about the importance  of accepting “invalid” data. In many cases, you’ll see computerized solutions flat out refuse to accept values that they consider to be wrong. The problem is that often, you need to record reality, which may not agree with validation rules on your systems. And in most cases, reality wins.

time to read 1 min | 169 words

We have a lot of tests for RavenDB, and we are running them on plenty of environments. We semi frequently get a build failure when running on the “macOS latest” runner on GitHub.

The problem is that the information that I have is self-contradicting. Here is the most relevant piece:

Here you can see the failure itself and what is causing it.

Note that the debug message is showing that all three variables here have the same numeric value. The address and the current variables are also held on the stack, so there is no option for race conditions, or something like that.

I can’t figure out any reason why this would be triggered, in this case. About the only thing that pops to mind is whether there is some weirdness going on with pointer comparisons on MacOS, but I don’t have a lead to follow.

We haven’t investigated it properly yet, I thought to throw this to the blog and see if you have any idea what may be going on here.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Recording (8):
    17 Feb 2023 - RavenDB Usage Patterns
  2. Production postmortem (48):
    27 Jan 2023 - The server ate all my memory
  3. Answer (12):
    05 Jan 2023 - what does this code print?
  4. Challenge (71):
    04 Jan 2023 - what does this code print?
  5. RavenDB Indexing (2):
    20 Oct 2022 - exact()
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats