Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,481 | Comments: 47,787

filter by tags archive

Public Service AnnouncementConcurrentDictionary.Count is locking

time to read 2 min | 233 words

During a performance problem investigation, we discovered that the following innocent looking code was killing our performance.

This is part of a code that allow users to subscribe to changes in the database using a WebSocket. This is pretty rare, so we check that there aren’t any and skip all the work.

We had a bunch of code that run on many threads and ended up calling this method. Since there are not subscribers, this should be very cheap, but it wasn’t. The problem was that the call to Count was locking, and that created a convoy that killed our performance.

We did some trawling through our code base and through the framework code and came up with the following conclusions.

ConcurrentQueue:

  • Clear locks
  • CopyTo, GetEnumerator, ToArray, Count creates a snapshot (consume more memory)
  • TryPeek, IsEmpty are cheap

Here is a piece of problematic code, we are trying to keep the last ~25 items that we looked at:

The problem is that this kind of code ensures that there will be a lot of snapshots and increases memory utilization.

ConcurrentDictionary:

  • Count, Clear, IsEmpty, ToArray  - lock the entire thing
  • TryAdd, TryUpdate, TryRemove – lock the bucket for this entry
  • GetEnumerator does not lock
  • Keys, Values both lock the table and forces a temporary collection

If you need to iterate over the keys of a concurrent dictionary, there are two options:

Iterating over the entire dictionary is much better then iterating over just the keys.

The innocuous code that tripped me

time to read 1 min | 155 words

When building a cache, I need a way to generate a hash code from a query. A query is a complex object that has many properties. My first attempt to do so looked like this:

And it failed, disastrously. All queries always had the same hash code, and I couldn’t figure out why until I actually debugged through the code.

Here is a hint, this only happened if WaitForNonStaleResultsTimeout is null. It might be better to also look at the way the compiler looks at this:

Basically, if WaitForNonStaleResultsTimeout is null, the entire expression evaluated to null, so we return zero, losing all the previous hash values that we put in.

The solution was to parenthesis the expression, but this is something that passed a couple rounds of scrutiny and didn’t show up at all, because it looked fine, and I don’t keep the precedence rules for operators in C# in my head.

Production postmortemThe case of 99.99% percentile

time to read 2 min | 326 words

imageA customer called us with a problem, for the most part, RavenDB was very well behaved, but they were worried about the 99.99% percentile of request duration. While the 99.9% was excellent (around a few milliseconds), the 99.99% was measured in seconds.

Usually, this is a particularly slow request that happens, and we can handle that by figuring out what is slow on that particular request and fix it. Common causes for such an issue is a request that returns a lot of unnecessary data, such as large full documents, when it needs just a few fields.

In this case, however, our metrics told us that the problem was pretty widespread. There wasn’t a particular slow request, rather, at what appeared to be random times, certain requests would be slow. We also realized that it wasn’t a particular request that was slow, but all requests within a given time period.

That hinted quite strongly at the problem, it was very likely that this is a GC issue that caused a long pause. We still had to do some work to rule out other factors, such as I/O / noisy neighbors, but we narrowed down on GC as being the root cause.

The problem in this case was that the customer was running multiple databases on the same RavenDB process. And each of them was fairly large. The total amount of memory that the RavenDB process was using was around 60GB of managed memory. At that level, anything that would cause a GC can cause a significant pause and impact operations.

The solution in this case was to break that into multiple separate processes, one for each database. In this manner, we didn’t have a single managed heap that the GC had to traverse, each heap was much smaller, and the GC pause times were both greatly reduced and spaced much further apart.

Production postmortemThe lightly loaded trashing server

time to read 3 min | 522 words

1329757286-300pxWe got a puzzle. A particular customer was seeing very high latency in certain operations on a fairly frequent basis. The problem is that when this happened, the server was practically idle, serving around 5% of the usual requests/sec. What was even stranger was that during the times when we had reached the peek of requests/seconds, we’ll see no such slowdowns. The behavior was annoyingly consistent, we’ll see no slowdown at all during high load, but after a period of relatively light load, the server would appear to choke.

That one took a lot of time to figure out, because it was so strange. The immediate cause was pretty simple to figure out, the server was busy paging into memory a lot of data, but why would it need to do this? The server was just sitting there doing nothing much, but it was thrashing like crazy, and that was affecting the entire system.

I’ll spare you the investigation, because it was mostly grunting and frustration, but the sequence of events as we pieced it together was something like this:

  • The system is making heavy use of caching, with a cache duration set to 15 minutes or so. Most pages would hit the cache first, and if there was a miss, generate it and save it back. The cached documents was setup with the RavenDB expiration bundle.
  • During periods of high activity, we’ll typically have very few cache expiration (because we kept using the cached data) and we’ll fill up the cache quite heavily (the cache db was around 100GB or so).
  • That would work just fine and rapidly be able to serve a high number of requests.
  • And then came the idle period…
  • During that time, we had other work (by a different process) going on in the server, which we believe would give the OS reason to page the now unused memory to disk.

So far, everything goes on as predicted, but then something happens. The expiration timer is hit, and we now have a lot of items that need to be expired. RavenDB expiration is coarse, and it runs every few minutes, so each run we had a lot of stuff to delete. Most of it was on disk, and we needed to access all of it so we can delete it. And that caused us to trash, affecting the overall server performance.

As long as we were active, we wouldn’t expire so much at once, and we had a lot more of the db in memory, so the problem wasn’t apparent.

The solution was to remove the expiration usage and handle the cache invalidation in the client, when you fetched a cached value, you checked its age, and then you can apply a policy decision if you wanted to update it or not. This actually turned out to be a great feature in general for that particular customer, since they had a lot of data that can effectively be cached for much longer periods, and that gave them the ability to express that policy.

Bad bugs makes for self assigning issues

time to read 1 min | 144 words

One of our developers just added the following bug:

image

This is in an area that of the code that this particular developer is not regularly traversing*. The image above includes the full contents of the bug. And that caused me to immediately assign it back to its opener.

The problem?  If you say that you got an error, include the error. In many cases, you can save a lot of time and guessing.

For an internal bug, where the person who opened it is available, we have much lower bar for bug report quality. Most bugs are closed relatively quickly anyway. But lower bar for bug report quality still means there is a bar.

* I started to say, not responsible for, but we don’t have code ownership, so that wouldn’t have been right.

We won’t be fixing this race condition

time to read 2 min | 345 words

During the work on restoring backup, the developer in charge came up with the following problematic scenario.

  • Start restoring backup of database Northwind on node A, which can take quite some time for large database
  • Create a database named Northwind on node B while the restore is taking place.

The problem is that during the restore the database doesn’t exists in a proper form in the cluster until it is done restoring. During that time, if an administrator is attempting to create a database it will look like it is working, but it will actually create a new database on all the other nodes and fail on the node where the restore is going on.

When the restore will complete, it will either remove the previously created database or it will join it and replicate the restored data to the rest of the nodes, depending exactly on when the restore and the new db creation happened.

Now, trying to resolve this issue involve us coordinating the restore process around the cluster. However, that also means that we need to do heartbeats during the restore process (to the entire cluster), handle timeouts and recovery and effectively take upon us a pretty big burden of pretty complicated code. Indeed, the first draft of the fix for this issue suffered from the weakness that it would only work when running on a single node, and only work in a cluster mode in very specific cases.

In this case, it is a very rare scenario that require an admin (not just a standard user) to do two things that you’ll not usually expect them together, and the outcome of this is a bit confusing even if you managed, but there isn’t any data loss.

The solution was to document that during the restore process you shouldn’t create a database with the same name but instead let RavenDB complete and then let the database span additional nodes. That is a much simpler alternative to going in to a distributed mode reasoning just for something that is an operator error in the first place.

Bug storiesThe memory ownership in the timeout

time to read 3 min | 442 words

imageWe are running a lot of tests right now on RavenDB, in all sort of interesting configurations. Some of the more interesting results came from testing wildly heterogeneous systems. Put a node on a fast Windows machine, connect it to a couple of Raspberry PIs, a cheap Windows tablet over WiFi and a slow Linux machine and see how that kind of cluster is handling high load.

This has turned out a number of bugs, the issue with the TCP read buffer corruption is one such example, but another is the reason for this post. In one of our test runs, the RavenDB process crashed with invalid memory access. That was interesting to see. Tracking down the issue led us to the piece of code that is handling incoming replication. In particular, the issue was possible if the following happened:

  • Node A is significantly slower than node B, primarily with regards to disk I/O.
  • Node B is being hit with enough load that it send large requests to node A.
  • There is a Node C that is also trying to replicate the same information (because it noticed that node A isn’t catching fast enough and is trying to help).

The root cause was that we had a bit of code that looked like this:

Basically, we read the data from the network into a buffer, and now we hand it off to the transaction merger to run. However, if there is a lot of load on the server, it is possible that the transaction merger will not have a chance to return in time.  We try to abort the connection here, since something is obviously wrong, and we do just that. The problem is that we sent a buffer to the transaction merger, and while it might not have gotten to processing our request yet (or haven’t completed it, at least), there is no way for us to actually be able to pull the request out (it might have already started executing, after all).

The code didn’t consider that, and what happened when we did get a timeout is that the buffer was returned to the pool, and if it was freed in time, we would get an access violation exception if we were lucky, or just garbage in the buffer (that we previously validated, so we didn’t check again) that would likely also cause a crash.

The solution was to wait for the task to complete, but ping the other host to let it know that we are still alive, and that the connection shouldn’t be aborted.

Bug storiesHow do I call myself?

time to read 3 min | 522 words

imageThis bug is actually one of the primary reasons we had a Beta 2 release for RavenDB 4.0 so quickly.

The problem is easy to state, we had a problem in any non trivial deployment setup where clients would be utterly unable to connect to us. Let us examine what I mean by non trivial setup, shall we?

A trivial setup is when you are running locally, binding to “http://localhost:8080”. In this case, everything is simple, and you can bind to the appropriate interface and when a client connects to you, you let it know that your URL is “http://localhost:8080”.

Hm… this doesn’t make sense. If a client just connected to us, why do we need to let it know what is the URL that it need to connect to us?

Well, if there is just a single node, we don’t. But RavenDB 4.0 allows you to connect to any node in the cluster and ask it where a particular database is located. So the first thing that happens when you connect to a RavenDB server is that you find out where you really need to go. In the case of a single node, the answer is “you are going to talk to me”, but in the case of a cluster, it might be some other node entirely. And this is where things begin to be a bit problematic. The problem is that we need to know what to call ourselves when a client connects to us.

That isn’t as easy as it might sound. Consider the case where the user configure the server url to be “http://0.0.0.0:8080”. We can’t give that to the client, so we default to sending back the host name in that case. And this is where things started to get tricky. In many cases, the host name is not something that make sense.

Oh, for internal deployments, you can usually rely on it, but if you are deploying to AWS, for example, the machine host name is of very little use in routing to that particular machine. Or, for that matter, a docker container host name isn’t particularly useful when you consider it from the outside.

The problem is that with RavenDB, we had a single configuration value that was used both for the binding to the network and for letting the user know how to connect to us. That didn’t work when you had routers in the middle. For example, if my public docker IP is 10.0.75.2, that doesn’t mean that this is the IP that I can bind to inside the container. And the same is true whenever you have any complex network topology (putting nginx in front of the server, for example).

The resolution for that was pretty simple, we added a new configuration value that will separate the host that we bind to from the host that we report to the outside world. In this manner, you can bind to one IP but let the world know that you should be reached via another. 

Bug storiesThe data corruption in the cluster

time to read 5 min | 986 words

imageThe bug started as pretty much all others. “We have a problem when replicating from a Linux machine to a Windows machine, I’m seeing some funny values there”. This didn’t raise any alarm bells, after all, that was the point of checking what was going on in a mixed mode cluster. We didn’t expect any issues, but it wasn’t surprising that they happened.

The bug in question showed up as an invalid database id in some documents. In particular, it meant that we might have node A, node B and node C in the cluster, and running a particular scenario suddenly started also reporting node Ω, node Σ and other fun stuff like that.

And so the investigation began. We were able to reproduce this error once we put enough load on the cluster (typically around the 20th million document write or so), and it was never consistent.

We looked at how we save the data to disk, we looked at how we ready it, we scanned all the incoming and outgoing data. We sniffed raw TCP sockets and we looked at everything from the threading model to random corruption of data on the wire to our own code reading the data to manual review of the TCP code in the Linux kernel.

The later might require some explanation, it turned out that setting TCP_NODELAY on Linux would make the issue go away. That only made things a lot harder to figure out. What was worse, this corruption only ever happened in this particular location, never anywhere else. It was maddening, and about three people worked on this particular issue for over a week with the sole result being: “We know where it roughly happening, but no idea why or how”.

That in itself was a very valuable thing to have, and along the way we were able to fix a bunch of other stuff that was found under this level of scrutiny. But the original problem persisted, quite annoyingly.

Eventually, we tracked it down to this method:

We were there before, and we looked at the code, and it looked fine. Except that it wasn’t. In particular, there is a problem when the range we want to move is overlapped with the range we want to move it to.

For example, consider that we have a buffer of 32KB, and we read from the network 7 bytes. We then consumed 2 of those bytes. In the image below, you can see that as the Origin, with the consumed bytes shown as ghosts.

image

What we need to do now is to move the “Joyou” to the beginning of the buffer, but note that we need to move it from 2 – 7 to 0 – 5, which are overlapping ranges. The issue is that we want to be able to fully read “Joyous”, which require us to do some work to make sure that we can do that. This ReadExactly piece of code was written with the knowledge that at most it will be called with 16 bytes to read, and the buffer size is 32KB, so there was an implicit assumption that those ranges can’t overlap.

when they do… Well, you can see in the image how the data is changed with each iteration of the loop. The end result is that we have corrupted our buffer and mess everything up. The Linux TCP stack had no issue, it was all in our code. The problem is that while it is rare, it is perfectly fine to fragment the data you send into multiple packets, each with very small length. The reason why TCP_NODELAY “fixed” the issue was that it probably didn’t trigger the multiple small buffers one after another in that particular scenario. It is also worth noting that we tracked this down to specific load pattern that would cause the sender to split packets in this way to generate this error condition.

That didn’t actually fix anything, since it could still happen, but I traced the code, and I think that this happened with more regularity since we hit the buffer just right to send a value over the buffer size in just the wrong way. The fix for this, by the way, is to avoid the manual buffer copying and to use memove(), which is safe to use for overlapped ranges.

That leave us with the question, why did it take us so long to find this out? For that matter, how could this error surface only in this particular case? There is nothing really special with the database id, and this particular method is called a lot by the code.

Figuring this out took even more time, basically, this bug was hidden by the way our code validate the incoming stream. We don’t trust data from the network, and we run it through a set of validations to ensure that it is safe to consume. When this error happened in the normal course of things, higher level code would typically detect that as corruption and close the connection. The other side would retry and since this is timing dependent, it will very likely be able to proceed. The issue with database ids is that they are opaque binary values (they are guids, so no structure at all that is meaningful for the application). That means that only when we got this particular issue on that particular field (and other field at all) will we be able to pass validation and actually get the error.

The fix was annoyingly simply given the amount of time we spent finding it, but we have been able to root out a significant bug as a result of the real world tests we run.

The case of the hanging request in the cache

time to read 2 min | 351 words

imageA user just reported that after a short time of using RavenDB, all requests started to hang and eventually timed out. We are talking about making several dozen requests in the space of a minute or so, as a user operating against a web application.

That was a cause of concern, obviously, and something that was very strange. Luckily, the user was able to send us a stripped down version of his application and we were able to reproduce this state locally. On that particular application only. We couldn’t reproduce this anywhere else.

In fact, we fixed the problem and still don’t know how to reproduce it. The underlying issue is that we are making a GET request using HttpClient and specifying HttpCompletionOption.ResponseHeadersRead in the SendAsync method. Now, the server might return us a 304 response, in which case we’ll use the cached value we have. If we make a few requests like that, all future requests will hang. In order to fix this problem we need to dispose the response, or read the (empty) content of the response.

Now, the crazy thing is that I can probably speculate pretty accurately on why this is going on, we have fixed it, but I can’t reproduce this problem except within the application the user sent us. So no independent reproduction.

The likely issue is that the HttpClient considers the request open until we read all its content (even if there is non), and that as we make more requests that have the same 304 and therefor no access to the content of the request, we have more and more requests that are considered to be open, up to the maximum number of requests that are allowed against the server concurrently. At this point, we are waiting for timeout, and it look like we are hanging.

All very reasonable and perfectly understandable (if annoying because you only need to dispose it sometimes is a trap waiting to happen), but I don’t know why we couldn’t reproduce it.

FUTURE POSTS

  1. PR Review: Beware the things you can’t see - about one day from now
  2. The right thing and what the user expect to happen are completely unrelated - 2 days from now
  3. PR Review: It’s the error handling, again - 3 days from now

There are posts all the way to Oct 25, 2017

RECENT SERIES

  1. PR Review (7):
    20 Oct 2017 - Code has cost, justify it
  2. RavenDB 4.0 (15):
    13 Oct 2017 - Interlocked distributed operations
  3. re (21):
    10 Oct 2017 - Entity Framework Core performance tuning–Part III
  4. RavenDB 4.0 Unsung Heroes (5):
    05 Oct 2017 - The design of the security error flow
  5. Writing SSL Proxy (2):
    27 Sep 2017 - Part II, delegating authentication
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats