Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 345 words

During the work on restoring backup, the developer in charge came up with the following problematic scenario.

  • Start restoring backup of database Northwind on node A, which can take quite some time for large database
  • Create a database named Northwind on node B while the restore is taking place.

The problem is that during the restore the database doesn’t exists in a proper form in the cluster until it is done restoring. During that time, if an administrator is attempting to create a database it will look like it is working, but it will actually create a new database on all the other nodes and fail on the node where the restore is going on.

When the restore will complete, it will either remove the previously created database or it will join it and replicate the restored data to the rest of the nodes, depending exactly on when the restore and the new db creation happened.

Now, trying to resolve this issue involve us coordinating the restore process around the cluster. However, that also means that we need to do heartbeats during the restore process (to the entire cluster), handle timeouts and recovery and effectively take upon us a pretty big burden of pretty complicated code. Indeed, the first draft of the fix for this issue suffered from the weakness that it would only work when running on a single node, and only work in a cluster mode in very specific cases.

In this case, it is a very rare scenario that require an admin (not just a standard user) to do two things that you’ll not usually expect them together, and the outcome of this is a bit confusing even if you managed, but there isn’t any data loss.

The solution was to document that during the restore process you shouldn’t create a database with the same name but instead let RavenDB complete and then let the database span additional nodes. That is a much simpler alternative to going in to a distributed mode reasoning just for something that is an operator error in the first place.

time to read 3 min | 442 words

imageWe are running a lot of tests right now on RavenDB, in all sort of interesting configurations. Some of the more interesting results came from testing wildly heterogeneous systems. Put a node on a fast Windows machine, connect it to a couple of Raspberry PIs, a cheap Windows tablet over WiFi and a slow Linux machine and see how that kind of cluster is handling high load.

This has turned out a number of bugs, the issue with the TCP read buffer corruption is one such example, but another is the reason for this post. In one of our test runs, the RavenDB process crashed with invalid memory access. That was interesting to see. Tracking down the issue led us to the piece of code that is handling incoming replication. In particular, the issue was possible if the following happened:

  • Node A is significantly slower than node B, primarily with regards to disk I/O.
  • Node B is being hit with enough load that it send large requests to node A.
  • There is a Node C that is also trying to replicate the same information (because it noticed that node A isn’t catching fast enough and is trying to help).

The root cause was that we had a bit of code that looked like this:

Basically, we read the data from the network into a buffer, and now we hand it off to the transaction merger to run. However, if there is a lot of load on the server, it is possible that the transaction merger will not have a chance to return in time.  We try to abort the connection here, since something is obviously wrong, and we do just that. The problem is that we sent a buffer to the transaction merger, and while it might not have gotten to processing our request yet (or haven’t completed it, at least), there is no way for us to actually be able to pull the request out (it might have already started executing, after all).

The code didn’t consider that, and what happened when we did get a timeout is that the buffer was returned to the pool, and if it was freed in time, we would get an access violation exception if we were lucky, or just garbage in the buffer (that we previously validated, so we didn’t check again) that would likely also cause a crash.

The solution was to wait for the task to complete, but ping the other host to let it know that we are still alive, and that the connection shouldn’t be aborted.

time to read 2 min | 205 words

The following is the opening paragraphs for discussion RavenDB 4.0 clustering and distribution model in the Inside RavenDB 4.0 book.

You might be familiar with the term "murder of crows" as a way to refer to a group for crows[1]. It has been used in literature and arts many times. Of less reknown is the group term for ravens, which is "unkindness". Personally, in the name of all ravens, I'm torn between being insulted and amused.

Professionally, setting up RavenDB as a cluster on a group of machines is a charming exercise (however, that term is actually reserved for finches) that bring a sense of exaltation (taken too, by larks) by how pain free this is. I'll now end my voyage into the realm of ornithology's etymology and stop speaking in tongues.

On a more serious note, the fact that RavenDB clustering is easy to setup is quite important, because it means that it is much more approachable.


[1] If you are interested in learning why, I found this answer fascinating

It was amusing to write, and it got me to actually start writing that part of the book. Although I’m not sure if this will survive editing and actually end up in the book.

time to read 3 min | 522 words

imageThis bug is actually one of the primary reasons we had a Beta 2 release for RavenDB 4.0 so quickly.

The problem is easy to state, we had a problem in any non trivial deployment setup where clients would be utterly unable to connect to us. Let us examine what I mean by non trivial setup, shall we?

A trivial setup is when you are running locally, binding to “http://localhost:8080”. In this case, everything is simple, and you can bind to the appropriate interface and when a client connects to you, you let it know that your URL is “http://localhost:8080”.

Hm… this doesn’t make sense. If a client just connected to us, why do we need to let it know what is the URL that it need to connect to us?

Well, if there is just a single node, we don’t. But RavenDB 4.0 allows you to connect to any node in the cluster and ask it where a particular database is located. So the first thing that happens when you connect to a RavenDB server is that you find out where you really need to go. In the case of a single node, the answer is “you are going to talk to me”, but in the case of a cluster, it might be some other node entirely. And this is where things begin to be a bit problematic. The problem is that we need to know what to call ourselves when a client connects to us.

That isn’t as easy as it might sound. Consider the case where the user configure the server url to be “http://0.0.0.0:8080”. We can’t give that to the client, so we default to sending back the host name in that case. And this is where things started to get tricky. In many cases, the host name is not something that make sense.

Oh, for internal deployments, you can usually rely on it, but if you are deploying to AWS, for example, the machine host name is of very little use in routing to that particular machine. Or, for that matter, a docker container host name isn’t particularly useful when you consider it from the outside.

The problem is that with RavenDB, we had a single configuration value that was used both for the binding to the network and for letting the user know how to connect to us. That didn’t work when you had routers in the middle. For example, if my public docker IP is 10.0.75.2, that doesn’t mean that this is the IP that I can bind to inside the container. And the same is true whenever you have any complex network topology (putting nginx in front of the server, for example).

The resolution for that was pretty simple, we added a new configuration value that will separate the host that we bind to from the host that we report to the outside world. In this manner, you can bind to one IP but let the world know that you should be reached via another. 

time to read 5 min | 986 words

imageThe bug started as pretty much all others. “We have a problem when replicating from a Linux machine to a Windows machine, I’m seeing some funny values there”. This didn’t raise any alarm bells, after all, that was the point of checking what was going on in a mixed mode cluster. We didn’t expect any issues, but it wasn’t surprising that they happened.

The bug in question showed up as an invalid database id in some documents. In particular, it meant that we might have node A, node B and node C in the cluster, and running a particular scenario suddenly started also reporting node Ω, node Σ and other fun stuff like that.

And so the investigation began. We were able to reproduce this error once we put enough load on the cluster (typically around the 20th million document write or so), and it was never consistent.

We looked at how we save the data to disk, we looked at how we ready it, we scanned all the incoming and outgoing data. We sniffed raw TCP sockets and we looked at everything from the threading model to random corruption of data on the wire to our own code reading the data to manual review of the TCP code in the Linux kernel.

The later might require some explanation, it turned out that setting TCP_NODELAY on Linux would make the issue go away. That only made things a lot harder to figure out. What was worse, this corruption only ever happened in this particular location, never anywhere else. It was maddening, and about three people worked on this particular issue for over a week with the sole result being: “We know where it roughly happening, but no idea why or how”.

That in itself was a very valuable thing to have, and along the way we were able to fix a bunch of other stuff that was found under this level of scrutiny. But the original problem persisted, quite annoyingly.

Eventually, we tracked it down to this method:

We were there before, and we looked at the code, and it looked fine. Except that it wasn’t. In particular, there is a problem when the range we want to move is overlapped with the range we want to move it to.

For example, consider that we have a buffer of 32KB, and we read from the network 7 bytes. We then consumed 2 of those bytes. In the image below, you can see that as the Origin, with the consumed bytes shown as ghosts.

image

What we need to do now is to move the “Joyou” to the beginning of the buffer, but note that we need to move it from 2 – 7 to 0 – 5, which are overlapping ranges. The issue is that we want to be able to fully read “Joyous”, which require us to do some work to make sure that we can do that. This ReadExactly piece of code was written with the knowledge that at most it will be called with 16 bytes to read, and the buffer size is 32KB, so there was an implicit assumption that those ranges can’t overlap.

when they do… Well, you can see in the image how the data is changed with each iteration of the loop. The end result is that we have corrupted our buffer and mess everything up. The Linux TCP stack had no issue, it was all in our code. The problem is that while it is rare, it is perfectly fine to fragment the data you send into multiple packets, each with very small length. The reason why TCP_NODELAY “fixed” the issue was that it probably didn’t trigger the multiple small buffers one after another in that particular scenario. It is also worth noting that we tracked this down to specific load pattern that would cause the sender to split packets in this way to generate this error condition.

That didn’t actually fix anything, since it could still happen, but I traced the code, and I think that this happened with more regularity since we hit the buffer just right to send a value over the buffer size in just the wrong way. The fix for this, by the way, is to avoid the manual buffer copying and to use memove(), which is safe to use for overlapped ranges.

That leave us with the question, why did it take us so long to find this out? For that matter, how could this error surface only in this particular case? There is nothing really special with the database id, and this particular method is called a lot by the code.

Figuring this out took even more time, basically, this bug was hidden by the way our code validate the incoming stream. We don’t trust data from the network, and we run it through a set of validations to ensure that it is safe to consume. When this error happened in the normal course of things, higher level code would typically detect that as corruption and close the connection. The other side would retry and since this is timing dependent, it will very likely be able to proceed. The issue with database ids is that they are opaque binary values (they are guids, so no structure at all that is meaningful for the application). That means that only when we got this particular issue on that particular field (and other field at all) will we be able to pass validation and actually get the error.

The fix was annoyingly simply given the amount of time we spent finding it, but we have been able to root out a significant bug as a result of the real world tests we run.

time to read 2 min | 398 words

imageWe have a feature in RavenDB that may leave behind some traces when a document is gone. The actual details aren’t really important for the story. Those traces are there for a reason, and a user have a good reason to want to see them in the UI.

That meant that we needed to come up with a name for them. After a short pause, we selected Zombies, because they are the remnants of real documents that are hanging around. That seem to mesh well with the technical terminology already in use (zombie processes, for example) and a reference to the current popularity of zombies in culture (books, movies, etc) which many of our guys enjoy.

Note that in this case, I’m specifically using the term guys to refer to our male developers. One of our female developers didn’t like the terminology. Because Zombies are creepy, and we don’t want that in our UI.

There was a discussion on the terminology we’ll use that was very interesting, because it was on clearly defined gender lines. None of the guys had any issue with the term, and that included a few that considered zombie movies to be yucky as well. All the women, on the other hand, thought (to varying degrees) that zombies isn’t the appropriate term to use.

We threw a few other ones around, such as orphans, but one of the features we wanted to have is the ability to wipe those traces, and “kill all orphans” is not something that I think would go well in our UI.

Eventually the idea to use the term ghosts was brought up, and it was liked by all. It has all the connotations desired to explain what this is (the remnants of a deleted document), but the images it evoked was Casper the Friendly Ghost and Pacman, apparently.

Given that while none of the guys thought there was a problem with zombies, but no one was also particularly attached to the name, and on the other hand we had strong opposition to the term and an alternative that made everyone happy, we switched to that terminology.

Fun fact, I was telling my wife this story and I wasn’t able to complete the description of the debate before she suggested using the Pacman image.

time to read 5 min | 961 words

Related imageI’m reading MongoDB in Action right now. It is an interesting book and I wanted to learn more about the approach to using MongoDB, rather then just be familiar with the feature set and what it can do. But this post isn’t about the book, it is about something that I read, and as I was reading it I couldn’t help but put down the book and actually think it through.

More specifically, I’m talking about this little guy. This is a small Ruby class that was presented in the book as part of an inventory management system. In particular, this piece of code is supposed to allow you to sell limited inventory items and ensure that you won’t sell stuff that you don’t have. The example is that if you have 10 rakes in the stores, you can only sell 10 rakes. The approach that is taken is quite nice, by simulating the notion of having a document per each of the rakes in the store and allowing users to place them in their cart. In this manner, you prevent the possibility of a selling more than you actually have.

What I take strong issue with is the way this is implemented. MongoDB doesn’t have multi document transactions, but the solution presented requires it. Therefor, the approach outlined in the book is to try to build transactional semantics from the client side. I write databases for a living, and I find that concept utterly baffling. Clients shouldn’t try to do stuff like that, not only would they most likely get it wrong, but they’ll do that extremely inefficiently.

Let us consider the following tidbit of code:

The idea here is that the fetcher is supposed to be able to atomically add the products to the order. If there aren’t enough available products to be added, the entire thing is supposed to be rolled back. As a business operation, this make a lot of sense. The actual implementation, however, made me wince.

What it does, if it was SQL, is the following:

I intentionally used SQL here, both to simplify the issue for people who aren’t familiar with MongoDB and to explain the major dissonance that I have with this approach. That little add_to_cart call that we had earlier resulted in no less than eight network roundtrips. That is in the happy case.  There is also the failure mode to consider, which involved resetting all the work done so far.

The thing that really bothers me is that I can’t believe that this is something that you’ll actually want to do except as an intellectual exercise. I mean, sure, how we can pretend to get transactions from non transactional store is interesting, but given the costs of doing this or the possibility of failure or the fact that this is a non atomic state transition or… you get my point, right?

In the case of this code, the whole process is non atomic. That means that outside observers can see the changes as they are happening. It also opens you up for a lot of Bad Stuff in terms of abusing the system. If the user is malicious, they can use the fact that this “transaction” is going to be running back and forth to the database (and thus taking a lot of time) and just open another tab to initiate an action while this is going on, resulting in operations on invalid state. In the example that the book give, we can use that to force purchases of invalid items.

If you think that this is unrealistic, consider this page, which talks about doing things like making money appear from thin air using just this sort of approaches.

Another thing that really bugged me about this code is that it has “error handling” I use that in quotes because it is like a security blanket for a 2 years old. Having it there might calm things up, but it doesn’t actually change anything. In particular, this kind of error handling looks right, but it is horribly broken if you consider what kind of actual errors can happen here. If the process running this code failed for any reason, the “transaction” is going to stay in an invalid state. It is possible that one of your rake will just disappear into thin air, as a result. It is supposed to be in someone’s cart, but it isn’t.  The same can be the case if the server had an issue midway or just a regular network hiccup.

I’m aware that this is code that was written explicitly to be readable and easy to explain, rather then be able to withstand the vagaries of production, but still, this is a very dangerous thing to do.

As an aside, not quite related to the topic of this post, but one thing that really bugged me in the book so far is the number of remote requests that are commonly required to do things. Is there an assumption that the database in question is nearby or very cheap to access, because the entire design philosophy I use is to assume that going over the network is expensive, so let us give the users a lot of ways to reduce that cost. In contrast, at least in the book, there is a lot of stuff that is just making remote calls like there is a fire sale that will close in 5 minutes.

To be fair to the book, it notes that there is a possibility of failure here and explain how to handle one part of it (it missed the error conditions in the error handling) and call this out explicitly as something that should be done with consideration.

time to read 2 min | 228 words

During code review I run into these two sections, which raised a flag. Can you tell why?

image

image

The problem with this type of code is two fold. First, we add optional parameters, to reduce the number of breaking changes we have. The problem with that is that we already have parameters on the call, and eventually you’ll get to something like this:

image

Which is the queen of optional parameters method, and you can probably guess how it looks internally.

In the first case, we can add the new optional parameter to the… options variable that we are already sending this method. This way, we don’t have to worry about breaking changes, and we already have a way to setup options, determine defaults, etc.

In the second case, we are passing two bools to the method, and there isn’t a preexisting parameters object. Instead of creating one, we can use a Flags enum, whose bits we can set to determine what exactly the behavior of this method should be. That is generally much easier to maintain in the long run.

time to read 1 min | 189 words

imageThe RavenDB 4.0 book is going really well, this week I have managed to write about 20,000 words and the current page count is at 166. At this rate, it is going to turn into a monster in terms of how big it is going to be.

The book so far covers the client API, how to model data in a document database and how to use batch processing in RavenDB with subscriptions. The full drafts are available here, and I would really appreciate any feedback you have.

Next topic is to start talking about clustering and this is going to be real fun.

I’m also looking for a technical editor for the book. Someone who can both tell me that I missed a semi column in a code listing and tell me why my phrasing / spelling / grammar is poor (and likely all three at once and some other stuff that I don’t even know). If you know someone, or better yet, interested and capable, send me a line.

time to read 2 min | 202 words

I run into the following in a PR I recently reviewed. Take a look and try to figure out why I’m pointing this bits out:

image_thumb[2]

image_thumb[3]

Those are bad errors. In the sense that they are hiding information. Sure, in 90% of the cases the user just put it the backup location or the restore location, so they know what the application is referring to. But the other 10% is looking at the exception from the logs, having “I got an error” support call or all sort of weird stuff.

In particular, one of the most annoying things is when the user and the application disagree on a relative path. If the user believes that the relative path is based on one location and the application another, hilarity ensues. But not in a fun way.

A better way to prevent all of those would be to just include the actual locations (fully resolved, mind) in the error. It doesn’t prevent support calls, but it does make them a lot easier to solve.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}