Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,421 | Comments: 47,496

filter by tags archive

Inside RavenDB 4.0Chapter 6 is done

time to read 1 min | 73 words

I’ve just completed writing chapter 6 (distributed RavenDB) and pushed a preview up. This put the page count at over 200 pages so far, with another two thirds or so left.

This chapter was really hard to write, and I would really appreciate any feedback that you have on the text and on the distributed nature of RavenDB 4.0 in general. It is very similar and a different beast entirely then 3.x.

RavenDB 4.0The admin’s backdoor is piping hot

time to read 5 min | 805 words

image

We take security very seriously. With the move to X509 certificates only for authentication (on all RavenDB editions) I feel that we have a really good story around securing RavenDB and controlling access to it.

Almost. One of the more annoying things about security is that you also need to consider the hard cases, such as the administrators messing up badly. As in, losing the credentials that allows you to administrator RavenDB. This can happen because the database has just run without issue for so long that no one can remember where the keys are. That isn’t supposed to happen, but RavenDB has been in production usage for close to a decade now, which mean that we have seen our fair share of mess ups (both our own and by customers).

In some cases, we have had to help a customer manage a third system handover between different hosting providers, which felt very much half like forensic and half like hacking. In short, when we design a system now, we also consider the fact that as secure as we want the system to be, there must be a way for an authorized person to get in.

If this made you cringe, you are in good company. I both love and hate this feature. I love it because it is going to be very useful, I hate it because it was a headache to figure it right. But I’m jumping ahead of myself. What is this backdoor that I’m talking about?

Properly configured RavenDB will require a client certificate (that was registered in the cluster) to access the server. However, in addition to listening over HTTPS, RavenDB will also listen for commands on standard input. An admin can use the standard input / output as a way to talk with RavenDB without requiring any authentication. Basically, we expose a mini shell that you can use to enter commands and inspect and change our state.

Here is how it looks like when running in console mode:

image

From a security point of view, if a user is able to access my standard input, that usually means that they are the one that have run this process or are able to so. RavenDB obviously won’t have any setuid bits turned on, so no need to worry about a user tricking us to do something that the user don’t have permissions to do.

So using the console is a really nice way for us to offer the administrator an escape hatch to start messing with the internals of RavenDB in interesting way. However, that only work if you are running RavenDB in interactive mode. What about when running as a service or daemon? They don’t have a standard input that is available to the admin. In fact, in most production deployments, you won’t have an easy time at all trying to connect to the console.

So that option is out, sadly. Or is it?

The nice thing about operating systems is that we can lean on them. In this case, we expose the exact same console that we have for stdin / stdout using Named Pipes (actually, Unix Sockets in Linux / Mac, but pretty much the same idea). The idea is that those are both methods for inter process communication that are local to the machine and can be secured by the operating system directly. In this case, we make sure that the pipe is only accessible to the RavenDB user (and to root / Administrator, obviously). That means that an admin can log into the box, run a single command and land in the RavenDB admin shell where he can manage the server. For example, by registering a new certificate in the server Smile.

Because only the user running the RavenDB process or an administrator / root can access the pipe (ensured by setting the proper ACL on the pipe during creation) we know that there isn’t any security risk here. An admin can already override any security in the box, and the permissions are always on the user level, not the process level, so if you are running as the same user as the RavenDB process you can already do anything that RavenDB can do.

After we ensured that our security isn’t harmed by this option, we can relax knowing that we have an easy (and safe) way for the administrator to manage the server in an emergency.

In fact, the most obvious usage of this feature is during initial cluster setup, when you don’t have anything yet. This allow you to enter the system as a trusted party and do the initial configuration.

RavenDB 4.0Securing the keys to the kingdom

time to read 7 min | 1238 words

imageA major design goal for RavenDB is that it would be easy and convenient to user. A major constraint is that it must be secured. As you can imagine, those two are quite often work against one another. Security is often anything but easy to use, and it is rarely convenient. 

Previously, we have used Windows Authentication and OAuth to secure access to RavenDB. That works and has been deployed in the wild for quite some time. It is also a major pain whenever there is an issue. If the connection to the domain controller drops, we might have authentication delays of many seconds, and trying to debug Active Directory issues in production deployments can be… a bit of a pain, in the same way that an audit by the IRS that starts with SWAT team bashing down your door is mildly annoying.  OAuth, on the other hand, is much better, since it is under our control, and we can figure out exactly what is going on with it if need be.

Since RavenDB 4.0 is running on Windows, Linux & Mac, we decided to drop the Windows Authentication support and just use OAuth. The problem is that if we choose to support HTTP, we have to rely on extremely complex protocols that attempt to secure authentication using plain text, but don’t usually deliver good results and are typically a pain to debug and support. Or, we can use HTTPS and just let SSL/TLS to handle it all for us. A good example of the difference can be seen in OAuth 1.0 vs OAuth 2.0.

When we built RavenDB 1.0, roughly around 2009, the operating environment was quite different. In 2017, not using HTTPS is pretty much a sin into itself. As we started security modeling for RavenDB 4.0, it became obvious that we couldn’t really support any security on top of HTTP without effectively having to implement most of the properties of HTTPS ourselves. I’m many things, but I’m not a security expert, not by a long shot. Given the chance to implement my own security protocol, I would gladly do that, for a toy project or a weekend hackfest. But there is no way I would trust my own security in production against serious attacks. That pretty much led us to the realization that we have to require HTTPS for anything that require security.

That includes running inside the organization, exposed to the public internet, running inside the cloud or in a shared datacenter, etc. Pretty much, unless you have HTTPS, there is no real point in talking about security. Given that, it meant that we could shift our baseline approach to security. If we are always going to require HTTPS for security, it means that we are operating in an environment that is much nicer for us to apply security.

Now, you can choose to run HTTP only, and avoid the need for certificate management, etc. However, at that point, you aren’t running a secure system, or you are already running it in a trusted and secured environment. In that case, we want to be clear that there isn’t any point to try to apply security policy (such as who can access what). Any network sniffer can figure out the access tokens and pretend to be whomever they want, if you are using HTTP.

With HTTPS required, we now move to the realm of having the admin take care of the certificates, securing them, renewal, etc. That is the part where it isn’t as easy or convenient as we could wish for. However, once we had that as a baseline, it opens an interesting path for security. Instead of relying on our own solution, we can use the builtin one and use x509 certificates from the client for authentication. This has the advantage that it is widely supported, standardized and secured. It is a bit less convenient then just a password, but the advantage is that any security system already in place know how to deal with, store, authorize and manage access to certificates.

The idea is that you can go to RavenDB and either register or generate a x509 certificate. To that certificate an administrator can assign permissions (such as what dbs it is allowed to access). From that point on, a client (RavenDB, browser, curl, etc) can connect to RavenDB and just issue REST requests. There is no need to do anything else for the system to work. Contrast that with how you would typically have to deal authentication using OAuth, by sending the token, keeping it fresh manually, etc.

Using x509 also has the distinct advantage that it is widely trusted. We intend to provide this level of security to all editions of RavenDB (so the Community Edition will also be able to use it).

A nice accidental feature of this decision is that we are going to be able to apply authentication at the connection level, and connection pooling means that we are likely going to have connections live for a long time. That means that we only need to pay the authentication cost once, instead of per request, with OAuth.

To simplify matters, we’ll likely just use the client certificates for authenticating the client, so we’ll not care if they are from a trusted root, etc. We’ll just require that the admin register the valid certificate with the cluster so they will be recognized. If you need to stop using a certificate, you can delete its registration or generate a new certificate to take its place. On the client side, it means that the DocumentStore will expose a X509Certificate property that you can set (or the equivalent in other clients). That means that you can use your own policies on the client to determine how to store the certificate.

On the server side, by the way, we’ll expose an extension point that will allow you to retrieve the certificate using your own policies. For example, if you are using Azure Key Vault or Hashicorp Vault or even your own HSM. This is done by invoking a process you specify, so you can write your own scripts / mini programs and apply whatever logic you need. This creates a clean separation between RavenDB and the secret store in use.

Authentication between servers is also done using SSL and certificates. We expect that we’ll commonly have all the servers running the same wildcard certificate, in which case they will obviously trust each other. Alternatively, you can also specify additional certificates that will be treated as servers. This is useful for when you are running with separate certificate for each server, but it is also a critical part of certificate rotation. When your certificate is about to expire, the admin will register the new certificate as trusted, and then start replacing the certificates of each of the nodes in turn. This allow us to run with both old and new certificates concurrently during this process.

We considered relying on some properties of the certificate itself, but it seemed like an error prune process. It is better to have the admin explicitly state, both for clients and server certificates which one we should actually trust, and at what level.

I would really appreciate any commentary you have about this feature, both in terms of ease of use, acceptability and obviously its security.

The ghost of the zombie of revisions past

time to read 3 min | 433 words

I talked about difficult naming decisions, and this one was certainly one of the more lively ones.

We bounced between zombies, orphans and ghosts, with a bunch of crazy stuff going in between. At one point it was suggested we’ll just make up a word, but my suggestion to use Welchet was sadly declined by all, including a rather rude comment by the author of this blog about what kind of jokes are appropriate for the workplace.

After we settled the discussion on ghosts, there was another discussion about whatever we should use Inky, Blinky, Pinky and Clyde. I tell you, when we aren’t building distributed databases, the office is a hotbed for nerd references.

And then an idea cam along. Which I really liked, so we talked about this in the morning and I’m showing screenshots at a blog post a bit before midnight. The feature is called the revision bin.

In the UI, you can see it as one of the top level elements.

image

In essence, this is a recycle bin for revisions. RavenDB can be configured to keep revisions of documents as they change, and even keep track of them after they were deleted. However, that presented a problem. If you deleted a document that had revisions, how would you tell that it was there in the first place? Just knowing the document id and looking for that wouldn’t work very well. So we created the revisions bin, whose content looks like this:

image

And from there you can go to:

image

For that matter, if we recreate this document again, you’ll be able to see its entire history, including across deletes.

image

Now admittedly this is a nice looking UI, and the skull on the menu is a nice touch, if a bit morbid. However, why make such a noise about such a feature?

The answer is that the revisions bin isn’t that important, but keeping track of deletes of documents using revisions is quite important, since it allow subscriptions and ETL to handle them in a clean and easy to grok manner. And in order to actually explain that, we needed to be able to show the users what we are talking about.

RavenDB 4.0Unbounded results sets

time to read 3 min | 503 words

Unbounded result sets are a pet peeve of mine. I have seen them destroy application performance more then once. With RavenDB, I decided to cut that problem at the knees and placed a hard limit on the number of results that you can get from the server. Unless you configured it differently, you couldn’t get more than 1,024 results per query. I was very happy with this decisions, and there have been numerous cases where this has been able to save an application from serious issues.

Unfortunately, users hated it. Even though it was configurable, and even though you could effectively turn it off, just the fact that it was there was enough to make people angry.

Don’t get me wrong, I absolutely understand some of the issues raised. In particular, if the data goes over a certain size we suddenly show wrong results or error, leaving the app in a “we need to fix this NOW”. It is an easy mistake to make. In fact, in this blog, I noticed a few months back that I couldn’t get entries from 2014 to show up in the archive. The underlying reason was exactly that, I’m getting the number of items per month, and I’ve been blogging for more than 128 months, so the data got truncated.

In RavenDB 4.0 we removed the limit. If you don’t specify a limit in a query, you’ll get exactly how many results there are in the database. You can ask RavenDB to raise an error if you didn’t specify a limit clause, which is a way for you to verify that you won’t run into this issue in production, but it is off by default and will probably better match the new user expectations.

The underlying issue of loading too many results is still there, of course. And we still want to do something about it. What we did was raise alerts.

I have made a query on a large set (160,000 results, about 400 MB in all) and the following popped up in the RavenDB Studio:

image

This tells the admin that it have some information that it needs to look at. This is intentionally non obtrusive.

When you click on the notifications, you’ll get the following message.

image

And if you’ll click on the details, you’ll see the actual details of the operations that triggered this warning.

image

I actually created an issue so we’ll supply you with more information (such as the index, the query, duration and the total size that it generated over the network).

I think that this gives the admin enough information to act upon, but will not cause hardship to the application. This make it something that we Should Fix instead Get the OnCall Guy.

Batch processing with subscriptions in RavenDB 4.0

time to read 3 min | 425 words

Subscription is a somewhat neglected feature in RavenDB. It was created to handle a specific customer need and grew from there, but it had relatively little traction and was a bit of a pain to use. When we looked at the things we wanted to do in RavenDB 4.0 re-working how people use subscription was high enough in the list that it got a dedicated dev for about a year.

Here is how a subscription looks like in RavenDB 3.x.

It is only available from code, and the model used is heavily influenced by Reactive Extensions. It give you reliable subscription to data, even if the client or server went down, it could recover on restart, but it was complex to do the more advanced things. There are events that you can register to respond to things that are happening, but there isn’t a complete story. Other things, such as automatic failover or responding to deletes were flat out impossible.

With RavenDB 4.0, we decided to do things differently. I talked about this before several times, but recently we completed a major restructuring and simplification of the user visible behavior that I’m really happy about. To start with, we ditched the Reactive Extensions and IObservable model. This is just not the right fit for the kind of things we want to do. Instead, we are going with full blown batch processing.

Instead of being called once per item, we are going to call you one per batch. This is actually how things are going over the wire, and exposing it directly to the user make our life a lot easier. It also means that you have much better model to actually do things in a batch mode. Such as applying modification to all the items in the batch and saving them back in a single operation.

Subscriptions in RavenDB 4.0 are also fault tolerant and highly available (both client & server), allow to access versioned and deleted snapshots, allow to apply complex filtering and transformations on the server side and in general a lot more suitable for the task we intend them for.

Perhaps what is more exciting is that subscriptions are available to all the clients, and in some cases, it just make more sense to write them as a batch processing script. Consider:

This is the kind of thing that can really make the operations team happy, because they can do targeted jobs with very little friction. I spend the whole of Chapter 5 talking about subscriptions, and I think it is well worth it.

We won’t be fixing this race condition

time to read 2 min | 345 words

During the work on restoring backup, the developer in charge came up with the following problematic scenario.

  • Start restoring backup of database Northwind on node A, which can take quite some time for large database
  • Create a database named Northwind on node B while the restore is taking place.

The problem is that during the restore the database doesn’t exists in a proper form in the cluster until it is done restoring. During that time, if an administrator is attempting to create a database it will look like it is working, but it will actually create a new database on all the other nodes and fail on the node where the restore is going on.

When the restore will complete, it will either remove the previously created database or it will join it and replicate the restored data to the rest of the nodes, depending exactly on when the restore and the new db creation happened.

Now, trying to resolve this issue involve us coordinating the restore process around the cluster. However, that also means that we need to do heartbeats during the restore process (to the entire cluster), handle timeouts and recovery and effectively take upon us a pretty big burden of pretty complicated code. Indeed, the first draft of the fix for this issue suffered from the weakness that it would only work when running on a single node, and only work in a cluster mode in very specific cases.

In this case, it is a very rare scenario that require an admin (not just a standard user) to do two things that you’ll not usually expect them together, and the outcome of this is a bit confusing even if you managed, but there isn’t any data loss.

The solution was to document that during the restore process you shouldn’t create a database with the same name but instead let RavenDB complete and then let the database span additional nodes. That is a much simpler alternative to going in to a distributed mode reasoning just for something that is an operator error in the first place.

Bug storiesHow do I call myself?

time to read 3 min | 522 words

imageThis bug is actually one of the primary reasons we had a Beta 2 release for RavenDB 4.0 so quickly.

The problem is easy to state, we had a problem in any non trivial deployment setup where clients would be utterly unable to connect to us. Let us examine what I mean by non trivial setup, shall we?

A trivial setup is when you are running locally, binding to “http://localhost:8080”. In this case, everything is simple, and you can bind to the appropriate interface and when a client connects to you, you let it know that your URL is “http://localhost:8080”.

Hm… this doesn’t make sense. If a client just connected to us, why do we need to let it know what is the URL that it need to connect to us?

Well, if there is just a single node, we don’t. But RavenDB 4.0 allows you to connect to any node in the cluster and ask it where a particular database is located. So the first thing that happens when you connect to a RavenDB server is that you find out where you really need to go. In the case of a single node, the answer is “you are going to talk to me”, but in the case of a cluster, it might be some other node entirely. And this is where things begin to be a bit problematic. The problem is that we need to know what to call ourselves when a client connects to us.

That isn’t as easy as it might sound. Consider the case where the user configure the server url to be “http://0.0.0.0:8080”. We can’t give that to the client, so we default to sending back the host name in that case. And this is where things started to get tricky. In many cases, the host name is not something that make sense.

Oh, for internal deployments, you can usually rely on it, but if you are deploying to AWS, for example, the machine host name is of very little use in routing to that particular machine. Or, for that matter, a docker container host name isn’t particularly useful when you consider it from the outside.

The problem is that with RavenDB, we had a single configuration value that was used both for the binding to the network and for letting the user know how to connect to us. That didn’t work when you had routers in the middle. For example, if my public docker IP is 10.0.75.2, that doesn’t mean that this is the IP that I can bind to inside the container. And the same is true whenever you have any complex network topology (putting nginx in front of the server, for example).

The resolution for that was pretty simple, we added a new configuration value that will separate the host that we bind to from the host that we report to the outside world. In this manner, you can bind to one IP but let the world know that you should be reached via another. 

Bug storiesThe data corruption in the cluster

time to read 5 min | 986 words

imageThe bug started as pretty much all others. “We have a problem when replicating from a Linux machine to a Windows machine, I’m seeing some funny values there”. This didn’t raise any alarm bells, after all, that was the point of checking what was going on in a mixed mode cluster. We didn’t expect any issues, but it wasn’t surprising that they happened.

The bug in question showed up as an invalid database id in some documents. In particular, it meant that we might have node A, node B and node C in the cluster, and running a particular scenario suddenly started also reporting node Ω, node Σ and other fun stuff like that.

And so the investigation began. We were able to reproduce this error once we put enough load on the cluster (typically around the 20th million document write or so), and it was never consistent.

We looked at how we save the data to disk, we looked at how we ready it, we scanned all the incoming and outgoing data. We sniffed raw TCP sockets and we looked at everything from the threading model to random corruption of data on the wire to our own code reading the data to manual review of the TCP code in the Linux kernel.

The later might require some explanation, it turned out that setting TCP_NODELAY on Linux would make the issue go away. That only made things a lot harder to figure out. What was worse, this corruption only ever happened in this particular location, never anywhere else. It was maddening, and about three people worked on this particular issue for over a week with the sole result being: “We know where it roughly happening, but no idea why or how”.

That in itself was a very valuable thing to have, and along the way we were able to fix a bunch of other stuff that was found under this level of scrutiny. But the original problem persisted, quite annoyingly.

Eventually, we tracked it down to this method:

We were there before, and we looked at the code, and it looked fine. Except that it wasn’t. In particular, there is a problem when the range we want to move is overlapped with the range we want to move it to.

For example, consider that we have a buffer of 32KB, and we read from the network 7 bytes. We then consumed 2 of those bytes. In the image below, you can see that as the Origin, with the consumed bytes shown as ghosts.

image

What we need to do now is to move the “Joyou” to the beginning of the buffer, but note that we need to move it from 2 – 7 to 0 – 5, which are overlapping ranges. The issue is that we want to be able to fully read “Joyous”, which require us to do some work to make sure that we can do that. This ReadExactly piece of code was written with the knowledge that at most it will be called with 16 bytes to read, and the buffer size is 32KB, so there was an implicit assumption that those ranges can’t overlap.

when they do… Well, you can see in the image how the data is changed with each iteration of the loop. The end result is that we have corrupted our buffer and mess everything up. The Linux TCP stack had no issue, it was all in our code. The problem is that while it is rare, it is perfectly fine to fragment the data you send into multiple packets, each with very small length. The reason why TCP_NODELAY “fixed” the issue was that it probably didn’t trigger the multiple small buffers one after another in that particular scenario. It is also worth noting that we tracked this down to specific load pattern that would cause the sender to split packets in this way to generate this error condition.

That didn’t actually fix anything, since it could still happen, but I traced the code, and I think that this happened with more regularity since we hit the buffer just right to send a value over the buffer size in just the wrong way. The fix for this, by the way, is to avoid the manual buffer copying and to use memove(), which is safe to use for overlapped ranges.

That leave us with the question, why did it take us so long to find this out? For that matter, how could this error surface only in this particular case? There is nothing really special with the database id, and this particular method is called a lot by the code.

Figuring this out took even more time, basically, this bug was hidden by the way our code validate the incoming stream. We don’t trust data from the network, and we run it through a set of validations to ensure that it is safe to consume. When this error happened in the normal course of things, higher level code would typically detect that as corruption and close the connection. The other side would retry and since this is timing dependent, it will very likely be able to proceed. The issue with database ids is that they are opaque binary values (they are guids, so no structure at all that is meaningful for the application). That means that only when we got this particular issue on that particular field (and other field at all) will we be able to pass validation and actually get the error.

The fix was annoyingly simply given the amount of time we spent finding it, but we have been able to root out a significant bug as a result of the real world tests we run.

FUTURE POSTS

  1. Building a query parser over a weekend: Part II - 2 hours from now
  2. Let the Merge Games begin! - about one day from now
  3. Keeping track on long running branches - 2 days from now
  4. Error handling belongs at Layer 7 (policy) - 3 days from now
  5. Practice makes perfect - 6 days from now

And 2 more posts are pending...

There are posts all the way to Aug 02, 2017

RECENT SERIES

  1. Production postmortem (17):
    23 Aug 2016 - The insidious cost of managed memory
  2. Building a query parser over a weekend (2):
    24 Jul 2017 - Part I
  3. PR Review (3):
    21 Jul 2017 - Is your error handling required?
  4. Reviewing Resin (6):
    20 Jul 2017 - Part VI – Analyzing I/O and being unfair
  5. Inside RavenDB 4.0 (2):
    17 Jul 2017 - Chapter 6 is done
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats