Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,840 | Comments: 49,138

filter by tags archive
time to read 6 min | 1072 words

RavenDB stores (critical) data for customers. We have customers in pretty much every field imaginable, healthcare, finance, insurance and defense. They do very different things with RavenDB, some run a single cluster, some deploy to tens of thousands of locations. The one thing that they all have in common is that they put their data into RavenDB, and they really don’t want to put that data at the hands of an unknown third party.

Some of my worst nightmares are articles such as these:

That is just for the last six months, and just one site that I checked.

To be fair, none of these cases are because of a fault in MongoDB. It wasn’t some clever hack or a security vulnerability. It was someone who left a production database accessible over the public Internet with no authentication.

  1. Production database + Public Internet + No authentication
  2. ?
  3. Profit (for someone else, I assume)

When we set out to design the security model for RavenDB, we didn’t account only for bad actors and hostile networks. We had to account for users who did not care.

Using MongoDB as the example, by default it will only listen on localhost, which sounds like it is a good idea. Because no one external can access it. Safe by default, flowers, parade, etc.

And then you realize that the first result for searching: “mongodb remote connection refused” will lead to this page:

image

Where you’ll get a detailed guide on how to change what IPs MongoDB will listen to. And guess what? If you follow that article, you’ll fix the problem. You would be able to connect to your database instance, as would everything else in the world!

There is even a cool tip in the article, talking about how to enable authentication in MongoDB. Because everyone reads that, right?

image

Except maybe the guys at the beginning of this post.

So our threat model had to include negligent users. And that leads directly to the usual conundrum of security.

I’ll now pause this post to give you some time to reflect on the Wisdom of Dilbert:

In general, I find that the best security for a computer is to disconnect it from any power sources. That does present some challenges for normal operations, though. So we had to come up with something better.

In RavenDB, security is binary. You are either secured (encrypted communication and mutual authentication) or you are not (everything is plain text and there everyone is admin). Because the Getting Started scenario is so important, we have to account for it, so you can get RavenDB started without security. However, that will only work when you set RavenDB to bind to local host.

How is that any different than MongoDB? Well, the MongoDB guys have a pretty big set of security guidelines. At one point I took a deep look at that and, excluding the links for additional information, the MongoDB security checklist consisted of about 60 pages. We decided to go a very different route with RavenDB.

If you try to change the binding port of RavenDB from localhost, it will work, and RavenDB will happily start up and serve an error page to all and sundry. That error page is very explicit about what is going on. You are doing something wrong, you don’t have security and you are exposed. So the only thing that RavenDB is willing to do at that point is to tell you what is wrong, and how to fix it.

That lead us to the actual security mechanism in RavenDB. We use TLS 1.2, but it is usually easier to just talk about it as HTTPS. That gives us encrypted data over the wire and it allows for mutual authentication at the highest level. It is also something that you can configure on your own, without requiring an administrator to intervene. The person setting up RavenDB is unlikely to have Domain Admin privileges or the ability to change organization wide settings. Nor should this be required. HTTPS relies on certificates, which can be deployed, diagnosed and debugged without any special requirements.

Certificates may not require you to have a privileges access level, but they are complex. One of the reasons we choose X509 Certificates as our primary authentication system is that they are widely used. Many places already have policies and expertise on how to deal with them. And for the people who don’t know how to deal with them, we could automate a lot of that and still get the security properties that we wanted.

In fact, Let’s Encrypt integration allowed us to get to the point where we can setup a cluster from scratch, with security, in a few minutes. I actually got it on video, because it was so cool to be able to do this.

Using certificates also meant that we could get integration with pretty much anything. We got good support from browsers, we got command line integration, great tools, etc.

This isn’t a perfect system. If you need something that our automated setup doesn’t provide, you’ll need to understand how to work with certificates. That isn’t trivial, but it is also not a waste, it is both interesting and widely applicable.

The end result of RavenDB’s security design is a system that is meant to be deployed in hostile environment, prevent information leakage on the wire and allow strong mutual authentication of clients and servers. It is also a system that was designed to prevent abuses. If you really want to, you can get an unsecured instance on the public internet. Here is one such example: http://live-test.ravendb.net

In this case, we did it intentionally, because we wanted to get this in the browser:

image

But the easy path? The path that we expect most users to follow? That one ends up with a secured and safe system, without showing up on the news because all your data got away from you.

time to read 4 min | 786 words

imageKrzysztof has been working on our RavenDB Go Client for almost a year, and we are at the final stretch (docs, tests, deployment, etc). He has written a blog post detailing the experience of porting over 50,000 lines of code from Java to Go.

I wanted to point out a few additional things about the porting effort and the Go client API that he didn’t get to.

From the perspective of RavenDB, we want to have as many clients as possible, because the more clients we have, the more approachable we are for developers. There are over million Go developers, so that is certainly something that we want to enable. More important, Go is a great language for server side work and primary used for just the kind of applications that can be helped from using RavenDB.

RavenDB currently have clients for:

  1. .NET  / CLR – C#, VB.Net, F#, etc.
  2. JVM – Java, Kotlin, Clojure, etc.
  3. Node.js
  4. Python
  5. Go – finalization stage
  6. C++ – alpha stage

We also have a Ruby client under wraps and wouldn’t object to having a PHP one.

We used to only run on Windows and really only pay attention to the C# client. That has changed toward the end of 2015, when we started the work on the 4.0 release of RavenDB. We knew that we were going to be cross platform and we knew that we were going to target additional languages and runtimes. That meant that we had to deal with a pretty tough choice.

Previously, when we had just a single client, we could do quite a lot in it. That meant that a lot of the  functionality and the smarts could reside in the client. But we now have 6+ clients that we need to maintain, which means that we are in a very different position.

For reference, the RavenDB Server alone is 225 KLOC, the .NET client is 62 KLOC and the other clients are about 50 KLOC each (Linq support is quite costly for .NET, in terms of LOC and overall complexity).

One of the design guidelines for RavenDB 4.0 was that we want to move, as much as possible, responsibility from the client side to the server side. We have done a lot of stuff to make this happen, but the RavenDB client is still a pretty big chunk of code. With 50 KLOC, you can do quite a lot, so what is actually going on in there?

The RavenDB client core responsibilities are:

  • Commands on the server / documents – About 12 KLOC. This provide strongly typed access to commands, including specific command error handling and handling.
  • Caching, Failover & request processing – About 3 KLOC. Handles failover and recovery, topology handling and the client side portion of RavenDB’s High Availability features by implementing transparent failover if there is a failure. Also handles request caching as well as aggressive caching.
  • JSON handling. About 3 KLOC. Type convertors, serialization helpers and other stuff related to handling JSON that we need client side.
  • Exceptions – 1.5 KLOC. Type safe exceptions for various errors takes a lot of bit of code, mostly because we try hard to get good errors to the user.

But by far, the most complex part of the RavenDB client is the session. The session is the typical API you have for working with RavenDB and it is how you’ll usually interact with it. You can see the Go client above using the session to store a document and save it to the database.

The sessions is about 20 KLOC or so. By far the biggest single component that we have.

But why it is to big? Especially since I just told you that we spent a lot of time moving responsibilities away from the client.

Because the session implements a lot of really important behaviors for the client. Without any particular order, and off the top of my head, we have:

  • Unit of Work
  • Change Tracking
  • Identity Map
  • Queries
  • Patching
  • Lazy operations

The surface area of RavenDB’s client API is very important to me. I think that giving you a high level API is quite important to reduce the complexity that you have to deal with and making it easy for you to get things done. And that end up taking quite a lot of code to implement.

The good news is that once we have a client, keeping it up to date is relatively simple. And having the taken the onus of complexity upon ourselves, we free you from having to manage that. The overall experience of building application using RavenDB is much better, to the point where you can pretty much ignore the database, because it will Just Work.

time to read 1 min | 148 words

RavenDB 4.x is using X509 Certificates for authentication. We got a feedback question from a customer about that, they much rather to use API Keys, instead.

We actually considered this as part of the design process for 4.x and we concluded that we can make this work in just the same manner as API Keys. Here is how you can make it work.

You have the certificate file (usually PFX) and convert that to a Base64 string, like so:

image

[System.Convert]::ToBase64String( (gc "cert.pfx" -Encoding byte ) )

You can take the resulting string and store it like an API key, because that is effectively how it is treated. In your application startup, you can use:

And this is it. For all intents and purposes, you can now use the certificate as an API key.

time to read 1 min | 185 words

Last week we had a couple of interesting milestones. The first of which is that we reached the End Of Life for RavenDB 3.0. If you are still running on RavenDB 3.0 (or any previous version), be aware that this marks the end of the support cycle for that version. You are strongly encouraged to upgrade to RavenDB 3.5 (which still has about 1.5 years of support).

I got an email today from a customer talking about maybe considering upgrade from the RavenDB version that was released in Dec 2012, so I’m very familiar with slow upgrade cycles.

End of Life for 3.0 means that we no longer offer support for it. If your operations team is dragging their feet on the upgrade, please hammer this point home. We really want to see people running on at least 3.5.

The other side of the news is that the new bits for RavenDB 4.2 Release Candidate are out. This release moves out of the experimental phase features such as Cluster Wide Transactions and Counters and introduce Graph Queries support. As usual, I would really love your feedback.

time to read 2 min | 244 words

I don’t usually talk about what we call the Studio Features. These set of features impact the RavenDB Studio and are meant to make the lives of developers and administrators of RavenDB easier. We actually spend lot of effort on the Studio, because making features accessible and usable is considered to be a core part of the feature itself. We run the metrics, and bout 30% – 35% of the time spent building RavenDB 4.0 was spent on the Studio. We are still investing > 10% of our time and effort in continuously improving the Studio. I tend not to write about the Studio explicitly because we usually use it to show case the server features.

This feature, on the other hand, is purely a Studio feature. Yesterday I talked about Revisions in RavenDB and was reminded that we also have a new major feature for Revisions in the Studio only. Now you can diff different revisions directly  in the Studio.

Here is how you invoke this feature:

image

When you click on the compare button, you’ll get the following screen:

image

You can compare a revision to the current document or a revision to revision.

This can make it very easy to see what changed between two versions of a document.

time to read 2 min | 394 words

Image result for ravendb clusterRavenDB allows you to tune, per transaction, what level of safety you want your changes to have. At the most basic level, you can select between a single node transaction and a cluster wide transaction.

We get questions from customers about the usage scenario for each mode. It seems obvious that we really want to always have the highest level of safety for our data, so why not make sure that all the data is using cluster wide transactions at all times?

I like to answer this question with the lottery example. In the lottery example, there are two very distinct phases for the system. First, you record lottery tickets as they are purchased. Second, you run the actual lottery and select the winning numbers. (Yes, I know that this isn’t exactly how it works, bear with me for the sake of a clear example).

While I’m recording purchased lottery ticket, I always want to succeed in recording my writes. Even if there is a network failure of some kind, I never want to lose a write. It is fine if only one node will accept this write, since it will propagate the data to the rest off the cluster once communication is restored. In this case, you can use the single node transaction mode, and rely on RavenDB replication to distribute the data to the rest of the cluster. This is also the most scalable approach, since we can operate on each node separately.

However, for selecting the winning numbers (and tickets), you never want to have any doubt or the possibility of concurrency issues with that. In this case, I want to ensure that there is just one lottery winner selection transaction that actually commit, and for those purposes, I’m going to use the cluster transaction mode. In this way, we ensure that a quorum of the cluster will confirm the transaction for it to go through. This is the right thing to do for high value, low frequency operations.

We also have additional settings available, beyond the single node / full cluster quorum, which is write to a single node and wait for the transaction to propagate the write to some additional nodes. I don’t really have a good analogy for this use case using the lottery example, though. Can you think of one?

time to read 4 min | 754 words

imageThe feature outlined in this post is a hidden behind a small button at a relatively obscure part of the RavenDB Studio (Database > Settings > Document Revisions). You can see how it looks like on the right. Despite its unassuming appearance, this is a pretty important feature. Revisions revert is a feature that we wish that no one use, though, which make it an interesting one.

Revision Revert allow you to take your entire database back to a particular moment in time. Documents changes will be undone, deleted documents will be restored, new documents will be removed, etc.

image

So far, this isn’t a surprising feature, being able to restore to a point in time is a feature that many other database have. How is this feature different? In most systems, a point in time restore require you to… well, restore. In a large database, that can take a lot of time. Revision Revert is an alternative to that. Instead of restoring from scratch, it utilize the revisions features in RavenDB to allow you to just hit the time machine button and go back to the desired time.

The common use case for that is immediately after the “Opps” moment. You have run an query without specifying a where clause, deployed a bad version of your app that removed important fields, etc.

Revision Revert is an online operation, you don’t need to take the database down. In fact, you can still serve reads while the process is going on. Since typically you’ll need to go back in time a relatively short period, this is a very quick process.

In a distributed system, the admin will invoke this process on one of the nodes in the system and the reverts will be applied on that node and then replicated from there to all the other nodes in the system. We have made every attempt to make what is likely to be a pretty stressful event as easy as possible.

You might have noticed the Window configuration in the screen above. What is that about?

To be honest, this is something that we expect most users to never really care about. It is there for correctness’ sake in a distributed environment. Let’s dig a little deeper into this feature.

First thing we need to talk about is time. The point in time that we’ll restore to is the user’s local time. This is converted into UTC internally and used to compute the cutoff point for the revert. In a distributed system, it is possible (even likely) that different machines on the network will have different clocks. (Note that while RavenDB will work just fine and do the Right Thing if your nodes have different timezones, we have found that really confusing. Better to keep all nodes on the same timezone and clock sync system).

This means that one problem for this feature is that changes happening on two machines at the same time may have different time stamps (in UTC, the local time is not relevant). You need to take that into account when using Revision Revert since that is what RavenDB uses to decide what stays and what go.

The second problem is that just because two updates happened at the same time, it doesn’t mean that we learned about them at the same time. What I mean here is that a change that two changes that happened at the same time on different machine may have reached a particular node at very different times. That is where the Window option come into play. We scan the revisions log for all changes to the system. And we scan them in the order that we learned about them. By default, we’ll go back 4 days until we are sure that there aren’t any revisions that we got out of order and missed.

A few additional things about this feature. Obviously, it requires that you’ll have revisions enabled (and have enough revisions capacity to go back far enough in time, naturally). It support live restores and operates nicely in a distributed environment. Note that if you are doing Revision Revert and not all your documents have revisions enabled, only those that have revisions will be reverted.

Currently we apply this revert globally, we are considering allowing you to select specific collections to revert, but I’m not sure how useful that would be in practice.

time to read 3 min | 405 words

The most common network topology for RavenDB replication is a full mesh. For example, if you have three nodes in your cluster and a database that reside on all three nodes, you’ll have a replication topology that will look like this:

image

This works great when the number of nodes that you have in your cluster is reasonably small. However, we recently got a customer question about a different kind of topology. They have a bunch of nodes, in the order of a few dozens, which cooperate to perform some non trivial task. A key part of this is that the nodes are transient and identical. So a new node may pop up, live for a while (days, weeks, months) and then go away. At any given time you might have a few dozen nodes. That kind of environment won’t really work with a full mesh topology. If we would try, it would look something like that (fully connected network with 40 nodes):

image

This has a total of 780 connections(!) in it.  You can create a topology like that, but a lot of the processing power in the network is going to be dedicated to just maintaining these connections. And you don’t actually need it. RavenDB’s replication algorithm is actually a gossip algorithm, and as you grow the number of nodes that take part in the replication, the less connection you need between nodes. In this case, we can take each of the live nodes and connect each of them to four other (random) nodes. The result would look like so:

image

Remember, each of the nodes is actually connected to a random four other nodes. RavenDB’s replication will ensure that a change to any document in any of the nodes under these conditions will propagate to all the other nodes efficiently.

This approach will also transparently handle any intermediary failures and be robust for nodes coming and leaving on the fly. RavenDB doesn’t implement gossip membership, mostly because that is very heavily dependent on the application and deployment pattern, but once you tell a node who its neighbors are, everything will proceed on its own.

time to read 2 min | 334 words

imageI’m happy to let you know that as of last week, RavenHQ has been updated with full general availability of managed RavenDB 4.1 in the cloud.

If you aren’t familiar with RavenHQ, this update gives you the ability to get a managed RavenDB cluster in minutes. In this case, you get all the usual benefits of RavenDB as well as the peace and quite of someone else managing all the other details for you.

We have been steadily working on making RavenDB a self managed database, but there are still things that it can’t do on its own. RavenHQ close the loop. If your database is large and is about to run out of disk space, RavenHQ can automatically increase the allocated storage, for example. A machine went down? RavenHQ will transparently replace it and allow the cluster to recover without having any impact on your client code.

If you have used RavenHQ in the past, there are important changes. Previously, you would use RavenHQ to provision a database. That has changed, you now provision a cluster of nodes and you can create as many databases as you want. If you need a test database for CI, you can just create one on the fly, no extra charges or need to use additional APIs over the RavenDB native client.

All the usual management features of RavenDB are available as well, including the ability to extend RavenDB on the fly using index extensions and analyzers. We went over all the feedback that we got from the community and users and have done the same thing for RavenHQ that was done in the move from RavenDB 3.5 to RavenDB 4.0. Everything should be better, in the case of RavenDB, we had a rule that things should be at least 10x better, and I believe that you’ll find the RavenHQ experience similar.

As always, we would really love your feedback.

time to read 4 min | 735 words

imageAbout a month ago I wrote about a particular issue that we wanted to resolve. RavenDB is using X509 certificates for authentication. These are highly secured and are a good answer for our clients who need to host sensitive information or are working in highly regulated environments. However, certificates have a problem, they expire. In particular, if you are following common industry best practices, you’ll replace your certificates every 2 – 3 months. In fact, the common setup of using RavenDB with Let’s Encrypt will do just that. Certificates will be replaced on the fly by RavenDB without the need for an administrator involvement.

If you are running inside a single cluster, that isn’t something you need to worry about. RavenDB will coordinate the certificate update between the nodes in such a way that it won’t cause any disruption in service. However, it is pretty common in RavenDB to have multi cluster topologies. Either because you are deployed in a geo-distributed manner or because you are running using complex topologies (edge processing, multiple cooperating clusters, etc). That means that when cluster A replaces its certificate, we need to have a good story for cluster B still allowing it access, even though the certificate has changed.

I outlined our thinking in the previous post, and I got a really good hint,  13xforever suggested that we’ll look at HPKP (HTTP Public Key Pinning) as another way to handle this. HPKP is a security technology that was widely used, run into issues and was replaced (mostly by certificate transparency). With this hint, I started to investigate this further. Here is what I learned:

  • A certificate is composed of some metadata, the public key and the signature of the issuer (skipping a lot of stuff here, obviously).
  • Keys for certificates can be either RSA or ECDSA. In both cases, there is a 1:1 relationship between the public and private keys (in other words, each public key has exactly one private key).

Given these facts, we can rely on that to avoid the issues with certificate expiration, distributing new certificates, etc.

Whenever a cluster need a new certificate, it will use the same private/public key pair to generate the new certificate. Because the public key is the same (and we verify that the client has the private key during the handshake), even if the certificate itself changed, we can verify that the other side know the actual secret, the private key.

In other words, we slightly changed the trust model in RavenDB. From trusting a particular certificate, we trust that certificate’s private key. That is what grants access to RavenDB. In this way, when you update the certificate, as long as you keep the same key pair, we can still authenticate you.

This feature means that you can drastically reduce the amount of work that an admin has to do and lead you to a system that you setup once and just keeps working.

There are some fine details that we still had to deal with, of course. An admin may issue a certificate and want it to expire, so just having the user re-generate a new certificate with the private key isn’t really going to work for us. Instead, RavenDB validates that the chain of signatures on the certificate are the same. Actually, to be rather more exact, it verifies that the chain of signatures that signed the original (trusted by the admin) certificate and the new certificate that was just presented to us are signed by the same chain of public key hashes.

In this way, if the original issuer gave you a new certificate, it will just work. If you generate a new certificate on your own with the same key pair, we’ll reject that. The model that we have in mind here is trusting a driver’s license. If you have an updated driver’s license from the same source, that is considered just as valid as the original one on file. If the driver license is from Toys R Us, not so much.

Naturally, all such automatic certificate updates are going to be logged to the audit log, and we’ll show the updated certificates in the management studio as well.

As usual, we’ll welcome your feedback, the previous version of this post got us a great feature, after all.

FUTURE POSTS

  1. TimeSeries in RavenDB: Exploring the requirements - about one day from now

There are posts all the way to May 20, 2019

RECENT SERIES

  1. Reviewing Sled (3):
    23 Apr 2019 - Part III
  2. RavenDB 4.2 Features (5):
    21 Mar 2019 - Diffing revisions
  3. Workflow design (4):
    06 Mar 2019 - Making the business people happy
  4. Data modeling with indexes (6):
    22 Feb 2019 - Event sourcing–Part III–time sensitive data
  5. Production postmortem (25):
    18 Feb 2019 - This data corruption bug requires 3 simultaneous race conditions
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats