Ayende @ Rahien

My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:


+972 52-548-6969

, @ Q c

Posts: 6,399 | Comments: 47,407

filter by tags archive

RavenDB 4.0Data subscriptions, Part II

time to read 3 min | 493 words

imageIn my previous post I introduced data subscription and mentioned that there is more to it than just being able to get push based reliable stream of documents as they change. The problem is that RavenDB will send us the current document, and if the document has been modified multiple times quickly enough, we’ll only get it once. What is more, we are getting the document in our client code, but all we know is that it was changed, no what changed.

With RavenDB 4.0 we now have versioned subscriptions, working alongside the versioning feature. First, we define that a particular collection will have versioning enabled:


And now we can make use of versioned subscriptions.

In this case, you can see that we make use of Versioned<User> type, which indicates to RavenDB that we are interested in a versioned subscription. Instead of sending us just the modified document, we’ll get both the Previous and Current version of the document. In fact, we’ll be called with the Previous / Current version of the document on each change. You might have noticed the null checks in the subscription code, this is because when a document is created, we’ll get it with null Previous value and when a document is deleted, we’ll get it with a null Current value. If the document has been deleted and recreated, we’ll be called first with a Previous instance and null Current  and then null Previous and a Current instance.

In other words, you are now able to track the entire history of a document inside RavenDB, and make decisions based on that. Like regular subscriptions, we have the ability to script a lot of the logic, like so:

What this subscription will do is to analyze all the changes on a user, and then send us the user document as it was banned.

It is important to note that this doesn’t require you to be monitoring the subscription as it happens, you can do this at any point, and you’ll get the historical data. For that matter, this is also a high available solution. If a client goes down, it (or another client) can resume from where it left off, and if the server goes down, the client will transparently be able to failover to a replica without any user or admin involvement, running from where it left off.

We only started looking into the implications of this feature, but the potential for analytics on the changes is already quite obvious. We are going to send you the data in the order it was generated, so you can build a model of changes as it make sense in your domain, without having to go to the trouble of manually keeping track of everything.

RavenDB 4.0Data subscriptions, Part I

time to read 3 min | 516 words

imageI’ll be talking about this feature more once the UI for it is complete, but this feature just landed in our v4.0 branch and it is so awesome that I can’t help talking about it right away.

In RavenDB 4.0 we have taken the idea of subscriptions and pushed it up a few notches. Data subscriptions gives you a reliable push based method to get documents from RavenDB. You setup a subscriptions, and then you open it and RavenDB will stream to you all the documents that are relevant to your subscription. New documents will be sent immediately, and failures are handled and retried automatically. Subscriptions are a great way to build all sort of background jobs.

In RavenDB 3.x their main strength was that they gave you a reliable push base stream of documents, but in RavenDB 4.0, we decided that we want more. Let us take it in stages, here is the most basic subscription usage I can think of:

This is subscribing to all User documents, and RavenDB will first go through all the User documents, sending them to us, and then keep the connection alive and send us the document whenever a User document is updated. Note that we aren’t talking about just a one time thing. If I modify a document once an hour, I’ll be getting a notification on each change. That allows us to do hook this up to jobs, analytics, etc.

The really fun thing here is that this is resilient to failure. If the client maintaining the subscription goes down, it can reconnect and resume from where it left off. Or another client can take over the subscription and continue processing the documents. In RavenDB 4.0, we now also have high availability subscriptions. That means that if a server goes down, the client will simply reconnect to a sibling node and continue operating normally, with no interruption in service.

But you aren’t limited to just blinding getting all the documents in a collection. You can apply a filter, like so:

In this manner, we’ll now only get notified about active users, not all of them. This filtering allows you to handle some really complex scenarios. If you want to apply logic to the stream of changed documents, you can, getting back only the documents that match whatever logic you have in your document.

But the script can do more than just filter, it can also transform. Let us say that we want to get all the active users, but we don’t need the full document (which may be pretty big), we just want a few fields from it.

In this manner, you can select just the right documents, and just the right values you need from the document and process them in your subscription code.

There is actually more, which I’ll post in the next post, but I’m so excited about this feature that I’m not even going to wait for the next publishing schedule and push this immediately. You can read it here.

Migration strategies considerations for Dev–>UAT—>Production

time to read 5 min | 866 words

imagePart of the reason for RavenDB was that I wanted a database that actually took into account how it is being used into account, and provided good support for common usage scenarios. Making the process of moving between Dev –> UAT –> Production easier is a big part of that.

Typically, databases don’t handle that directly, but let the users figure it out. You can see the plethora of SQL Schema deploying and versioning options that you have to deal with.

With RavenDB, for indexes in particular, we made the process very easy. Indexes are defined in code, deployed along side the application and are versioned in the exact same manner, in the exact same system.

But as the feature set of RavenDB grows, we need to consider the deployment scenario in additional places. We recently started talking about the development cycle of ETL processes, Subscriptions, Backups and external replication. The last two are fairly rare in development / UAT scenarios, so we’ll ignore them for now. They are typically only ever setup & used in production. Sometimes you test them on a dedicated instance, but it doesn’t make sense to deploy a backup configuration in most cases. External replication is basically just destination  + credentials, so there isn’t really all that much to track or deploy.

ETL Processes and Subscriptions, on the other hand, can contain quite a bit of logic in them. An ETL process that feed into a reporting database might be composed of several distinct pieces, each of them feeding some part of the data to the reporting db. If the reporting needs change, we’ll likely need to update the ETL process as well, which means that we need to consider exactly how we’ll do that. Ideally, we want a developer to be able to start working on the ETL process on their own machine, completely isolated. Once they are done working, they can checkin their work into the code repository and move on to other tasks. At some future time, this code will get deployed, which will setup the right ETL process in production.

That is a really nice story, and how we are dealing with indexes, but it doesn’t actually work for ETL processes. The problem is that ETL is typically not the purview of the application developer, it is in the hand of the operations team or maybe it is owned by the team that owns the reports. Furthermore, changes to the ETL process are pretty common and typically happen outside the release cycle of the application itself. That means that we can’t tie this behavior to the code, unlike indexes, which have a pretty tight integration with the code that is using them, ETL is a background kind of operation, with little direct impact.  So it can’t be tied to the application code like indexes is. Even with indexes, we have measures in place that prevent (lock) the index definition, so an administrator can update the index definition on the fly without the application overwriting it with the old version of the index.

Subscriptions are more of a middle ground. A subscription is composed of a client side application that process the data and some server side logic related to filtering and shaping it. On the one hand, it make a lot of sense for the subscribing application to control its subscription, but an admin that wants to update the subscription definition is a very likely scenario. Maybe as a result of a data change, or need input from the business. We can update the server side code without re-deployment, and that is usually a good idea.

To make matters a bit more complex, we also have to consider secrets management. ETL processes, in particular, can contain sensitive information (connection strings). So we need to figure out a way to have the connection string, but not have the connection string Smile. In other words, if I write a new ETL process and deploy it to production, I need to be sure that I don’t need to remember to update the connection string from my local machine to the production database. Or, much worse, if I’m taking the ETL from production, I don’t want to accidently also get the production connection string. That means that we need to use named connection strings, and rely on the developer / admin to set it up properly across environments.

I would really appreciate any feedback you have about how to handle this scenario.

Both ETL processes and Subscriptions are just JSON documents of not too much complexity, so actually moving them around between servers isn’t hard, it is the process of doing so that we are trying to flesh out. I should also mention that we are probably just going to make sure that there is a process to handle that, not something that is mandated, because some companies have very different deployment models that we need to fit into. This is us trying to think about the best way to handle the most common scenario with as little friction as possible.

Inside RavenDB 4.0: Chapter 3 is done

time to read 1 min | 96 words

imageI have just completed writing the 3rd chapter for Inside RavenDB 4.0, and the full book is available here (all three chapters of it). It is just under 100 pages in size, and I’m afraid that at this rate this is going to be a massive tome.

The content so far covers setting up RavenDB 4.0 and doing basic CRUD operations, then goes on to deal with modeling documents inside RavenDB.

Any and all feedback is welcome.

RavenDB 4.0 Licensing & Pricing

time to read 4 min | 772 words

Let us start with the biggest news. We are offering RavenDB 4.0 FREE for production usage for single node deployments.

In addition to that, the license for the RavenDB Client APIs is going to change from AGPL to MIT. We currently have clients for .NET, JVM, Python and Node.JS, with Go and Ruby clients coming (hopefully by the time we hit RC). All of which will be available under the MIT license. The server license will remain AGPL / Commercial, with a Community deployment option that will come free of charge. The Community Edition is fully functional on a single node.

You can use this edition to deploy production systems without needing to purchase a license, completely FREE and unencumbered.

We have also decided to do somewhat of a shakeup in the way we split features among editions, primarily by moving features down the slide, so features that used to be Enterprise only are now more widely available. The commercial edition will be available in either Professional and Enterprise editions. For our current subscribers, we are very happy that you’ve been with us for all this time, and we want to assure you that as long as your subscription will be valid, the pricing for it will stay as is.

Both Professional and Enterprise Editions are going to offer clustering, offsite backups, replication and ETL processes to SQL and RavenDB databases. The Enterprise edition also offers full database encryption, automatic division of work between the nodes in the cluster (including failover & automatic recovery), snapshot backups and SNMP integration among other things. Commercial support (including 24x7 production) will be available for Professional and Enterprise Editions.

Since I’m pretty sure that the thing that you are most interested in is pricing information and feature matrix, so here is the information in an easy to digest form.







$749 per core*

$1,319 per core*


Up to 4

Up to 32



Up to 6 GB

Up to 64 GB


Cluster Size

Single node

Up to 5



Community only








Local full backups

Full & incremental
Local / remote / cloud

Full & incremental
Local / remote / cloud

Full database snapshot1




Tasks distribution2

(backups, ETL, subscriptions)

Single node



Highly available tasks3




Highly available databases &
automatic failover




ETL Support


SQL & RavenDB

SQL & RavenDB

Full database encryption








Client authentication via




* To save you the math, a 6 cores server with Professional edition will be about 4,494$ and Enterprise edition will be about 7,914$.

  1. Snapshots capture the database state and allow to reduce restore time on large databases, at the expense of disk space. Snapshots can work together with incremental backup to get point in time recovery.
  2. Tasks (such as backups, ETL processes, subscriptions, updating new nodes, etc) are assigned to specific node dynamically, to spread load fairly in the cluster.
  3. When a node responsible for a task goes down, the cluster will automatically re-assign that task without interruption in service.

Task distribution and failover is best explained via a concrete example. We have a database with 3 nodes, and we define an ETL process to send some of the data to a reporting database. That work will be dynamically assigned to a node in the cluster, and balanced with all other work that the cluster need to do. For Enterprise Edition, if the node that was assigned that task failed, the cluster will automatically transfer all such tasks to a new node until the node will recover.

The new model comes with significant performance boost, all the features that I mentioned earlier and multi-platform support. But we are not forgetting about newcomers, small enterprises and individual clients. For those we have introduced a Community version, a completely FREE license that should suit their needs.

Again, for all existing subscribers, we assure you that while your subscription is valid, its pricing will stay as is. In fact, given that we will grandfather all existing subscriptions at the current price point, and you can purchase a subscription now, before the official release of 4.0, you have a nice arbitrage option available now.

The beta release represent over two years of work by the RavenDB Core Team to bring you top of the line database that is fast, safe and easy to use. It is chockfull of features, to the point where I don’t know where to start blogging about some of the cool stuff that we do (don’t worry, those posts are coming).

RavenDB 4.0 Beta1 is now available

time to read 4 min | 671 words

imageI’m really proud to announce that we have released the RavenDB 4.0 beta.  It has been a bit over six months of so since the alpha release (I can’t believe it has been that long), and we have been working hard on completing the feature set that we want to have for the final release. We are now ready to unveil RavenDB 4.0 and show you what we can do with it.

Since the alpha release, we completed the refactoring to the clustering infrastructure, so a RavenDB node is always in a cluster (even if just a single one), added attachments, full database encryption and subscriptions, improved indexing performance and performance in general and had done a lot of work to make the database more accessible and easier to use. Cluster scenarios are easier and more robust, you can now take a database and span it on multiple nodes at will and you get the usual RavenDB safety, reliability and fault tolerance.

Users coming from the RavenDB 3.x version will notice (immediately after the studio theme) that everything is much faster. Our internal testing shows anything between 10x to 100x improvement in speed over the previous version.

RavenDB is now capable of handling over 100K req/second for writes (that is over hundred thousands requests per second), and much more for reads, with the  caveat that we always hit the network capacity before we hit RavenDB’s capacity. That is per node, so you can scale the number of nodes in the cluster, so can you scale the amount of reads and writes you can handle.

RavenDB 4.0 also comes with much smarter query optimizer, allowing it to generate optimal queries for aggregation in addition to simple queries. And map/reduce in general has been worked heavily to make it much faster and more responsive.

You can download the RavenDB 4.0 beta from our download page, and NuGet package is available by running:

Install-Package RavenDB.Client -Version 4.0.0-beta-40014 -Source https://www.myget.org/F/ravendb/api/v3/index.json

You can run the beta on Windows, Linux, Raspberry PI and Docker, and you can access the live test instance to check it out.

Known issues include:

  • Identities aren’t exported / imported
  • Cluster operations sometimes stall and timeout.
  • The studio currently assumes that the user is administrator.
  • Highlighting & spatial searches aren’t supported.
  • Deployment in hardened environment is awkward.
  • Upgrading from previous versions is only possible via smuggler.
  • Custom analyzers are not currently supported.

If you are upgrading from 3.x or the 4.0 Alpha, you’ll need to export the data and import it again. Automatic upgrades will be part of the RC.

Please remember, this is a beta. I included some of the known issues in this post to make sure that you remember that. We expect users to start developing with RavenDB 4.0 from the beta, and API and behavior are pretty fixed now. But you shouldn’t be deploying to production with this, and you should be aware that upgrading from the beta to RC is going to be done using smuggler as well.

We’ll likely have a few beta releases on the road to RC, fixing all the issues that will pop up along the way. Your feedback is crucial at this stage, since subjecting RavenDB to real world conditions is the only way to battle test it.

In the studio, you can use the feedback icon to send us your immediate feedback, and there is also the issue tracker.


Take RavenDB for a spin, setup a node, setup a cluster, see how it all works together. If we did our job right, you should be able to figure out everything on your own, which is good, because the docs are still TBD. Feedback on bugs and issues is important, but I’m also very interested in feedback on the flow. How easy it is to do things, deploy, setup, connect from the client, etc.

Distributed work in RavenDB 4.0

time to read 4 min | 660 words

imageI talked about the new clustering mode in RavenDB 4.0 a few days ago. I realized shortly afterward that I didn’t explain a crucial factor. RavenDB has several layers of distributed work.

At the node level, all nodes are (always) part of a cluster. In some cases, it may be a cluster that is a single node, but in all cases, a node is part of a cluster. Cluster form a consensus between all the nodes (using Raft) and all cluster wide operations go through Raft.

Cluster wide operations are things like creating a new database, or assigning a new node to the database and other things that are obviously cluster wide related. But a lot of other operations are also handled in this manner. Creating an index goes through Raft, for example. And so does high availability subscriptions and backup information. The idea is that the cluster holds the state of its databases, and all such state flow through Raft.

A database can reside in multiple nodes, and we typically call that a database group (to distinguish from the cluster as a whole). Data written to the database does not go out over Raft. Instead, we use multi master distributed mesh to replication all data (documents, attachments, revisions, etc) between the different nodes in the database. Why is that?

The logic that guides us is simple. Cluster wide operations happen a lot less often and require a lot more resiliency to operate properly. In particular, not doing consensus resulted in having to deal with potential conflicting changes, which was a PITA. On the other hand, common operations such as document writes tend to have a lot more stringent latency requirements, and what is more, we want to be able to accept writes even in the presence of failure. Consider a network split in a 3 nodes cluster, even though we cannot make modifications to the cluster state on the side with the single node, we are still able to accept and process write and read requests. When the split heals, we can merge all the changes between the nodes, potentially generating (and resolving) conflicts as needed.

The basic idea is that for data that is stored in the database, we will always accept the write, because it it too important to let it just go poof. But for data about the database, we will ensure that we have a consensus for it, since almost all such operations are admin based and repeatable.

Those two modes end up creating an interesting environment. At the admin level, they can work with the cluster and be sure that their changes are properly applied cluster wide. At the database level, each node will always accept writes to the database and distribute them across the cluster in a multi master fashion. A client can choose to accept a write to a single node or a to a particular number of nodes before considering a write successful, but even with network splits, we can still remain up and functioning.

A database group has multiple nodes, and all of them are setup to replicate to one another in master/master setup as well as distribute whatever work is required of the database group (backups, ETL, etc). What about master/slave setups?

We have the notion of adding an outgoing only connection to a database group, one of the nodes in the cluster will take ownership on that connection and replicate all data to it. That allow you to get master/slave, but we’ll not failover to the other node, only to a node inside our own database group. Typical reasons for such a scenario is if you want to have a remote offsite node, or if you have the need to run complex / expensive operations on the system and you want to split that work away from the database group entirely.

Artificial documents in RavenDB 4.0

time to read 2 min | 228 words

Artificial documents are a really interesting feature. They allow you to define an index, and specify that the result of the index will be… documents as well.

Let us consider the following index, running on the Norhtwind dataset.


We can ask RavenDB to output the result of this index to a collection, in addition to the normal indexing. This is done in the following manner:


And you can see the result here:


The question here is, what is the point? Don’t we already have the exact same data indexed and available as the result of the map/reduce index? Why store it twice?

The answer is quite simple, with the output of the index going into documents, we can now define additional indexes on top of them, which give us the option to very easily create recursive map/reduce operations. So you can do daily/monthly/yearly summaries very cheaply. We can also apply all the usual operations on documents (subscriptions and ETL processes come to mind immediately). That give you a lot of power, and without incurring a high complexity overhead.

Clustering in RavenDB 4.0

time to read 5 min | 855 words

This week or early next week, we’ll have the RavenDB 4.0 beta out. I’m really excited about this release, because it finalize a lot of our work for the past two years. In the alpha version, we were able to show off some major performance improvements and a few hints of the things that we had planned, but it was still at the infrastructure stage. Now we are talking about unveiling almost all of our new functionality and design.

The most obvious change you’ll see is that we made a fundamental  change in how we are handle clustering. In prior versions of RavenDB, clusters were created by connecting together database instances running on independent nodes. In RavenDB 4.0, each node is always a member of a cluster, and databases are distributed among those nodes. That sounds like a small distinction, but it completely reversed how we approach distributed work.

Let us consider three nodes that form a RavenDB cluster in RavenDB 3.x. Each database in RavenDB 3.x is an independent entity. You can setup replication between different databases and out of the cooperation of the different nodes and some client side help, we get robust high availability and failover. However, there is a lot of work that you need to do on all the nodes (setup master/master between all the nodes on each can grow very tedious). And while you get high availability for reads and writes, you don’t get that for other tasks in the database.

Let us see how this works in RavenDB 4.0, shall we? The first thing we need to do is to spin up 3 nodes.


As you can see, we have three nodes, and Node A has been selected as the leader.  To simplify things to ourselves, we just assign arbitrary letters to the nodes. That allow us to refer to them as Node A, Node B, etc. Instead of something like WIN-MC2B0FG64GR. We also expose this information directly in the browser.


Once the cluster has been created, we can create a database, and when we do that, we can either specify what the replication factor should be, or manually control what nodes this database will be on.



Creating this database will create it on both A and C, but it will do a bit more than that. Those aren’t independent databases that hooked together. This is actually the same database, running on two different nodes. I created the sample data on Node C, and this is what I see when I look on Node A.


We can see that the data (indexes and documents) has been replicated. Now, let us see how we can work with this database:

You might notice that this looks almost exactly like you would use RavenDB 3.x. And you are correct, but there are some important differences. Instead of specifying a single server url, you can now specify several. And the actual url we provided doesn’t make any sense at all. We are pointing it to Node B, running on port 8081. However, that node doesn’t have the Northwind database. That is another important change. We can now go to any node in the cluster and ask for the topology of any database, and we’ll get the current database topology to use.

That make it much simpler to work in a clustered environment. You can bring in additional nodes without having to update any configuration, and mix and match the topology of databases in the cluster freely.

Another aspect of this behavior is the notion of database tasks. Here are a few of them.


Those are tasks (looks like we need to update the icon for backup) that are defined at the database level, and they are spread over all the nodes in the database automatically. So if we defined an ETL task and a scheduled backup, we’ll typically see one node handling the backups and another handling the ETL. If there is a failure, the cluster will notice that and redistribute the work transparently.

We can also extend the database to additional nodes, and the cluster will setup the database on the new node, transfer all the data to it (by assigning a node to replicate all the data to the new node), wait until all the data and indexing is done and only then bring it up as a full fledged member of the database, available for failover and for handling all the routine tasks.

The idea is that you don’t work with each node independently, but the cluster as a whole. You can then define a database on the cluster, and the rest is managed for you. The topology, the tasks, failover and client integration, the works.

The pain of HTTPS

time to read 4 min | 652 words

imageA few weeks ago we started looking into what it would take to run RavenDB 4.0 over HTTPS.

Oh, not the actual mechanics of that, we had that covered a long time ago, and pretty much everything worked as expected. No, the problem that we set out to solve was whatever we could get RavenDB to Just Work over HTTPS without requiring the admin to jump through hops. Basically, what I really wanted was a way to just spin up the server and have it running on HTTPS by default.

That turned out to be a lot harder then I wished it would be.

HTTPS has two very distinct goals:

  • To encrypt communication between two parties.
  • To ensure that the site you visited is actually the site you thought you visited.

The first portion can be handled by generating the certificate yourself, and the communication between client & server would be encrypted. So far so good, but the second portion is probably more important. If my communication with ThisIsNotPayPal.com is encrypted, that doesn’t really help me all that much, I’m afraid.

Verifying who you are is a very important component of HTTPS, and that is something that we can’t just ignore. Well, technically speaking I guess that RavenDB could have installed a root CA into the system during installation, but the mere thought of doing that is giving me a pause, so I really don’t want to try and do that.

And without doing that, we can’t really support HTTPS. Remember that things like Let’s Encrypt won’t work here. RavenDB is often deployed on closed networks, and without having a publicly visible domain to run. My RavenDB is running on oren-pc.hrhinos.local, for example, and I think you’ll find that it is a bit hard to get a Let’s Encrypt certificate for this.

So we can’t just magically get a certificate and have it work.

While I wish there was a way to just have encryption over the wire, without validation of identity, that would be pretty pointless with such things as man in the middle attacks.

So what do we do in RavenDB 4.0 with regards to HTTPS?

We rely on the admin (shocking, I know). They can either generate a self signed certificate and trust it ( a matter of a few shell commands on any platform ) or use their organization’s certificate (either trusted internally or externally obtained). RavenDB doesn’t care about that, but if you provide a certificate, it will ensure that all communication are SSL encrypted.

The client API exposes a method that let you control certificate validation, which make it easier if you need to customize the authentication policy. On the server side, however, we take things differently. Instead of letting the user configure trust policies in certificates, we decided to ignore the issue completely. Or, to be rather more exact, to specify that RavenDB is going to lean on the operating system for such decisions. A simple scenario is an administrator that define a cluster of servers and generate a self signed certificate(s) for them to use. The administrator need to make sure that the certificate(s) in question are trusted by all nodes in the cluster. RavenDB will refuse to connect over HTTPS to an untrusted source.

Yes, I’m aware of all the horrible things that this can do (certificate expiration kills the system, for example), but we couldn’t think of any way were not doing this wouldn’t result in even worse situations.

RavenDB has support for encrypted databases, but we don’t allow them to be accessed from non secured connection, or to connect to non secure destinations. So the data is encrypted at rest and over the wire, and the admin is responsible to making sure that the certs are up to date and valid (or at least trusted by the machines in question).


  1. Zombies vs. Ghosts: The great debate - about one day from now
  2. Bug stories: The data corruption in the cluster - 2 days from now
  3. Bug stories: How do I call myself? - 3 days from now
  4. Bug stories: The memory ownership in the timeout - 4 days from now
  5. We won’t be fixing this race condition - 5 days from now

And 2 more posts are pending...

There are posts all the way to Jul 04, 2017


  1. RavenDB 4.0 (8):
    13 Jun 2017 - The etag simplification
  2. PR Review (2):
    23 Jun 2017 - avoid too many parameters
  3. Reviewing Noise Search Engine (4):
    20 Jun 2017 - Summary
  4. De-virtualization in CoreCLR (2):
    01 May 2017 - Part II
View all series



Main feed Feed Stats
Comments feed   Comments Feed Stats