Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,608
|
Comments: 51,241
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 98 words

And now the book is another tiny big step close to actually being completed. All editing has been completed, and we did a full pass through the book. All content is written and there isn’t much to do at all.

We are now sending this for production work, and once that is done, I can announce this project complete. Of course, by that time, I’ll have to start writing about the new features in RavenDB 4.1, but that is a story for another day.

You can get the updated bits here, as usual, I would really appreciate any feedback.

time to read 3 min | 523 words

imageRavenDB’s subscription give you the ability to run batch processing easily and robustly. In other words, you specify a query and subscribe to its results. RavenDB will send you all the documents matching the query. So far, that is pretty obvious, but what is important with subscriptions is the fact that it will keep sending you results. As long as your subscription is opened, you’ll get any changed document that matches your query. That gives you a great way to implement event pipelines, batch processes and in general opens up some interesting options.

In this case, I want to talk about how failures with subscriptions. Not failure in the sense of a server going down, or a client crashing. These are already handled by the subscription mechanism itself. A server going down will cause the cluster to change the ownership of subscription, and your client code will not even notice. A client going down can either failover to another client. Alternatively, upon restart of the client, it will pick up right from where it dropped things. No, this is handled.

What require attention is what happen if there is an error during the processing of a batch of documents. Imagine that we want to do some background processing. We could do that in many ways, such as introducing a queuing system and tasks queue, but in many cases, the overhead of that is quite high. A simpler approach is to just write the tasks out as documents and use a subscription to process them. In this case, let’s imagine that we want to send emails. A subscription will run over all the EmailToSend collection, doing whatever processing is required to actually send it. Once we are done processing a batch, we’ll delete all the items that we processed. Whenever there are new emails to send, the subscriptions will get them for us immediately.

But what happens if there is a failure to send one particular email in a batch? Well, we can ignore this (and not delete the document), but that will require some admin involvement to resolve. Subscriptions will not revisits documents that they have already seen. Except if these documents were changed.  Here is one way to handle this scenario:

In short, we’ll try to process each document, sending the email, etc. If we failed to do so, we’ll not delete the document, instead, we’ll patch it to increment a Retries property in the metadata. This operation has two interesting effects. First, it means that we can keep track of how often we retried a particular document. But as a side effect of modifying the document, we’ll get it back in the subscription again. In other words, this piece of code will give a document 5 retries before it give up.

As an admin, you can then peek into your database and see all the documents that have exceeded the allow retries and make a decision on what to do with them. But anything that failed because of some transient failure will just work.

time to read 4 min | 754 words

RavenDB uses a consensus protocol to manage much of its distributed state. The consensus is used to ensure consistency in a distributed system and it is open for users as well. You can use this feature to enable some interesting scenarios.

The idea is that you can piggy back on RavenDB’s existing consensus engine to gain the ability allow you to create robust and consistent distributed operations. RavenDB exposes these operations using  a pretty simple interface: compare-exchange.

At the most basic level, you have a key/value interface that you can make distributed atomic operations on, knowing that they are completely consistent. This is great, in abstract, but it s a bit hard to grasp without a concrete example.

Consider the following scenario. We have a bunch of support engineers, ready and willing to take on any support call that come. At the same time, an engineer can only a certain number of support calls. In order to handle this, we allow engineers to register when they are available to take a new support call. How would we handle this in RavenDB? Assuming that we wanted absolute consistency? An engineer may never be assigned too much work and work may never be lost. Assume that we need this to be robust in the face of network and node failure.

Here is how an engineer can register to the pool of available engineers.


The code above is very similar to how you would write multi-threaded code. You first get the value, then attempt to do an atomic operation to swap the old value with the new one. If we are successful, the operation is done. If not, then
we retry. Concurrent calls to RegisterEngineerAvailability will race each other. One of them will succeed and the others will have to retry.

The actual data that we store in the compare exchange value in this case is an array. You can see an example of how that would look here:

img18


Compare exchange values can be simple values (numbers, strings), arrays or even objects. Any value that can be represented as JSON is valid there. However, the only operation that is allowed on a compare exchange value is a wholesale replacement.

The code above is only doing half of the job. We still need to be able to get an engineer to help us handle a support call. The code to complete this task is shown below:


The code for pulling an engineer from the pool is a bit more complex. Here we read the available engineers from the server. If there are none, we'll wait a bit and try again. If there are available engineers we'll remove the first one and then try to update the value. This can happen for multiple clients at the same time, so we check whatever our update was successful and only return the engineer if our change was accepted.

Note that in this case we use two different modes to update the value. If there are still more engineers in the available  pool, we'll just remove our engineer and update the value. But if our engineer is the last one, we'll delete the value
entirely. In either case, this is an atomic operation that will first check the index of the pre-existing value before performing the write.

It is important to note that when using compare exchange values, you'll typically not act on read. In other words, in PullAvailableEngineer, even if we have an available engineer, we'll not use that knowledge until we successfully wrote the new value.
The whole idea with compare exchange values is that they give you atomic operation primitive in the cluster. So a typical usage of them is always to try to do something on write until it is accepted, and only then use whatever value you read.

The acceptance of the write indicates the success of your operation and the ability to rely on whatever values you read. However, it is important to note that compare exchange operations are atomic and independent. That means an operation
that modify a compare exchange value and then do something else needs to take into account that these would run in separate transactions.

For example, if a client pull an engineer from the available pool but doesn't provide any work (maybe because the client crashed) the engineer will not magically return to the pool. In such cases, the idle engineer should periodically check
that the pool still the username and add it back if it is missing.

time to read 4 min | 685 words

I have talked before about RavenDB’s MapReduce indexes and their ability to output results to a collection as well as RavenDb’s ETL processes and how we can use them to push some data to another database (a RavenDB database or a relational one).

Bringing these two features together can be surprisingly useful when you start talking about global distributed processing. A concrete example might make this easier to understand.

Imagine a shoe store (we’ll go with Gary’s Shoes) that needs to track sales across a large number of locations. Because sales must be processed regardless of the connection status, each store hosts a RavenDB server to record its sales. Here is the geographic distribution of the stores:

img06

To properly manage this chain of stores, we need to be able to look at data across all stores. One way of doing this is to set up external replication from each store location to a central server. This way, all the data is aggregated into a single location. In most cases, this would be the natural thing to do. In fact, you would probably want two-way replication of most of the data so you could figure out if a given store has a specific shoe in stock by just looking at the local copy of its inventory. But for the purpose of this discussion, we’ll assume that there are enough shoe sales that we don’t actually want to have all the sales replicated.

We just want some aggregated data. But we want this data aggregated across all stores, not just at one individual store. Here’s how we can handle this: we’ll define an index that would aggregate the sales across the dimensions that we care about (model, date, demographic, etc.). This index can answer the kind of queries we want, but it is defined on the database for each store so it can only provide information about local sales, not what happens across all the stores. Let’s fix that. We’ll change the index to have an output collection. This will cause it to write all its output as documents to a dedicated collection.

Why does this matter? These documents will be written to solely by the index, but given that they are documents, they obey all the usual rules and can be acted upon like any other document. In particular, this means that we can apply an ETL process to them. Here is what this ETL script would look like.

img07

The script sends the aggregated sales (the collection generated by the MapReduce index) to a central server. Note that we also added some static fields that will be helpful on the remote server so as to be able to tell which store each aggregated sale came from. At the central server, you can work with these aggregated sales documents to each store’s details, or you can aggregate them again to see the state across the entire chain.

The nice things about this approach are the combination of features and their end result. At the local level, you have independent servers that can work seamlessly with an unreliable network. They also give store managers a good overview of their local states and what is going on inside their own stores.

At the same time, across the entire chain, we have ETL processes that will update the central server with details about sales statuses on an ongoing basis. If there is a network failure, there will be no interruption in service (except that the sales details for a particular store will obviously not be up to date). When the network issue is resolved, the central server will accept all the missing data and update its reports.

The entire process relies entirely on features that already exist in RavenDB and are easily accessible. The end result is a distributed, highly reliable and fault tolerant MapReduce process that gives you aggregated view of sales across the entire chain with very little cost.

time to read 2 min | 238 words

This s actually an old feature, that didn’t make the cut to enter 4.0. This is now back, and it is roaring. This is the kind of feature that is useful if you are utilizing RavenDB’s search capabilities. Let us assume that you want to search for something, but instead of querying for “give me all the active users” you want to actually… search. For example, you want to search for all employees with a BA in their bio. However, you don’t want to just get the matches, you want to show the user why this was matches.

That is the problem that highlighting is meant to solve. Consider the following query:

image

Which returns the following results:

image

Why did we get this particular employees?  Let’s find out:

image

Now we are asking the server to highlight for us the reason for the match. You can see this in the studio directly, in the Highlight tab:

image

Using this approach, you can enrich the search result and provide nicer experience for your users.

time to read 1 min | 66 words

Just to let you know, the book is pretty much edited, that means that you won’t have to suffer through my horrible sentence structure.

You can read this here.

What remains to be done now is for me to go over the book again, verify that there aren’t any issues, and we are done.

In other words, we are now “Done, Done” in the “Done, Done, Done” scale.

time to read 1 min | 107 words

Kamran Ayub did a great DotNetRocks show about RavenDB 4.0. Kamran is also being the RavenDB 4.0 course on PluralSight, so he knows his stuff.

I got to say, it is… strange to listen to a podcast about RavenDB. I found myself nodding along quite often and the outside perspective is pretty awesome.

Kamran also tested the same application on RavenDB 3.5 and RavenDB 4.0, seeing 20x performance improvement. Best quote from the show as far as I’m concerned:

So fast you aren’t sure it actually worked.

Kamran also have a follow up post with some numbers and more details here.

Listen to the show here.

time to read 1 min | 136 words

imageIn addition to the book and the documentation, we are also working on making it more accessible to get started with RavenDB. The RavenDB Bootcamp is a self directed course meant to give you an easy way to start using RavenDB.

This is a guided tour, walking you through the fundamentals of getting RavneDB up and running, how to put data in and query it, how you can use indexing and MapReduce. These are short lessons, providing practical experience and guidance on how to start using RavenDB.

You can also register to get a lesson a day.

This is now updated to RavenDB 4.0, smoothing the learning curve and making it even simpler to get started.

time to read 3 min | 501 words

imageDocuments are awesome, they allow you to model your data in a very natural way. At the same time, there are certain things that just don’t fit into the document model.

Consider the simple case of counting. This seems like it would be very obvious, right? As simple as 1+1. However, you need to also consider concurrency and distribution. Look at the image on the right. What you can see there is a document describing a software release. In addition to tracking the features that are going into the release, we also want to count various statistics about the release. In this example, you can see how many times a release was downloaded, how many times it was rated, etc.

I’ll admit that the stars rating is a bit cheesy, but it looks good and actually test that we have good Unicode support Smile.

Except for a slightly nicer way to show numbers on the screen, what does this feature gives you? It means that RavenDB now natively understand how to count things. This means that you can increment (or decrement) a value without modifying the whole document. It also means that RavenDB will be able to automatically handle concurrency on the counters, even when running in a distributed system. This make this feature suitable for cases where you:

  • want to increment a value
  • don’t care (and usually explicitly desire) concurrency
  • may need to handle very large number of operations

The case of the download counter or the rating votes is a classic example. Two separate clients may increment either of these values at the same time a third user is modifying the parent document. All of that is handled by RavenDB, the data is updated, distributed across the cluster and the final counter values are tallied.

Counters cannot cause conflicts and the only operation that you are allowed to do to them is to increment / decrement the counter value. This is a cumulative operation, which means that we can easily handle concurrency at the local node or cluster level by merging the values.

Other operations (deleting a counter, deleting the parent document) are of course non cumulative, but are much rarer and don’t typically need any sort of cooperative concurrency.

Counters are not standalone values but are strongly associated with their owning document. Much like the attachments feature, this means that you have a structured way to add additional data types to you documents. Use counters to, well… count. Use attachments to store binary data, etc. You are going to see a lot more of this in the future, since there are a few things in the pipeline that we are already planning to add.

You can use counters as a single operation (incrementing a value) or in a batch (incrementing multiple values, or even modifying counters and documents together). In all cases, the operation is transactional and will ensure full ACIDity.

time to read 3 min | 600 words

Note: This feature is an experimental one. It will be included in 4.1, but it will be behind an experimental feature flag. It is possible that this will change before full inclusion in the product.

RavenDB now supports multiple operating systems and we spend a lot of effort to bring RavenDB client APIs to more platforms. C#, JVM and Python are already done, Go, Node.JS and Ruby are in various beta stages. One of the things that this brought up was our indexing structure. Right now, if you want to define a custom index in RavenDB, you use C# Linq syntax to do so. When RavenDB was primarily focused on .NET, that was a perfectly fine decision. However, as we are pushing for more platforms, we wanted to avoid forcing users to learn the C# syntax when they create indexes.

With no further ado, here is a JavaScript index in RavenDB 4.1:

As you can see, this is pretty simple translation between the two. It does make certain set of operations easier, since the JavaScript option is a lot more imperative. Consider the case of this more complex index:

You can see here the interplay of a few features. First, instead of just selecting a value to index, we can use a full fledged function. That means that you can run your complex computation during index more easily. Features such as loading related documents are there, and you can see how we use reduce to aggregate information as part of the indexing function.

JavaScript’s dynamic nature gives us a a lot of flexibility. If you want to index fields dynamically, just do so, as you can see here:

MapReduce indexes work along the same concept. Here is a good example:

The indexing syntax is the only thing that changed. The rest is all the same. All the capabilities and features that you are used to are still there.

JavaScript is used extensively in RavenDB, not surprisingly. That is how you patch documents, do projections and manage subscription. It is also a very natural language to handle JSON documents. I think that it is a pretty fair to assume that anyone who uses RavenDB will have at least a passing familiarity with JavaScript, so that make it easier to get how indexing work.

There is also the security aspect. JavaScript is much easier to control and handle in an embedded fashion. The C# indexes are allowing users to write their own code that RavenDB will run. That code can, in theory, do anything. This is why index creation is an admin level operation. With JavaScript indexes, we can allow users to run their computation without worrying that they will do something that they shouldn’t. Hence, the access level required for creating JavaScript indexes is much lower.

Using JavaScript for indexing does have some performance implications. The C# code is faster, generally, but not much faster. The indexing function isn’t where we usually spend a lot of time when indexing, so adding a bit of additional work there (interpreting JavaScript) doesn’t hurt us too badly. We are able to get to speeds of over 80,000 documents / second using JavaScript indexes, which should be sufficient. The C# indexes aren’t going anywhere, of course. They are still there and can provide additional flexibility / power as needed.

Another feature that might be very useful is the ability to attach additional sources to an index. For example, you may really like a sum using lodash. You can add the lodash.js file as an additional file to an index, and that would expose the library to the indexing functions.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Recording (18):
    29 Sep 2025 - How To Run AI Agents Natively In Your Database
  2. Webinar (8):
    16 Sep 2025 - Building AI Agents in RavenDB
  3. RavenDB 7.1 (7):
    11 Jul 2025 - The Gen AI release
  4. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  5. RavenDB News (2):
    02 May 2025 - May 2025
View all series

Syndication

Main feed ... ...
Comments feed   ... ...
}