Ayende @ Rahien

Refunds available at head office

Azure DocumentDB

On Friday, Microsoft came up with Azure DocumentDB. You might say that I have a small interest in such things, so I headed over there to see what I can learn about this project.

Aside from being somewhat annoyed with the name, this seems to be a very different animal from RavenDB, and something that was built to serve a different niche. One of the things that we put first with RavenDB is ease of use, development and deployment for business applications. The ADB design appears to be built around a different goal, around very big datasets.

Nitpicker corner: Yes, I know this is a preview, and I know that they are going to be changes. And I repeat, I have no knowledge about this project beyond the documentation and several hours of playing with it.

That said, I do have a fair bit of experience in this area. So I feel that I can speak with confidence about the topic.

ADB is supposed to be an highly scalable system that store documents. So far, so good, I can certainly understand that need. But it has made drastically different design choices, some of which I feel very strongly about. I'll try to explore the issues that I have issues with, and contrast that with what you can do with RavenDB.

This post has two parts, the first talks about conceptual issues. The second talk about the currently published limits, and their implications for general use for ADB.


  • No sorting option, or a good paging story
  • SQL Injection, without any other alternative
  • Hard to deploy and to keep current with your codebase
  • Poor development story & no testing story
  • Poor client API
  • Lots of table scans
  • Limited queries and few optimization options
  • Single document transactions (from the client)
  • No cross collection transactions at all
  • Very small document sizes allowed

Also see the “What is this for?” section below.

For a document database platform that doesn’t have any of those issues, and run in Azure, see RavenHQ on Azure.

Transactions – ADB say that it has transactions, and for a very limited meaning of the word, I believe it means it. Transactions in ADB means a single document only can be saved with a guarantee it will either be saved or not. That is great, in the sense that at least you won’t have data corruption, but that isn’t really something that mean much. Even MongoDB can satisfy that bar.

Oh, sure, you can get actual transactions if you write JS code that run as a “stored procedure” inside ADB. This means that you can send data to the server and have your JS Stored Procedure make multiple operations in a single transaction. Which is just slightly better (although see my comments on those stored procedures later), but that is still limited to only operations inside the same collections.

A trivial example for transactions in a document database would be to add a new comment, and update the comment count. You cannot do that in ADB. Not in a single transaction. I don’t know about you, but most of the interesting use cases happen when you are working with multiple document types. Sure, you can put all your documents inside the same collection, but have fun trying to work with that in the long term.

In contrast, RavenDB fully support actual transactions that can span multiple documents (even on different collections, which I would never believe would be an accomplishment). RavenDB can even support DTC and transactions that spans multiple interactions with the server. You know, the kind of transactions you actually want to use. For more, see the documentation on RavenDB transactions.

Management – it honestly feels like someone missed all the points that made people want to ditch SQL in the first place. ADB has the concepts of triggers, user defined functions (more on that travesty later, when I discuss queries) and stored procedures. You can define them in JS, and you create something that looks like this:


Let me count the ways that this is going to cause problems for you.

  • Business logic in the database, because we haven’t learned anything about that in the past.
  • Code that you cannot run or test independently. Just debugging something like that is going to be hard.
  • No way to actually manage deployment or make sure that this code is in sync with the rest of your codebase.
  • Didn’t we already learn that triggers are a source for a lot of pain? Are they really necessary for you to do things?

Yes, you have a DB that is schema less, but those kind of things are actually important. They define what you can do with the database, and not having a good way to move those around, and most importantly, not having a way to tie them to the source control system you are using is going to be a giant PITA.

Sorry, that isn’t actually something that you can delay doing for later. You need a good development story, and as I see it, the entire development story all around here is just going to be hard. You would have to manually schlep things around between development and production. And that isn’t just about the SP or UDFs. There are a lot of settings that you’re going to have to deal with. For example, the configuration per collection, which you’ll want to make sure is the same (otherwise you get some very subtle and hard to understand bugs).

For that matter, there doesn’t seem to be a development story. You are probably expected to just run another ADB instance on Azure and use that. This means a lot of latency in development, and that also means that you can’t have separate databases per developer, which is a standard practice. This means having to write a lot of code just to manage those things, and you are right back again at the good old days of “who didn’t update the schema script” and failed deployments.

In contrast, RavenDB make is very easy to handle your indexes & transformers in your code and deploying them as a single step. That also means that they are versioned in the same place as your code, so you don’t have to worry about moving between dev & prod. We spent a lot of time thinking and working around this specific area, because this is a common pain point in relational databases, and we weren’t willing to accept that being the case in our database. For more information, please see the documentation about index management in RavenDB.

Indexing – there are several things here that bother me. By default, everything is indexed, and in the same transaction. This is a great decision, for a demo system. But in a real world system, the overhead of indexing everything is prohibitive, especially in a high write system. So ADB is allowing to specify the paths that you will include or exclude from indexing, as well as whatever indexing should be within the same transaction or lazy.

The problem with that is that those are per collection settings and there doesn’t appear to be any way to modify them after the fact. So you start running your system in production, realize that the cost of indexing is high, so you need to change the indexing strategy for a collection. The only way to do that is to create a new collection, with a new indexing strategy, move all the data there, then delete the old one. For even more fun, consider the case where you have a production and development environments. In production, you have a different indexing strategy then in development (where the ‘index everything’ mode is still on). That means that when you push things to production, your system will fail silently, because you won’t be indexing the fields you though were indexed.

This need re-iteration, the way this currently work, you start running with the default indexing option, which is expensive. As long as you don’t have any performance requirements (for example, during development), that is just fine. But when you actually have a lot of data there, or have a lot of writes, that is when you’ll figure out that those things need to be changed. At that point, you are are pretty much screwed, because you need to pull all the data out, create a new collection with the new indexing options, and write it all back. That is a horrible experience, especially because you’ll likely need to do that under pressure with users breathing down your necks and management complaining about the performance.

For that matter, indexing in general scares me. Again, I don’t actually have any knowledge of the internal operations, but there are a lot of stuff there that just doesn’t make sense. It looks like the precision of the indexes used are up to 3 characters (by default) per value. I’m guessing that this is done to reduce the amount of space used by the indexing, at least that is what the docs says. The problem is that when you do that, you do a lookup by the first 3 characters, then you have to do a flat search over all the other values. That is going to be causing problems.

It is also indicated that you cannot do any range searches except on numeric values. Which has interesting implications if you want to do searches on something like a date range, or time spans, an incredibly common operation.

In contrast, RavenDB indexes are always using the full value, so you are getting an O(logN) search behavior, and not a fallback to O(N) behavior. Range searches are possible on any value, numeric, date time, time span, string, etc. For more information, see the RavenDB documentation about searching with RavenDB.

Queries – Speaking of problems. Let me talk for a moment on ADB SQL. It looks nice on the surface, it is certainly would be familiar to most people. It is also contain a lot of hidden traps.

For example, the docs talk about being able to do joins, but you are only actually able to do “joins” into the sub documents, not into other collections, or even documents in the same collection. A query such as:


SELECT c.Name as CustomerName, o.Total, o.Date
FROM Orders o
JOIN Customers c ON c.Id = o.CustomerId

Can’t be executed on ADB. So the whole notion of “joins” is actually limited to what you can do in a single document and the sub documents it contains. That make it very limited.

The options for filtering (where clause) is also interesting. Mostly because of the wide range they allow. It is very easy to create queries that cannot be computed using indexes. That means that your query is now running table scans. Lots & lots of table scans. Sure, you don’t have tables, but O(N) is still O(N), and when N is large, as it is apparently the expected case here, you are going to be pretty much dead in the water.

Another thing that I can’t wrap my head around is the queries shown. There is no way to pass parameters to the query. None.  This appears to be the case because 30+ years of working with SQL has shown that there is absolutely no issue with putting user’s input directly into the query. And since complex queries require you to use the raw ADB SQL, you are pretty much have guaranteed that you’ll have SQL Injection attacks.

Sure, you might no get caught by Little Bobby Tables (you can’t modify data via SQL), but you are still exposed and can leak important data. This query works just fine, and will return all products:

SELECT * FROM Products p WHERE p.Name = "testing" OR 1 = 1 -- "

I’ll assume that you understand how I got there. This is a brand new database engine, but ADB is bringing very old issues back into the future. Not only that, we don’t have anyway around that. I guess you are going to have to write your on parameter scrubbing code, and make sure to use it everywhere.

In general, queries are limited. Severely limited, actually. Take a look at the following query:

SELECT * FROM Products p 
WHERE p.Type = "Beer"
AND p.Maker = "Guinness"
AND p.Discontinued = false 
AND p.Price > 10 AND p.Price < 100

You can’t run it in ADB. It is too complex to run. Note that this is about as trivial a query as you can get, in pretty much any reasonable business system.

Continuing on with the problems for business apps theme, there doesn’t appear to any good way to do things like paging. When you issue a query, you can specify the number of items to take and you can repeat a query by passing a continuation. But that doesn’t really help when you need to actually page with the user. So you show the data to the user, then want to go to the next page… you have to pass the continuation token all the way around, and hope that it will remain valid for the duration. For that matter, the current client API does paging at the server level, but it will fetch all the results for a query, even if it take it hours to do so.

There is no way to actually get the total number of items that match the query. So you can’t show the user something like: “You have 250 new emails”, nor can you show them “Page 1 … 50”.

Another troubling omission is the total lack of anything that would allow you to actually query your documents in a particular order. If I want to get the latest orders in descending order (or in fact, in any well defined order), I am out of luck. There is no way of doing that. This is a huge deal, because this isn’t just something that you can try papering over. This is a core functionality that you need in pretty much any application. And it is just not there. There is some indication that this is being worked on, but I’m surprised that this isn’t here already. Distributed sorting is a non trivial problem, of course, so I’ll reserve further judgment until I see what they have.

ADB’s queries are highly limited, so I expect a workaround for that is going to be to push functionality into the UDF. Note that UDF don’t have access to any context, so it can’t load additional documents. What it can do it utterly destroy any chance you’ll ever have for optimizing a query. The moment that a UDF is involved, you don’t really have a choice about how to execute a query, you pretty much have to go to a table scan. Maybe filtering some stuff based on the other filters in the query, but in many cases, that means that you’ll have to run your UDF over millions of records. Because UDFs are permitted to perform non pure operations (like the current time), you can’t even cache its values, or do anything smart around that. You’ll always have to execute the UDF, regardless of the amount of data you have to go through. I don’t expect that to perform very well.

In contrast, RavenDB was explicitly designed to give you both flexibility and performance in queries. There are no table scans in RavenDB, and complex queries are expected, encouraged and are handled properly. Queries across multiple documents (and in other collections) are possible, and quite easy to do. Common operations, like paging or sorting are part of the core functionality, and are both very easy to use and come with no additional costs. Complex things like full text search, spatial queries, facets and many more are right there for you to use.  For more information, see the RavenDB documentation about querying in RavenDB, spatial searches in RavenDB and how RavenDB actually index the data to allow complex operations.

Data types – ADB data types are the ones defined in the JSON spec. In particular, it doesn’t have native support for date times. The ADB documentation suggest that you’ll do custom serialization to handle that. Rendering things like asking: “Give me all the orders for this customer for 2014” very hard, leaving aside the issues of querying for orders in a particular month, which is not possible either, since you can only do range searches on numeric data. Dates, in particular, are a very complex topic, and not actually handling this in the database is going to open you up for a lot of issues down the road. And dates are kinda important type to have.

In contrast, RavenDB handles complex (including user defined) types in a well defined manner. And has full support for dates, operations on dates, etc. It seems silly to mention, to be fair, because it seems so basic to have that. For more information, you can read the documentation about dates in RavenDB.

Aggregation – this one is simple, you don’t have any. That means that you cannot get the total number of unread emails, or the total sum of orders per customer, or the maximum order per this month . This whole functionality just isn’t there.

In contrast, RavenDB has explicit support for counting the number of results for a query as well as map/reduce indexes. Those give you powerful aggregation framework, which execute the work in the background. When you query, you get the pre-computed results very quickly, without having to do any work at query time. For more information, you can read about Map/Reduce in RavenDB and dynamic aggregation queries.

Set operations – another easy one, it is just not there. You can do some operations in a stored procedure, but you have 5 seconds to run, and that is it. If you need to do something like: Split FullName to FirstName and LastName, get ready to write a lot of code, and wait for a long time for this to complete. For that matter, something as simple as “delete all inactive users” is very hard to do as well.

In contrast, RavenDB has explicit support for set based updates and deletes. You can specify a query that match a set of results that would either be deleted or patched using a JS script. For more operations, read the documentations about Set Based Operations.

Client API – this is still a preview, so that is somewhat unfair, but the client API is very primitive. Basically, it is a very thin wrapper around the REST API, and it does a poor job at that. The REST API support paging, but the C# client API does not, for example. There is no concept of unit of work, change tracking, client side behavior or anything at all that would actually make this work nicely. There is also an interesting design decision to go async for all operations except queries.

With queries, you actually issue an async REST call, but you are going to be waiting on that query synchronously. This is probably because of the IQueryable interface and its assumption that the query is sync. But that is a very bad thing to do in terms of mixing sync and async work. It is easy to get into problems such as deadlocks, self lock and just plain weirdness.

In contrast, RavenDB has a carefully designed client APIs (for .NET, JVM, etc), which fully expose the power of RavenDB. They have been designed to be intuitive, easy to use and guide you into the pit of success, RavenDB also have separate sync and async API, including fully async queries. For more information, read the documentation about the client API.

Self links – when issuing any operation whatsoever to the database, you have to use something call the object link, or self link. For example, in my test database, the Products collection link is: dbs/frETAA==/colls/frETANSmEAA=/

You have to use links like that whenever you make any operation what so ever. For fun, those are going to be unique per database, so creating a Products collection in another database would result in a different collection link. That means that I can’t just store them in configuration. So you’ll probably have to read them from the database every time you need to use them (maybe with some caching?). This is just silly. This is making it very hard to look at what is going on and see what the system is doing (for example, by watching what is going on in Fiddler).

In contrast, RavenDB applies human readable names whenever possible. For more information, see the documentation about the efforts to make sure that everything in RavenDB in human readable and easily debuggable. One such place is the id generation strategy.

Development and testing – in this day and age, people are connected to the internet through most of their day to day life. That doesn’t mean that they are always connected, or that you can actually rely on the network, or that the latency is acceptable. There is no current development story for ADB. No way to run your own database and develop while you are offline (on the train or at 30,000 feet in the air). That means that every call to ADB has to go over the internet, and that means, in turn, that there is no local development story at all. It means a lot more waiting from the point of view of the developer (also see next point), it means that there is just no testing story.

If you want to run code to test your ADB usage, you have to setup (and pay) a whole new ADB instance, have to make sure that it is setup exactly the same way as your production instance, and run it against that. It means that test not only have to go outside your process, but across the internet to a remote server. This pretty much kills the notion of fast tests.

In contrast, RavenDB has an excellent development and testing story. You don’t pay for development or CI instances, and you can run tests against RavenDB using an in memory mode embedded inside your process. This has been heavily optimize to allow fast running tests. We are developers, and we care to make other developers’ life easy. It shows. For more information, see the documentation about unit testing RavenDB.

Joins are for your code – because ADB doesn’t actually support joins beyond the document scope, or any other option like that, it means that if you want to do something trivial, like show a customer a list of their orders, you are actually going to have to do the join in your own code, not in the database. In fact, let us take a silly scenario, let us say that we want to show a list of new employees as well as their managers, so we can have a chat with them about how they are settling in.

If we were using SQL, we would be using something like this:

SELECT emp.Id as EmpId, emp.Name as EmpName, mngr.Id as ManagerId, mngr.Name as ManagerName
FROM Employees emp
JOIN Managers mngr where emp.ManagerId = mngr.Id
WHERE emp.JoinedAt > '2014-06-01'

That is pretty easy, right? How do you do something like that in ADB? Well, you start with the first query:

SELECT emp.Id as EmpId, emp.Name as EmpName, emp.ManagerId as ManagerId
FROM Employees emp
WHERE emp.JoinedAt > '2014-06-01'

And then, for each of the returned managers’ ids, we have to issue a separate query (ADB doesn’t have support for IN). This pattern of usage is called SELECT N+1, and it is a very well known anti pattern, even leaving aside the fact that you have to manually do the join in your own code, with all that this implies. This sort of operations will effectively kill the performance of any application, because you are very chatty with the database.

In contrast, RavenDB contains several ways to load related items. From including a related document to projecting it via a transformer, you can very easily and efficiently get all the data you need, in a single query to RavenDB. In fact, RavenDB applies a Safe By Default approach and limit the number of times you can call the server (configurable) to prevent just this case. We’ll error if you go over the budget of remote calls you are allowed to make. This gives you an early chance to catch performance problems. For more information, see the documentation about includes, transformers and  the Safe By Default approach practiced by RavenDB.

Limits - reading the limits for ADB makes for some head scratching. Yes, I know that we are talking about the preview mode only. I’m aware that you can ask to increase those limits. Nevertheless, those limits likely reflect real trade offs made in the system. So increasing those limits for a particular use case means that you’ll have to pay the price for that elsewhere.

For example, let us take this blog post as an example. It is over 22KB in size. But I can’t store this blog post in ADB. That is because documents are limited to 16KB in size. This is utterly ridiculous. I just checked a few of our databases, an common size for documents is 4 – 8 KB, this is true. But larger documents appear all the time. Even if you exclude blog posts as BLOB of text, we have order documents that have with  multiple order lines that are easily past that size. In our users, we see every document size possible, from hundreds of KB to several MB.

I reached out to Codealike, one of our customers, who were also featured in one of Azure’s case studies, to hear from them what their situation was. Out of 1.6 million documents in one of their databases, about 90% are in the 500Kb range.

I’m assuming that a large part of this limitation is the fact that by default, everything is indexed. You can’t index everything and have large documents and have reasonable performance. So this limit was introduced. I’m also assuming that there are other issues here (to be able to fit into pages? low level technical stuff?). Regardless, this is just utterly ridiculous low limit. The problem is that even raising this limit by x5 or x10, that is still not enough. And I’m assuming that they didn’t chose this limit out of thin air, that there is a technical reason for it.

Other issues is the number of stored procedure and UDF that you have available. You get 5 of each, and that is it. So you don’t get to actually express anything complex there. You also get to use only a single UDF per query, and to use a maximum of 3 AND / OR clauses in a query. I’m assuming that the reasoning here is that the more clauses you have, the more complex it is to run the query, especially in a distributed environment. So they put a hard limit on that.

Those limits together, along with not supporting sorting basically render ADB into an interesting curiosity, but not a real contender for a generally applicable database.

What is this for?

After going over the documentation, there is one thing that I couldn’t find. What is the primary use case for ADB? 

It looks more like a solution in search of a problem than the other way around. It appears that this is used by several MS systems to store 100s of TB of data, and process millions of queries. Sheer data size isn’t really interesting, we have customers that have multiple TB data. And millions of queries per day isn’t really something to brag about (10 million queries per day translate to about 115 queries per second, or about 20 – 30 queries per second per node).

What interests me is what sort of data do you put there? The small size limitation make it pretty much unsuitable for storing actual complex documents. You have to always be aware of the size you are storing, and that put a serious crimp in how you can work with this. The limited queries and the inability to sort also lead me to believe that this is a very purpose built tool.

OneNote’s server side is apparently one such use case, but from the look of things, I would expect that this is the other way around. That ADB is actually the backend of OneNote that Microsoft has decided to make public (like Dynamo’s in Amazon’s case).

Some of those limitations are probably going to be alleviated by using additional Microsoft tools. So the new Search Server (presumably that one has complex searching & sorting available) would allow you to do some proper queries, and HDInsight might be used for doing aggregation.

You aren’t going to be able to get the “show me the count of unread emails for this user” from Hadoop, not when the data is constantly changing. And using a secondary search server will introduce high latencies for the indexing. That is leaving aside the additional operational complexity of having to manage multiple systems (and the communication between them) just to get things done.

Here are a few things that would be hard to build in ADB, as it stands today:

  • This blog – the posts are too big, can’t sort posts by date, can’t do “complex” queries (tag & date & published & not deleted)
  • Logging – I actually thought that this would be a great use case, but we actually need to show logs by date. As well as be able to search using multiple fields (more than 3) or do contains queries.
  • Orders system –  important orders with a lot of line items will be rejected because of the size limitation.

In fact, I don’t know what would work there. What kind of data are you putting there? This isn’t good for bulk data work, because the ingest rate is really small (~500 writes / second? The debug version RavenDB does 2,500 writes per sec that on my dev laptop without even using the bulk insert API) and there isn’t a good way to work with large amount of data at once. It isn’t good for business applications, for the reasons outlined above.

I guess that if you patched this and the search server and Hadoop together you would get something that might be able to serve. But I think that the complexity involved is going to be very high, and I just don’t see where this would be a great solution.

In short, what is the problem that this is trying to solve? What application would be a perfect fit for this?

With RavenDB, the answer is simple, it is a general purpose database focused on OTLP applications. Until you have an answer, you can use RavenDB on Azure today using RavenHQ on Azure.

On site Architecture & RavenDB consulting availabilities: Malmo & New York City

I’m going to have availability for on site consulting in Malmo, Sweden  (17 Sep) and in New York City, NY (end of Sep – beginning of Oct).

If you want me to come by and discuss what you are doing (architecture, nhibernate or ravendb), please drop me a line.

I’m especially interested in people who need to do “strange” things with data and data access. We are building a set of tailored database solutions for customers now, and we have seem customers show x750 improvement in performance when we gave them a database that was designed to fit their exact needs, instead of having to contort their application and their database to a seven dimensional loop just to try to store and read what they needed.

Your customer isn’t a single entity

An interesting issue came up in the comments for my modeling post.  Urmo is saying:

…there are no defined processes, just individual habits (even among people with same set of obligations) with loose coupling on the points where people need to interact. In these companies a software can be a boot that kicks them into more defined and organized operating mode.

This is part of discussion of software modeling and the kind of thinking you have to do when you approach a system. The problem with Urmo’s approach is that there is a set implicit assumptions, and that is that the customer is speaking with a single voice, that they actually know what they are doing and that they have the best interests. Yes, it is really hard to create software (or anything, actually) without those, but that happens more frequently than one might desire.

A few years ago I was working on a software to manage what was essentially long term temp workers. Long term could be 20 years, and frequently was a number of years. The area in question was caring for invalids,  and most of the customers for that company were the elderly. That meant that a worker might not be required on a pretty sudden basis (the end customer died, care no longer required).

Anyway, that is the back story. The actual problem we run into was that by the time the development team got into place there was already a very detailed spec, written by a pretty good analyst after many sessions at a luxury hotel conference room. In other words, the spec cost a lot of money to generate, and involved a lot of people from the company’s management.

What it did not include, however, was feedback from the actual people who had to place the workers at particular people’s homes, and eventually pay them for their work. Little things like the 1st of the month (you have 100s of workers coming in to get their hours approved and get paid) weren’t taken into account. The software was very focused on the individual process, and there were a lot of checks to validate input.

What wasn’t there were things like: “How do I efficiently handle many applicants at the same time?’'

The current process was paper form based, and they were basically going over the hours submitted, ask minimal questions, and provisionally approve it. Later on, they would do a more detailed scan of the hours, and do any fixups needed. That would be the time that they would also input the data to their old software. In other words, there was an entire messy process going on that the higher ups didn’t even realize was happening.

This include decisions such as “you need an advance, we’ll register that as 10 extra hours you worked this month, and we’ll deduct it next month” and “you weren’t supposed to go to Mrs. Xyz, you were supposed to go to Mr. Zabc! We can’t pay for all your hours there” , etc.

When we started working on the software, we happened to do a demo to some of the on site people, and they were horrified by what they saw. The new & improved software would end up causing them much more issues, and it would actually result in more paperwork that they have to manage just so they can make the software happy.

Modeling such things was tough, and at some point (with the client reluctant agreement) we essentially threw aside the hundreds of pages of well written spec, and just worked directly with the people who would end up using our software. The solution in the end was to codify many of the actual “business processes” that they were using. Those business processes made sense, and they were what kept the company working for decades. But management didn’t actually realize that they were working in this manner.

And that is leaving aside the “let us change the corporate structure through software” endeavors, which are unfortunately also pretty common.

To summarize, assuming that your client is a single entity, which speaks with one voice and actually know what they are talking about? Not going to fly for very long. In another case, I had to literally walk a VP of Sales through the process of how a sale is actually happening in his company versus what he thought was happening.

Sometimes this job is likely playing a shrink, but for corporations.

Modeling exercise: Flights & Travelers

I just got a really interesting customer inquiry, and I got their approval to share it. The basic problem is booking flights, and how to handle that.

The customer suggested something like the following:

{   //customers/12345
    "Name" : "John Doe",
    "Bookings" : [{
        "FlightId": "flights/1234",
        "BookingId": "1asifyupi",
        "Flight": "EA-4814",
        "From": "Iceland",
        "To" : "Japan", 
        "DateBooked" : "2012/1/1"

{ // flight/1234
   "PlaneId": "planes/1234"// centralized miles flown, service history
           "Seat": "F16"
           "BookedBy": "1asifyupi"

But that is probably a… suboptimal way to handle this. Let us go over the type of entities that we have here:

  • Customers / Passengers
  • Flights
  • Planes
  • Booking

The key point in here is that each of those is pretty independent. Note that for simplicity’s sake, I’m assuming that the customer is also the passenger (not true in many cases, a company may pay for your flight, so you the company in the customer and you the passenger).

The actual problem the customer is dealing with is that they have thousands of flights, tens or hundreds of thousands of seats and millions of customers competing for those seats.

Let us see if we can breaking it down to a model that can work for this scenario.  Customers deserve its own document, but I wouldn’t store the bookings directly in the customer document. There are many customers that fly a lot, and they are going to have a lot of booking there. At the same time, there are many bookings that are made for a lot of people at the same time (an entire family flying).

That leaves the Customer’s document with data about the customer (name, email, phone, passport #, etc) as well as details such as # of miles traveled, the frequent flyer status, etc.

Now, we have the notion of flights and bookings. A flight is a (from, to, time, plane), which contains the available seats number. Note that we need to explicitly allow for over booking, since that is a common practice for airlines.

There are several places were we have contention here:

  • When ordering, we want to over book up to a certain limit.
  • When seating (usually 24 – 48 hours before the flight) we want to reserve seats.

The good thing about it is that we actually have a relatively small contention on a particular flight. And the way the airline industry works, we don’t actually need a transaction between creating the booking and taking a seat on the flight.

The usual workflows goes something like this:

  • A “reservation” is made for a particular itinerary.
  • That itinerary is held for 24 – 48 hours.
  • That itinerary is sent to the customer for approval.
  • Customer approve and a booking is made, flight reservations are turned into actual booked seats.

The good thing about this is that because a flight can have up to ~600 seats in it, we don’t really have to worry about contention on a single flight. We can just use normal optimistic concurrency and avoid more complex models. That means that we can just retry on concurrency errors and see where that leads us. The breaking of the actual order into reservation and booking also helps, since we don’t have to coordinate between the actual charge and the reservation on the flight.

Overbooking is handled by setting a limit of how much we allow overbooking, and managing the number of booked seats vs. reserved seats. When we look at the customer data, we show the customer document, along with the most recent orders and the stats. When we look at a particular flight, we can get pretty much all of its state from the flight document itself.

And the plane’s stats are usually just handled via a map/reduce index on all the flights for that plane.

Now, in the real world, the situation is a bit more complex. We might give out 10 economy seats and 3 business seats to Expedia for a 2 months period, so they manage that, and we have partnership agreements with other airlines, and… but I think that this is a good foundation to start building this on.

Design practice: Building a search engine library

Note: This is done purely as a design practice. We don’t have any current plans to implement this, but I find that it is a good exercise in general.

How would I go about building a search engine for RavenDB to replace Lucene. Well, we have Voron as the basis for storage, so from the get go, we have several interesting changes. To start with, we inherit the transactional properties of Voron, but more importantly, we don’t have to do merges, so we don’t have any of those issues. In other words, we can actually generate stable document ids.

But I’m probably jumping ahead a bit. We’ll start with the basics. Analysis / indexing is going to be done in very much the same way. Users will provide an analyzer and a set of documents, like so:

   1: var index = new Corax.Index(storageOptions, new StandardAnalyzer());
   3: var scope = index.CreateIndexingScope();
   5: long docId = scope.Add(new Document
   6: {
   7:     {"Name", "Oren Eini", Analysis.Analyzer},
   8:     {"Name", "Ayende Rahien", Analysis.Analyzer},
   9:     {"Email", "ayende@ayende.com", Analyzed.AsIs, Stored.Yes}
  10: });
  12: docId = scope.Add(new Document
  13: {
  14:     {"Name", "Arava Eini", Analysis.Analyzer},
  15:     {"Email", "arava@doghouse.com", Analyzed.AsIs, Stored.Yes}
  16: });
  18: index.Sumbit(index);

Some notes about this API. It is modeled quite closely after the Lucene API, and it would probably need additional work. The idea here is that you are going to need to get an indexing scope, which is single threaded. And you can have multiple indexing scopes running a the same time. You can batch multiple writes into a single scope, and it behaves like a transaction.

The idea is to deal with all of the work associated with indexing the document into a single threaded work, so that make it easier for us. Note that we immediately get the generated document id, but that the document will only be available for searching when you have submitted the scope.

Under the hood, this does all of the work at the time of calling Add(). The analyzer will run on the relevant fields, and we will create the appropriate entries. How does that work?

Every document has a long id associated with it. In Voron, we are going to have a ‘Documents’ tree, with the long id as the key, and the value is going to be all the data about the document we need to keep. For example, it would have the stored fields for that documents, or the term positions, if required, etc. We’ll also have a a Fields tree, which will have a mapping of all the field names to a integer value.

Of more interest is how we are going to deal with the terms and the fields. For each field, we are going to have a tree called “@Name”, “@Email”, etc. Those are multi trees, with the keys in that tree being the terms, and the multi values being the document ids that has those threes. In other words, the code above is going to generate the following data in Voron.

  • Documents tree:
    • 0 – { “Fields”: { 1: “ayende@ayende.com” } }
    • 1 – { “Fields”: { 1: “arava@doghouse.com” } }
  • Fields tree
    • 0 – Name
    • 1 – Email
  • @Name tree
    • ayende – [ 0 ]
    • arava – [ 1 ]
    • oren – [ 0 ]
    • eini – [ 0, 1 ]
    • rahien – [ 0 ]
  • @Email tree
    • ayende@ayende.com – [ 0 ]
    • arava@doghouse.com – [ 1 ]

Given this model, we can now turn to the actual querying part of the business. Let us assume that we have the following query:

   1: var queryResults = index.Query("Name: Oren OR Email: ayende@ayende.com");
   2: foreach(var result in queryResult.Matches)
   3: {
   4:     Console.WriteLine("{0}: {1}", result.Id, result["Email"] )
   5: }

How does querying works? The query above would actually be something like:

   1: new BooleanQuery(
   2:     Match.Or,
   3:     new TermQuery("Name", "Oren"),
   4:     new TermQuery("Email", "ayende@ayende.com")
   5:     )

The actual implementation of this query would just access the relevant field tree and load the values in the particular key. The result is a set of ids for both parts of the query, and since we are using an OR here, we will just union them and return the list of results back.

Optimizations here can be to run those queries in parallel, or to just cache the results of particular term query, so we can speed things even more.  This looks simple, and it is, because a lot of the work has already been done in Voron. Searching is pretty much complete, we only need to tell it what to search with.

Reviewing go-raft, part II

In my previous post, I started to go over the go-raft implementation, after being interrupted by the need to sleep, I decided to go on with this, but I wanted to expand a bit first about the issue we discussed earlier, not checking the number of bytes read in log_entry’s Decode.


Let us assume that we actually hit that issue, what would happen?


The process goes like this, we try to read a value, but the Read method only return some of the information. We explicitly ignore that, and try to use the buffer anyway. Best case scenario, we are actually getting an error, so we bail early. At that point, we detect the error and truncate the file. Hello data loss, nice to see you. For fun, this is the best case scenario. It is worse if we marshal the partial data without an error. Then we have not the case of “oh, we have a node that is somehow way behind”, we have the case of “this node actually applied different commands than anyone else”.

I reported this issue, and I’m interested to know if my review is in any way correct. With that said, let us move on…

getEntriesAfter gives us all the in memory entries. That is quite similar to how RavenDB handled indexing, for that matter, so it is amusing. But this applies only to in memory stuff, and it is quite interesting to see how this will interact with other parts of the codebase.

setCommitIndex is interesting. In my head, committing something means flushing them to disk. But in Raft’s term. Committing something means applying the commands. But the reason it is interesting is that it has some interesting comments on edge cases. So far, I haven’t see actually writing to disk, mind.

And this one gives me a headache:


Basically, this mean that we need to write the commit index to the beginning of the file. It is also an extremely unsafe operation. What happens if you crash immediately after? Did you change go through, or not? For that matter, there is nothing that prevents the OS from first writing the changes you made to the beginning of the file then whatever else you wrote at the end. So a crash might actually leave you with the commit pointer pointing at corrupted data. Luckily, I don’t see anything there actually calling this, though.

The truncate method makes my head ache, mostly because it does things like delete data, which makes my itchy. This is called from the server code as part of normal processing of the append entries request. What this does, in effect, is to say something like: I want you to apply this log entry, make sure that your previous log entry is this, and if it isn’t, revert it back to this entry.  This is how Raft ensure that all the logs are the same across the cluster.

Then we have this:

   1:  // Appends a series of entries to the log.
   2:  func (l *Log) appendEntries(entries []*protobuf.LogEntry) error {
   3:      l.mutex.Lock()
   4:      defer l.mutex.Unlock()
   6:      startPosition, _ := l.file.Seek(0, os.SEEK_CUR)
   8:      w := bufio.NewWriter(l.file)
  10:      var size int64
  11:      var err error
  12:      // Append each entry but exit if we hit an error.
  13:      for i := range entries {
  14:          logEntry := &LogEntry{
  15:              log:      l,
  16:              Position: startPosition,
  17:              pb:       entries[i],
  18:          }
  20:          if size, err = l.writeEntry(logEntry, w); err != nil {
  21:              return err
  22:          }
  24:          startPosition += size
  25:      }
  26:      w.Flush()
  27:      err = l.sync()
  29:      if err != nil {
  30:          panic(err)
  31:      }
  33:      return nil
  34:  }

This seems pretty easy to follow, all told. But note the call to sync() there in line 27. And the fact that this translate down to an fsync, which is horrible for performance.

There is also appendEntry, which appears to be doing the exact same thing as appendEntries and writeEntry. I’m guessing that the difference is that appendEntries is called for a follower, and appendEntry is for a leader.

The last thing to go through in the log.go file is the compact function, which is… interesting:

   1:  // compact the log before index (including index)
   2:  func (l *Log) compact(index uint64, term uint64) error {
   3:      var entries []*LogEntry
   5:      l.mutex.Lock()
   6:      defer l.mutex.Unlock()
   8:      if index == 0 {
   9:          return nil
  10:      }
  11:      // nothing to compaction
  12:      // the index may be greater than the current index if
  13:      // we just recovery from on snapshot
  14:      if index >= l.internalCurrentIndex() {
  15:          entries = make([]*LogEntry, 0)
  16:      } else {
  17:          // get all log entries after index
  18:          entries = l.entries[index-l.startIndex:]
  19:      }
  21:      // create a new log file and add all the entries
  22:      new_file_path := l.path + ".new"
  23:      file, err := os.OpenFile(new_file_path, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0600)
  24:      if err != nil {
  25:          return err
  26:      }
  27:      for _, entry := range entries {
  28:          position, _ := l.file.Seek(0, os.SEEK_CUR)
  29:          entry.Position = position
  31:          if _, err = entry.Encode(file); err != nil {
  32:              file.Close()
  33:              os.Remove(new_file_path)
  34:              return err
  35:          }
  36:      }
  37:      file.Sync()
  39:      old_file := l.file
  41:      // rename the new log file
  42:      err = os.Rename(new_file_path, l.path)
  43:      if err != nil {
  44:          file.Close()
  45:          os.Remove(new_file_path)
  46:          return err
  47:      }
  48:      l.file = file
  50:      // close the old log file
  51:      old_file.Close()
  53:      // compaction the in memory log
  54:      l.entries = entries
  55:      l.startIndex = index
  56:      l.startTerm = term
  57:      return nil
  58:  }

This code can’t actually run on Windows. Which is interesting. The issue here is that it is trying to rename a file that is open on top of another file which is open. Windows does not allow it.

But the interesting thing here is what this does. We have the log file, which is the persisted state of the in memory entries collection. Every now and then, we compact it by creating a snapshot, and then we create a new file, with only the entries after the newly created snapshot position.

So far, so good, and that gives me a pretty good feeling regarding how the whole thing is structured. Next in line, the peer.go file. This represent a node’s idea about what is going on in the another node in the cluster. I find the heartbeat code really interesting:

// Starts the peer heartbeat.
func (p *Peer) startHeartbeat() {
    p.stopChan = make(chan bool)
    c := make(chan bool)
    go p.heartbeat(c)

// Stops the peer heartbeat.
func (p *Peer) stopHeartbeat(flush bool) {
    p.stopChan <- flush

// Listens to the heartbeat timeout and flushes an AppendEntries RPC.
func (p *Peer) heartbeat(c chan bool) {
    stopChan := p.stopChan

    c <- true

    ticker := time.Tick(p.heartbeatInterval)

    debugln("peer.heartbeat: ", p.Name, p.heartbeatInterval)

    for {
        select {
        case flush := <-stopChan:
            if flush {
                // before we can safely remove a node
                // we must flush the remove command to the node first
                debugln("peer.heartbeat.stop.with.flush: ", p.Name)
            } else {
                debugln("peer.heartbeat.stop: ", p.Name)

        case <-ticker:
            start := time.Now()
            duration := time.Now().Sub(start)
            p.server.DispatchEvent(newEvent(HeartbeatEventType, duration, nil))

Start heartbeat starts a new heartbeat, and then wait under the heartbeat function notify it that it has started running.

What is confusing is the reference to the peer’s server. Peer is defined as:

// A peer is a reference to another server involved in the consensus protocol.
type Peer struct {
    server            *server
    Name              string `json:"name"`
    ConnectionString  string `json:"connectionString"`
    prevLogIndex      uint64
    mutex             sync.RWMutex
    stopChan          chan bool
    heartbeatInterval time.Duration

And it seems logical to think that this is a remote peer’s server, but this is actually the local server reference, not the remote one. Note that it is actually the flush method that does the remote call.

Flush is defined as:

func (p *Peer) flush() {
    debugln("peer.heartbeat.flush: ", p.Name)
    prevLogIndex := p.getPrevLogIndex()
    term := p.server.currentTerm

    entries, prevLogTerm := p.server.log.getEntriesAfter(prevLogIndex, p.server.maxLogEntriesPerRequest)

    if entries != nil {
        p.sendAppendEntriesRequest(newAppendEntriesRequest(term, prevLogIndex, prevLogTerm, p.server.log.CommitIndex(), p.server.name, entries))
    } else {
        p.sendSnapshotRequest(newSnapshotRequest(p.server.name, p.server.snapshot))

The interesting thing here is that the entries collection might be empty (in which case this serve as just a heartbeat). Another thing that pops to mind is that this has an explicitly leader instructing follower to generate snapshots. The Raft paper suggested that this is something that would happen locally on each server on an independent basis.

There is a lot of interesting behavior in sendAppendEntriesRequest(), not so much in what it does, as in how it handles replies. There is a lot of state going on there. It’s very well commented, so I’ll let you read it, there isn’t anything that is actually going on that is complex.

What is fascinating is that while the transport layer for go-raft is HTTP, which is inherently request/response. It actually handles this in an interesting fashion:

  • Requests are synchronous
  • On reply, the in memory state of the peer is updated immediately
  • The response from the peer is queued to be handled by the server event loop

The end result is that a lot of the handling is centralized into a really pretty state machine. The rest of what is going on there is not very interesting, except for snapshots, but those are covered elsewhere.

And now, we are ready to actually go and look at the server code, but… not yet. It is over thousand lines of code, so I think that I’ll go over other stuff first. In particular, snapshotting looks interesting.


This is actually quite depressing. Note the State properties here. There is an implicit assumption that it is possible / advisable to go with the entire in memory state like that. I know that I am sensitive to such things, but that seems like an aweful lot of waste when talking about large systems.

Here is one such issue:


Let us assume that our state is big, hundreds of MB or maybe a few GB in size.

We currently hold it in memory inside the Snaphsot.STate, then we marshal that to json. Now, I actually had to go and check, but Go’s json package actually does the usual thing and encode a byte array as a base 64 formatted string. What that means, in turn, is that you have an overhead of about 25% that you have to deal with, and this is all allocated in main memory. And then you write it to a file.

This is…. quite insane, to be frank.

Assuming that I have a state that is 100 MB in size, I’m going to hold all of that in memory, then allocate another 125MB just to hold the json state, then write it to a file. Why not write it to a file directly in the first place? (You could do CRC along the way).

The whole thing appear to be assuming small sizes of data. Throughout the entire codebase, actually.

And now, I have no other ways to avoid it, we are going into the server.go itself…

There is a lot of boilerplate stuff there, but the first interesting thing happens when we look at how to apply the log:


This says, when we need to apply it, execute the command method on my state. A lot of the other methods are some variant of:


Nothing to see here at all.

And we finally get to the key part, the event loop:


Let us look in detail on the followerLoop. Inside that function, we have a loop that waits for:

  1. Stop signal, which would lead to us shutting down…
  2. We got an event on our queue…
  3. The timeout for an event has expired…

There is one part there that puzzles me:


promotable will return true if the log has any entries at all. I’m not really sure why that is the case, to be honest. In particular, what about the case when we start with an empty server. I’m going to go on reading the code, and we’ll see where it leads us. And it leads us to:


So next we need to figure out what is this self join stuff. I am not sure if that is something comes from Raft or from an external source. I found this issue that discusses this, but it isn’t very helpful in terms of understanding who issue the self join command. I tried looking at the etcd codebase, but I didn’t find anything so far. I’ll leave it for now.

The rest of the operations are basically just forwarding the calls to the appropriate methods if they are in the allowed state.

The caniddateLoop method isn’t anything special, it follows the Raft paper pretty nicely, although I have to admit the “candidate becomes follower upon Append Entries command” is buried deep. The same is true for the other behaviors. The appropriate state based responses sometimes are hard to figure out, because you have the state loop, then you have the same apparent behavior everywhere. For example, we need to become follower if we get an Append Entries request. That happens in processAppendEntriesRequest(), but it would actually be easier to see this if we had code duplication. This is a case where getting familiar with the codebase would help understanding it, and I don’t think that this would be a change worth doing, anyway.

Probably the most interesting behavior is in the leadershipLoop when we process a command. A command is added to the server queue using a Do(Command) method. It is then processed in processCommand.

The problem here is that commands are actually appended to the log, and then sent to peers using the heartbeat interval. By default, that stand at 50 ms.

This is great and all, but it does mean that the latency for requests is going to suffer. This doesn’t matter that much for something like etcd. The assumption here is that the requests are all going to be on different things, so we can queue a lot of commands and get pretty good speed overall. It is a problem if in our system, we have to process sequential operations. In that mode, we can’t wait until the heartbeat, and we want to process this right away. I’ll discuss this later,I think. It is a very important property of this implementation (but not for Raft in general).

Looking at the snapshot state, this happens when a follower get this a SnapshotRequest, but I don’t see anywhere that send it. Maybe it is another caller originated thing?

I just looked at the etcd source, and I think that I confirmed that both behaviors are there in the etcd source. So I think that that explains it.

And this is it, basically. I have some thoughts about the implementation of this and of etcd, but I think that this is enough for now… I’ll post them in my next post.

Reviewing go-raft, part I

After going over the etcd codebase, I decided that the raft portion of this is deserving a much stronger look. The project is here, and I am reviewing commit: 30f261bfe873561c2c75b6206ba1f62a42dbc8d6

Again, I strong recommend reading the Raft paper. It is quite good. At any rate, assuming that you understand Raft, let us get cracking. This time, I’m reading this in Sublime Text. As usual, I’m reading in lexicographical order, and I’m starting from append_entires.go

AppendEntries is at the very heart of Raft, so I was pleased to see it here:

// The request sent to a server to append entries to the log.
type AppendEntriesRequest struct {
    Term         uint64
    PrevLogIndex uint64
    PrevLogTerm  uint64
    CommitIndex  uint64
    LeaderName   string
    Entries      []*protobuf.LogEntry

// The response returned from a server appending entries to the log.
type AppendEntriesResponse struct {
    pb     *protobuf.AppendEntriesResponse
    peer   string
    append bool

However, I didn’t really understand this code. It seemed circular, at least until I realized that we also have a whole lot of generated files. See:


The actual protobuf semantics are (excluding a lot of stuff, of course):

message LogEntry {
    required uint64 Index=1;
    required uint64 Term=2;
    required string CommandName=3;
    optional bytes Command=4; // for nop-command

message AppendEntriesRequest {
    required uint64 Term=1;
    required uint64 PrevLogIndex=2;
    required uint64 PrevLogTerm=3;
    required uint64 CommitIndex=4;
    required string LeaderName=5;
    repeated LogEntry Entries=6;

message AppendEntriesResponse {
    required uint64 Term=1;
    required uint64 Index=2;
    required uint64 CommitIndex=3;
    required bool   Success=4;

So, goraft (which I always read as graft) is using protocol buffers as its wire format. Note in particular that the LogEntry contain the full content of a command. That AppendEntriesRequest has an array of them, and that the AppendEntriesResponse is setup separately. That means that it is very natural to use a one way channel for communication. Even though we do request response, there is a high degree of separation between the request & reply. Indeed, from reading the code in etcd, I thought that was the case.

There is something that really bothers me, though. I noticed that in etcd’s codebase as well. This is things like this:

// Encodes the AppendEntriesRequest to a buffer. Returns the number of bytes
// written and any error that may have occurred.
func (req *AppendEntriesRequest) Encode(w io.Writer) (int, error) {
    pb := &protobuf.AppendEntriesRequest{
        Term:         proto.Uint64(req.Term),
        PrevLogIndex: proto.Uint64(req.PrevLogIndex),
        PrevLogTerm:  proto.Uint64(req.PrevLogTerm),
        CommitIndex:  proto.Uint64(req.CommitIndex),
        LeaderName:   proto.String(req.LeaderName),
        Entries:      req.Entries,

    p, err := proto.Marshal(pb)
    if err != nil {
        return -1, err

    return w.Write(p)

I’m not sure about the actual semantics of memory allocations in Go, but let us assume that we had a single ,log entry with 1KB for the command data. This means that we would have the command data:

  • Once in the LogEntry inside the AppendEntriesRequest
  • Once in the protocol buffers byte array returned from Marshal

There doesn’t appear to be any way to directly stream things. Maybe it is usually dealing with small amounts of data, maybe they didn’t notice, or maybe something in Go make this very efficient, but I doubt it.

The next interesting part is Command handling. Raft is all about reaching a consensus on the order of executing a set of commands in a cluster. So it is really interesting to see it being handled with Go’s interfaces.

// Command represents an action to be taken on the replicated state machine.
type Command interface {
    CommandName() string

// CommandApply represents the interface to apply a command to the server.
type CommandApply interface {
    Apply(Context) (interface{}, error)

type CommandEncoder interface {
    Encode(w io.Writer) error
    Decode(r io.Reader) error

We have some additional things about serializing commands and reading them back, but nothing beyond this. The Commands.go file, however, is of a little bit more interest. Let us look at the join command:

// Join command interface
type JoinCommand interface {
    NodeName() string

// Join command
type DefaultJoinCommand struct {
    Name             string `json:"name"`
    ConnectionString string `json:"connectionString"`

// The name of the Join command in the log
func (c *DefaultJoinCommand) CommandName() string {
    return "raft:join"

func (c *DefaultJoinCommand) Apply(server Server) (interface{}, error) {
    err := server.AddPeer(c.Name, c.ConnectionString)

    return []byte("join"), err

func (c *DefaultJoinCommand) NodeName() string {
    return c.Name

I’m not sure when we have an interface for JoinCommand, then a default implementation like that. I saw that elsewhere in etcd, it might be a Go pattern. Note that the JoinCommand is an interface that embeds another interface (Command, in this case, obviously).

Note that you have the Apply function to actually handle the real work, in this case, add a peer.  There is nothing interesting in config.go, debug.go or context.go but event.go is puzzling. To be fair, I am really at a loss to explain this style:

// Event represents an action that occurred within the Raft library.
// Listeners can subscribe to event types by using the Server.AddEventListener() function.
type Event interface {
    Type() string
    Source() interface{}
    Value() interface{}
    PrevValue() interface{}

// event is the concrete implementation of the Event interface.
type event struct {
    typ       string
    source    interface{}
    value     interface{}
    prevValue interface{}

// newEvent creates a new event.
func newEvent(typ string, value interface{}, prevValue interface{}) *event {
    return &event{
        typ:       typ,
        value:     value,
        prevValue: prevValue,

// Type returns the type of event that occurred.
func (e *event) Type() string {
    return e.typ

// Source returns the object that dispatched the event.
func (e *event) Source() interface{} {
    return e.source

// Value returns the current value associated with the event, if applicable.
func (e *event) Value() interface{} {
    return e.value

// PrevValue returns the previous value associated with the event, if applicable.
func (e *event) PrevValue() interface{} {
    return e.prevValue

Why go to all this trouble to define things this way? It seems like a lot of boiler plate code. It would be easier to just expose a struct directly. I am assuming that this is done so you can send other things than the event struct, with additional information as well. In C# you’ll do that by subsclassing the event, but you cannot do that in Go. A better alternative might have been to just have a tag / state field in the struct and let it go that way, though.

event_dispatcher.go is just an implementation of a dictionary of string to events, nothing much beyond that. A lot of boiler plate code, too.

http_transporter.go is next, and is a blow to my hope that this will do a one way messaging system. I’m thinking about doing Raft over ZeroMQ or NanoMSG. Here is the actual process of sending data over the wire:

// Sends an AppendEntries RPC to a peer.
func (t *HTTPTransporter) SendAppendEntriesRequest(server Server, peer *Peer, req *AppendEntriesRequest) *AppendEntriesResponse {
    var b bytes.Buffer
    if _, err := req.Encode(&b); err != nil {
        traceln("transporter.ae.encoding.error:", err)
        return nil

    url := joinPath(peer.ConnectionString, t.AppendEntriesPath())
    traceln(server.Name(), "POST", url)

    t.Transport.ResponseHeaderTimeout = server.ElectionTimeout()
    httpResp, err := t.httpClient.Post(url, "application/protobuf", &b)
    if httpResp == nil || err != nil {
        traceln("transporter.ae.response.error:", err)
        return nil
    defer httpResp.Body.Close()

    resp := &AppendEntriesResponse{}
    if _, err = resp.Decode(httpResp.Body); err != nil && err != io.EOF {
        traceln("transporter.ae.decoding.error:", err)
        return nil

    return resp

This is very familiar territory for me, I have to say Smile. Although, again, there is a lot of wasted memory here by encoding the data multiple times, instead of streaming it directly.

And here is how it receives information:

// Handles incoming AppendEntries requests.
func (t *HTTPTransporter) appendEntriesHandler(server Server) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        traceln(server.Name(), "RECV /appendEntries")

        req := &AppendEntriesRequest{}
        if _, err := req.Decode(r.Body); err != nil {
            http.Error(w, "", http.StatusBadRequest)

        resp := server.AppendEntries(req)
        if _, err := resp.Encode(w); err != nil {
            http.Error(w, "", http.StatusInternalServerError)

Really there is nothing much to write home about, to be frank. All of the operations are like that, just encoding/decoding and forwarding the code to the right function. I’m skipping log.go in favor of going to log_entry.go for a moment. The log is really important in Raft, so I want to focus on small chewables first.

If the user don’t provide an encoder for a command, it will be converted using json, then serialized to a writer using protocol buffers format.

One thing that I did notice that was interesting was a bug in decoding from a ptorocol buffer stream:

// Decodes the log entry from a buffer. Returns the number of bytes read and
// any error that occurs.
func (e *LogEntry) Decode(r io.Reader) (int, error) {

    var length int
    _, err := fmt.Fscanf(r, "%8x\n", &length)
    if err != nil {
        return -1, err

    data := make([]byte, length)
    _, err = r.Read(data)

    if err != nil {
        return -1, err

    if err = proto.Unmarshal(data, e.pb); err != nil {
        return -1, err

    return length + 8 + 1, nil

Do you see the bug?

It is in the reading of the data from the reader. A reader may decide to read less than the data that was requested. In this case, I’m assuming that it is always sending fully materialized readers to the Decode method, not surprising given how often it will create in memory buffers for the entire dataset. Still.. that isn’t nice to do, and it can create the most subtle and hard to understand bugs.

And now, into the Log!

// A log is a collection of log entries that are persisted to durable storage.
type Log struct {
    ApplyFunc   func(*LogEntry, Command) (interface{}, error)
    file        *os.File
    path        string
    entries     []*LogEntry
    commitIndex uint64
    mutex       sync.RWMutex
    startIndex  uint64 // the index before the first entry in the Log entries
    startTerm   uint64

// The results of the applying a log entry.
type logResult struct {
    returnValue interface{}
    err         error

There are a few things that we can notice right now. First, ApplyFunc is how we control the application of stuff to the in memory state, I am assuming. Given that applying the log can only happen after we have a consensus and probably fsynced to disk, it makes sense to invoke it from here.

Then, we also have a file, so that is where we are actually doing a lot of the interesting stuff, like actual storage IO and things like that. The in memory events array is also interesting, mostly because I wonder just how big it is, and when it is getting truncated. I think that the way it works, we have the log properties, which likely represent the flushed to disk state, and the entries represent the yet to be flushed state.

Things get interesting in the open method, which is called to create new log or recover an existing one. The interesting parts (recovery) is here:

// Read the file and decode entries.
for {
    // Instantiate log entry and decode into it.
    entry, _ := newLogEntry(l, nil, 0, 0, nil)
    entry.Position, _ = l.file.Seek(0, os.SEEK_CUR)

    n, err := entry.Decode(l.file)
    if err != nil {
        if err == io.EOF {
            debugln("open.log.append: finish ")
        } else {
            if err = os.Truncate(path, readBytes); err != nil {
                return fmt.Errorf("raft.Log: Unable to recover: %v", err)
    if entry.Index() > l.startIndex {
        // Append entry.
        l.entries = append(l.entries, entry)
        if entry.Index() <= l.commitIndex {
            command, err := newCommand(entry.CommandName(), entry.Command())
            if err != nil {
            l.ApplyFunc(entry, command)
        debugln("open.log.append log index ", entry.Index())

    readBytes += int64(n)

This is really interesting, because it is actually sending the raw file to the Decode function, unlike what I expected. The reason this is surprising is that there is a strong likelihood that the OS  will actually return less data than requested. As it turns out, on Windows, this will never be the case, but it does appear (at least from the contract of the API it ends up calling) that at least on Linux, that is possible. Now, I ended up going all the way to the sys call interface in linux, so I’m pretty sure that this can’t happen there either, but still…

At any rate, the code appear to be pretty clear. We read the log entry, decode it, (truncating the file if there are any issues) and if need be, we apply it.

And I think that this is enough for now… it is close to 9 PM, and I need to do other things as well. I’ll get back to this in my next post.

Usage conventions for using Voron

As we are gearing up to do more & more stuff in Voron, it occurred to me that while we have settled on a good technological system for it, we haven’t settled on a real set of conventions for real use. We’re probably going to see a lot of use in Voron, and we want to see some consistency and best practices there.

Voron is a key/value store that expose a sorted tree abstraction. You can have as many trees as you would like. And the keys & values are both arbitrary byte strings. Given that, let us try to bring some order to the mix.

Don’t use the root tree

The root tree should reserved for handling of other trees, and not for the use of data of any kind. The only case where using the root tree is fine is if you don’t have any other trees. That tend to be a rare occasion, though. See next topic.

Have a $metadata tree

Always have a $metadata tree, that gives you information about the actual database you are using. For example, you’ll want to have things like the storage version, the database id (always a guid, to be able to tell dbs from their backups), etc.

Alpha numeric values only for tree names, please

You can use any value as a tree name, but you really want to limit yourself to the printable ASCII set. This is recommended because dumping the data to any other format (think debugging) would be greatly enhanced if you can actually read the tree names.

Use @tree for super trees

It is very common to have trees where all their data is handled via MutiAdd, MultiRead, etc. We call such trees super trees (they are trees that contains trees, also see big table). While they are usually used for indexing, there are many cases where you want to do that for things like queues, general run of the meal data (this is great for holding edges in a graph, for example), etc.

Prefix ix_ for all indexing trees

I’m not so sure about this, but it is worth mentioning. The purpose here is to distinguish between standard trees and trees that can be rebuild from scratch if needed. That can be of help in diagnostics mode, for example.

Dynamic trees should be explained

It is very easy to create trees in Voron. But like files, you don’t just create some around for no purpose. A tree cost 4KB (minimum), and more importantly, if you are looking at your storage, you want to be able to make sense of things. If you are using a tree as a queue for a particular destination, make sure that you name it appropriately.

Alternatively, use the $metadata tree to keep track of what goes where.

Avoid non printable key names

Just like in tree names, you can use any byte string in a key name, but you want to be able to read debug data or run diagnostics, it would really help if you could actually look at the data .

Remember, sequential writes are best

Voron will deal with random writes just fine, but it would be far  faster to write if you’ll arrange things so they are sequential. It is fine if you have once in a while a random write. But try to keep things sequential if at all possible. On that node, sequential for us means increasing, all of our optimizations assumes that. Decreasing sequential data is currently not as optimized.

Write and end, delete at start

This is a common operation you need for queues. It is generally better to do that with writing of the data at the end and removing from start. If you can, avoid just deleting stuff all over the place. Again, that works just fine, but we’ve optimized Voron to handle this scenario very well.

Keep the data simple

Voron does a lot with memory mapping. If the data you can use can be read directly, you can literally just access it off our own buffer, and have no copy required at all.

And that is all I have for now…


Published at

Originally posted at

Distributed counters feature design

This is another experiment with longer posts.

Previously, I used the time series example as the bed on which to test some ideas regarding feature design, to explain how we work and in general work out the rough patches along the way. I should probably note that these posts are purely fiction at this point. We have no plans to include a time series feature in RavenDB at this time. I am trying to work out some thoughts in the open and get your feedback.

At any rate, yesterday we had a request for Cassandra style counters at the mailing list. And as long as I am doing feature design series, I thought that I could talk about how I would go about implementing this. Again, consider this fiction, I have no plans of implementing this at this time.

The essence of what we want is to be able to… count stuff. Efficiently, in a distributed manner, with optional support for cross data center replication.

Very roughly, the idea is to have “sub counters”, unique for every node in the system. Whenever you increment the value, we log this to our own sub counter, and then replicate it out. Whenever you read it, we just sum all the data we have from all the sub counters.

Let us outline the various parts of the solution in the same order as the one I used for time series.


A counter is just a named 64 bits signed integer. A counter name can be any string up to 128 printable characters. The external interface of the storage would look like this:

   1: public struct CounterIncrement
   2: {
   3:     public string Name;
   4:     public long Change;
   5: }
   7: public struct Counter
   8: {
   9:     public string Name;
  10:     public string Source;
  11:     public long Value;
  12: }
  14: public interface ICounterStorage
  15: {
  16:     void LocalIncrementBatch(CounterIncrement[] batch);
  18:     Counter[] Read(string name);
  20:     void ReplicatedUpdates(Counter[] updates);
  21: }

As you can see, this gives us very simple interface for the storage. We can either change the data locally (which modify our own storage) or we can get an update from a replica about its changes.

There really isn’t much more to it, to be fair. The LocalIncrementBatch() increment a local value, and Read() will return all the values for a counter. There is a little bit of trickery involved in how exactly one would store the counter values.

For now, I think we’ll store each counter as two step values. We’ll have a tree of multi tree values that will carry each value from each source. That means that a counter will take roughly 4KB or so. This is easy to work with and nicely fit the model Voron uses internally.

Note that we’ll outline additional requirement for storage (searching for counter by prefix, iterating over counters, addresses of other servers, stats, etc) below. I’m not showing them here because they aren’t the major issue yet.

Over the wire

Skipping out on any optimizations that might be required, we will expose the following endpoints:

  • GET /counters/read?id=users/1/visits&users/1/posts <—will return json response with all the relevant values (already summed up).
    { “users/1/visits”: 43, “users/1/posts”: 3 }
  • GET /counters/read?id=users/1/visits&users/1/1/posts&raw=true <—will return json response with all the relevant values, per source.
    { “users/1/visits”: {“rvn1”: 21, “rvn2”: 22 } , “users/1/posts”:  { “rvn1”: 2, “rvn3”: 1 } }
  • POST /counters/increment <– allows to increment counters. The request is a json array of the counter name and the change.

For a real system, you’ll probably need a lot more stuff, metrics, stats, etc. But this is the high level design, so this would be enough.

Note that we are skipping the high performance stream based writes we outlined for time series. We’ll probably won’t need them, so that doesn’t matter, but they are an option if we need them.

System behavior

This is where it is really not interesting, there is very little behavior here, actually. We only have to read the data from the storage, sum it up, and send it to the user. Hardly what I’ll call business logic.

Client API

The client API will probably look something like this:

   1: counters.Increment("users/1/posts");
   2: counters.Increment("users/1/visits", 4);
   4: using(var batch = counters.Batch())
   5: {
   6:     batch.Increment("users/1/posts");
   7:     batch.Increment("users/1/visits",5);
   8:     batch.Submit();
   9: }

Note that we’re offering both batch and single API. We’ll likely also want to offer a fire & forget style, which will be able to offer even better performance (because they could do batching across more than a single thread), but that is out of scope for now.

For simplicity sake, we are going to have the client just a container for all of endpoints that it knows about. The container would be responsible for… updating the client visible topology, selecting the best server to use at any given point, etc.

User interface

There isn’t much to it. Just show a list of counter values in a list. Allow to search by prefix, allow to dive into a particular counter and read its raw values, but that is about it. Oh, and allow to delete a counter.

Deleting data

Honestly, I really hate deletes. They are very expensive to handle properly the moment you have more than a single node. In this case, there is an inherent race condition between a delete going out and another node getting an increment. And then there is the issue of what happens if you had a node down when you did the delete, etc.

This just sucks. Deletion are handled normally, (with the race condition caveat, obviously), and I’ll discuss how we replicate them in a bit.

High availability / scale out

By definition, we actually don’t want to have storage replication here. Either log shipping or consensus based. We actually do want to have different values, because we are going to be modifying things independently on many servers.

That means that we need to do replication at the database level. And that leads to some interesting questions. Again, the hard part here is the deletes. Actually, the really hard part is what we are going to do with the New Server Problem.

The New Server Problem dictates how we are going to bring a new server into the cluster. If we could fix the size of the cluster, that would make things a lot easier. However, we are actually interested in being able to dynamically grow the cluster size.

Therefor, there are only two real ways to do it:

  • Add a new empty node to the cluster, and have it be filled from all the other servers.
  • Add a new node by backing up an existing node, and restoring as a new node.

RavenDB, for example, follows the first option. But it means that in needs to track a lot more information. The second option is actually a lot simpler, because we don’t need to care about keeping around old data.

However, this means that the process of bringing up a new server would now be:

  1. Update all nodes in the cluster with the new node address (node isn’t up yet, replication to it will fail and be queued).
  2. Backup an existing node and restore at the new node.
  3. Start the new node.

The order of steps is quite important. And it would be easy to get it wrong. Also, on large systems, backup & restore can take a long time. Operationally speaking, I would much rather just be able to do something like, bring a new node into the cluster in “silent” mode. That is, it would get information from all the other nodes, and I can “flip the switch” and make it visible to clients at any point in time.  That is how you do it with RavenDB, and it is an incredibly powerful system, when used properly.

That means that for all intents and purposes, we don’t do real deletes. What we’ll actually do is replace the counter value with delete marker. This turns deletes into a much simple “just another write”. It has the sad implication of not free disk space on deletes, but deletes tend to be rare, and it is usually fine to add a “purge” admin option that can be run on as needed basis.

But that brings us to an interesting issue, how do we actually handle replication.

The topology map

To simplify things, we are going to go with one way replication from a node to another. That allows complex topologies like master-master, cluster-cluster, replication chain, etc. But in the end, this is all about a single node replication to another.

The first question to ask is, are we going to replicate just our local changes, or are we going to have to replicate external changes as well? The problem with replicating external changes is that you may have the following topology:


Now, Server A got a value and sent it to Server B. Server B then forwarded it to Server C. However, at that point, we also have a the value from Server A replicated directly to Server C. Which value is it supposed to pick? And what about a scenario where you have more complex topology?

In general, because in this type of system, we can have any node accept writes, and we actually desire this to be the case, we don’t want this behavior. We want to only replicate local data, not all the data.

Of course, that leads to an annoying question, what happens if we have a 3 node cluster, and one node fails catastrophically. We can bring a new node in, and the other two nodes will be able to fill in their values via replication, but what about the node that is down? The data isn’t gone, it is still right there in the other two nodes, but we need a way to pull it out.

Therefor, I think that the best option would be to say that nodes only replicate their local state, except in the case of a new node. A new node will be told the address of an existing node in the cluster, at which point it will:

  • Register itself in all the nodes in the cluster (discoverable from the existing node). This assumes a standard two way replication link between all servers, if this isn’t the case, the operators would have the responsibility to setup the actual replication semantics on their own.
  • New node now starts getting updates from all the nodes in the cluster. It keeps them in a log for now, not doing anything yet.
  • Ask that node for a complete update of all of its current state.
  • When it has all the complete state of the existing node, it replays all of the remembered logs that it didn’t have a chance to apply yet.
  • Then it announces that it is in a valid state to start accepting client connections.

Note that this process is likely to be very sensitive to high data volumes. That is why you’ll usually want to select a backup node to read from, and that decision is an ops decision.

You’ll also want to be able to report extensively on the current status of the node, since this can take a while, and ops will be watching this very closely.

Server Name

A node requires a unique name. We can use guids, but those aren’t readable, so we can use machine name + port, but those can change. Ideally, we can require the user to set us up with a unique name. That is important for readability and for being able to alter see all the values we have in all the nodes. It is important that names are never repeated, so we’ll probably have a guid there anyway, just to be on the safe side.

Actual Replication Semantics

Since we have the New Server Problem down to an automated process, we can choose the drastically simpler model of just having an internal queue per each replication destination. Whenever we make a change, we also make a note of that in the queue for that destination, then we start an async replication process to that server, sending all of our updates there.

It is always safe to overwrite data using replication, because we are overwriting our own data, never anyone else.

And… that is about it, actually. There are probably a lot of details that I am missing / would discover if we were to actually implement this. But I think that this is a pretty good idea about what this feature is about.

Published at

Originally posted at

Time series feature design: Storage replication & the bee’s knees

Being able to handle replication at the storage level is a really nice feature to have. More than that, it is a feature that can be broadly applied. But… a database is a lot more than just storage. And being able to just move the data around between machines is nice, but there are other things we have to take into account.

In particular, when we replicate via storage changes, we don’t have a good way to take actions on changes. Most of the time, that means that we can’t actually rely on internal caches, and would probably have to deal with that somehow in another fashion. But there are usually secondary processing that is done on a node that would have to be accounted for.

For example, let us assume that we had the ability to replicate RavenDB (docs) changes between machines using storage replication. The problem here is that we would be replicating the documents, but not the indexes, and when we do that, we would need to index the changed documents on the destination node. However, that would actually require two data stores, one for the actual documents data, and one for all of the non replicated data (indexing, stats, etc).

In other words, I think that such a database would have to be designed specifically for that scenario.

In addition to that, it would probably be best for the storage replication to also be annotated with information for higher level code. So if you modify this range in the file, you’ll also know that you need to drop the following documents from the cache.

Published at

Originally posted at

Time series feature design: The Consensus has dRafted a decision

So, after reaching the conclusion that replication is going to be hard, I went back to the office and discussed those challenges and was in general pretty annoyed by it. Then Michael made a really interesting suggestion. Why not put it on RAFT?

And once he explained what he meant, I really couldn’t hold my excitement. We now have a major feature for 4.0. But before I get excited about that (we’ll only be able to actually start working on that in a few months, anyway), let us talk about what the actual suggestion was.

Raft is a consensus algorithm. It allows a distributed set of computers to arrive into a mutually agreed upon set of sequential log records. Hm… I wonder where else we can find sequential log records, and yes, I am looking at you Voron.Journal.

The basic idea is that we can take the concept of log shipping, but instead of having a single master/slave relationship, we change things so we can put Raft in the middle. When committing a transaction, we’ll hold off committing the transaction until we have a Raft consensus that it should be committed. The advantage here is that we won’t be constrained any longer by the master/slave issue. If there is a server down, we can still process requests (maybe need to elect a new cluster leader, but that is about it).

That means that from an architectural standpoint, we’ll have the ability to process write requests for any quorum (N/2+1). That is a pretty standard requirement for distributed databases, so that is perfectly fine.

That is a pretty awesome thing to have, to be honest, and more importantly, this is happening at the low level storage layer. That means that we can apply this behavior not just to a single database solution, but to many database solutions.

I’m pretty excited about this.

Time series feature design: Replication

Armed with the knowledge about replication strategies from my previous post, we can now consider this in the context of the time series database.

We actually have two distinct pieces of data that we track. We have the actual time data, the timestamp and value that we keep track of, and we have the series information (tags, mostly).  We can consider using log shipping here, and that would give us a great way to get a read replica. But the question is why we would want to do that. It is nice to get a replica, but that replica would have to be read only. Is that useful?  It could take over, if the master is down, but that would mean that the master would have to stay down (or converted to a slave). And divergent writes are a problem.

While attractive as a cheap way to handle replication, I don’t like this very much.

So that leaves us with using a multi write partners situation. In this case, we can allow the two servers to operate in tandem. We need to have some way to resolve conflicts, and this is where things gets a bit messy.

For series data, it is trivial to just use some form of last write wins. This assumes a synchronized clock between the servers, but we’ll leave that requirement for now.

The problem is with the actual time data. Conceptually, we intend to store the information like this:


The problem is how do you detect conflicts. And are they really even possible. Let us assume that we want to update a particular value at time T on both servers. Server A replicates to server B, and now we need to decide how to deal with it. Ignore the value? Overwrite the value?

The important thing is that we need some predictable way to handle this that will end up with all the nodes in the cluster having the same agreed upon value. The simplest scenario, assuming a clock sync, is to use the write timestamp. But that would require us to keep the write time stamp. Currently we can use just 16 bytes for each time record. But recording the write timestamp will increase our usage to 24 bytes. That is a 50% increase just to handle conflicts. I don’t want to pay that.

The good thing about time series data is that a single value isn’t that important, and the likelihood that they will be changed it relatively small. We can just decide to say: We’ll choose a value, for example, we will choose the maximum value for that time, and be done with it. That has its own set of problems, but we’ll deal with that in a bit. We need to discuss how we deal with replication in general, first.

Let us imagine that we have 3 servers:

  • Server A replicates to B
  • Server B replicates to C
  • Server C replicates to A

We have concurrent writes to the same time value on both server A and B. For the purpose of the discussion, let us assume that we have a way to resolve the conflict.

Server A notifies Server B about the change, but server B already have a different value for that. Conflict resolution is run, and we have a new value .That value need to be replicated down stream. It goes to Server C, who then replicate it to Server A, who then replicates it to Server B? Ad infinitum?

I intentionally chose this example, but the same thing can happen with just two servers replicating to one another (master/master). And the problem here is that in order to be able to actually track this properly, we are going to need to keep a lot of metadata around, per value. While I can sort of accept the need to keep the write time (thus requiring 50% more space per value), the idea of holding many times more metadata for replication purposes than the actual data we want to replicate seems… silly at best.

Log shipping replication it is, at least until someone can come up with a better way to resolve the issues above.

On replication strategies, or the return of the long article

Meta note: I’ve been doing short series of blog posts for a while. I thought that this would be a good time to change. I am not sure how big this blog post is going to be, but it is going to be big. Please let me know about which approach you find better, and your reasoning.

I have been thinking about this quite a lot in the past few days. I am trying to see if there is a common solution to replication in general that we can utilize across a number of solutions. If we can do that, we can provide much better feature set for a wide variety of scenarios.

But before we can talk about how to actually implement replication, we need to talk about what type of replication we are talking about. We are assuming a single database (non sharded, running on multiple nodes). In general, there appears to be the following options:

  • Master / slaves
  • Primary / secondaries
  • Multi write partners
  • Multi master

Those are just designations that I’ll use for this series of blog posts. For the purpose of those posts, they are very different beast indeed.

The master/ slaves approach is talking specifically for a scenario where you have a single write master and one or more slaves. A key aspect of this strategy is that you can never (at least under normal operations) make any change whatsoever to the slaves. They are pure reads, and they can not be changed to become writeable without severing their ties to the master or risking data corruption.

A common example of such an approach is log shipping. I’ll discuss it in detail later on, but you can look at the docs for any such system, changing a slave to be writable is a decidedly non trivial process. And for a good reason.

The primary / secondaries mode is very similar to the master / slaves approach, however, here we have an explicit option for a secondary to become the primary. There can be only one primary, but the idea is that we allow a much easier way to switch the primary node. MongoDB uses such a system.

Multi write partners systems allow any node to accept a write, and it will take care of distributing the change to the other nodes. It also need, unlike the other options so far, to deal with conflicts. The ability of two users to write to the same value on multiple nodes at the same time. However, multi write partners usually make assumptions about their partners. For example, that they are relatively in sync, and that there is a separate protocol for bringing a new node online into the partnership that is outside the usual replication metric.

Multi master systems allow, accept and encourages nodes to come and go as they please, they assume that writes can and will conflict, and the need to resolve that on an ongoing basis. There are no expectations from the other nodes about being relatively in sync, and it is common to “re-seed” a new node by just starting replication to it, which means that you need to replicate all the data from the beginning of time to it. It is also common to have a node pop up once in a blue moon, expect to get all changes that happened while it was gone, and then drop off again.

Let us look at the actual implementation details of each, including some examples, and hopefully it’ll be clearer what I am talking about.


Log Shipping

Master / slaves is usually implemented via log shipping. The easiest way to think about log shipping is that the master database will send (magically, we don’t really care much how at this point) to the slaves instructions on how to directly modify the database files. In other words, conceptually, it is sending them the following:

   1: writep(fd1, 1024, new[]{ 17,85,124,13,86}, 5);
   2: writep(fd1, 18432, new[]{ 12,95,34,83,76,32,59}, 7);b

Those are very low level modifications, as you can imagine. The advantage here is that it is very easy to capture and replay those changes. The disadvantage is that you cannot really do anything else. Because the changes are happening at the very bottom of the stack, there is no chance to run any sort of logic. We are just writing to the file, same as the master server did.

This is the key reason why it is so hard for a slave to allow writes. The moment it makes any independent write, it opens itself up to the risk that the master would also do a write, that would generate data corruption. That is why you have to do the major song & dance if you want to switch the master & the slave. You have to go through all of this trouble to ensure that you don’t ever have a scenario where you have a write happening on both ends.

Once that happens, you can never ever get those two in sync again. It is just happening at too low a level.

Generating a new node, however, is very easy. Make sure to keep the journal around, do a full backup of the database and move it to another node. Then start shipping the logs over. Because they started at the same point, they can be safely applied.

Note that this version is very sensitive to versioning issues. You cannot have even a tiny change in the versions of working with the low level storage, because then all hell might break lose. This method is very good for generating read replicas. Indeed, this is what this is used for most of the time.

In theory, you can even get it to do failovers, because while the master is down, the slave can write. The problem is how do you handle a case where the slave think that the master is down, and the master think that everything is fine. At that point, you might have both of them accept writes, resulting in an unmergable situation.

In theory, since they share a common root, you can decide that one of them is the victor, and go with that, but that would result in losing data from the loser server, and probably data that you have no actual way of getting back. The changes we keep track of here are very small, and likely too granular to allow you to actually do something meaningful to extract the changed information.


This is actually quite similar to the log shipping method, but instead of sending the very low level file I/O operations, we’re actually sending higher level commands. This leads to a quite a few benefits as far as we are concerned. The primary server can send its log as:

   1: set("users/1", {"name": "oren" });
   2: set("users/2", {"name": "ayende" });
   3: del("users/1");

Executing this set of instruction on the secondary will result in identical state on the secondary.  Unlike Log Shipping option, this actually require the secondary server to perform work, so it is more expensive than just apply the already computed file updates.

However, the upside of this is that you can have a far more readable log. It is also much easier to turn a secondary into a primary. Mostly, this is silly. The actual operation is the exact same thing. But because you are working at the protocol level, rather than the file level. You can get some interesting benefits.

Let us assume that you have the same split brain issue, when both primary & secondary think that they are the primary. In the Log Shipping case, we had no way to reconcile the differences. In the case of Oplog, we can actually do this.  The key here is that we can:

  • Dump one of the servers rejected operations into a recoverable state.
  • Attempt to apply both severs logs, hoping that they didn’t both work on the same document.

This is the replication mode used by MongoDB. And it has chosen the first approach for handling such conflicts. Indeed, that is pretty much the only choice that it can safely make. Two servers making modifications to the same object is always going to require manual resolution, of course. And it is usually better to have to do this in advance and explicitly rather than “sometimes it works”.

You can see some discussion on how merging back divergent writes works in MongoDB here. In fact, continuing to use the same source, you can see the internal oplog in MongoDB here:

   1: // Operations
   3: > use test
   4: switched to db test
   5: > db.foo.insert({x:1})
   6: > db.foo.update({x:1}, {$set : {y:1}})
   7: > db.foo.update({x:2}, {$set : {y:1}}, true)
   8: > db.foo.remove({x:1})
  10: // Op log view
  12: > use local
  13: switched to db local
  14: > db.oplog.rs.find()
  15: { "ts" : { "t" : 1286821527000, "i" : 1 }, "h" : NumberLong(0), "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }
  16: { "ts" : { "t" : 1286821977000, "i" : 1 }, "h" : NumberLong("1722870850266333201"), "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d"), "x" : 1 } }
  17: { "ts" : { "t" : 1286821984000, "i" : 1 }, "h" : NumberLong("1633487572904743924"), "op" : "u", "ns" : "test.foo", "o2" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") }, "o" : { "$set" : { "y" : 1 } } }
  18: { "ts" : { "t" : 1286821993000, "i" : 1 }, "h" : NumberLong("5491114356580488109"), "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("4cb3586928ce78a2245fbd57"), "x" : 2, "y" : 1 } }
  19: { "ts" : { "t" : 1286821996000, "i" : 1 }, "h" : NumberLong("243223472855067144"), "op" : "d", "ns" : "test.foo", "b" : true, "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") } }

You can actually see the chain on command to oplog entry. The upsert command in line 7 was turned into an insert in line 18, for example. There appears to also be a lot of work done to avoid having to do any sort of computable work, in favor of resolving things to a simple idempotent operation.

For example, if you have a doc that looks like {counter:1} and you do an update like {$inc:{counter:1}} on the primary, you’ll end up with {counter:2} and the oplog will store {$set:{counter:2}}. The secondaries will replicate that instead of the $inc.

That is pretty nice feature, since it mean that you can much apply changes multiple times and end with the same result. But it all leads to the end result, in which you can’t merge divergent writes.

You do get a much better approach for actually going over the data and doing the fixup yourself, but still.. I don’t really like it.

Multi write partners

In this mode, we have a set of servers, each of which is familiar with their partners. All the writes coming are accepted, and logged. Replication happen from the source server contacting all of the destination servers and asking them: What is the last you heard from me? Here are all of my changes since then. Critically, it is at this point that we can trim the log for all of the actions that were already replicated to all of the servers.

A server being down means that the log of changes to go there is going to increase in size until the partner is up again, or we remove the entry for that server from our replication destination.

So far, this is very similar to how you would structure an oplog. The major difference is how you structure the actual data you log. In the oplog scenario, you’re going to write the changes that happens to the system. And the only way to act on this is to actually apply the op log in the same sequence as it was generated. This leads to a system where you can always have just a single primary node. And that leads to situations when split brains will result in data loss or manual merge steps.

In MWP case, we are going to keep enough context (usually full objects) so that we can give the user a better option to resolve the conflict. This also gives us the option of replaying the log in non sequential manner.

Note, however, that you cannot just bring a new server online and expect it to start playing nicely. You have to start from a known state, usually a db backup of an existing node. Like the log shipping scenario, the process is essentially, start replicating (to the currently non existent server), that will ensure that the log will be there when we actually have the new server. Backup the database and restore on a secondary server. Configure to accept replication from the source server.

The complexities here are that you need to deal with operations that you might already have. That is why this is usually paired with vector clocks, so you can automatically resolve such conflicts. When you cannot resolve such conflicts, this falls down to manual user intervention.

Multi Master

Multi master systems are quite similar to multi write partners, but they are designed to operate independently. It is common for servers to be able communicate with one another only rarely. For example, a mobile system that is only able to get connected just a few hours a week. As such, we cannot just cap the size of the operations to replicate. In fact, the common way to bring a new server up to speed is just to replicate to it. That means that we need to be able to replicate, essentially from any point in the server history, to a new server.

That works great, as long as you don’t have deletes. Those do tend to make things harder, because you need to keep track of those, and replicate them. RavenDB and CouchDB are both multi master systems, for example. Conflicts works the same way, pretty much, and we use a vector clock to determine if a value is in conflict or not.


Divergent writes

I mentioned this a few times, but I didn’t fully explain. For my purposes, we assume that we are using 2 servers (and yes, I know all about quorums, etc. Not relevant for this discussion) running in master/slave mode.

At some point, the slave think that the master is down and takes over, and the master doesn’t notice this and still think it is the master. During this time, both server accept the following writes:

Server A Server B
write users/1 wrier users/2
write users/3 write users/3
delete users/4 delete users/5
delete users/6 write users/6
write users/7 delete all users
set all users to active write users/8

After those operation happen, we restore communication between the two servers and they need to decide how to resolve those changes

Getting down to business

Okay, that is enough talking about what those terms mean. Let us consider the implications of using them. Log shipping is by far the simplest method to use. Well, assuming that you actually have a log mechanism, but most dbs do. It is strictly one writer model, and there is absolutely no way to either resolve divergent writes or even to find out what they were. The good thing about log shipping is that it is quite easy to get this working without actually needing to care anything about the actual data involved. We work directly at the file level, we don’t care at all about what the data is. The problem is that we can’t even solve simple conflicts, like writes to the different objects. This is because we are actually working at the file level, and all the changes are there. Attempting to merge changes from multiple logs would likely result in file corruption. The up side is that it is probably the most efficient way to go about doing this.

Oplog is a step above log shipping, but not a big one. It doesn’t resolve the divergent writes issues. This is now an application level protocol. The log needs to contain information specific to the actual type of data that we store. And you need to write explicit code to handle this. That is nice, but it also require strict sequence of all operations. Now, you can try to merge things between different logs. However, you need to worry about conflicts, and more to the point, there is usually nothing in the data itself that will help you even detect conflicts.

Multi write partners are meant to take this up a notch. They do keep track of the version history (usually via vector clocks). Here, the situation is more complex, because we need to explicitly decide how to deal with conflicts (either resolve automatically or defer to user decision), but also how to handle distribution of updates. Usually they are paired with some form of logic that tells you how to direct your writes. So all writes for a particular piece of data would go to a preferred node, to avoid generating multiple versions. The data needs to contains some information about that, so we keep vector clock information around. Once we sent the changes to all our partners, we can drop them, saving in space.

Multi master is meant to ensure that you can have partners that might only see one another occasionally, and it makes no assumptions about the topology. It can handle a node that comes on, get some data replicated, and drop off for a while (or forever). Each node is fully independent, and while they would collaborate with others, they don’t need them. The downside here is that we need to keep track of some things forever. In particular, we need to keep track of deletes, to ensure that we can get them to the remote machines.

What about set operations?

Interesting enough, that is probably the hardest issue to resolve. Consider the case when you have the following operations happen:

Server A Server B
write users/7 delete all users
set all users to active write users/8 (active: false)

What should be the result of this? There isn’t a really good answer. Should users/8 be set to active: true? What about users/7, should it be deleted or kept?

It gets hard because you don’t have good choices. The hard part here is actually figuring out that you have a conflict. And there isn’t a really good way to handle set operations nicely with conflicts. The common solution is to translate this to the actual operations made (delete users/1,user/2, users/3 – writer users/8, users/5) and leave it at that. The set based operation is translated to the actual individual operations that actually happened. And on that we can detect conflicts much more easily.

Log shipping is easiest to work with, operationally speaking. You know what you get, and you deal with that. Oplog is also simple, you have a single master, and that works. Multi master and multi write partners requires you to take explicit notice of conflicts, selection of the appropriate node to reduce conflicts, etc.

In practice, at least in the scenarios relating to RavenDB, the ability to take a server offline for weeks or months doesn’t seem to be used that often. The common deployment model is of servers running as steady partners. There are some optimizations that you can do for multi write partners that are hard/impossible to do with multi master.

My current personal preference at this point would like to go with either log shipping or multi write master. I think that either one of them would be fairly simple to implement and support operationally. I think that I’ll discuss actual design for the time series topic using either option in my next posts.

Logging & Production systems

I got a question from Dominic about logging:

Jeff Atwood wrote a great blog post about over-using logging, where stack traces should be all a developer needs to find the root cause of a problem. Therefore ...

When building an enterprise level system, what rules do you have to deem a log message 'useful' to a developer or support staff?

This is the relevant part in Jeff’s post:

So is logging a giant waste of time? I'm sure some people will read about this far and draw that conclusion, no matter what else I write. I am not anti-logging. I am anti-abusive-logging. Like any other tool in your toolkit, when used properly and appropriately, it can help you create better programs. The problem with logging isn't the logging, per se -- it's the seductive OCD "just one more bit of data in the log" trap that programmers fall into when implementing logging. Logging gets a bad name because it's so often abused. It's a shame to end up with all this extra code generating volumes and volumes of logs that aren't helping anyone.

We've since removed all logging from Stack Overflow, relying exclusively on exception logging. Honestly, I don't miss it at all. I can't even think of a single time since then that I'd wished I'd had a giant verbose logfile to help me diagnose a problem.

I don’t really think that I can disagree with this more vehemently. This might be a valid approach if/when you are writing what is essentially a single threaded, single use, code. It just so happens that most web applications are actually composed of something like that. The request code very rarely does threading, and it is usually just dealing with its own stuff. For system where most of the code is actually doing ongoing work, there really isn’t any alternative to logging.

You cannot debug multi threaded code efficiently. The only way to really handle that is to do printf debugging. In which you write what happens, and then construct the actual execution from the traces. And that is leaving aside one very important issue. It isn’t the exceptions that will get you, it is when your system is subtly wrong. Maybe it missed an update, or skipped a validation, or something just doesn’t look right. And you need to figure out what is going on.

And then you have distributed system, when you might have things happening concurrently in multiple systems, and good luck trying to get a good grip about how to resolve problems without using logging.

Speaking of which, here is a reply to a customer from one of our developers:


There is absolutely no way this would have been found without logging. The actual client  visible issue happened quite a bit later than when the actual bug was, and no exception was thrown.

Of course, this is all just solving problems on the developer machine. When you go to production, the rules are different, usually the only thing that you have are the logs, and you need to be able to figure out what was wrong and how to fix it, when the system is running.

I find that I don’t really sweat Debug vs. Info and Warn vs. Error debates. The developers will write whatever they consider to be relevant on each case. And you might get Errors that show up in the logs that are error for that feature, but are merely warnings for the entire product.  I don’t care, all of that can be massages later, using log configuration & filtering. But the very first thing that has to happen is to have the logs. Most of the time you don’t care, but it is pretty much the same as in saying: “We removed all the defibrillators from the building, because they were expensive and took up space. We rely exclusively on CPR in the event of a heart failure. Honestly, I don’t miss it at all. I can’t think of a single time since then that I’d wished I’d a machine to send electricity into someone’s heart to solve a problem.”

When you’ll realize that you need it, it is likely going to be far too late.

Time series feature design: Querying over large data sets

What happens when you want to do an aggregation query over very large data set? Let us say that you have 1 million data points within the range you want to query, and you want to get a rollup of all the data in the range of a week.

As it turns out, there is a relatively simple solution for optimizing this and maintaining a relatively constant query speed, regardless of the actual data set size.

Time series data has the great benefit of being easily aggregated. Most often, the data looks like this:


The catch is that you have a lot of it.

The set of aggregation that you can do over the data is also relatively simple. You have mean, max, min, std deviation, etc.

The time ranges are also pretty fixed, and the nice thing about time series data is that the bigger the range you want to go over, the bigger your rollup window is. In other words, if you want to look at things over a week, you would probably use a day or hour rollup. If you want to see things over a month, you will use a week or a day, over a year, you’ll use a week or a month, etc.

Let us assume that the cost of aggregation is 10,000 operations per second (just some number I pulled because it is round and nice to work with, real number is likely several orders of magnitude higher). So if we have to run this over a set that is 1 million data points in size, with the data being entered on per minute basis. With 1 million data points, we are going to wait almost two minutes for the reply. But there is really no need to actually check all those data points manually.

What we can do is actually prepare, ahead of time, the rollups on an hourly basis. That gives us a summary on a per hour basis of just over 16 thousand data points, and will result in a query that runs in under 2 seconds. If we also do a daily rollup, we move from having a million data points to less than a thousand.

Actually maintaining those computed rollups would probably be a bit complex, but it won’t be any more complex than how we are computing map/reduce indexes in RavenDB (this is a private, and simplified, case of map/reduce). And the result would be instant query times, even on very large data sets.

Time series feature design: Scale out / high availability

I expect (at a bare minimum), to be able to do about 25,000 sustained writes per second on a single machine. I actually expect to be able to do significantly more. But let us go with that amount for now as something that if it drops below that value, we are in trouble. Over a day, that is 2.16 billion records. I don’t think this likely, but that is a lot of data to work with.

That leads to interesting questions on scale out story. Do we need one? Well, probably not for performance reasons (said the person who haven’t written the code, much less benchmarked it) at least not if my hope for the actual system performance comes about.

However, just writing the data isn’t very interesting, we also need to read and work with it. One of the really nice thing about time series data is that it is very easily sharded, because pretty much all the operations you have are naturally accumulative. You have too much data for a single server, go ahead and split it. On query time, you can easily merge it all back up again and be done with it.

However, a much more interesting scenario for us is the high availability / replication story. This is where things gets… interesting. With RavenDB, we do async server side replication. With time series data, we can do that (and we have the big advantage of not having to worry about conflicts), but the question is how.

The underlying mechanism for all the replication in RavenDB is the notion of etags, of a way to track the order of changes to the database. In time series data, that means tracking the changes that happens in some form of a sane fashion.

It means having to track, per write, at least another 16 bytes. (8 bytes for a monotonically increasing etag number, another 8 for the time that was changed). And we haven’t yet spoken of things like replication of deletes, replication of series tag changes, etc. I can tell you that dealing with replication of deletes in RavenDB, for example, was no picnic.

A much simpler alternative would be to not handle this at the data level. One nice thing about Voron is that we can actually do log shipping. That would trivially solve pretty much the entire problem set of replication, because it would happen at the storage layer, and take care of all of it.

However, it does come with its own set of issues. In particular, it means that the secondary server has to be read only, it cannot accept any writes. And by that I mean that it would physically reject them. That leads to a great scale out story with read replicas, but it means that you have to carefully consider how you are dealing the case of the primary server going down for a duration, or any master/master story.

I am not sure that I have any good ideas about this at this point in time. I would love to hear suggestions.

Time series feature design: User interface

We need one.  But that is pretty much all I have to say about it. Most of the ways you’ve to look at time series data end up something like this:

But before you can get here, you need to handle:

  • Looking at series
  • Inspecting raw series data
  • Tagging / searching series
  • Looking at the roll up values across different dates for different series
  • Delete data (full series or range of data)

Probably other stuff, but those are the things that I can think of.

Probably a good place to do a lot of graphs, too, just to let the users play with the data. But that is what I have in mind here so far.

Oh, and quite obviously, we want to be able to output everything to CSV so users can look at that in Excel, of course.

Time series feature design: Client API

We have gone over the system behavior, the wire protocol and how we actually store the data on disk. Now, let us talk about the actual client API. The entry point is going to the TimeSeries class, which will have the following behavior:

Stateless operations:

  • Queries:
    • timeSeries.Query(“sensor1.heat”, “sensor1.flow”)
         .Aggergation(AggergateBy.Max, AggergateBy.Min, AggergateBy.Mean);
    • timeSeries.SeriesBy(“temp:C”);
  • Operations:
    • timeSeries.Delete(“sensor1.heat”, start, end);
    • timeSeries.Tag(“sensor1.heat”, “temp:C”);

Those types of operations have no state, require nothing beyond just knowing where the server is located and can be immediately executed without requiring any external state. The returned results aren’t tracked or managed  by us in any way, so there is no need for a session. 

Stateful operation - The only stateful operation we have (at least so far) is adding data to the database. We do that using the connection abstraction. This is very close to the actual on the wire representation, which is always good. We have something like:

   1: using(var con = timeSeries.OpenConnection(waitForServerFlush: true))
   2: {
   3:     using(var series = con.AddToSeries("sensor1.heat"))
   4:     {
   5:         for(var i = 0; i < 100; i++) 
   6:         {
   7:             series.Add(time.AddMinutes(i), value + i);
   8:         }
   9:     }
  10: }

This is a bit of an awkward API, but it serves a purpose, it is very close to the way the on-wire format is, and it is optimized for performance, not for being nice.

We can also have:

con.Add(“sensor1.heat”, time, value);

But if you are mixing things up (add sensor1.heat, sensor1.flow and then sensor1.heat again, etc), it probably won’t be as efficient. (It is important to be able to expose those optimizations all the way from the disk to the wire to the client API. Most times, they don’t matter, which is why we have the higher level API, but when they do, they really do.

And… this is pretty much it.

The API will probably be an async one, to keep up with the times, but those are pretty much the high level things that we have here.

Time series feature design: System behavior

It is easy to forget that a database isn’t just about storing and retrieving data. A lot of work goes into the actual behavior of the system beyond the storage aspect.

In the case of time series database, just storing the data isn’t enough, we very rarely actually want to access all the data. If we have a sensor that send us a value once a minute, that comes to 43,200 data points per month. There is very little that we actually want to do for that amount of data. Usually we want to do things over some rollup of the data. For example, we might want to see the mean per day, or the standard deviation on a weakly basis, etc.

We might also want to do some down sampling. By that I mean that we take a series whose value is stored on a per minute / second basis and we want to store just the per day totals and delete the old data to save space.

The reason that I am using time series data for this series of blog posts is that there really isn’t all that much that you can do for a time series data, to be honest. You store it, aggregate over it, and… that is about it. Users might be able to do derivations on top of that, but that is out of scope for a database product.

Can you think about any other behaviors that the system needs to provide?

Time series feature design: The wire format

Now that we actually have the storage interface nailed down, let us talk about how we are going to actually expose this over the wire. The storage interface we defined is really low level and not something that I would really care to try working with if I was writing user level code.

Because we are in the business of creating databases, we want to allow this to run as a server, not just as a library. So we need to think about how we are going to expose this over the wire.

There are several options here. We can expose this as a ZeroMQ service, sending messages around. This looks really interesting, and would certainly be a great research to do, but it is a foreign concept for most people, and it would have to be exposed as such through any API we use, so I am afraid we won’t be doing that.

We can write a custom TCP protocol, but that has its own issues. Chief among which, just like ZeroMQ, we would have to deal, from scratch with things like:

  • Authentication
  • Tracing
  • Debugging
  • Hosting
  • Monitoring
  • Easily “hackable” format
  • Replayability

And probably a whole lot more on top of that.

I like building the wire interface over some form of HTTP interface (I intentionally don’t use the overly laden REST word). We get a lot of things for free, in particular, we get easy ability to do things like pipe stuff through Fiddler so we can easily debug them. That also influence decisions such as how to design the wire format itself (hint, human readable is good).

I want to be able to easily generate requests and see them in the browser. I want to be able to read the fiddler output and figure out what is going on

We actually have several different requirements that we need to do.

  • Enter data
  • Query data
  • Do operations

Entering data is probably the most interesting aspect. The reason is that we need to handle inserting many entries in as efficient a manner as possible. That means reduce the number of server round trips. At the same time, we want to keep other familiar aspects, such as the ability to easily predict when the data is “safe” and the storage transaction has been committed to disk.

Querying data and handling one off operations is actually much easier. We can handle it via a simple REST interface:

  • /timeseries/query?start=2013-01-01&end=2014-01-01&id=sensor1.heat&id=sensor1.flow&aggregation=avg&period=weekly
  • /timeseries/delete?id=sensor1.heat&id=sensor1.flow&start=2013-01-01&end=2014-01-01
  • /timeseries/tag?id=sensor1.heat&tag=temp:C

Nothing much to it, so far. We expose it as JSON endpoints, and that is… it.

For entering data, we need something a bit more elaborate. I chose to use websockets for this, mostly because I want two way communication and that pretty much force my hand. The reason that I want to use two way communication mechanism is that I want to enable the following story:

  • The client send the data to the server as fast as it is able to.
  • The server will aggregate the data and will decide, base on its own consideration, when to actually commit the data to disk.
  • The client need some way of knowing when the data it just sent is actually flushed so it can rely on it.
  • the connection should stay open (potentially for a long period of time) so we can keep sending new stuff in without having to create a new connection.

As far as the server is concerned, by the way, it will decide to flush the pending changes to disk whenever:

  • over 50 ms passed waiting for the next value to arrive, or
  • it has been at least 500 ms from the last flush, or
  • there are over 16,384 items pending, or
  • the client indicated that it wants to close the connection, or
  • the moon is full on a Tuesday

However, how does the server communicate that to the client?

The actual format we’ll use is something like this (binary, little endian format):

   1: enter-data-stream =  
   2:    data-packet+
   4: data-packet = 
   5:    series-name-len [4 bytes]
   6:    series-name [series-name-len, utf8]
   7:    entries
   8:    no-entries
  10: entries =
  11:    count [2 bytes]
  12:    entry[count]
  14: no-entries =
  15:    count [2 bytes] = 0
  17: entry =
  18:    date [8 bytes]   
  19:    value [8 bytes]

The basic idea is that this format should be pretty good in conveying just enough information to send the data we need, and at the same time be easy to parse and work with. I am not sure if this is a really good idea, because this might be premature optimization and we can just send the data as json strings. That would certainly be more human readable. I’ll probably go with that and leave the “optimized binary approach” for when/if we actually see a real need for it. The general idea is that computer time is much cheaper than human time, so it is worth the time to make things human readable.

The server keep tracks of the number of entries that were sent to it for each connection, and whenever it is flushing the buffered data to disk, it will send notification to that effect to the client. The client can then decide “okay, I can notify my caller that everything that I have sent has been properly saved to disk”. Or it can just ignore it (usual case if they have a long running connection).

And that is about it as far as the wire protocol is concerned. Next, we will take a look at some of the operations that we need.

Time series feature design: Storage

In this post, we are going to talk about the actual design for a time series feature, from a storage perspective. For the purpose of the discussion, I am going to assume the low level Voron as the underlying storage engine.

We are going to need to store the data in a series. And the actual data is going to be relatively ordered (time based, usually increasing).

As such, I am going to model the data itself as a set of Voron trees, one tree per each series that is created. The actual data in each tree would be composed of key and value, that are each 8 bytes long.

  • Key (8 bytes): 100-nanosecond interval between Jan 1, 0001 to Dec 31, 9999.
  • Val (8 bytes): 64 bits double precision floating point number.

There can be do duplicate values for a specific time in a series. Since this represent one ten millionth of a second, that should be granular enough, I think.

Reading from the storage will always be done on a series basis. The idea is to essentially use the code from this post, but to simplify to a smaller key.  The idea is that we can store roughly 250 data points in each Voron page. Which should give us some really good performance for both reads & writes.

Note that we need to provide storage API to do bulk writes, since there are some systems that would require it. Those can either be systems with high refresh rates (a whole set of sensors with very low refresh rates) or, more likely, import scenarios. The import scenario can be either a one time (moving to a new system), or something like a nightly batch process where we get the aggregated data from multiple sources.

For our purposes, we aren’t going to distinguish between the two.

We are going to provide and API similar to:

    • void Add(IEnumerable<Tuple<string,IEnumerable<Tuple<DateTime, double>>> data);

This ugly method can handle updates to multiple series at once. In human speak, this is an enumerable of updates to a series where each update is the time and value for that time. From the storage perspective, this creates a single transaction where all the changes are added at once, or not at all. It is the responsibility of higher level code to make sure that we optimize number of calls vs. in flight transaction data vs. size of transactions.

Adding data to a series that doesn’t exists will create it.

We also assume that series names is up to printable Unicode characters of up to 256 bytes (UTF8).

The read API is going to be composed of:

  • IEnumerable<Tuple<DateTime, double>> ScanRange(string series, DateTime start, DateTime end);
  • IEnumerable<Tuple<DateTime, double[]>> ScanRanges(string []series, DateTime start, DateTime end);

Anything else would have to be done at a higher level.

There is no facility for updates. You can just add again on the same time with a new value, and while this is supported, this isn’t something that is expected.

Deleting data can be done using:

  • void Delete(string series, DateTime start, DateTime end);
  • void DeleteAll(string series);

The first option will delete all the items in range. The second will delete the entire tree. The second is probably going to be much faster. We are probably better off checking to see if the max / min ranges for the tree are beyond the items for this series and falling to DeleteAll if we can. Explicit DeleteAll will also delete all the series tags. While implicit Delete(“series-1”, DateTime.MinValue, DateTime.MaxValue) for example will delete the series’ tree, but keep the series tags.

Series can have tags attached to it. Those can be any string up to 256 bytes (UTF8). By conventions, they would usually be in the form of “key:val”.

Series can be queried using:

  • IEnumerable<string> GetSeries(string start = null);
  • IEnumerable<string> GetSeriesBy(string tag, string start = null);
  • IEnumerable<string> GetSeriesTags(string series);

Series can be tagged using:

  • void TagSeries(string series, string []tags);

There can be no duplicate tags.

In summary, the interface we intend to use for storage would look roughly like the following:

public interface ITimeSeriesStorage
    void Add(IEnumerable<Tuple<string,IEnumerable<Tuple<DateTime, double>>> data);
    IEnumerable<Tuple<DateTime, double>> ScanRange(string series, DateTime start, DateTime end);
    IEnumerable<Tuple<DateTime, double[]>> ScanRanges(string []series, DateTime start, DateTime end);
    void Delete(string series, DateTime start, DateTime end);
    void DeleteAll(string series);
    IEnumerable<string> GetSeries(string start = null);
    IEnumerable<string> GetSeriesBy(string tag, string start = null);
    IEnumerable<string> GetSeriesTags(string series);
    void TagSeries(string series, string []tags);

Data sizes – assume 1 value per minute per series, that gives us an update rate of 1,440 updates per day or 525,600 per year. That means that for 100,000 sensors (not an uncommon amount) we need to deal with 52,560,000,000 data items per year. This would probably end up being just over 3 GB or so. Assuming 1 value per second, that gives us 86,400 values per day, 31,536,000 per year and 3,153,600,000,000 values per year for the 100,000 sensors will user about 184 GB or so. Those seems to be eminently reasonable values for the data size that we are talking about here. 

Next, we’ll discuss how this is all going to look like over the wire…

Time series feature design

I mentioned before that there is actually quite a lot that goes into a major feature, usually a lot more than would seem on the surface. I thought that this could serve as a pretty good example of a series of blog posts that we can use to show how we are actually sitting down and doing high level design of features in RavenDB.

I’ve actually already spoken, and we have seen an implementation of the backend, about time series data. But that was just the basic storage, and that was mostly to explore how to deal with Voron for more scenarios, to be honest.

Now, allow me to expand that into a full pledged feature design.

What we want: the ability to store time series data and work with it.

Time series data is a value (double) that is recorded at a specific time for a specific series. A good explanation on that can be found here: https://tempo-db.com/docs/modeling-time-series-data/

We have a bunch of sensors that report data, and we want to store it, and get it back, and get rollups of the information, and probably do more, but that is enough for now.

At first glance, we need to support the following operations:

  • Add data (time, value) to a series.
  • Add bulk data to a set of series.
  • Read data (time,value) from a series or a set of them in a given range.
  • Read rollup (avg, mean, max, min) per period (minutes, hour, day, month) in a given range for a set of series.

In the next few posts, I’ll discuss storage, network, client api, user interface and distribution design for this. I would truly love to hear from you about any ideas you have.

The cost of select count(*) from tbl

I was always somewhat baffled that something that was so obviously required so often can be so expensive with RDBMSes. When diving deep into B-Tree implementation I suddenly found it obvious why that is the case.

Let us look at the following B-Tree:

Now, if I asked you how many entries you had in this tree, you would have to visit each and every page. And that can be very expensive when you have large number of items.

But why can’t we just keep track of the total number ourselves? There is counting B-Trees that we can use that will give us the answer in O(1) time.

The answer to that is that it is too expensive to maintain this. Imagine a big tree, with 10 million records, and a fan out of 16 keys per page. That means that it has a depth of 6. Let us say that we want to add a new record. We can do that by modifying and saving the page. That would cost us 4KB of writes.

However, if we use a counting B-Tree, we have to update all of the parent pages. With a depth of 6, that means that instead of writing 4KB to disk we will have to write 24KB to disk. And probably have to do multiple seeks to actually do the write properly. That is an unacceptable performance cost, and the reason why you have to manually check all the pages when you want the full count.


Published at

Originally posted at

Comments (14)

Async API considerations

I have a method that does a whole bunch of work, then submit an async write to the disk. The question here is how should I design it?

In general, all of the method’s work is done, and the caller can continue doing whatever they want to do with they life, all locks are released, all the data structures are valid again, etc.

However, it isn’t until the async work complete successfully that we are actually done with the work.

I could do it like this:

   1: public async Task CommitAsync();

But this implies that you have to wait for the task to complete, which isn’t true. The commit function looks like this:

   1: {
   2:     // do work
   3:     return SubmitWriteAsync();
   4: }

So by the time you have the task from the method, all the work has been done. You might want to wait of the task to ensure that the data is actually durable on disk, but in many cases, you can get away with not waiting.

I am thinking about doing it like this:

   1: public Task Commit();

To make it explicit that there isn’t any async work done that you have to wait for.



Published at

Originally posted at

Comments (19)

re: Why You Should Never Use MongoDB

I was pointed at this blog post, and I thought that I would comment, from a RavenDB perspective.

TL;DR summary:

If you don’t know what how to tie your shoes, don’t run.

The actual details in the posts are fascinating, I’ve never heard about this Diaspora project. But to be perfectly honest, the problems that they run into has nothing to do with MongoDB or its features. They have a lot to do with a fundamental lack of understanding on how to model using a document database.

In particular, I actually winced in sympathetic pain when the author explained how they modeled a TV show.

They store the entire thing as a single document:

Hell, the author even talks about General Hospital, a show that has 50+ sessions and 12,000 episodes in the blog post. And at no point did they stop to think that this might not be such a good idea?

A much better model would be something like this:

  • Actors
    • [episode ids]
  • TV shows
    • [seasons ids]
  • Seasons
    • [review ids]
    • [episode ids]
  • Episodes
    • [actor ids]

Now, I am not settled in my own mind if it would be better to have a single season document per season, containing all episodes, reviews, etc. Or if it would be better to have separate episode documents.

What I do know is that having a single document that large is something that I want to avoid. And yes, this is a somewhat relational model. That is because what you are looking at is a list of independent aggregates that have different reasons to change.

Later on in the post the author talks about the problem when they wanted to show “all episodes by actor”, and they had to move to a relational database to do that. But that is like saying that if you stored all your actors information as plain names (without any ids), you would have hard time to handle them in a relational database.

Well, duh!

Now, as for the Disapora project issues.  They appear to have been doing some really silly things there. In particular, let us store store the all information multiple times:

You can see for example all the user information being duplicated all over the place.  Now, this is a distributed social network, but that doesn’t call for it to be stupid about it.  They should have used references instead of copying the data.

Now, to be fair, there are very good reasons why you’ll want to duplicate the data, when you want a point in time view of it. For example, if I commented on a post, I probably want my name in that post to remain frozen. It would certainly make things much easier all around. But if changing my email now requires that we’ll run some sort of a huge update operation on the entire database… well, it ain’t the db fault. You are doing it wrong.

Now, when you store references to other documents, you have a few options. If you are using RavenDB, you have Include support, so you can get the associated documents easily enough. If you are using mongo, you have an additional step in that you have to call $in(the ids), but that is about it.

I am sorry, but this is blaming the dancer blaming the floor.