Ayende @ Rahien

filter by tags archive

architecture (618) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (647) rss
hibernating-practices (72) rss
miscellaneous (592) rss
performance (397) rss
programming (1093) rss
raven (1459) rss
ravendb.net (545) rss
reviews (184) rss

2025
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Couchbase vs RavenDB Performance at Rakuten Kobo Whitepaper

Feb 28 2014

Time series feature designThe Consensus has dRafted a decision

time to read 2 min | 317 words

Tweet Share Share 3 comments

Tags:

So, after reaching the conclusion that replication is going to be hard, I went back to the office and discussed those challenges and was in general pretty annoyed by it. Then Michael made a really interesting suggestion. Why not put it on RAFT?

And once he explained what he meant, I really couldn’t hold my excitement. We now have a major feature for 4.0. But before I get excited about that (we’ll only be able to actually start working on that in a few months, anyway), let us talk about what the actual suggestion was.

Raft is a consensus algorithm. It allows a distributed set of computers to arrive into a mutually agreed upon set of sequential log records. Hm… I wonder where else we can find sequential log records, and yes, I am looking at you Voron.Journal.

The basic idea is that we can take the concept of log shipping, but instead of having a single master/slave relationship, we change things so we can put Raft in the middle. When committing a transaction, we’ll hold off committing the transaction until we have a Raft consensus that it should be committed. The advantage here is that we won’t be constrained any longer by the master/slave issue. If there is a server down, we can still process requests (maybe need to elect a new cluster leader, but that is about it).

That means that from an architectural standpoint, we’ll have the ability to process write requests for any quorum (N/2+1). That is a pretty standard requirement for distributed databases, so that is perfectly fine.

That is a pretty awesome thing to have, to be honest, and more importantly, this is happening at the low level storage layer. That means that we can apply this behavior not just to a single database solution, but to many database solutions.

I’m pretty excited about this.

Feb 27 2014

Shiny features in the depth: New index optimization

time to read 2 min | 305 words

Tweet Share Share 8 comments

Tags:

raven

One of the nice features in RavenDB 3.0 is optimizing the process of creating a new index. In particular, we want to optimize it when you create a new index on a small collection in a large database.

If you have a small database, you don’t care, it is going to complete quickly anyway. If you are creating an index on a collection that compose a significant amount of the documents in the database, you don’t care, you are going to have to do a lot of work anyway. But the common case for a big database is that you usually have one very big collection, and much smaller collections for everything else.

In RavenDB 2.x, you still have to pay the full price for indexing everything, but that isn’t the case in RavenDB 3.0. What we have done is to effectively optimize the process so that in this case, we will preload all of the documents taking part in the relevant collection, and send them directly to be indexed.

We do this by utilizing the Raven/DocumentsByEntityName index. Which has already indexed everything in the database anyway. This is a nice little feature, because it allows us to really take advantage of the work we already did long ago. Using one index to pre-populate another is a neat trick, and one that I am very happy about.

Because this is a new code path, it also means that it is actually executed outside of standard indexing. And that in turn means that adding a new index will not impact other indexes at all.

This is a small feature, but it does address a common pain point our users have with working in RavenDB in production.

Reminder, we have our upcoming RavenDB Conference in April, where we’ll discuss other stuff in the 3.0 release.

Feb 26 2014

RavenDB 3.0 Status Update

time to read 7 min | 1260 words

Tweet Share Share 9 comments

Tags:

raven

We are gearing up for the RavenDB Conference in April and we just released a private alpha preview sneak peek to a few external people. But we have been working on RavenDB 3.0 for the past 18 months or so, some bits in it are actually dated from 2011(!) that we only now are able to actually put into production. That is a lot of work that is going on and it is easy to actually get lost in what is going on there.

So, without further ado, here are the major highlights of RavenDB 3.0…

Indexing

Yes, we did more work on improving indexing performance. But that is actually secondary. What we really focused on this release are operational indexing concerns.

What does this means? Well, to start with, internally we don’t use the index name any longer. That means that we can do silly things like make an index delete async. Users that have large indexes, especially map/reduce or indexes with LoadDocument calls found that deleting an index can take a very long while. Now this is no longer the case, we are now able to immediately delete the index, and actually do the cleanup in the background.

For that matter, another operational concern people has is the introduction of new indexes. Especially if the index in question covers just small part of a very large database, that used to take a very long time. The index would have to go through all the documents in the database to complete indexing. Now, we are able to make several optimizations that means that we can take just the relevant documents, and complete indexing much more quickly in this scenario.

Introducing a new index would split it entirely from the other indexes while it is running, so you won’t have to deal with a new index slowing down other indexes. And neither will big deletes impact indexing performance in this manner any longer, we have much better interleaving of that now.

Still on an operational bent, we now have much better reporting on what is actually being indexed. You can see what the indexes are doing, and act accordingly.

From a development point of view, we have added some nice things. The ability to index an attachment, so you can index big text without it residing in documents. We have also added some better situational awareness in the indexing code. We have some people who were doing… funny things there. Indexes that were producing hundreds of index entries per every document they indexed. Then we had to deal with the associated performance problems. We now can detect and warn about that, and we let the user specific the valid limits on a per index case.

There are other stuff in indexing, but I want to go over the rest of what we did for 3.0…

RavenFS

RavenFS was created because RavenDB’s attachments are nice, but they aren’t nearly good enough. We have a lot of users that want to use attachments, then they find that they can’t see them, search on them in meaningful ways, etc. More importantly, they are limited because we intended them to be of relatively small size. Users really asked us for something better, and RavenFS is the answer.

We are talking about a replicated file system, which supports very large files as well as all the facilities to work and manage them. It was explicitly designed to be geo-distributed and it can drastically decrease the network load on systems that need to share very large files that change frequently.

I’m going to talk about RavenFS quite a bit in my keynote in the RavenDB Conference, but I think that it is really cool and there are some very nice use cases for it. Just for fun, it has been used in production by several customers for the past 2 years, so we already have some great experience with that.

JVM Client API

A fully functional JVM client opens us for more fronts with regards to who can make use of RavenDB. We already have people building applications using that. And we intend to have more clients for additional platforms after 3.0 is shipped.

Internal changes

There have been a lot of that, actually. But the one of the most important ones is that we are now hosting RavenDB on top of OWIN and Web API. The change, from our own HTTP server, was done in order to help our users have a better foundation to understand how RavenDB works internally, and to encourage contributions. It also allow us to do some nice things, like have an end point that documents all the endpoints in a RavenDB server.

Another important change is that we are moving away from the Silverlight studio in favor of a brand new HTML5 studio. That is quite exciting, especially because the performance and responsiveness of the system as a whole become so much better. And it doesn’t hurt that we don’t have to deal with the complexities of Silverlight.

Voron

This one probably got the most attention, but hopefully it will be least noticed once you actually get the bits. Voron is our new storage engine. In fact, it is more than that, it is a new way for us to store data, which we use in RavenDB to store our transactional data. The reason it is more than that is that it isn’t limited to what RavenDB currently needs to do. It can do much more, and we are already in the process of doing quite a lot more with it than one might suspect from outside. I’ll leave that for later, and just talk about what we have already done.

RavenDB now have the option of running on Voron. In fact, we have tested it with the entire RavenDB test suite (over 3,000 tests) and it passes with flying colors.

Voron will allow us to run on Linux, at some point, but it is a lot more important that it allows us to very carefully tune our storage usage and get a much better appreciation for how we are actually doing things. We expect to be able to do some really nice things with it, and it has already shown itself to be competitive with regards to performance against Esent.

Operations

There is always that, isn’t it. And there is a reason why is is last, but never least, in this list. (Try to say that repeatedly, fast Smile ).

We kicked off performance counters, which caused no end of operational headaches (corrupted counters, permission issues, hanging, etc) in favor of an internal metrics library. Because it is internal, we are able to add a lot more metrics and a lot more meaningful metrics to the system.

We have new endpoints that expose even more internal states. We improved periodic backup support so it would be much nicer to work with (we now allow to define: full export every week, periodic export every day). There are quite a few goodies there available for the ops people to get insight into what is going on.

And… those are the highlights, and the code aspect of things.

We have a team of about a dozen people working on RavenDB at this time, and we keep growing. This is quite exciting, and I’m really looking forward to getting to meet our users in the conference…

Here is a hint, there are going to be surprises…

Feb 25 2014

Time series feature designReplication

time to read 4 min | 706 words

Tweet Share Share 0 comments

Tags:

Armed with the knowledge about replication strategies from my previous post, we can now consider this in the context of the time series database.

We actually have two distinct pieces of data that we track. We have the actual time data, the timestamp and value that we keep track of, and we have the series information (tags, mostly). We can consider using log shipping here, and that would give us a great way to get a read replica. But the question is why we would want to do that. It is nice to get a replica, but that replica would have to be read only. Is that useful? It could take over, if the master is down, but that would mean that the master would have to stay down (or converted to a slave). And divergent writes are a problem.

While attractive as a cheap way to handle replication, I don’t like this very much.

So that leaves us with using a multi write partners situation. In this case, we can allow the two servers to operate in tandem. We need to have some way to resolve conflicts, and this is where things gets a bit messy.

For series data, it is trivial to just use some form of last write wins. This assumes a synchronized clock between the servers, but we’ll leave that requirement for now.

The problem is with the actual time data. Conceptually, we intend to store the information like this:

The problem is how do you detect conflicts. And are they really even possible. Let us assume that we want to update a particular value at time T on both servers. Server A replicates to server B, and now we need to decide how to deal with it. Ignore the value? Overwrite the value?

The important thing is that we need some predictable way to handle this that will end up with all the nodes in the cluster having the same agreed upon value. The simplest scenario, assuming a clock sync, is to use the write timestamp. But that would require us to keep the write time stamp. Currently we can use just 16 bytes for each time record. But recording the write timestamp will increase our usage to 24 bytes. That is a 50% increase just to handle conflicts. I don’t want to pay that.

The good thing about time series data is that a single value isn’t that important, and the likelihood that they will be changed it relatively small. We can just decide to say: We’ll choose a value, for example, we will choose the maximum value for that time, and be done with it. That has its own set of problems, but we’ll deal with that in a bit. We need to discuss how we deal with replication in general, first.

Let us imagine that we have 3 servers:

Server A replicates to B
Server B replicates to C
Server C replicates to A

We have concurrent writes to the same time value on both server A and B. For the purpose of the discussion, let us assume that we have a way to resolve the conflict.

Server A notifies Server B about the change, but server B already have a different value for that. Conflict resolution is run, and we have a new value .That value need to be replicated down stream. It goes to Server C, who then replicate it to Server A, who then replicates it to Server B? Ad infinitum?

I intentionally chose this example, but the same thing can happen with just two servers replicating to one another (master/master). And the problem here is that in order to be able to actually track this properly, we are going to need to keep a lot of metadata around, per value. While I can sort of accept the need to keep the write time (thus requiring 50% more space per value), the idea of holding many times more metadata for replication purposes than the actual data we want to replicate seems… silly at best.

Log shipping replication it is, at least until someone can come up with a better way to resolve the issues above.

Feb 24 2014

On replication strategies, or the return of the long article

time to read 24 min | 4624 words

Tweet Share Share 0 comments

Tags:

Meta note: I’ve been doing short series of blog posts for a while. I thought that this would be a good time to change. I am not sure how big this blog post is going to be, but it is going to be big. Please let me know about which approach you find better, and your reasoning.

I have been thinking about this quite a lot in the past few days. I am trying to see if there is a common solution to replication in general that we can utilize across a number of solutions. If we can do that, we can provide much better feature set for a wide variety of scenarios.

But before we can talk about how to actually implement replication, we need to talk about what type of replication we are talking about. We are assuming a single database (non sharded, running on multiple nodes). In general, there appears to be the following options:

Master / slaves
Primary / secondaries
Multi write partners
Multi master

Those are just designations that I’ll use for this series of blog posts. For the purpose of those posts, they are very different beast indeed.

The master/ slaves approach is talking specifically for a scenario where you have a single write master and one or more slaves. A key aspect of this strategy is that you can never (at least under normal operations) make any change whatsoever to the slaves. They are pure reads, and they can not be changed to become writeable without severing their ties to the master or risking data corruption.

A common example of such an approach is log shipping. I’ll discuss it in detail later on, but you can look at the docs for any such system, changing a slave to be writable is a decidedly non trivial process. And for a good reason.

The primary / secondaries mode is very similar to the master / slaves approach, however, here we have an explicit option for a secondary to become the primary. There can be only one primary, but the idea is that we allow a much easier way to switch the primary node. MongoDB uses such a system.

Multi write partners systems allow any node to accept a write, and it will take care of distributing the change to the other nodes. It also need, unlike the other options so far, to deal with conflicts. The ability of two users to write to the same value on multiple nodes at the same time. However, multi write partners usually make assumptions about their partners. For example, that they are relatively in sync, and that there is a separate protocol for bringing a new node online into the partnership that is outside the usual replication metric.

Multi master systems allow, accept and encourages nodes to come and go as they please, they assume that writes can and will conflict, and the need to resolve that on an ongoing basis. There are no expectations from the other nodes about being relatively in sync, and it is common to “re-seed” a new node by just starting replication to it, which means that you need to replicate all the data from the beginning of time to it. It is also common to have a node pop up once in a blue moon, expect to get all changes that happened while it was gone, and then drop off again.

Let us look at the actual implementation details of each, including some examples, and hopefully it’ll be clearer what I am talking about.

Log Shipping

Master / slaves is usually implemented via log shipping. The easiest way to think about log shipping is that the master database will send (magically, we don’t really care much how at this point) to the slaves instructions on how to directly modify the database files. In other words, conceptually, it is sending them the following:

   1: writep(fd1, 1024, new[]{ 17,85,124,13,86}, 5);

   2: writep(fd1, 18432, new[]{ 12,95,34,83,76,32,59}, 7);b

Those are very low level modifications, as you can imagine. The advantage here is that it is very easy to capture and replay those changes. The disadvantage is that you cannot really do anything else. Because the changes are happening at the very bottom of the stack, there is no chance to run any sort of logic. We are just writing to the file, same as the master server did.

This is the key reason why it is so hard for a slave to allow writes. The moment it makes any independent write, it opens itself up to the risk that the master would also do a write, that would generate data corruption. That is why you have to do the major song & dance if you want to switch the master & the slave. You have to go through all of this trouble to ensure that you don’t ever have a scenario where you have a write happening on both ends.

Once that happens, you can never ever get those two in sync again. It is just happening at too low a level.

Generating a new node, however, is very easy. Make sure to keep the journal around, do a full backup of the database and move it to another node. Then start shipping the logs over. Because they started at the same point, they can be safely applied.

Note that this version is very sensitive to versioning issues. You cannot have even a tiny change in the versions of working with the low level storage, because then all hell might break lose. This method is very good for generating read replicas. Indeed, this is what this is used for most of the time.

In theory, you can even get it to do failovers, because while the master is down, the slave can write. The problem is how do you handle a case where the slave think that the master is down, and the master think that everything is fine. At that point, you might have both of them accept writes, resulting in an unmergable situation.

In theory, since they share a common root, you can decide that one of them is the victor, and go with that, but that would result in losing data from the loser server, and probably data that you have no actual way of getting back. The changes we keep track of here are very small, and likely too granular to allow you to actually do something meaningful to extract the changed information.

Oplog

This is actually quite similar to the log shipping method, but instead of sending the very low level file I/O operations, we’re actually sending higher level commands. This leads to a quite a few benefits as far as we are concerned. The primary server can send its log as:

   1: set("users/1", {"name": "oren" });

   2: set("users/2", {"name": "ayende" });

   3: del("users/1");

Executing this set of instruction on the secondary will result in identical state on the secondary. Unlike Log Shipping option, this actually require the secondary server to perform work, so it is more expensive than just apply the already computed file updates.

However, the upside of this is that you can have a far more readable log. It is also much easier to turn a secondary into a primary. Mostly, this is silly. The actual operation is the exact same thing. But because you are working at the protocol level, rather than the file level. You can get some interesting benefits.

Let us assume that you have the same split brain issue, when both primary & secondary think that they are the primary. In the Log Shipping case, we had no way to reconcile the differences. In the case of Oplog, we can actually do this. The key here is that we can:

Dump one of the servers rejected operations into a recoverable state.
Attempt to apply both severs logs, hoping that they didn’t both work on the same document.

This is the replication mode used by MongoDB. And it has chosen the first approach for handling such conflicts. Indeed, that is pretty much the only choice that it can safely make. Two servers making modifications to the same object is always going to require manual resolution, of course. And it is usually better to have to do this in advance and explicitly rather than “sometimes it works”.

You can see some discussion on how merging back divergent writes works in MongoDB here. In fact, continuing to use the same source, you can see the internal oplog in MongoDB here:

   1: // Operations

2:

   3: > use test

   4: switched to db test

   5: > db.foo.insert({x:1})

   6: > db.foo.update({x:1}, {$set : {y:1}})

   7: > db.foo.update({x:2}, {$set : {y:1}}, true)

   8: > db.foo.remove({x:1})

9:

  10: // Op log view

11:

  12: > use local

  13: switched to db local

  14: > db.oplog.rs.find()

  15: { "ts" : { "t" : 1286821527000, "i" : 1 }, "h" : NumberLong(0), "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }

  16: { "ts" : { "t" : 1286821977000, "i" : 1 }, "h" : NumberLong("1722870850266333201"), "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d"), "x" : 1 } }

  17: { "ts" : { "t" : 1286821984000, "i" : 1 }, "h" : NumberLong("1633487572904743924"), "op" : "u", "ns" : "test.foo", "o2" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") }, "o" : { "$set" : { "y" : 1 } } }

  18: { "ts" : { "t" : 1286821993000, "i" : 1 }, "h" : NumberLong("5491114356580488109"), "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("4cb3586928ce78a2245fbd57"), "x" : 2, "y" : 1 } }

  19: { "ts" : { "t" : 1286821996000, "i" : 1 }, "h" : NumberLong("243223472855067144"), "op" : "d", "ns" : "test.foo", "b" : true, "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") } }

You can actually see the chain on command to oplog entry. The upsert command in line 7 was turned into an insert in line 18, for example. There appears to also be a lot of work done to avoid having to do any sort of computable work, in favor of resolving things to a simple idempotent operation.

For example, if you have a doc that looks like {counter:1} and you do an update like {$inc:{counter:1}} on the primary, you’ll end up with {counter:2} and the oplog will store {$set:{counter:2}}. The secondaries will replicate that instead of the $inc.

That is pretty nice feature, since it mean that you can much apply changes multiple times and end with the same result. But it all leads to the end result, in which you can’t merge divergent writes.

You do get a much better approach for actually going over the data and doing the fixup yourself, but still.. I don’t really like it.

Multi write partners

In this mode, we have a set of servers, each of which is familiar with their partners. All the writes coming are accepted, and logged. Replication happen from the source server contacting all of the destination servers and asking them: What is the last you heard from me? Here are all of my changes since then. Critically, it is at this point that we can trim the log for all of the actions that were already replicated to all of the servers.

A server being down means that the log of changes to go there is going to increase in size until the partner is up again, or we remove the entry for that server from our replication destination.

So far, this is very similar to how you would structure an oplog. The major difference is how you structure the actual data you log. In the oplog scenario, you’re going to write the changes that happens to the system. And the only way to act on this is to actually apply the op log in the same sequence as it was generated. This leads to a system where you can always have just a single primary node. And that leads to situations when split brains will result in data loss or manual merge steps.

In MWP case, we are going to keep enough context (usually full objects) so that we can give the user a better option to resolve the conflict. This also gives us the option of replaying the log in non sequential manner.

Note, however, that you cannot just bring a new server online and expect it to start playing nicely. You have to start from a known state, usually a db backup of an existing node. Like the log shipping scenario, the process is essentially, start replicating (to the currently non existent server), that will ensure that the log will be there when we actually have the new server. Backup the database and restore on a secondary server. Configure to accept replication from the source server.

The complexities here are that you need to deal with operations that you might already have. That is why this is usually paired with vector clocks, so you can automatically resolve such conflicts. When you cannot resolve such conflicts, this falls down to manual user intervention.

Multi Master

Multi master systems are quite similar to multi write partners, but they are designed to operate independently. It is common for servers to be able communicate with one another only rarely. For example, a mobile system that is only able to get connected just a few hours a week. As such, we cannot just cap the size of the operations to replicate. In fact, the common way to bring a new server up to speed is just to replicate to it. That means that we need to be able to replicate, essentially from any point in the server history, to a new server.

That works great, as long as you don’t have deletes. Those do tend to make things harder, because you need to keep track of those, and replicate them. RavenDB and CouchDB are both multi master systems, for example. Conflicts works the same way, pretty much, and we use a vector clock to determine if a value is in conflict or not.

Divergent writes

I mentioned this a few times, but I didn’t fully explain. For my purposes, we assume that we are using 2 servers (and yes, I know all about quorums, etc. Not relevant for this discussion) running in master/slave mode.

At some point, the slave think that the master is down and takes over, and the master doesn’t notice this and still think it is the master. During this time, both server accept the following writes:

Server A Server B

write users/1 wrier users/2

write users/3 write users/3

delete users/4 delete users/5

delete users/6 write users/6

write users/7 delete all users

set all users to active write users/8

After those operation happen, we restore communication between the two servers and they need to decide how to resolve those changes

Getting down to business

Okay, that is enough talking about what those terms mean. Let us consider the implications of using them. Log shipping is by far the simplest method to use. Well, assuming that you actually have a log mechanism, but most dbs do. It is strictly one writer model, and there is absolutely no way to either resolve divergent writes or even to find out what they were. The good thing about log shipping is that it is quite easy to get this working without actually needing to care anything about the actual data involved. We work directly at the file level, we don’t care at all about what the data is. The problem is that we can’t even solve simple conflicts, like writes to the different objects. This is because we are actually working at the file level, and all the changes are there. Attempting to merge changes from multiple logs would likely result in file corruption. The up side is that it is probably the most efficient way to go about doing this.

Oplog is a step above log shipping, but not a big one. It doesn’t resolve the divergent writes issues. This is now an application level protocol. The log needs to contain information specific to the actual type of data that we store. And you need to write explicit code to handle this. That is nice, but it also require strict sequence of all operations. Now, you can try to merge things between different logs. However, you need to worry about conflicts, and more to the point, there is usually nothing in the data itself that will help you even detect conflicts.

Multi write partners are meant to take this up a notch. They do keep track of the version history (usually via vector clocks). Here, the situation is more complex, because we need to explicitly decide how to deal with conflicts (either resolve automatically or defer to user decision), but also how to handle distribution of updates. Usually they are paired with some form of logic that tells you how to direct your writes. So all writes for a particular piece of data would go to a preferred node, to avoid generating multiple versions. The data needs to contains some information about that, so we keep vector clock information around. Once we sent the changes to all our partners, we can drop them, saving in space.

Multi master is meant to ensure that you can have partners that might only see one another occasionally, and it makes no assumptions about the topology. It can handle a node that comes on, get some data replicated, and drop off for a while (or forever). Each node is fully independent, and while they would collaborate with others, they don’t need them. The downside here is that we need to keep track of some things forever. In particular, we need to keep track of deletes, to ensure that we can get them to the remote machines.

What about set operations?

Interesting enough, that is probably the hardest issue to resolve. Consider the case when you have the following operations happen:

Server A Server B

write users/7 delete all users

set all users to active write users/8 (active: false)

What should be the result of this? There isn’t a really good answer. Should users/8 be set to active: true? What about users/7, should it be deleted or kept?

It gets hard because you don’t have good choices. The hard part here is actually figuring out that you have a conflict. And there isn’t a really good way to handle set operations nicely with conflicts. The common solution is to translate this to the actual operations made (delete users/1,user/2, users/3 – writer users/8, users/5) and leave it at that. The set based operation is translated to the actual individual operations that actually happened. And on that we can detect conflicts much more easily.

Log shipping is easiest to work with, operationally speaking. You know what you get, and you deal with that. Oplog is also simple, you have a single master, and that works. Multi master and multi write partners requires you to take explicit notice of conflicts, selection of the appropriate node to reduce conflicts, etc.

In practice, at least in the scenarios relating to RavenDB, the ability to take a server offline for weeks or months doesn’t seem to be used that often. The common deployment model is of servers running as steady partners. There are some optimizations that you can do for multi write partners that are hard/impossible to do with multi master.

My current personal preference at this point would like to go with either log shipping or multi write master. I think that either one of them would be fairly simple to implement and support operationally. I think that I’ll discuss actual design for the time series topic using either option in my next posts.

Feb 21 2014

Logging & Production systems

time to read 4 min | 735 words

Tweet Share Share 2 comments

Tags:

I got a question from Dominic about logging:

Jeff Atwood wrote a great blog post about over-using logging, where stack traces should be all a developer needs to find the root cause of a problem. Therefore ...
When building an enterprise level system, what rules do you have to deem a log message 'useful' to a developer or support staff?

This is the relevant part in Jeff’s post:

So is logging a giant waste of time? I'm sure some people will read about this far and draw that conclusion, no matter what else I write. I am not anti-logging. I am anti-abusive-logging. Like any other tool in your toolkit, when used properly and appropriately, it can help you create better programs. The problem with logging isn't the logging, per se -- it's the seductive OCD "just one more bit of data in the log" trap that programmers fall into when implementing logging. Logging gets a bad name because it's so often abused. It's a shame to end up with all this extra code generating volumes and volumes of logs that aren't helping anyone.
We've since removed all logging from Stack Overflow, relying exclusively on exception logging. Honestly, I don't miss it at all. I can't even think of a single time since then that I'd wished I'd had a giant verbose logfile to help me diagnose a problem.

I don’t really think that I can disagree with this more vehemently. This might be a valid approach if/when you are writing what is essentially a single threaded, single use, code. It just so happens that most web applications are actually composed of something like that. The request code very rarely does threading, and it is usually just dealing with its own stuff. For system where most of the code is actually doing ongoing work, there really isn’t any alternative to logging.

You cannot debug multi threaded code efficiently. The only way to really handle that is to do printf debugging. In which you write what happens, and then construct the actual execution from the traces. And that is leaving aside one very important issue. It isn’t the exceptions that will get you, it is when your system is subtly wrong. Maybe it missed an update, or skipped a validation, or something just doesn’t look right. And you need to figure out what is going on.

And then you have distributed system, when you might have things happening concurrently in multiple systems, and good luck trying to get a good grip about how to resolve problems without using logging.

Speaking of which, here is a reply to a customer from one of our developers:

There is absolutely no way this would have been found without logging. The actual client visible issue happened quite a bit later than when the actual bug was, and no exception was thrown.

Of course, this is all just solving problems on the developer machine. When you go to production, the rules are different, usually the only thing that you have are the logs, and you need to be able to figure out what was wrong and how to fix it, when the system is running.

I find that I don’t really sweat Debug vs. Info and Warn vs. Error debates. The developers will write whatever they consider to be relevant on each case. And you might get Errors that show up in the logs that are error for that feature, but are merely warnings for the entire product. I don’t care, all of that can be massages later, using log configuration & filtering. But the very first thing that has to happen is to have the logs. Most of the time you don’t care, but it is pretty much the same as in saying: “We removed all the defibrillators from the building, because they were expensive and took up space. We rely exclusively on CPR in the event of a heart failure. Honestly, I don’t miss it at all. I can’t think of a single time since then that I’d wished I’d a machine to send electricity into someone’s heart to solve a problem.”

When you’ll realize that you need it, it is likely going to be far too late.

Feb 20 2014

Time series feature designQuerying over large data sets

time to read 3 min | 447 words

Tweet Share Share 0 comments

Tags:

What happens when you want to do an aggregation query over very large data set? Let us say that you have 1 million data points within the range you want to query, and you want to get a rollup of all the data in the range of a week.

As it turns out, there is a relatively simple solution for optimizing this and maintaining a relatively constant query speed, regardless of the actual data set size.

Time series data has the great benefit of being easily aggregated. Most often, the data looks like this:

The catch is that you have a lot of it.

The set of aggregation that you can do over the data is also relatively simple. You have mean, max, min, std deviation, etc.

The time ranges are also pretty fixed, and the nice thing about time series data is that the bigger the range you want to go over, the bigger your rollup window is. In other words, if you want to look at things over a week, you would probably use a day or hour rollup. If you want to see things over a month, you will use a week or a day, over a year, you’ll use a week or a month, etc.

Let us assume that the cost of aggregation is 10,000 operations per second (just some number I pulled because it is round and nice to work with, real number is likely several orders of magnitude higher). So if we have to run this over a set that is 1 million data points in size, with the data being entered on per minute basis. With 1 million data points, we are going to wait almost two minutes for the reply. But there is really no need to actually check all those data points manually.

What we can do is actually prepare, ahead of time, the rollups on an hourly basis. That gives us a summary on a per hour basis of just over 16 thousand data points, and will result in a query that runs in under 2 seconds. If we also do a daily rollup, we move from having a million data points to less than a thousand.

Actually maintaining those computed rollups would probably be a bit complex, but it won’t be any more complex than how we are computing map/reduce indexes in RavenDB (this is a private, and simplified, case of map/reduce). And the result would be instant query times, even on very large data sets.

Feb 19 2014

Time series feature designScale out / high availability

time to read 3 min | 513 words

Tweet Share Share 0 comments

Tags:

I expect (at a bare minimum), to be able to do about 25,000 sustained writes per second on a single machine. I actually expect to be able to do significantly more. But let us go with that amount for now as something that if it drops below that value, we are in trouble. Over a day, that is 2.16 billion records. I don’t think this likely, but that is a lot of data to work with.

That leads to interesting questions on scale out story. Do we need one? Well, probably not for performance reasons (said the person who haven’t written the code, much less benchmarked it) at least not if my hope for the actual system performance comes about.

However, just writing the data isn’t very interesting, we also need to read and work with it. One of the really nice thing about time series data is that it is very easily sharded, because pretty much all the operations you have are naturally accumulative. You have too much data for a single server, go ahead and split it. On query time, you can easily merge it all back up again and be done with it.

However, a much more interesting scenario for us is the high availability / replication story. This is where things gets… interesting. With RavenDB, we do async server side replication. With time series data, we can do that (and we have the big advantage of not having to worry about conflicts), but the question is how.

The underlying mechanism for all the replication in RavenDB is the notion of etags, of a way to track the order of changes to the database. In time series data, that means tracking the changes that happens in some form of a sane fashion.

It means having to track, per write, at least another 16 bytes. (8 bytes for a monotonically increasing etag number, another 8 for the time that was changed). And we haven’t yet spoken of things like replication of deletes, replication of series tag changes, etc. I can tell you that dealing with replication of deletes in RavenDB, for example, was no picnic.

A much simpler alternative would be to not handle this at the data level. One nice thing about Voron is that we can actually do log shipping. That would trivially solve pretty much the entire problem set of replication, because it would happen at the storage layer, and take care of all of it.

However, it does come with its own set of issues. In particular, it means that the secondary server has to be read only, it cannot accept any writes. And by that I mean that it would physically reject them. That leads to a great scale out story with read replicas, but it means that you have to carefully consider how you are dealing the case of the primary server going down for a duration, or any master/master story.

I am not sure that I have any good ideas about this at this point in time. I would love to hear suggestions.

Feb 18 2014

Time series feature designUser interface

time to read 1 min | 143 words

Tweet Share Share 0 comments

Tags:

We need one. But that is pretty much all I have to say about it. Most of the ways you’ve to look at time series data end up something like this:

But before you can get here, you need to handle:

Looking at series
Inspecting raw series data
Tagging / searching series
Looking at the roll up values across different dates for different series
Delete data (full series or range of data)

Probably other stuff, but those are the things that I can think of.

Probably a good place to do a lot of graphs, too, just to let the users play with the data. But that is what I have in mind here so far.

Oh, and quite obviously, we want to be able to output everything to CSV so users can look at that in Excel, of course.

Feb 17 2014

Time series feature designClient API

time to read 5 min | 868 words

Tweet Share Share 0 comments

Tags:

We have gone over the system behavior, the wire protocol and how we actually store the data on disk. Now, let us talk about the actual client API. The entry point is going to the TimeSeries class, which will have the following behavior:

Stateless operations:

Queries:

timeSeries.Query(“sensor1.heat”, “sensor1.flow”)
   .Range(start,end)
   .Rollup(Rollup.Weekly)
   .Aggergation(AggergateBy.Max, AggergateBy.Min, AggergateBy.Mean);
timeSeries.SeriesBy(“temp:C”);

Operations:

timeSeries.Delete(“sensor1.heat”, start, end);
timeSeries.Tag(“sensor1.heat”, “temp:C”);

Those types of operations have no state, require nothing beyond just knowing where the server is located and can be immediately executed without requiring any external state. The returned results aren’t tracked or managed by us in any way, so there is no need for a session.

Stateful operation - The only stateful operation we have (at least so far) is adding data to the database. We do that using the connection abstraction. This is very close to the actual on the wire representation, which is always good. We have something like:

   1: using(var con = timeSeries.OpenConnection(waitForServerFlush: true))

   2: {

   3:     using(var series = con.AddToSeries("sensor1.heat"))

   4:     {

   5:         for(var i = 0; i < 100; i++)

   6:         {

   7:             series.Add(time.AddMinutes(i), value + i);

   8:         }

   9:     }

  10: }

This is a bit of an awkward API, but it serves a purpose, it is very close to the way the on-wire format is, and it is optimized for performance, not for being nice.

We can also have:

con.Add(“sensor1.heat”, time, value);

But if you are mixing things up (add sensor1.heat, sensor1.flow and then sensor1.heat again, etc), it probably won’t be as efficient. (It is important to be able to expose those optimizations all the way from the disk to the wire to the client API. Most times, they don’t matter, which is why we have the higher level API, but when they do, they really do.

And… this is pretty much it.

The API will probably be an async one, to keep up with the times, but those are pretty much the high level things that we have here.

Oren Eini

Oren Eini

CEO of RavenDB

Time series feature designThe Consensus has dRafted a decision

Shiny features in the depth: New index optimization

RavenDB 3.0 Status Update

Time series feature designReplication

On replication strategies, or the return of the long article

Logging & Production systems

Time series feature designQuerying over large data sets

Time series feature designScale out / high availability

Time series feature designUser interface

Time series feature designClient API

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Server A	Server B
write users/1	wrier users/2
write users/3	write users/3
delete users/4	delete users/5
delete users/6	write users/6
write users/7	delete all users
set all users to active	write users/8

Main feed
Comments feed