Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,812 | Comments: 49,041

filter by tags archive
time to read 4 min | 761 words

imageThe feature outlined in this post is a hidden behind a small bottom at a relatively obscure part of the RavenDB Studio (Database > Settings > Document Revisions). You can see how it looks like on the right. Despite its unassuming appearance, this is a pretty important feature. Revisions revert is a feature that we wish that no one use, though, which make it an interesting one.

Revision Revert allow you to take your entire database back to a particular moment in time. Documents changes will be undone, deleted documents will be restored, new documents will be removed, etc.

image

So far, this isn’t a surprising feature, being able to restore to a point in time is a feature that many other database have. How is this feature different? In most systems, a point in time restore require you to… well, restore. In a large database, that can take a lot of time. Revision Revert is an alternative to that. Instead of restoring from scratch, it utilize the revisions features in RavenDB to allow you to just hit the time machine button and go back to the desired time.

The common use case for that is immediately after the “Opps” moment. You have run an query without specifying a where clause, deployed a bad version of your app that removed important fields, etc.

Revision Revert is an online operation, you don’t need to take the database down. In fact, you can still serve reads while the process is going on. Since typically you’ll need to go back in time a relatively short period, this is a very quick process.

In a distributed system, the admin will invoke this process on one of the nodes in the system and the reverts will be applied on that node and then replicated from there to all the other nodes in the system. We have made every attempt to make what is likely to be a pretty stressful event as easy as possible.

You might have noticed the Window configuration in the screen above. What is that about?

To be honest, this is something that we expect most users to never really care about. It is there for correctness’ sake in a distributed environment. Let’s dig a little deeper into this feature.

First thing we need to talk about is time. The point in time that we’ll restore to is the user’s local time. This is converted into UTC internally and used to compute the cutoff point for the revert. In a distributed system, it is possible (even likely) that different machines on the network will have different clocks. (Note that while RavenDB will work just fine and do the Right Thing if your nodes have different timezones, we have found that really confusing. Better to keep all nodes on the same timezone and clock sync system).

This means that one problem for this feature is that changes happening on two machines at the same time may have different time stamps (in UTC, the local time is not relevant). You need to take that into account when using Revision Revert since that is what RavenDB uses to decide what stays and what go.

The second problem is that just because two updates happened at the same time, it doesn’t mean that we learned about them at the same time. What I mean here is that a change that two changes that happened at the same time on different machine may have reached a particular node at very different times. That is where the Window option come into play. We scan the revisions log for all changes to the system. And we scan them in the order that we learned about them. By default, we’ll go back 4 days until we are sure that there aren’t any revisions that we got out of order and missed.

A few additional things about this feature. Obviously, it requires that you’ll have revisions enabled (and have enough revisions capacity to go back far enough in time, naturally). It support live restores and operates nicely in a distributed environment. Note that if you are doing Revision Revert and not all your documents have revisions enabled, only those that have revisions will be reverted.

Currently we apply this revert globally, we are considering allowing you to select specific collections to revert, but I’m not sure how useful that would be in practice.

time to read 3 min | 405 words

The most common network topology for RavenDB replication is a full mesh. For example, if you have three nodes in your cluster and a database that reside on all three nodes, you’ll have a replication topology that will look like this:

image

This works great when the number of nodes that you have in your cluster is reasonably small. However, we recently got a customer question about a different kind of topology. They have a bunch of nodes, in the order of a few dozens, which cooperate to perform some non trivial task. A key part of this is that the nodes are transient and identical. So a new node may pop up, live for a while (days, weeks, months) and then go away. At any given time you might have a few dozen nodes. That kind of environment won’t really work with a full mesh topology. If we would try, it would look something like that (fully connected network with 40 nodes):

image

This has a total of 780 connections(!) in it.  You can create a topology like that, but a lot of the processing power in the network is going to be dedicated to just maintaining these connections. And you don’t actually need it. RavenDB’s replication algorithm is actually a gossip algorithm, and as you grow the number of nodes that take part in the replication, the less connection you need between nodes. In this case, we can take each of the live nodes and connect each of them to four other (random) nodes. The result would look like so:

image

Remember, each of the nodes is actually connected to a random four other nodes. RavenDB’s replication will ensure that a change to any document in any of the nodes under these conditions will propagate to all the other nodes efficiently.

This approach will also transparently handle any intermediary failures and be robust for nodes coming and leaving on the fly. RavenDB doesn’t implement gossip membership, mostly because that is very heavily dependent on the application and deployment pattern, but once you tell a node who its neighbors are, everything will proceed on its own.

time to read 2 min | 334 words

imageI’m happy to let you know that as of last week, RavenHQ has been updated with full general availability of managed RavenDB 4.1 in the cloud.

If you aren’t familiar with RavenHQ, this update gives you the ability to get a managed RavenDB cluster in minutes. In this case, you get all the usual benefits of RavenDB as well as the peace and quite of someone else managing all the other details for you.

We have been steadily working on making RavenDB a self managed database, but there are still things that it can’t do on its own. RavenHQ close the loop. If your database is large and is about to run out of disk space, RavenHQ can automatically increase the allocated storage, for example. A machine went down? RavenHQ will transparently replace it and allow the cluster to recover without having any impact on your client code.

If you have used RavenHQ in the past, there are important changes. Previously, you would use RavenHQ to provision a database. That has changed, you now provision a cluster of nodes and you can create as many databases as you want. If you need a test database for CI, you can just create one on the fly, no extra charges or need to use additional APIs over the RavenDB native client.

All the usual management features of RavenDB are available as well, including the ability to extend RavenDB on the fly using index extensions and analyzers. We went over all the feedback that we got from the community and users and have done the same thing for RavenHQ that was done in the move from RavenDB 3.5 to RavenDB 4.0. Everything should be better, in the case of RavenDB, we had a rule that things should be at least 10x better, and I believe that you’ll find the RavenHQ experience similar.

As always, we would really love your feedback.

time to read 4 min | 735 words

imageAbout a month ago I wrote about a particular issue that we wanted to resolve. RavenDB is using X509 certificates for authentication. These are highly secured and are a good answer for our clients who need to host sensitive information or are working in highly regulated environments. However, certificates have a problem, they expire. In particular, if you are following common industry best practices, you’ll replace your certificates every 2 – 3 months. In fact, the common setup of using RavenDB with Let’s Encrypt will do just that. Certificates will be replaced on the fly by RavenDB without the need for an administrator involvement.

If you are running inside a single cluster, that isn’t something you need to worry about. RavenDB will coordinate the certificate update between the nodes in such a way that it won’t cause any disruption in service. However, it is pretty common in RavenDB to have multi cluster topologies. Either because you are deployed in a geo-distributed manner or because you are running using complex topologies (edge processing, multiple cooperating clusters, etc). That means that when cluster A replaces its certificate, we need to have a good story for cluster B still allowing it access, even though the certificate has changed.

I outlined our thinking in the previous post, and I got a really good hint,  13xforever suggested that we’ll look at HPKP (HTTP Public Key Pinning) as another way to handle this. HPKP is a security technology that was widely used, run into issues and was replaced (mostly by certificate transparency). With this hint, I started to investigate this further. Here is what I learned:

  • A certificate is composed of some metadata, the public key and the signature of the issuer (skipping a lot of stuff here, obviously).
  • Keys for certificates can be either RSA or ECDSA. In both cases, there is a 1:1 relationship between the public and private keys (in other words, each public key has exactly one private key).

Given these facts, we can rely on that to avoid the issues with certificate expiration, distributing new certificates, etc.

Whenever a cluster need a new certificate, it will use the same private/public key pair to generate the new certificate. Because the public key is the same (and we verify that the client has the private key during the handshake), even if the certificate itself changed, we can verify that the other side know the actual secret, the private key.

In other words, we slightly changed the trust model in RavenDB. From trusting a particular certificate, we trust that certificate’s private key. That is what grants access to RavenDB. In this way, when you update the certificate, as long as you keep the same key pair, we can still authenticate you.

This feature means that you can drastically reduce the amount of work that an admin has to do and lead you to a system that you setup once and just keeps working.

There are some fine details that we still had to deal with, of course. An admin may issue a certificate and want it to expire, so just having the user re-generate a new certificate with the private key isn’t really going to work for us. Instead, RavenDB validates that the chain of signatures on the certificate are the same. Actually, to be rather more exact, it verifies that the chain of signatures that signed the original (trusted by the admin) certificate and the new certificate that was just presented to us are signed by the same chain of public key hashes.

In this way, if the original issuer gave you a new certificate, it will just work. If you generate a new certificate on your own with the same key pair, we’ll reject that. The model that we have in mind here is trusting a driver’s license. If you have an updated driver’s license from the same source, that is considered just as valid as the original one on file. If the driver license is from Toys R Us, not so much.

Naturally, all such automatic certificate updates are going to be logged to the audit log, and we’ll show the updated certificates in the management studio as well.

As usual, we’ll welcome your feedback, the previous version of this post got us a great feature, after all.

time to read 4 min | 655 words

This post really annoyed me. Feel free to go ahead and go through it, I’ll wait. The gist of the post, titled: “WAL usage looks broken in modern Time Series Databases?” is that time series dbs that uses a Write Ahead Log system are broken, and that their system, which isn’t using a WAL (but uses Log-Structure-Merge, LSM) is also broken, but no more than the rest of the pack.

This post annoyed me greatly. I’m building databases for a living, and for over a decade or so, I have been focused primarily with building a distributed, transactional (ACID), database. A key part of that is actually knowing what is going on in the hardware beneath my software and how to best utilize that. This post was annoying, because it make quite a few really bad assumptions, and then build upon them. I particularly disliked the outright dismissal of direct I/O, mostly because they seem to be doing that on very partial information.

I’m not familiar with Prometheus, but doing fsync() every two hours basically means that it isn’t on the same plane of existence as far as ACID and transactions are concerned. Cassandra is usually deployed in cases where you either don’t care about some data loss or if you do, you use multiple replicas and rely on that. So I’m not going to touch that one as well.

InfluxDB is doing the proper thing and doing fsync after each write. Because fsync is slow, they very reasonable recommend batching writes. I consider this to be something that the database should do, but I do see where they are coming from.

Postgres, on the other hand, I’m quite familiar with, and the description on the post is inaccurate. You can configure Postgres to behave in this manner, but you shouldn’t, if you care about your data. Usually, when using Postgres, you’ll not get a confirmation on your writes until the data has been safely stored on the disk (after some variant of fsync was called).

What really got me annoyed was the repeated insistence of “data loss or corruption”, which shows a remarkable lack of understanding of how WAL actually works. Because of the very nature of WAL, the people who build them all have to consider the nature of a partial WAL write, and there are mechanisms in place to handle it (usually by considering this particular transaction as invalid and rolling it back).

The solution proposed in the post is to use SSTable (sorted strings table), which is usually a component in LSM systems. Basically, buffer the data in memory (they use 1 second intervals to write it to disk) and then write it in one go. I’ll note that they make no mention of actually writing to disk safely. So no direct I/O or calls to fsync. In other words, a system crash may leave you a lot worse off than merely 1 second of lost data.  In fact, it is possible that you’ll have some data there, and some not. Not necessarily in the order of arrival.

A proper database engine will:

  • Merge multiple concurrent writes into a single disk operation. In this way, we can handle > 100,000 separate writes per seconds (document writes, so significantly larger than the typical time series drops) on commodity hardware.
  • Ensure that if any write was confirmed, it actually hit durable storage and can never go away.
  • Properly handle partial writes or corrupted files, in such a way that none of the invariants on the system is violated.

I’m leaving aside major issues with LSM and SSTables, of which write amplification, and the inability to handle sustained high loads (because there is never a break in which you can do book keeping). Just the portions on the WAL usage (which shows broken and inefficient use) to justify another broken implementation is quite enough for me.

time to read 2 min | 246 words

imageOne of the primary reasons why businesses chose to use workflow engines is that they get pretty pictures that explain what is going on and look like they are easy to deal with. The truth is anything but that, but pretty sell.

My recommended solution for workflow has a lot going for it, if you are a developer. But if you’ll try to show a business analyst this code, they are likely to just throw their hands up in the air and give up.  Where are the pretty pictures?

One of the main advantages of this kind of approach is that it is very rigid. You are handling things in the event handlers, registering the next step in the workflow, etc. All of which is very regimented. This is so for a reason. First, it make it very easy to look at the code and understand what is going on. Second, it allow us to process the code in additional ways.

Consider the following AST visitor, which operate over the same code.

This took me about twenty minutes to write, mostly to figure out the Graphviz notation. It take advantage of the fact that the structure of the code is predictable to generate the actual flow of actions from the code.

You get to use readable code and maintainable practices and show pretty pictures to the business people.

time to read 3 min | 407 words

In my previous post, I talked about the driving forces toward a scripting solution to workflow behavior, and I presented the following code as an example of such a solution. In this post, I want to focus on the non obvious aspects of such a design.

The first thing to note about this code is that it is very structured. You are working on an event based system, and as such, the input / output for the system are highly visible. It also means that we have straightforward ways to deal with complexity. We can break some part of the behavior into a different file or even a different workflow that we’ll call into.

The second thing to note is that workflows tend to be long running processes. In the code above, we have a pretty obvious way to handle state. We get passed a state object, which we can freely modify. Changes to the state object are persisted between event invocations. That is actually a pretty important issue. Because if we store that state inside RavenDB, we also get the ability to do a bunch of other really interesting stuff:

  • You can query ongoing workflow and check their state.
  • You can use the revisions feature inside of RavenDB and be able to track down the state changes between invocations.

The input to the events is also an object, and that means that you can also store that natively, which means that you have full tracing capabilities.

The third important thing to note is that the script is just code, and even in complex cases, it is going to be pretty small. That means that you can run version resistant workflows. What do I mean by that?

Once a workflow process has started, you want to keep it on the same workflow script that is started with. This make versioning decision much nicer, and it is very easy for you to deal with changes over time.  On the other hand, sometimes you need to fix the script itself (there was a bug that allowed negative APR), in which case you can change it for just the ongoing workflows.

Actual storage of the script can be in Git, or as a separate document inside the database. Alternatively, you may actually want to include the script itself in every workflow. That is usually reserved for industries where you have to be able to reproduce exactly what happened and I wouldn’t recommend doing this in general.

time to read 2 min | 301 words

The Reddit’s front page contain a list of recent posts from all communities. In most cases, you want to show posts from communities that the user is subscribe to, but at the same time, you want to avoid flooding the front page with posts from any single community. You also need this to be really fast.

It turns out that doing this in RavenDB is actually very easy. We are going to create a map/reduce index that will aggregate the few most recent posts per community, like so:

image

What this index will do is provide us with the five most recent posts in each community, as well as their date. This is an interesting example of a map/reduce index, because we are using both aggregation and fanout in the index.

The nice thing about this index is that we can project the results directly from it to the user. Let’s see how the queries will look like:

image

This is a simple query that does quite a lot. It gives us the most recent 15 posts across all the communities that the user care about, with no single community able to generate more than 5 posts. It sort them the posted date and fetch the actual posts in the same query. This is going to give you consistent performance regardless of how much data you have and how many updates your experience. The actual Reddit front page is a lot more complex, I’m sure, but this serve as a nice example of how you can do non trivial stuff in RavenDB’s indexes that simplify your life by a lot.

time to read 4 min | 760 words

I talked about some of the requirements for proper workflow design in my previous post. As a reminder, the top ones are:

  • Cater for developers, not the business analysts. (More on this later).
  • Source control isn’t optional, meaning:
    • Multiple branches
    • Can diff & review changes
    • Merging
    • Multiple people can work at the same time
  • Encapsulate complexity

This may seem like a pretty poor list, because if you are a developer, you might be taking all of these as granted. Because of that, I wanted to display a small taste from what used to be Microsoft’s primary workflow engine.

image

A small hint… this kind of system is not going to be useful for anything relating to source control, change management, collaborative work, understanding what is going on, etc.

A better solution for this would be to use a tool that can work with source control, that developers are familiar with and can handle the required complexity.

That tool is called… code.

It checks all the boxes required, naturally. But it does have a distinct disadvantage. One of the primary reasons you want to use a workflow engine of some kind is to decouple the implementation of your business from the policies of the business. Coming back to the mortgage example, how you calculate late fees payment is fixed (in the contract itself, but usually also by law and many regulations), but figuring out whatever late fees should be waived, on the other hand, is subject to the whims of the business.

That is a pretty simple example, but in most businesses, these kind of workflows adds up. You can easily end up with dozens to hundreds of different workflows without the business being too big or complex.

There is another issue, though. Code is pretty good when you need to handle straightforward tasks. A set of if statements (which is pretty much all most workflows are) are trivial to handle. But workflow has another property, they tend to be long. Not long on computer scale (seconds), but long on people scale (months and years).

The typical process of getting a loan may involve an initial submission, review by a doctor, asking for follow up documentation (rinse – repeat a few times), getting doctor appraisal and only then being able to generate a quote for the customer. Then we have a period of time in which the customer can accept, a qualifying period, etc. That can last for a good long while.

Trying to code long running processes like that require us a very unnatural approach to coding. Especially since you are likely to need to handle software updates while the workflows are running.

In short, we are in a strange position: we want to use code, because it is clear, support software development practices that are essentials and can scale up in complexity as needed. On the other hand, we don’t want to use our usual codebase for that, because we’ll have very different deployment strategies, the manner of working is very different and there is a lot more involvement of the business in what is going on there.

The way to handle that is to create a proper boundary between parts of the system. We’ll have the workflow behavior, defined in scripts, that describe the policy of the system. These tend to be fairly high level concepts and are designed explicitly for the rule of business policy behaviors. The infrastructure for that, on the other hand, is just a standard application using normal software practices, that is driven by the workflow scripts.

And by a script, I meant literally a script. As in, JavaScript.

I want to give you a sneak peak into how I envision this kind of system, but I’ll defer full discussion of what is involved to my next post.



The idea is that we use the script to define our policy, and then we use that to make decisions and invoke the next stage in the process. You might notice that we have the state variable, which is persisted between invocations. That allow us to use a programming model that is fairly common and obvious to developers. We can usually also show this, as is, to a business analyst and get them to understand what is going on easily enough. All the actual actions are abstracted. For example, life insurance setup is a completely different workflow that we invoke.

In my next post, I’m going to drill down a bit into the details of this approach and what kind of features do we need there.

time to read 5 min | 815 words

One of the most common themes I run into when talking to customers, users and sundry people in tech is the repeated desire to fire developers.

Actually, that is probably too loaded a statement. It actually come in two parts:

  • Developers usually want to focus on the interesting bits, and the business logic portions aren’t that much fun.
  • The business analysts usually want to get things done and having to get developers to do that is considered inefficient.

If only there was a tool, or a pattern, or a framework, or something that would allow the business analysts to directly specify the behavior of the system… Why, we could cut the developers from the process entirely! And speaking as a developer, that would be a huge relief.

I think the original name for that was CASE tools, and that flopped. In fact, literally every single one of the attempts to replace developers by a tool has flopped. They got such a bad rap that people keep trying to implement them using different names. Some stuff can be done fairly easily, though. WYSIWYG for GUI is well established and Wordpress and WIX, to name the two examples that come to mind immediately, show that you can have a non techie build a proper website. In fact, you can even plug in some pretty sophisticated functionality without burdening the user with too much.

But all that takes you to a point. And past that point, the drop off is harsh. Let’s take another common tool that is used to reduce the dependency on developers, SharePoint.

You pay close to double for actual developer time on SharePoint, mostly because it is so painful to work with it.

In a recent conference, I got into a conversation about business workflows and how to best implement them. You can look at the image on the right to get a good idea about what kind of process they were talking about.

To make things real, I want to take a “simple” example, of accepting a life insurance policy. Here is what the (extremely simplified) workflow looks like for issuing a life insurance policy:

image

This looks good, and it certainly should make sense to a business analyst. However, even after I pretty much reduced the process to its bare bones and even those has been filed away, this is still pretty complex. The process of actually getting a policy is actually a lot more complex. Some questions don’t require doctor evaluation (for example, smoking) and some require supplemental documentation (oh, you were hospitalized? Gimme all these records). The doctor may recommend different rates, rejecting entirely, some exceptions in the policy, etc. All of which need to be in the workflow. Actuarial tables needs to be consulted for each of those cases, etc, etc, etc.

But something like the diagram above isn’t going to be able to handle this level of complexity. You are going to get lost very quickly if you try to put so many boxes on the screen.

So you need encapsulation.

And you’ll probably want to have a way to develop these business workflows, which means that they aren’t static.

So you need source control.

And if you have a complex business process, you likely have different people working on it.

So you need to be able to review changes, and merge them.

Note that this is explicitly distinct from being able to store the data in source control. Being able to actually diff in a meaningful fashion two versions of such a process is anything but trivial. Usually you are left with diffing the raw XML / JSON that store the structure. Good luck with that.

If the workflow is complex, you need to be able to understand what is going on under various conditions.

So you need a debugger.

In fact, pretty soon you’ll realize that you’ll need quite a lot of the things that developers do. Except that your tool of choice doesn’t do that, or if they do, they do it poorly.

Another issue that even if you somehow managed to bypass all of those details, you are going to be facing the same drop that you see elsewhere with tools that attempt to get rid of developers. At some point, the complexity grows too large, and you’ll call your development team and hand of the system to them. At which point they will be stack with a very clucky tool that attempt to be quite clever and easy to use. It is also horribly limiting for a developer. Mostly because all of the “complexity” involved is in the business process itself, not in the actual complexity of what is going on.

There are better ways of handling that, and the easier among them is to just use code. That can be… surprisingly versatile.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB 4.2 Features (4):
    19 Mar 2019 - Time travel and revisions revert
  2. Workflow design (4):
    06 Mar 2019 - Making the business people happy
  3. Data modeling with indexes (6):
    22 Feb 2019 - Event sourcing–Part III–time sensitive data
  4. Production postmortem (25):
    18 Feb 2019 - This data corruption bug requires 3 simultaneous race conditions
  5. Making money from Open Source Software (3):
    08 Feb 2019 - How we do it?
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats