Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,783 | Comments: 48,880

filter by tags archive
time to read 1 min | 182 words

I run into this very interesting blog post and I decided to see if I could implement this on my own, without requiring any runtime support. This turned out the be surprisingly easy, if you are willing to accept some caveats.

I’m going to assume that you have read the linked blog post, and here is the code that implement it:

Here is what it gives you:

  • You can have any number of queues.
  • You can associate an instance with a queue at any time, including far after it was constructed.
  • Can unregister from the queue at any time.
  • Can wait (or easily change it to do async awaits, of course) for updates about phantom references.
  • Can process such events in parallel.

What about the caveats?

This utilize the finalizer internally, to inform us when the associated value has been forgotten, so it takes longer than one would wish for. This implementation relies on ConditionalWeakTable to do its work, by creating a weak associating between the instances you pass and the PhantomReference class holding the handle that we’ll send to you once that value has been forgotten.

time to read 4 min | 734 words

imageThe title of this post is pretty strange, I admit. Usually, when we think about modeling, we think about our data. If it is a relational database, this mostly mean the structure of your tables and the relations between them. When using a document database, this means the shape of your documents. But in both cases, indexes are there merely to speed things up. Oh, a particular important query may need an index, and that may impact how you lay out the data, but these are relatively rare cases. In relational databases and most non relational ones, indexes do not play any major role in data modeling decisions.

This isn’t the case for RavenDB. In RavenDB, an index doesn’t exist merely to organize the data in a way that make it easier for the database to search for it. An index is actually able to modify and transform the data, on the current document or full related data from related documents. A map/reduce index is even able aggregate data from multiple documents as part of the indexing process. I’ll touch on the last one in more depth later in this series, first, let’s tackle the more obvious parts. Because I want to show off some of the new features, I’m going to use JS for most of the indexes definitions in these psots, but you can do the same using Linq / C# as well, obviously.

When brain storming for this post, I got so many good ideas about the kind of non obvious things that you can do with RavenDB’s indexes that a single post has transformed into a series and I got two pages of notes to go through. Almost all of those ideas are basically some form of computation during indexing, but applied in novel manners to give you a lot of flexibility and power.

RavenDB prefers to have more work to do during indexing (which is batched and happen on the background) than during query time. This means that we can push a lot more work into the background and just let RavenDB handle it for us. Let’s start from what is probably the most basic example of computation during query, the Order’s Total. Consider the following document:

image

As you can see, we have the Order document and the list of the line items in this order. What we don’t have here is the total order value.

Now, actually computing how much you are going to pay for an order is complex. You have to take into account taxation, promotions, discounts, shipping costs, etc. That isn’t something that you can do trivially, but it does serve to make an excellent simple example and similar requirements exists in many fields.

We can obvious add an Total field to the order, but then we have to make sure that we update it whenever we update the order. This is possible, but if we have multiple clients consuming the data, this can be fairly hard to do. Instead, we can place the logic to compute the property in the index itself. Here is how it would look like:

image

The same index in JavaScript is almost identical:

In this case, they are very similar, but as the complexity grow, I find it is usually easier to express logic as a JavaScript index rather than use a single (complex) Linq expression.

Such an index give us a computed field, Total, that has the total value of an order. We can query on this field, sort it and even project it back (if this field is stored). It allow us to always have the most up to date value and have the database take care of computing it.

This basic technique can be applied in many different ways and affect the way we shape and model our data. Currently I have at least three more posts planned for this series, and I would love to hear your feedback. Both on the kind of stuff you would like me to talk about and the kind of indexes you are using RavenDB and how it impacted your data modeling.

time to read 13 min | 2462 words

imageMost developers have been weaned on relational modeling and have to make a non trivial mental leap when the times comes to model data in a non relational manner. That is hard enough on its own, but what happens when the data store that you use actually have multi model capabilities. As an industry, we are seeing more and more databases that take this path and offer multiple models to store and query the data as part of their core nature. For example, ArangoDB, CosmosDB, Couchbase, and of course, RavenDB.

RavenDB, for example, gives you the following models to work with:

  • Documents (JSON) – Multi master with any node accepting reads and writes.
    • ACID transactions over multiple documents.
    • Simple / full text queries.
    • Map/Reduce and aggregation queries.
  • Binary data – Attachments to documents.
  • Counters (Map<string, int64>) – CRDT multi master distributed counters.
  • Key/Value – strong distributed consistency via Raft protocol.
  • Graph queries – on top of the document model.
  • Revisions – built in audit trail for documents

With such a wealth of options, it can be confusing to select the appropriate tool for the job when you need to model your data. In this post, I aim to make sense of the options RavenDB offers and guide you toward making the optimal choices.

The default and most common model you’ll use is going to be the document model. It is the one most appropriate for business data and you’ll typically follow the Domain Driven Design approach for modeling your data and entities. In other words, we are talking about Aggregates, where each document is a whole aggregate. References between entities are either purely local to an aggregate (and document) or only between aggregates. In other words, a value in one document cannot point to a value in another document. It can only point to another document as a whole.

Most of your business logic will be focused on the aggregate level. Even when a single transaction modify multiple documents, most of the business logic is done at each aggregate independently. A good way to handle that is using Domain Events. This allow you to compose independent portions of your domain logic without tying it all in one big knot.

We talked about modifying documents so far, but a large part of what you’ll do with your data is query and present it to users. In these cases, you need to make a conscious and explicit decision. Whatever your display model is going to be based on your documents or a different source. You can use RavenDB ETL to project the data out to a different database, changing its shape to the appropriate view model along the way. RavenDB ETL allow you to replicate a portion of the data in your database to another location, with the ability to modify the results as they are being sent. This can be a great tool to aid you in bridging the domain model and the view model without writing a lot of code.

This is useful for applications that have a high degree of complexity in their domain and business rules. For most applications, you can simply project the relevant data out at query time, but for more complex systems, you may want to have strict physical separation between your read model and the domain model. In such a scenario, Raven ETL  can greatly simplify the task of facilitating the task of moving (and transforming) the data from the write side to the read side.

When it comes to modeling, we also need to take into account RavenDB’s map/reduce indexes. These allow you to define aggregation that will run in the background, in other words, at query time, the work has already been done. This in turn leads to blazing fast aggregation queries and can be another factor in the design of the system. It is common to use map/reduce indexes to aggregation the raw data into a more usable form and work with the results. Either directly from the index or use the output collection feature to turn the results of the map/reduce index to real documents (which can be further indexes, aggregated, etc).

Thus far, we have only touched on document modeling, mind. There are a bunch of other options as well. We’ll start from the simplest option, attachments. At its core, an attachment is just that. A way to attach some binary data to the document. As simple as it sounds, it has some profound implications from a modeling point of view. The need to store binary data somewhere isn’t new, obviously and there have been numerous ways to resolve it. In a relational database, a varbinary(max) column is used. In a document database, I’ve seen the raw binary data stored directly in the document (either as raw binary data or as BASE64 encoded value). In most cases, this isn’t a really good idea. It blow up the size of the document (and the table) and complicate the management of the data. Storing the data on the file system lead to a different set of problems, coordinating transactions between the database and the file system, organizing the data, securing paths such as “../../etc/passwd”, backups and restore, and many more.

Attachments

These are all things that you want your database to handle for you. At the same time, binary data is related to but not part of the document. For those reasons, we use the attachment model in RavenDB. This is meant to be viewed just like attachments in email. The binary data is not stored inside the document, but it is strongly related to it. Common use cases for attachments include profile picture for a user’s document, the signatures on a lease document, the excel spreadsheet with details about a loan for a payment plan document or the associated pictures from a home inspection report. A document can have any number of attachments, and an attachment can be of any size. This give you a lot of freedom to attach (pun intended) additional data to your documents without any hassle. Like documents, attachments also work in multi master mode and will be replicated across the cluster with the document.

Counters

While attachments can be any raw binary data and has only a name (and optional mime type) for structure, counters are far more strictly defined. A counter is… well, a counter. It count things. On the most basic level, it is just a named 64 bits integer that is associated with a document. And like attachments, a document may have any number of such counters. But why is it important to have a 64 bits integer attached to the document? How could something so small can be important enough that we would need a whole new concept for it? After all, couldn’t we just store the same counter more simply as a property inside the document?

To understand why RavenDB have counters, we need to understand what they aren’t. The are related to the document, but not of the document. That means that an update to the counter is not going to modify the document as a whole. This, in turn, means that operations on the counters can be very cheap, regardless of how many counters you have in a document or how often you modify the counter. Having the counter separate from the document allow us to do several important things:

  • Cheap updates
  • Distributed modifications

In a multi master cluster, if any node can accept any write, you need to be aware of conflicts, when two updates to the same value were made on two disjoint servers. In the case of documents, RavenDB detect and resolve it according to the pre-defined policy. In the case of counters, there is no such need. A counter in RavenDB is stored using a CRDT. This is a format that allow us to handle concurrent modifications to the same value without losing data or expensive locks. This make counter suitable for for values that changes often. Good examples of counters is tracking views on a page or an ad, you can distribute the operations on a number of servers and still reach the correct final tally. This works both for increment and decrement, obviously.

Given that counters are basically just map<string, int64> you might expect that there isn’t any modeling work to be done here, right? But it turns out that there is actually quite a bit that can be done even with that simple an interface. For example, when tracking views on a page or downloads for a particular package, I’m interested not only in the total number of downloads, but also in the downloads per day. In such a case, whenever we want to note another download, we’ll increment both the counter for overall download and another counter for downloads for that particular day. In other words, the name of the counter can hold meaningful information.

Key/Value

So far, all the data we have talked about were stored and accessed in a multi-master manner. In other words, we could chose any node in the cluster and make a write to it and it would be accepted. Data that is modified on multiple nodes at the same time would either be merged (counters), stored (attachments) or resolved (documents). This is great when you care about the overall availability of your system, we are always accepting writes and always proceed forward. But that isn’t always the case, there are situations where you might need to have a higher degree of consistency in your operations. For example, if you are selling a fixed number of items, you want to be sure that two buyers hitting “Purchase” at the same time don’t cause you problems just because their requests used different database servers.

In order to handle this situation, RavenDB offers the Cmp Xcng model. This is a cluster wide key/value store for your database, it allows you to store named values (integer, strings or JSON objects) in a consistent manner. This feature allow you to ensure consistent behavior for high value data. In fact, you can combine this feature with cluster wide transactions.

Cluster wide transactions allow you to combine operations on documents with Cmp Xcng ops to create a single consistent transaction. This mode enable you to perform conditional operations to modify your documents based on the globally consistent Cmp Xcng values.

A good example for Cmp Xcng values include pessimistic locks and their owners, to generate a cluster wide lock that is guaranteed to be consistent and safe regardless of what is going on with the cluster. Other examples can be to store global configuration for your system in all the nodes in the cluster.

Graph Queries

Graph data stores are used to hold data about nodes and the edges between them. They are a great tool to handle tasks such as social network, data mining and finding patterns in large datasets. RavenDB, as of release 4.2, has support for graph queries, but it does so in a novel manner. A node in RavenDB is a document, quite naturally, but unlike other features, such as attachments and counters, edges don’t have separate physical existence. Instead, RavenDB is able to use the document structure itself to infer the edges between documents. This dynamic nature means that when the time comes to apply graph queries on top of your existing database, you don’t have to do a lot of prep work. You can start issuing graph queries directly, and RavenDB will work behind the scenes to make sure that all the data is found, and quickly, too.

The ability to perform graph queries on your existing document structure is a powerful one, but it doesn’t alleviate the need to model your data properly to best take advantage of this. So what does this mean, modeling your data to be usable both in a document form and for graph operations? Usually, when you need to model your data in a graph manner, you think mostly in terms of the connection between the nodes.

One way of looking at graph modeling in RavenDB is to be explicit about the edges, but I find this awkward and limiting. It is usually better to express the domain model naturally and allow the edges to pop up from the underlying data as you work with it. Edges in RavenDB are properties (or nested objects) that contain a reference to another document. If the edge in a nested object, then all the properties of the object are also the properties on the edge and can be filtered upon.

For best results, you want to model your edge properties as a single nested object that can be referred to explicitly. This is already a best practice when modeling your data, for better cohesiveness, but graph queries make this a requirement.

Unlike other graph databases, RavenDB isn’t limited to just graph representation. A graph queries in RavenDB is able to utilize the full power of RavenDB queries, which means that you can start your graph operation with a spatial query and then proceed to the rest of the graph pattern match. You should aim to do most of the work in the preparatory queries and not spend most of the time in graph operations.

A common example of graph operation is fraud detection, with graph queries to detect multiple orders made using many different credit card for the same address. Instead of trying to do the matches using just graph operations, we can define a source query on a map/reduce index that would aggregate all the results for orders on the same address. This would dramatically cut down on the amount of work that the database is required to do to answer your queries.

Revisions

The final topic that I want to discuss is this (already very long) post is the notion of Revisions. RavenDB allow the database administrator to define a revisions policy, in which case RavenDB will maintain, automatically and transparently, an immutable log of all changes on documents. This means that you have a built-in audit trail ready for use if you need to. Beyond just having an audit trail, revisions are also very important feature in several key capabilities in RavenDB.

For example, using revisions, you can get the tuple of (previous, current) versions of any change made in the database using subscriptions. This allow you to define some pretty interesting backend processes, which have full visibility to all the changes that happen to the document over time. This can be very interesting in regression analysis, applying business rules and seeing how the data changes over time.

Summary

I tried to keep this post at a high level and not get bogged down in the details. I’m probably going to have a few more posts about modeling in general and I would appreciate any feedback you may have or any questions you can raise.

time to read 4 min | 625 words

imageThis post was written because of this tweet, asking whatever RavenDB has something similar to MongoDB’s capped collections. RavenDB doesn’t have this feature, by design. It is something that would be trivial to add and would likely lead to horrible consequences down the line.

The basic idea with a capped collection is that you set a limit on the size of the collection, and for any write beyond that limit, you delete the oldest document in the collection until to maintain that limit. You can also specify a maximum number of elements, but you must also always specify the size (in bytes). Redis has a similar feature, with capped lists, but that feature only count the number of items, not their size.  Both MongoDB and Redis has the notion of expiring data (which is, incidentally, how RavenDB would handle a similar case).

I can follow the reasoning behind having a limit based on the number of items, even if I think that in most cases that is going to be an issue. But using the overall size of the data for the limit is a recipe for disaster. You need to plan ahead for the capacity you’ll get, and in that case, the actual size of the data isn’t really meaningful (assuming that you aren’t going to run out of disk space). In fact, even a small change in the access patterns to your system can cause you to lose documents because they have reached the limit in the capped collection.

For example, if you use a capped collection to process events (which seems like a fairly common scenario), you’ll set the maximum size of the collection as a factor of how quickly you can process all the events in the collection plus some padding. But if you event processing is running a bit slow (or even just plain crashed), you are racing against the clock to keep up with the incoming events before they will be removed by the size limit.

A change in the usage pattern can also be that users are sending more data, instead of sending 140 chars, they are now sending 280 chars, to keep the example focused on Twitter. That can really mess up any calculation you made based on expected sizing. 

In practice, you will usually set the maximum number of elements and set the maximum size of the collection as something that is very high, hoping to never hit that limit.

I mentioned that I can follow the logic behind having a collection that is capped to a certain number of items. I can see follow the logic, but I disagree with it. In most scenarios when you need this feature, you are looking at stuff like “I would like to retain the last 100 transactions” or stuff like that. The problem is that in the business world, you almost never think in this way, you don’t think in counts, you think in time. What you’ll likely be asked is: “I would like to retain last month transactions”. And if you have less than a 100 transactions a month, these two options end up being very close to one another.

The way we handle it with RavenDB is through document expiration. You can specify a certain point in time when a document will be automatically deleted and RavenDB will take care of that for you. However, we also have a way to globally disable this feature when you need to extend the expiration for some reason. Working on top of time means that the feature is more closely aligned with the actual business requirements and that you don’t throw away perfectly good data too soon.

time to read 3 min | 536 words

imageIn my previous post I talked about an interesting challenge, distributing data among many different nodes, each of which can act independently on this data. Sadly, the best example for this scenario is distributing ads, but I hope you’ll excuse me on that.

The first thing to realize about this task that this is basically a synchronization problem over anything else. We define a certain location as the primary one, the one that will accept all the modifications to the data. Each of the nodes will then simply need to connect to that location every now and then and get all the updates.

“Basically a synchronization problem” is like saying that getting a P1 emergency bug at Friday night just before the movie starts is a “tad annoying”. The problem is that to do synchronization properly, you have to model your data properly, make sure that you keep track of changes and be able to send partial changes down the wire efficiently. That is not a simple task at all.

In this post, I want to offer another option to handle this. Using Raft. This is a strange use case for a consensus algorithm, I’ll admit. I guess that technically you could run a Raft consensus over 10,000 nodes. Just don’t expect it to be making any sort of decisions. So why am I offering Raft for a scenario when we have that many nodes? Because Raft, at its core, is a way to achieve consensus on a distributed log, that is all. And no one says that you must get that distributed log only via Raft directly.

The idea is basically to have a single source of truth. This can be a single server or it can be a Raft cluster with 3 – 7 nodes in it. All writes in the system are going to go there. The actual process is very well understood and there are multiple ways to do that. The simplest one to consume is likely rqlite. The log, in the case of rqlite, is going to be the SQL statements that are going to be applied to a sqlite database.

But how does this solve the problem of distributing the data? The answer is simple, we already have a way to disseminate distributed state, the log itself. What is going to happen is that you’ll have each of the nodes in the edge connect to the cluster and ask for a copy of the log as of the last committed entry that they have. When they get that, they can apply these statements to their own local copy of sqlite and know that they are now up to date with the state of the system at that time frame.

This approach skips over the need to architect your data for sync (which is hard) and push all of that complexity down the stack to your infrastructure. If the number of nodes you have is large enough, you might need to introduce mirrors to reduce the load. But that fits very nicely into the architecture without really needing to change something.

time to read 4 min | 648 words

imageBefore I start discussing this topic, I want to talk a bit about the speed of light. That pesky limit basically means that there in an inherent lag in passing information between any two points in space. On your daily life, you can mostly ignore it. The human brain is far too slow to perceive it, and even if you are working with computers, you can usually ignore the speed of light for anything less than about 500 miles.

But the speed of light is merely the hard upper limit of our ability to send information from one location to another. In practice, the lag time between any two computers connected to a network is much higher. In fact, if you are a gamer, you are very well acquainted with that fact.

Let’s assume that we have a need to do the following:

  • Hold a mutable state of some kind.
  • Modify it in a consistent manner.
  • Distribute it to many locations, where many is at least hundreds and potentially tens of thousands of locations.

Let’s break this up a bit to its component parts. A mutable state basically means that we can modify it over time, and that these modifications shows up in all locations. The speed of light (and network lag) ensures that we can’t have this happen instantaneously. And, of course, the killer requirement here is that we need to do this in a consistent manner.

I find it really hard to talk about abstract problems concretely so let’s use an example. We have a set of tax rules that determine how certain products should be taxed. The ruleset is big, changes on a regular basis and it is quite important to get things right. Oh, and we also need to do push those rules to tens of thousands of locations that would independently compute taxes for purchases.

In this case, the solution is quite easy. We have a single source of truth that all the locations can pull the tax ruleset from (directly or through mirrors). We’ll also ensure that tax rules are applied only at some future date from their creation, to allow time for all these locations to be updated. If a location hasn’t had an updated ruleset for over a certain amount of time, they can refuse to process any payments until they get an update. This example has really very little for us to do in term of design.

A better example would have multiple actors changing the data as we work with it. An actual example for when this is required can be when we have an ad provider that needs to run ads in many (physical locations). Each location is actually an IoT device that includes a computer, a screen and a camera. Each such device decides independently what ads should be shown (for example, if the camera see a baby stroller, they show ads for baby toys).

I actually had to search hard for an example that would work for this scenario but I think this is a good  one. We obviously have a lot of activity on the ad provider with many people registering and modifying ads settings. We need that portion of our business to be consistent (this is what we are charging people for, after all). However, given the probably high amount of ad devices spread all over, we can’t rely on them always being connected to a central server or even on them being connected at all.

And even if there is no connectivity and nothing new to show, we must show something. Even if we aren’t getting paid for showing the ad, not showing something or (actually worse) showing an error is something that we can’t tolerate.

Given all of these constraints, how do would you build such a system?

time to read 4 min | 760 words

imageYou might be familiar with Moore’s law, which states that the number of transistors in a dense integrated circuit doubles about every two years. In effect, that performance doubles every 24 months. For many years, that has certainly held true. But that hasn’t been the case for the past 10 years or so. Even when Moore’s law held true, there was a snag. Wirth’s law is also in effect (as an aside, read his article “A Plea for Lean Software”, 23 years, and it holds true today) and Wirth’s law states that software is slower quicker than hardware is becoming faster. It’s good to be a software developer, because even when the CPU clock speed doesn’t jump all the time, we still get more CPUs to play with. A common approach to handle performance issues today is just to throw more parallelism at the problem until it shuts up.

To a certain extent, this make a perfect economic sense. In the calculus between developers’ salaries and the cost of hardware, you’ll usually find that buying a couple extra servers is drastically cheaper than spending another 6 months improving the performance of the system. Jeff Atwood wrote about the topic a decade ago and I think that he is still very much correct, to a degree. There are other factors to consider, which is the overhead over time and in particular in the cloud. One of the major factors to take into account here is that when you are running in the cloud, you aren’t running on your servers and you are charged on usage. That can change the math by quite a bit.

For example, if I bought a couple of servers, the number of IO operations that I make is pretty much meaningless to me. If I hitting the disk ten times a second or ten thousands times a second, I’m paying the same cost. Oh, sure, I might need to buy a better hard disk to get 10,000 IOPS, but that is a one time cost, and usually not that meaningful in the grand scheme of things. But when you are on the cloud, getting higher IOPS cost more, and it cost more over time. In the same sense, in my data center, the cost of querying a database is zero. In the cloud, you will typically be charged some (miniscule) amount on a per query basis. Nothing to worry about, except that your software still work according to Wirth’s law, so you are making more queries than you should, which means that you are charged for each and every one of them.

I used to make my living by being a database performance consultant. I would go to a customer, look at how they are using their database and optimize the access pattern. It was common to see 90%+ savings in the number of queries for common operations. I was doing that because that directly translated to better responsiveness of the application. Applications that respond faster are more pleasant to use and there are numerous studies about faster applications generating higher revenues. I remember talking to clients and explaining to them why they should invest in the overall performance of the application before they hit the total resource depletion that would take them down.

Today, in the cloud, I would have a much simpler task. Let’s assume a simple application using CosmosDB, as an example. With 200 page views / sec on the site, and each page view generating 80 requests to the database, that gives us a total of 16,000 requests a second, which translates to an end of the month bill of about 10,000$. I’m using the 80 queries / page view as a reasonable low ball estimate, mind. Drupal, for example, does 300 – 400 queries per page view and it is easy to get to these numbers without paying attention. Dropping the number of queries per page to 10 (which is usually pretty easy to do with proper queries, attention to details and some caching) gives you a database bill that is around than 1,000$.

Over the span of a year, that is enough to pay a full time developer that can go and find other places where your software can be improved. And unlike before, you don’t need to justify with studies or any indirect causation. You can point directly to the bottom line in an invoice and show how much money is saved.

time to read 3 min | 460 words

“I’m getting a 403 Forbidden error” is one of the more annoying things to debug. Something, somewhere, in a distributed system, has decided that a request is not authorized and blocked it.

In RavenDB 3.5, we supported OAuth and Windows Authentication to authenticate clients talking to RavenDB. That meant that we had to field support questions exactly like the one above. That was not fun. First of all, there seem to be an inclination in the security community to hide errors really well. “I don’t like you because the difference between our clocks, let’s send an EPERM to this socket and close it, and if I’m feeling nice, log it to /var/logs/obscure/dev/null”.

I have scars from trying to work out “Why doesn’t Windows Auth work for this user” that involved talking to a DBA who wasn’t neither the domain admin nor even aware of the domain topology in the organization (Fallacies: There is one administrator). At one point, we had a production issue that had RavenDB refusing all access because Windows couldn’t validate credentials because of a disconnect in a linked domain that was down for scheduled maintenance and the credential cache expired.

When we designed RavenDB 4.0, we decided that for security, we need something that would be debuggable. That means:

  • Secured – that goes without saying, if it is security measure that is debuggable but isn’t actually secured…
  • Rely only on two parties – the client & server in each connection.
  • Provide enough information to solve the problem

We selected TLS / SSL for this, with client certificate as the authentication mechanism. This answers the first requirement, because TLS has been analyzed enough that I’m certain that it is secured. It is also well known and familiar to administrators.

TLS uses PKI, so it isn’t technically a two parties solution, you may have to deal with certificates revocation, trust chains, etc. But those are well understood and you have really good errors on those. On the client certificate authentication side, however, we require that we’ll have the actual list of trusted certificates, so there is no need to check anywhere else.

We have also taken debuggable security a couple of steps further. For example, we’ll accept an unsecured connection and let it establish itself. Then send a single message down the line, explaining why we aren’t going to use it.

In practice, this works great. Take a look at this stack overflow question, which shows the following error:

image

Given just the information in the post and the error from RavenDB, it was easy to figure out that the client certificate hasn’t been registered and that this is why RavenDB is refusing access.

I consider this a major success.

time to read 10 min | 1862 words

imageI spent some idle time thinking about the topic lately, and I think that this can be a pretty good blog post. The goal is to have a generic application level network protocol for client/server and server/server communication. There are a lot of them out there, and this isn’t actually design work that I intend to implement. It is a thought exercise that run through a lot of the reasoning behind the design.

There are network protocol that are specific for a purpose, and that is reflected in a lot of hidden assumptions they have. I’m trying to conceive a protocol that would be generic enough to be used for a wide variety of cases. Here are the key points in my thinking.

  • Don’t invent the wheel
  • Security is a must
  • RPC is the most common scenario
  • Don’t cause problem for the client / server by messing the protocol
  • Debuggability can’t be bolted on
  • Push model should also work

By not re-inventing the wheel I mean that this should be relatively simple to implement. That pretty much limits us to TCP as the underlying mechanism. I’m actually going to specify a stream based communication protocol, instead, though. With the advent of QUIC, HTTP/3, etc, that might actually be useful. But the whole idea is that the underlying abstraction that we want to rely on is a connection between two nodes that is a stream. All the issue of packet ordering, retries, congestions, etc are to be handled at that level.

At this day and age, security is no an optional requirement, and that should be incorporated into the design of the system from the get go. I absolutely adore TLS, and it solves a whole bunch of problems for us at the same time. It give us a secure channel, it handles authentication on both ends and is is both widely understood and commonly used. This means that selecting TLS as the security mechanism, we aren’t limiting any clients.  So the raw protocol we rely on is TLS/TCP, with authentication done using client certificates.

By far the most common usage for a network protocol is the request/reply model. You can see it in HTTP, SMTP, POP3 and most other network protocol. There is a problem with this model, though. A simply request/reply protocol is going to cause scalability and management issues for the users. What do I mean by that? Look at HTTP as a great example. It is a simple request/reply protocol, and that fact has caused a lot of complexity for users. If you want to send several requests in parallel, you need multiple connections, and head of line queue is a real problem. This impact both client and servers and can cause a great deal of hardship for all. Indeed, this is why HTTP/2 allows framing and to send multiple requests without specifying the order in which the server reply to them.

A better model would be to break that kind of dependency, and I’m likely going to be modeling at least some of that on the design of HTTP/2.

Speaking of which, HTTP/2 is a binary protocol, which is great if you have the entire internet behind you. If you are designing a network protocol that isn’t going to be natively supported by all and sundry, you are going to need to take into account the debuggability of the solution. The protocol I have in mind is a text based protocol and should be usable from the command line by using something like:

openssl s_client -connect my_server:4833

This will give you what is effectively a shell into the server, and you should be able to write commands there and get their results. I used to play around a lot with network protocols and being able to telnet to a server and manually play with the commands is an amazing experience. As a side affect of this, it also means that having a trace of the communication between client and server will be amazingly useful for diagnostics down the line. For that matter, for certain industries, being able to capture the communication trace might be an absolute requirement for auditing purposes (who did what and when).

So, here is what we have so far:

  • TLS/TCP as the underlying protocol
  • Text based so we can manually

What we are left with, though, is what is the actual data on the wire going to look like?

I’m not going to be too fancy, and I want to stick closely to stuff that I know that works. The protocol will use messages as the indivisible unit of communication. A message will have the following structure (using RavenDB as the underlying model):

GET employees/1-A employees/2-B
Timeout: 30
Sequence: 293
Include: ReportsTo

PUT “document with spaces”
Sequence: 294
Body: chunked

39
<<binary data 39 bytes in len>>
 

So, basically, we have a line oriented protocol (each line separated by \r\n, and limited to a well known maximum size). A message starts with a command line, which has the following structure:

cmd (token) args[] (token)

Where token is either a sequence of characters without whitespace or a quoted string if it contains whitespace.

Following the command line, you have the headers, which pretty much follow the design of HTTP headers. They are used to pass additional information, such as the timeout for that particular command, command specific data (like the Include header on the first command) or protocol details (like specifying a timeout for that particular command or that the second command has a body and how to read it). A command ends with an empty line, and then you have an optional body.

The headers here serve a very important role. As you can see, they are key for protocol flexibility and enabling versioning. It give us a good way to add additional data after the first deployment without breaking everything.

Because we want to support people typing this manually, we’ll probably need to have some way to specify message bodies that a human can type on their own without having to compute sizes upfront. This is something that will likely be needed only for human input, so we can probably define a terminating token that would work, not focusing on this because it isn’t a mainline feature, but I wanted to mention this because debuggability isn’t a secondary concern.

You might have noticed a repeated header in the commands I sent. The Sequence header. This one is optional (when human write it) but will be very useful for tracing, so tools will always add it. A stream connection is actually composed of two channels, the read and the write. On the read side of this protocol, we read a full command from the network, hand it off to something else to process it and go right back to reading from the network. This design is suitable for the event based systems that has proven to be so useful to scale the amount of work a server can handle. Because we can start reading the next command while we process the current ones, we greatly reduce the number of connections we require and enable a lot more parallel work.

Once a message has been processed, the reply is sent back to the client. Note that there is no requirement that the replies will be sent in the same order as the requests that initiated them. That means that an expensive operation on the server side isn’t going to block cheaper operations that came after it, which is again, important for the overall speed of the system. It also lends itself quite nicely for an event loop based processing. The sequence number for the request is used in the reply to ensure that the client can properly correlate the reply to the relevant request.

On the client side, you write a command to the network. When reading from the network, you need to keep track of the sequence number you sent and route it back to the right caller. The idea here is that on the client side, you may have a single connection that is shared among several threads, reducing the number of overall connections you need and getting better overall utilization from the network.

A nice property of this design is that you don’t have to do things this way. If you don’t want to support concurrent requests / replies, just have a single connection and wait to read the reply from the server whenever you make a request. That give you the option of simple stateful approach, but also an easy upgrade path down the line if / when you need it. The fact that the mental model of the user is request/reply is a great help, to be honest, even if this isn’t what is actually going on. This greatly reduce the amount of complexity that a user need to keep in their head.

Some details on the protocol would need gentle massaging, to ensure that a human on the command line can type reasonable commands, but that is fairly straightforward. The text based nature of the communication also lends itself nicely to tracing / audits. At the client or server levels, we can write <connection-id>.trace file that will have all the reads and writes on that connection. During debugging, you can just tail <connection-id> the right file and see exactly what is going on, or just zip them for achieve for auditing.

Speaking of zipping, let’s consider the following command:

OPTIONS gzip

This command on the connection can do things like change how we encode the data, in this case, ask the server to use gzip for all reads and writes from now on. The server can reply with a message (uncompressed) that it is now switching compression and everything from that point forward will be compressed. Note that unlike HTTP compression, we can get the benefits of compression across multiple requests, and given that most requests / reply have a lot of the same structure, likely benefit quite us by a lot.

The last topic I listed is the notion of push operations and how they should be handled. Given that we don’t have a strict request/reply model, there is an obvious way for the server to send additional data “out of band”. A client can request to be notified by the server of certain things, and the server will make a note on that and just send the data back at some later time. There is obviously the need to correlate the push notification to the original request, but that is why we have the headers for. A simple CorrelationId header on the original request and the push notification will be sufficient for the client side to be able to route that to the right callback.

I think that this is pretty much it, this should cover enough to give you a clear idea about what is required and I believe that it is enough for a thought exercise. There are a lot of other details that should probably be answered, for example, how do you deal with very large responses (break them to multiple messages, I would assume, to avoid holding up the connection for other requests), but that should be the gist of it.

time to read 6 min | 1032 words

imageI talked a lot about how graph queries in RavenDB will work, but one missing piece of the puzzle is how they are going to be used. I’m going to use this post to discuss some of the options that the new features enables. We’ll take the following simple model, issue tracking. Let’s imagine that a big (secret) project is going on and we need to track down the tasks for it. On the right, you have an image of the permissions graph that we are interested in.

The permission model is pretty standard I think. We have the notion of users and groups. A user can be associated to one or more groups. Groups memberships are hierarchical. An issue’s access is controlled either by giving access to a specific users or to a group. The act of assigning a group will also allow access to all the group’s parents and any user that is associated to any of them.

Here is the issue in question:

image

Given the graph on the right, let’s see, for each user, how we can find out what issues they have access to.

We’ll start with Sunny, who is named directly in the issue as allowed access. Here is the query to find the issues on which we are directly named.

image

This was easy, I must admit. We could also write this without any need for graph queries, because it is so simple:

image

It gets a bit trickier when we have to look at Max. Unlike Sunny, Max isn’t directly named. Instead, he is a member in a group that is directly named in the issue. Let’s see how we can query on issues that Max has access to:

image

This is where things gets start to get interesting. If you’ll look into the match clause, you’ll see that we have arrows that go both left and right. We are telling RavenDB that we want to find issues that have the same group (denoted as g in the query) as the user. This kind of query you already can’t express with RavenDB right now without the graph query syntax. Another way to write this query is to use explicit clauses, like so:

image

In this case, all the arrows go in a single direction, but we have two clauses that are being and together. These two queries are exactly the same, in fact, RavenDB will translate the one with arrows going both ways to the one with the and.

The key here is that between any two clauses with an and, RavenDB will match the same alias to the same value.

So we now can find issues for Max and Sunny, but what about all the rest of the people? We can, of course, do this manually. Here is how we can find issues for people a group removed from the issue.

image

This query gives us all the issues for Nati that are one group removed from him. That works, but it isn’t how we want to do things. It is a good exercise in setting out the structure, though. Because instead of hard coding the pattern, we are now going to use recursion.

image

The key for this query is the recursive element in the third line. We moved from the issue to its groups, and then we recurse from the issues’ groups to their parents. We allow empty recursion and we follow all the paths in the graph.

On the other side, we go from the user and try to find a match from the user’s group to any of the groups that end the query. In Nati’s case, we went from project-x group to team-nati, which is a match, so we can return this issue.

Here is the final query that we can use, it is a bit much, even if it is short, so I will deconstruct it below:

image

We use a parameterized query here, to make things easier. We start from a user and find an issue if:

  • The issue directly name the user as allowed access. This is the case for Sunny.
  • The issue’s groups (or their parents) match the groups that the user belong to (non recursively, mind).

In the case of Max, for example, we have a match because the recursive element allows a zero length path, so the project-x group is allowed access and Max is a member of it.

In the case of Nati, on the other hand, we have to go from project-x to team-nati to get a match.

If we’ll set $uid to users/23, which is Pheobe, all the way to the left in the graph, we’ll also have a match. We’ll go from project-x to execs to board and then find a match.

Snoopy, on the other hand, doesn’t have access. Look carefully at the direction of the arrows in the graph. Snoopy belongs to the r-n-d group, and that groups is a child of exces, but the query we specified only go up to the parents, not the children, so Snoopy is not included.

I hope that this post gave you some ideas about what kind of use cases are enabled by graph queries in RavenDB. I would love to hear about any scenarios you have for graph queries so we can play with them and see how they are answered by RavenDB.

FUTURE POSTS

  1. Using TLS with Rust: Authentication - 19 minutes from now
  2. The role of domain model with CQRS / Event Sourcing - about one day from now
  3. Using TLS in Rust: Going to async I/O with Tokio - 4 days from now
  4. Investigating self inflicted wounds: The SSL failure on the Linux build server - 5 days from now
  5. Using TLS in Rust: tokio ain’t mere mortals - 6 days from now

And 4 more posts are pending...

There are posts all the way to Feb 04, 2019

RECENT SERIES

  1. Using TLS with Rust (4):
    11 Jan 2019 - Part III–Will native tls do the trick?
  2. Data modeling with indexes (3):
    14 Jan 2019 - Predicting the future
  3. Reminder (9):
    03 Jan 2019 - I’ll be in CodeMash is next week
  4. Production postmortem (24):
    25 Dec 2018 - Handled errors and the curse of recursive error handling
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats