Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

, @ Q j

Posts: 6,840 | Comments: 49,138

filter by tags archive
time to read 6 min | 1072 words

RavenDB stores (critical) data for customers. We have customers in pretty much every field imaginable, healthcare, finance, insurance and defense. They do very different things with RavenDB, some run a single cluster, some deploy to tens of thousands of locations. The one thing that they all have in common is that they put their data into RavenDB, and they really don’t want to put that data at the hands of an unknown third party.

Some of my worst nightmares are articles such as these:

That is just for the last six months, and just one site that I checked.

To be fair, none of these cases are because of a fault in MongoDB. It wasn’t some clever hack or a security vulnerability. It was someone who left a production database accessible over the public Internet with no authentication.

  1. Production database + Public Internet + No authentication
  2. ?
  3. Profit (for someone else, I assume)

When we set out to design the security model for RavenDB, we didn’t account only for bad actors and hostile networks. We had to account for users who did not care.

Using MongoDB as the example, by default it will only listen on localhost, which sounds like it is a good idea. Because no one external can access it. Safe by default, flowers, parade, etc.

And then you realize that the first result for searching: “mongodb remote connection refused” will lead to this page:

image

Where you’ll get a detailed guide on how to change what IPs MongoDB will listen to. And guess what? If you follow that article, you’ll fix the problem. You would be able to connect to your database instance, as would everything else in the world!

There is even a cool tip in the article, talking about how to enable authentication in MongoDB. Because everyone reads that, right?

image

Except maybe the guys at the beginning of this post.

So our threat model had to include negligent users. And that leads directly to the usual conundrum of security.

I’ll now pause this post to give you some time to reflect on the Wisdom of Dilbert:

In general, I find that the best security for a computer is to disconnect it from any power sources. That does present some challenges for normal operations, though. So we had to come up with something better.

In RavenDB, security is binary. You are either secured (encrypted communication and mutual authentication) or you are not (everything is plain text and there everyone is admin). Because the Getting Started scenario is so important, we have to account for it, so you can get RavenDB started without security. However, that will only work when you set RavenDB to bind to local host.

How is that any different than MongoDB? Well, the MongoDB guys have a pretty big set of security guidelines. At one point I took a deep look at that and, excluding the links for additional information, the MongoDB security checklist consisted of about 60 pages. We decided to go a very different route with RavenDB.

If you try to change the binding port of RavenDB from localhost, it will work, and RavenDB will happily start up and serve an error page to all and sundry. That error page is very explicit about what is going on. You are doing something wrong, you don’t have security and you are exposed. So the only thing that RavenDB is willing to do at that point is to tell you what is wrong, and how to fix it.

That lead us to the actual security mechanism in RavenDB. We use TLS 1.2, but it is usually easier to just talk about it as HTTPS. That gives us encrypted data over the wire and it allows for mutual authentication at the highest level. It is also something that you can configure on your own, without requiring an administrator to intervene. The person setting up RavenDB is unlikely to have Domain Admin privileges or the ability to change organization wide settings. Nor should this be required. HTTPS relies on certificates, which can be deployed, diagnosed and debugged without any special requirements.

Certificates may not require you to have a privileges access level, but they are complex. One of the reasons we choose X509 Certificates as our primary authentication system is that they are widely used. Many places already have policies and expertise on how to deal with them. And for the people who don’t know how to deal with them, we could automate a lot of that and still get the security properties that we wanted.

In fact, Let’s Encrypt integration allowed us to get to the point where we can setup a cluster from scratch, with security, in a few minutes. I actually got it on video, because it was so cool to be able to do this.

Using certificates also meant that we could get integration with pretty much anything. We got good support from browsers, we got command line integration, great tools, etc.

This isn’t a perfect system. If you need something that our automated setup doesn’t provide, you’ll need to understand how to work with certificates. That isn’t trivial, but it is also not a waste, it is both interesting and widely applicable.

The end result of RavenDB’s security design is a system that is meant to be deployed in hostile environment, prevent information leakage on the wire and allow strong mutual authentication of clients and servers. It is also a system that was designed to prevent abuses. If you really want to, you can get an unsecured instance on the public internet. Here is one such example: http://live-test.ravendb.net

In this case, we did it intentionally, because we wanted to get this in the browser:

image

But the easy path? The path that we expect most users to follow? That one ends up with a secured and safe system, without showing up on the news because all your data got away from you.

time to read 2 min | 361 words

I read this post about using Object Relational Mapper in Go with great interest. I spent about a decade immersed deeply within the NHibernate codebase, and I worked on a bunch of OR/Ms in .NET and elsewhere. My first reaction when looking at this post can be summed in a single picture:

image

This is a really bad modeling decision, and it is a really common one when people are using an OR/M. The problem is that this kind of model fails to capture a really important aspect of the domain, it’s size.

Let’s consider the Post struct. We have a couple of collections there. Tags and Comments, it is reasonable to assume that you’ll never have a post with more than a few tags, but a popular post can easily have a lot of comments. Using Reddit as an example, it took me about 30 seconds to find a post that had over 30,000 comments on it.

On the other side, the Tag.Posts collection may contains many posts. The problem with such a model is that trying to access such properties is that they are trapped. If you hit something that has a large number of results, that is going to cause you to use a lot of memory and put a lot of pressure on the database.

The good thing about Go is that it is actually really hard to play the usual tricks with lazy loading and proxies behind your back. So the GORM API, at least, is pretty explicit about what you load there. The problem with this, however, is that developers will explicitly call the “gimme the Posts” collection and it will work when tested locally (small dataset, no load on the server). It will fail in production in a very insidious way (slightly slower over time until the whole thing capsize).

I would much rather move those properties outside the entities and into standalone queries, ones that come with explicit paging that you have to take into account. That reflect the actual costs behind the operations much more closely.

time to read 1 min | 72 words

I’m going to be in London at the beginning of June. I’ll be giving a keynote at Skills Matters as well as visiting some customers.

I have a half day and a full day slots available for consulting (RavenDB, databases and overall architecture). Drop me a line if you are interested.

I also should have an evening or two free is there is anyone who wants to sit over a beer and chat.

time to read 2 min | 287 words

In a previous post about authorization in a microservice environment, I wrote that one option is to generate an authorization token and have it hold the relevant claims for the application. I was asked how I would handle a scenario in which the security claim is over individual categories of orders and a user may have too many categories to fit the token.

This is a great question, because it showcase a really important part of such a design. An inherent limit to complexity.  The fact that having a user with a thousand individual security claims is hard isn’t a bug in the system, it is a feature.

For many such cases, it really doesn’t make sense to setup security in such a manner. How can you ever audit or reason about such a system? It just doesn’t work this way in the real world. An agent may be authorized to a dozen customers, and her manager will be allowed access to them as well. But attaching each individual customer to the manager doesn’t work. Instead, you would create a group and attach the customers to the group, then allow the manager to access the group. Such a system is much easier to work with and review. It also match a lot more closely how the real world works.

Some of the problems here are derived from the fact that it seems like, when we use a computer, we can build such a system. But in most cases, this is a false premise. Not because of actual technical limitations, but because of management overhead.

Building the system upfront so the things that should be hard are actually hard is going to be a lot better in the long run.

time to read 5 min | 808 words

imageI talked a bit about microservices architecture in the past few weeks, but I think that there is a common theme to those posts that is missed in the details.

A microservices architecture, just like Domain Driven Design or Event Source and CQRS are architectural patterns that are meant to manage complexity. In the realms of operations, Kubernetes is another good example of a tool that is meant to manage complexity.

I feel that this is a part that is all too often getting lost. The law of leaky abstractions means that you can’t really reduce complexity, you can only manage it. This means that tools and architectures that are meant to deal with complexity are themselves complex, by necessity. The problem is when you try to take a solution that was successfully applied to solve a complex problem, and  apply that to something that isn’t of equal complexity.

Keep the following formula in mind:

Solution Complexity = Architecture Complexity + ( Problem Complexity / Architecture Factor )

Let’s try to solve this formula for a couple of projects. One would be managing a little league soccer website and the other would be the standard online shop. Here are the results

Cost / Benefit of Architecture

Little League

Online Shop

Architecture Complexity

10

10

Problem Complexity

2

20

Architecture Factor

3

3

Solution Complexity

10.6

16.6

By the way, the numbers are arbitrary, I’m trying to show a point, and showing it with numbers make it easier to get the point across. The formula is real, though, based on my experience.

The idea behind the formula and the table above is simple. Every architecture you make can be ranked along two axes. One is the architectural complexity and the second is the architecture factor. The architectural complexity is a fixed (usually) number that ranks how complex it is to use the architecture. The architectural factor is how much this architecture help you deal with the overall problem complexity.

You can see above that applying the same architecture for two different problem can result in very different results. The overall solution complexity for the little league website is less than the online shop, as expected. But you can also see that there are huge fixed costs here that drive the overall complexity far higher.

Using a different architecture, which will have a much smaller architectural factor, but also much lower fixed complexity, will allow you to deliver a solution that has much lower complexity (and get it faster, with less bugs, etc).

Choosing a microservice architecture implies that you are going to have a net benefit here. The additional complexity of using microservices is offset by the fact that the architectural factor is going to reduce your overall complexity. Otherwise, it just doesn’t make sense.

An 18 wheelers is a great thing to have, if you need to ship a whole bunch of stuff. It is the Wrong Tool For The Job if you need to commute to work.

In most cases, people select the architecture that sounds right for their project, mostly because they focus on the architectural factor. Without taking into account the fixed complexity cost. When they run into that, they either re-evaluate or strive forward regardless. Let’s assume that you run into a project where they chose the microservice architecture, and then they realize that some parts of it are complex, so they cut some corners. I’m thinking about something like what is shown here. Let’s analyze what you end up with?

Architecture Complexity – 10, Architecture Factor – 1, Problem Complexity – 8 = Overall Complexity = 18

And that is for the good case where your architectural factor isn’t actually below 1, which I would argue is actually going to be in the kind of architecture that these kind of solutions reach. A Distributed Monolith has an architecture complexity of 10 and a factor of 0.75. So trying to solve a problem that has a complexity of 8 here will result in overall complexity of 20.6.

I don’t actually have real numbers to evaluate different architectures and solution complexities. That would probably require rigorous study, but empirical evidence can give good off the cuff numbers for most of the common architectures. I’m going to leave it up to the comments, if someone want to take this challenge.

Keep this in mind when you are choosing your architecture, for both green fields and brown fields projects. That can save you a lot of trouble.

time to read 3 min | 520 words

They just aren’t. And I’m talking as someone who has actually implemented multiple distributed transaction systems. People moving to microservices are now discovering a lot of the challenges and hurdles of distributed systems and it is only natural to want to go back to the cozy transactional world, where you can reason about things properly.

This post is in response to this article: Microservices and distributed transactions, which I read with interest, because it isn’t often that a post will refute it’s own premise with the very first statement.

The two-phase commit protocol was designed in the epoch of the “big iron” systems like mainframes and UNIX servers; the XA specification was defined in 1991 when the typical deployment model consisted of having all the software installed in a single server.

That is a really important observation, because in this case, we remove one big factor from the distributed transactions, the distributed part. Note that this is almost 30 years ago, distributed transactions and the two phase commit protocol aren’t running on a single node any longer. But the architecture is still rooted into the same concept. And it doesn’t work. I wrote a blog post explaining the core issues with two phase commit about 5 years ago. Nothing changed so far.

From a technical perspective, the approach that is shown in the article is interesting. It is really nice that you can have a “transaction” that spawn multiple services and databases. It is a problem that this isn’t going to result in an atomic behavior (you can observe some of the transactions being committed before others), it is a problem that this has really bad failure modes (hanging / timeout / inconsistencies) under fairly common scenarios and finally, it is a really bad approach because your microservices shouldn’t be composed using transactions.

Leaving aside all the technical details about why two phase commit is a bad idea, there is still the core architectural issue, you are tying together the services in your system. If service A is stalled for whatever reason, your service B is now impacted because it is waiting for a transaction to close.

Have fun trying to debug something like that, especially because you actual state is hidden away in some transaction manager and not readily visible. It means adding a tricky layer of complexity that will break, and will cause issues, and will create silent dependencies between your services. Silent ones, invisible ones, and they will come to haunt you.

The whole point of a microservice architecture is separation of concerns to independently managed, deployed and provisioned systems. If you need to actually have cross service transactions, you either have modelled things wrong or are doing very badly. Go back to a monolith with a single database backend and use that as the transactional store. You’ll be much happier.

Remember: Microservices. Are. Separated.

That isn’t a bug, that isn’t a hurdle to overcome. That is the point. Tying them close together is a mistake, but you’ll usually only see it after a few months of production. So take a measure of prevention before you’ll need a metric ton of cures.

time to read 4 min | 722 words

This post was triggered by this post. Mostly because I got people looking strangely at me when I shouted DO NOT DO THAT when I read the post.

We’ll use the usual Users and Orders example, because that is simple to work with. We have the usual concerns about users in our application:

  • Authentication
    • Password reset
    • Two factor auth
    • Unusual activity detection
    • Etc, etc, etc.
  • Authorization
    • Can the user perform this particular operation?
    • Can the user perform this action on this item?
    • Can the user perform this action on this item on behalf of this user?

Authentication itself is a fairly simple process. Don’t build that, go and use a builtin solution, authentication is complex, but the good side of it is that there are rarely any business specific stuff around it. You need to authenticate a user, and that is one of those things that is generally such a common concern that you can take an off the shelve solution and go with that.

Authorization is a lot more interesting. Note that we have three separate ways to ask the same question. It might be better to give concrete examples about what I mean for each one of them.

Can the user create a new order? Can they check the recent product updates, etc? Note that in this case, we aren’t operating on a particular entity, but performing global actions.

Can the user view this order? Can they change the shipping address?  Note that in this case, we have both authorization rules (you should be able to view your own orders) and business rules (you can change the shipping address on your order if the order didn’t ship and the shipping cost is the same).

Can the helpdesk guy check the status of an order for a particular customer? In this case, we have a user that is explicitly doing an action on behalf on another user. We might allow it (or not), but we almost always want to make a special note of this.

The interesting thing about this kind of system is that there are very different semantics for each of those operations. One off the primary goals for a microservice architecture is the separation of concerns, I don’t want to keep pinging the authorization service on each operation. That is important. And not just for the architectural purity of the system, one of the most common reasons for performance issues in systems is the cost of authorization checks. If you make that go over the network, that is going to kill your system.

Therefor, we need to consider how to enable proper encapsulation of concerns. An easy to do that is to have the client hold that state. In other words, as part off the authentication process, the client is going to get a token, which it can use for the next calls. That token contains the list of allowed operations / enough state to compute the authorization status for the actual operations. Naturally, that state is not something that the client can modify, and is protected with cryptography. A good example of that would be JWT. The authorization service generate a token with a key that is trusted by the other services. You can verify most authorization actions without leaving your service boundary.

This is easy for operations such as creating a new order, but how do you handle authorization on a specific entity? You aren’t going to be able to encode all the allowed entities in the token, at least not in most reasonable systems. Instead, you combine the allowed operations and the semantics of the operation itself. In other words, when loading an order, you check whatever the user has “orders/view/self” operation and that the order is for the same user id.

A more complex process is required when you have operations on behalf of. You don’t want the helpdesk people to start sniffing into what <Insert Famous Person Name Here> ordered last night, for example. Instead of complicating the entire system with “on behalf of” operations, a much better system is to go back to the authorization service. You can ask that service to generate you a special “on behalf of” token, with the user id of the required user. This create an audit trail of such actions and allow the authorization service to decide if a particular user should have the authority to act on behalf of a particular user.

time to read 4 min | 797 words

This post is in reply to this one: Is a Shared Database in Microservices Actually an Anti-pattern?

The author does a great job outlining the actual problem. Given two services that need to share some data, how do you actually manage that in a microservice architecture? The author uses the Users and Orders example, which is great, because it is pretty simple and require very little domain knowledge.

The first question to ask is: Why?

Why use microservices? Wikipedia says:

The benefit of decomposing an application into different smaller services is that it improves modularity. This makes the application easier to understand, develop, test, and become more resilient to architecture erosion.

I always like an application that is easier to understand, develop and test. Being resilient to architecture erosion is a nice bonus.

The problem is that this kind of architecture isn’t free. A system that is composed of microservices is a system that need to communicate between these services, and that is usually where most of the complexity in such a system reside.

In the Orders service, if we need to access some details about the User, how do we do that?

We can directly call to the Users service, but that creates a strong dependency between the services. If Users is down, then Orders is down. That sort of defeats the purpose of the architecture. It also means that we don’t actually have separate services, we just have exchange the call assembly instruction with RPC and distributed debugging. All the costs, none of the benefits.

The post above rightly calls this problematic, and asks whatever async integration between the services would work, using streams. I’m not quite sure what was meant there. My usual method of integrating different microservices is to not do that, instead. Either we need to send a command to a different service (which is async) or we need to publish some data from a service (also async). Both of these options are assuming to be failure resistant and unreliable. In other words, if I send a command to another service, and I need to handle failure, I setup a timer to let me know to handle not being called back.

Even if you just need some data published from another service, and can use a feature such as RavenDB ETL to share that data. You still need to take into account issues such as network failures causing you to have a laggy view of the data.

This is not an accident.

That is not your data, you have a copy (potentially stale) of published data from another service. You can use that for reference, but you cannot count on it. If you need to rely on that data, you need to send a command to the owning service, which can then make the actual decision.

In short, this is not a trivial matter. Even if the actual implementation can be done pretty easily.

The fact that each service owns a particular portion of the system is a core principle of the microservice architecture.

Having a shared database is like having a back stage pass. It’s great, in theory, but it is also open for abuse. And it will be abused. I can guarantee that with 100% confidence.

If you blur the lines between services, they are no longer independent. Have fun trying to debug why your Users’ login time spiked (Orders’ is running the monthly report). Enjoy breaking the payment processing system (you added a new type of user that the Orders system can’t process). And these are the good parts. I haven’t started to talk about what happens when the Orders service actually attempt to write to the Users’ tables.

The article suggests using DB ACL to control that, but you already having something better. A different database, because it is a different service.

It might be better to think about the situation like joint bank account. It’s reasonable to a have a joint bank account with your spouse. It is not so reasonable to have a joint bank account with Mary from accounting, because that make it easier to direct deposit your payroll. There is separation there for a reason, and yes, that does make things harder.

That’s the point, it is not an accident.

The whole point is that integration between services is going to be hard, so you’ll have less of that, and you’ll have that along very well defined boundaries. That means that we can have proper boundaries and contracts between different areas, which lead us to better modularity, thus allowing easier development, deployment and management.

If that isn’t something you want, that is fine, just don’t go into the microservice architecture. Because a monolith architecture is just fine, but a Frankenstein creation of a microservice architecture with shared database is not. Just ask Mary from accounting…

time to read 1 min | 148 words

RavenDB 4.x is using X509 Certificates for authentication. We got a feedback question from a customer about that, they much rather to use API Keys, instead.

We actually considered this as part of the design process for 4.x and we concluded that we can make this work in just the same manner as API Keys. Here is how you can make it work.

You have the certificate file (usually PFX) and convert that to a Base64 string, like so:

image

[System.Convert]::ToBase64String( (gc "cert.pfx" -Encoding byte ) )

You can take the resulting string and store it like an API key, because that is effectively how it is treated. In your application startup, you can use:

And this is it. For all intents and purposes, you can now use the certificate as an API key.

time to read 4 min | 671 words

A decade(!) ago  I wrote that you should avoid soft deletes. Today I run into a question in the mailing list and I remembered writing about this, it turned out that there has been quite the discussion on this at the time.

The context of the discussion at the time was deleting data from relational systems, but the same principles apply. The question I just fielded asked how you can translate a Delete() operation inside the RavenDB client to a soft delete (IsDeleted = true) operation. The RavenDB client API supports a few ways to interact with how we are talking to the underlying database, including some pretty interesting hooks deep into the pipeline.

What it doesn’t offer, though, is a way to turn a Delete() operation into and update (or an update to a delete). We do have facilities in place that allow you to detect (and abort) on invalid operations. For example, invoices should never be deleted. You can tell the RavenDB client API that it should throw whenever an invoice is about to be deleted, but you have no way of saying that we should take the Delete(invoice) and turn that into a soft delete operation.

This is quite intentionally by design.

Having a way to transform basic operations (like delete –> update) is a good way to be pretty confused about what is actually going on in the system. It is better to allow the user to enforce the required behavior (invoices cannot be deleted) and let the calling code handle this different.

The natural response here, of course, is that this places a burden on the calling code. Surely we want to be able to follow DRY and not write conditionals when the user clicks on the delete button. But this isn’t an issue where this is extra duplicated code.

  • An invoice is never deleted, it is cancelled. There are tax implications on that, you need to get it correct.
  • A payment is never removed, it is refunded.

You absolutely want to block deletions of those type of documents, and you need to treat them (very) different in code.

In the enusing decade since the blog posts at the top of this post were written, there have been a number of changes. Some of them are architecturally minor, such as the database technology of choice or the guiding principles for maintainable software development. Some of them are pretty significant.

One such change is the GDPR.

“Huh?!” I can imagine you thinking. How does the GDPR applies to an architectural discussion of soft deletes vs. business operations. It turns out that it is very relevant. One of the things that the GDPR mandates (and there are similar laws elsewhere, such as the CCPA) the right to be forgotten. So if you are using soft deletes, you might actually run into real problems down the line. “I asked to be deleted, they told me they did, but they secretly kept my data!”.  The one thing that I keep hearing about the GDPR is that no one ever found it humorous. Not with the kind of penalties that are attached to it.

So when thinking about deletes in your system, you need to consider quite a few factors:

  • Does it make sense, from a business perspective, to actually lose that data? Deleting a note from a customer’s record is probably just fine. Removing the record of the customer at all? Probably not.
  • Do I need to keep this data? Invoices are one thing that pops to mind.
  • Do I need to forget this data? That is the other way, and what you can forget and how can be really complex.

At any rate, for all but the simplest scenarios, just marking IsDeleted = true is likely not going to be sufficient. And all the other arguments that has been raised (which I’m not going to repeat, read the posts, they are good ones) are still in effect.

FUTURE POSTS

  1. TimeSeries in RavenDB: Exploring the requirements - about one day from now

There are posts all the way to May 20, 2019

RECENT SERIES

  1. Reviewing Sled (3):
    23 Apr 2019 - Part III
  2. RavenDB 4.2 Features (5):
    21 Mar 2019 - Diffing revisions
  3. Workflow design (4):
    06 Mar 2019 - Making the business people happy
  4. Data modeling with indexes (6):
    22 Feb 2019 - Event sourcing–Part III–time sensitive data
  5. Production postmortem (25):
    18 Feb 2019 - This data corruption bug requires 3 simultaneous race conditions
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats