Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,395 | Comments: 47,405

filter by tags archive

Inventory management in MongoDB: A design philosophy I find baffling

time to read 5 min | 961 words

Related imageI’m reading MongoDB in Action right now. It is an interesting book and I wanted to learn more about the approach to using MongoDB, rather then just be familiar with the feature set and what it can do. But this post isn’t about the book, it is about something that I read, and as I was reading it I couldn’t help but put down the book and actually think it through.

More specifically, I’m talking about this little guy. This is a small Ruby class that was presented in the book as part of an inventory management system. In particular, this piece of code is supposed to allow you to sell limited inventory items and ensure that you won’t sell stuff that you don’t have. The example is that if you have 10 rakes in the stores, you can only sell 10 rakes. The approach that is taken is quite nice, by simulating the notion of having a document per each of the rakes in the store and allowing users to place them in their cart. In this manner, you prevent the possibility of a selling more than you actually have.

What I take strong issue with is the way this is implemented. MongoDB doesn’t have multi document transactions, but the solution presented requires it. Therefor, the approach outlined in the book is to try to build transactional semantics from the client side. I write databases for a living, and I find that concept utterly baffling. Clients shouldn’t try to do stuff like that, not only would they most likely get it wrong, but they’ll do that extremely inefficiently.

Let us consider the following tidbit of code:

The idea here is that the fetcher is supposed to be able to atomically add the products to the order. If there aren’t enough available products to be added, the entire thing is supposed to be rolled back. As a business operation, this make a lot of sense. The actual implementation, however, made me wince.

What it does, if it was SQL, is the following:

I intentionally used SQL here, both to simplify the issue for people who aren’t familiar with MongoDB and to explain the major dissonance that I have with this approach. That little add_to_cart call that we had earlier resulted in no less than eight network roundtrips. That is in the happy case.  There is also the failure mode to consider, which involved resetting all the work done so far.

The thing that really bothers me is that I can’t believe that this is something that you’ll actually want to do except as an intellectual exercise. I mean, sure, how we can pretend to get transactions from non transactional store is interesting, but given the costs of doing this or the possibility of failure or the fact that this is a non atomic state transition or… you get my point, right?

In the case of this code, the whole process is non atomic. That means that outside observers can see the changes as they are happening. It also opens you up for a lot of Bad Stuff in terms of abusing the system. If the user is malicious, they can use the fact that this “transaction” is going to be running back and forth to the database (and thus taking a lot of time) and just open another tab to initiate an action while this is going on, resulting in operations on invalid state. In the example that the book give, we can use that to force purchases of invalid items.

If you think that this is unrealistic, consider this page, which talks about doing things like making money appear from thin air using just this sort of approaches.

Another thing that really bugged me about this code is that it has “error handling” I use that in quotes because it is like a security blanket for a 2 years old. Having it there might calm things up, but it doesn’t actually change anything. In particular, this kind of error handling looks right, but it is horribly broken if you consider what kind of actual errors can happen here. If the process running this code failed for any reason, the “transaction” is going to stay in an invalid state. It is possible that one of your rake will just disappear into thin air, as a result. It is supposed to be in someone’s cart, but it isn’t.  The same can be the case if the server had an issue midway or just a regular network hiccup.

I’m aware that this is code that was written explicitly to be readable and easy to explain, rather then be able to withstand the vagaries of production, but still, this is a very dangerous thing to do.

As an aside, not quite related to the topic of this post, but one thing that really bugged me in the book so far is the number of remote requests that are commonly required to do things. Is there an assumption that the database in question is nearby or very cheap to access, because the entire design philosophy I use is to assume that going over the network is expensive, so let us give the users a lot of ways to reduce that cost. In contrast, at least in the book, there is a lot of stuff that is just making remote calls like there is a fire sale that will close in 5 minutes.

To be fair to the book, it notes that there is a possibility of failure here and explain how to handle one part of it (it missed the error conditions in the error handling) and call this out explicitly as something that should be done with consideration.

Migration strategies considerations for Dev–>UAT—>Production

time to read 5 min | 866 words

imagePart of the reason for RavenDB was that I wanted a database that actually took into account how it is being used into account, and provided good support for common usage scenarios. Making the process of moving between Dev –> UAT –> Production easier is a big part of that.

Typically, databases don’t handle that directly, but let the users figure it out. You can see the plethora of SQL Schema deploying and versioning options that you have to deal with.

With RavenDB, for indexes in particular, we made the process very easy. Indexes are defined in code, deployed along side the application and are versioned in the exact same manner, in the exact same system.

But as the feature set of RavenDB grows, we need to consider the deployment scenario in additional places. We recently started talking about the development cycle of ETL processes, Subscriptions, Backups and external replication. The last two are fairly rare in development / UAT scenarios, so we’ll ignore them for now. They are typically only ever setup & used in production. Sometimes you test them on a dedicated instance, but it doesn’t make sense to deploy a backup configuration in most cases. External replication is basically just destination  + credentials, so there isn’t really all that much to track or deploy.

ETL Processes and Subscriptions, on the other hand, can contain quite a bit of logic in them. An ETL process that feed into a reporting database might be composed of several distinct pieces, each of them feeding some part of the data to the reporting db. If the reporting needs change, we’ll likely need to update the ETL process as well, which means that we need to consider exactly how we’ll do that. Ideally, we want a developer to be able to start working on the ETL process on their own machine, completely isolated. Once they are done working, they can checkin their work into the code repository and move on to other tasks. At some future time, this code will get deployed, which will setup the right ETL process in production.

That is a really nice story, and how we are dealing with indexes, but it doesn’t actually work for ETL processes. The problem is that ETL is typically not the purview of the application developer, it is in the hand of the operations team or maybe it is owned by the team that owns the reports. Furthermore, changes to the ETL process are pretty common and typically happen outside the release cycle of the application itself. That means that we can’t tie this behavior to the code, unlike indexes, which have a pretty tight integration with the code that is using them, ETL is a background kind of operation, with little direct impact.  So it can’t be tied to the application code like indexes is. Even with indexes, we have measures in place that prevent (lock) the index definition, so an administrator can update the index definition on the fly without the application overwriting it with the old version of the index.

Subscriptions are more of a middle ground. A subscription is composed of a client side application that process the data and some server side logic related to filtering and shaping it. On the one hand, it make a lot of sense for the subscribing application to control its subscription, but an admin that wants to update the subscription definition is a very likely scenario. Maybe as a result of a data change, or need input from the business. We can update the server side code without re-deployment, and that is usually a good idea.

To make matters a bit more complex, we also have to consider secrets management. ETL processes, in particular, can contain sensitive information (connection strings). So we need to figure out a way to have the connection string, but not have the connection string Smile. In other words, if I write a new ETL process and deploy it to production, I need to be sure that I don’t need to remember to update the connection string from my local machine to the production database. Or, much worse, if I’m taking the ETL from production, I don’t want to accidently also get the production connection string. That means that we need to use named connection strings, and rely on the developer / admin to set it up properly across environments.

I would really appreciate any feedback you have about how to handle this scenario.

Both ETL processes and Subscriptions are just JSON documents of not too much complexity, so actually moving them around between servers isn’t hard, it is the process of doing so that we are trying to flesh out. I should also mention that we are probably just going to make sure that there is a process to handle that, not something that is mandated, because some companies have very different deployment models that we need to fit into. This is us trying to think about the best way to handle the most common scenario with as little friction as possible.

Artificial documents in RavenDB 4.0

time to read 2 min | 228 words

Artificial documents are a really interesting feature. They allow you to define an index, and specify that the result of the index will be… documents as well.

Let us consider the following index, running on the Norhtwind dataset.

image

We can ask RavenDB to output the result of this index to a collection, in addition to the normal indexing. This is done in the following manner:

image

And you can see the result here:

image

The question here is, what is the point? Don’t we already have the exact same data indexed and available as the result of the map/reduce index? Why store it twice?

The answer is quite simple, with the output of the index going into documents, we can now define additional indexes on top of them, which give us the option to very easily create recursive map/reduce operations. So you can do daily/monthly/yearly summaries very cheaply. We can also apply all the usual operations on documents (subscriptions and ETL processes come to mind immediately). That give you a lot of power, and without incurring a high complexity overhead.

The pain of HTTPS

time to read 4 min | 652 words

imageA few weeks ago we started looking into what it would take to run RavenDB 4.0 over HTTPS.

Oh, not the actual mechanics of that, we had that covered a long time ago, and pretty much everything worked as expected. No, the problem that we set out to solve was whatever we could get RavenDB to Just Work over HTTPS without requiring the admin to jump through hops. Basically, what I really wanted was a way to just spin up the server and have it running on HTTPS by default.

That turned out to be a lot harder then I wished it would be.

HTTPS has two very distinct goals:

  • To encrypt communication between two parties.
  • To ensure that the site you visited is actually the site you thought you visited.

The first portion can be handled by generating the certificate yourself, and the communication between client & server would be encrypted. So far so good, but the second portion is probably more important. If my communication with ThisIsNotPayPal.com is encrypted, that doesn’t really help me all that much, I’m afraid.

Verifying who you are is a very important component of HTTPS, and that is something that we can’t just ignore. Well, technically speaking I guess that RavenDB could have installed a root CA into the system during installation, but the mere thought of doing that is giving me a pause, so I really don’t want to try and do that.

And without doing that, we can’t really support HTTPS. Remember that things like Let’s Encrypt won’t work here. RavenDB is often deployed on closed networks, and without having a publicly visible domain to run. My RavenDB is running on oren-pc.hrhinos.local, for example, and I think you’ll find that it is a bit hard to get a Let’s Encrypt certificate for this.

So we can’t just magically get a certificate and have it work.

While I wish there was a way to just have encryption over the wire, without validation of identity, that would be pretty pointless with such things as man in the middle attacks.

So what do we do in RavenDB 4.0 with regards to HTTPS?

We rely on the admin (shocking, I know). They can either generate a self signed certificate and trust it ( a matter of a few shell commands on any platform ) or use their organization’s certificate (either trusted internally or externally obtained). RavenDB doesn’t care about that, but if you provide a certificate, it will ensure that all communication are SSL encrypted.

The client API exposes a method that let you control certificate validation, which make it easier if you need to customize the authentication policy. On the server side, however, we take things differently. Instead of letting the user configure trust policies in certificates, we decided to ignore the issue completely. Or, to be rather more exact, to specify that RavenDB is going to lean on the operating system for such decisions. A simple scenario is an administrator that define a cluster of servers and generate a self signed certificate(s) for them to use. The administrator need to make sure that the certificate(s) in question are trusted by all nodes in the cluster. RavenDB will refuse to connect over HTTPS to an untrusted source.

Yes, I’m aware of all the horrible things that this can do (certificate expiration kills the system, for example), but we couldn’t think of any way were not doing this wouldn’t result in even worse situations.

RavenDB has support for encrypted databases, but we don’t allow them to be accessed from non secured connection, or to connect to non secure destinations. So the data is encrypted at rest and over the wire, and the admin is responsible to making sure that the certs are up to date and valid (or at least trusted by the machines in question).

A thread per task keep the headache away

time to read 3 min | 449 words

imageOne of very conscious choices we made in RavenDB 4.0 is to use threads. Not just run code that is multi threaded, but actively use the notion of a dedicated thread per task.

In fact, most of the work in RavenDB (beyond servicing requests) is done by a dedicated thread. Each index, for example, has its own thread, transactions are merged and pushed using (mostly) a single thread. Replication is using dedicated threads, and so does the cluster communication mechanism.

That runs contrary to advice that I have been told many times, that threads are expensive resource and that we should not hold up a thread if we can use async operations to give it up as soon as possible. And to a certain extent, I still very much believe it. Certainly all our request processing is using this manner. But as we got faster and faster we observed some really hard issues with using thread pools, since you can’t rely on them executing a particular request in a given time frame, and having a mix bag of operations in a thread pool is a mix for slowing the whole thing down.

Dedicated threads give us a lot of control and simplicity, from the ability to control the priority of certain tasks to the ability to debug and blame them properly.

For example, here is how a portion of the threads on a running RavenDB server look like in the debugger:

image

And done of our standard endpoints can give you thread timing for all the various tasks we do, so if there is an expensive thing going on, you can tell, and you can act accordingly.

All of this has also led to an architecture that is mostly about independent thread doing their job in isolation, which means that a lot of the backend code in RavenDB doesn’t have to worry about threading concerns. And debugging it is as simple as stepping through it.

That isn’t true for request processing threads, which are heavily async, so they are all doing mostly reads. Even when you are actually sending a write to RavenDB, what is actually happening is that we have the request thread parse the request, then queue it up for the transaction merging thread to actually process it. That means that there is usually little (if any) contention on locks throughout the system.

It is a far simpler model, and it has proven itself capable of making a very complex system understandable and maintainable for us.

Throw away the tests, I’m debugging this in production mode

time to read 2 min | 218 words

Prismatic-Sketched-Brain-Silhouette-300pxToday I had a really strange revelation. We had an issue that related to race conditions in a distributed group, and we could just not figure out what was going on from the tests.

Then we switch to a production mode, where each node was a separate process, and we debugged one of them. And it was easy, and it was obvious, and everything just worked.  That was a profound revelation.

What happened was that we build our system so we can service them in production, that includes the internal design and exposed instrumentation data that we have. When we were trying to figure things out from the tests, we were running everything in a single process, and the act of trying to debug a single thread would cause all other threads to stop. That would trigger cascade behaviors (since timeout would be fired, and the cluster would move into recovery mode).

With us debugging just a single node in the cluster, the rest of the cluster just thought it was slow, it didn’t have any impact, but allow us to observe the behavior of the system very easily. The solution was quite obvious once we got into that stage.

RavenDB 4.0Managing encrypted databases

time to read 6 min | 1012 words

imageOn the right you can see how the new database creation dialog looks like, when you want to create a new encrypted database. I talked about the actual implementation of full database encryption before, but todays post’s focus is different.

I want to talk a out managing encrypted databases. As an admin tasked working with encrypted data, I need to not only manage the data in the database itself, but I also need to handle a lot more failure points when using encryption. The most obvious of them is that if you have an encrypted database in the first place, then the data you are protecting is very likely to be sensitive in nature.

That raise the immediate question of who can see that information. For that matter, are you allowed to see that information? RavenDB 4.0 has support for time limited credentials, so you register to get credentials in the system, and using whatever workflow you have the system generate a temporary API key for you that will be valid for a short time and then expire.

What about all the other things that an admin needs to do? The most obvious example is how do you handle backups, either routine or emergency ones. It is pretty obvious that if the database is encrypted, we also want the backups to be encrypted, but are they going to use the same key? How do you restore? What about moving the database from one machine to the other?

In the end, it all hangs on the notion of keys. When you create a new encrypted database, we’ll generate a key for you, and require that you confirm for us that you have persisted that information in some manner. You can print it, download it, etc. And you can see the key right there in plain text during the db creation. However, this is the last time that the database key will ever reside in plain text.

So what about this QR code, what is it doing there? Put simply, it is there to capture attention. It replicates the same information that you have in the key itself, obviously. But what for?

The idea is that users are often hurrying through the process, (the “Yes, dear!” mode) and we want to encourage them to stop of a second and think. The use of the QR code make it also much more likely that the admin will print and save the key in an offline manner, which is likely to be safer than most methods.

So this is how we encourage administrators to safely remember the encryption key. This is useful because that give the admin the ability to take a snapshot on one machine, and then recover it on another, where the encryption key is not available, or just move the hard disk between machines if the old one failed. It is actually quite common in cloud scenarios to have a machine that has an attached cloud storage, and if the machine fails, you just spin up a new machine and attach the storage to the new one.

We keep the encryption keys secret by utilizing system specific keys (either DPAPI or decryption key that only the specific user can access), so moving machines like that will require the admin to provide the encryption key so we can continue working.

The issue of backups is different. It is very common to have to store long term backups, and needing to restore them in a separate location / situation. At that point, we need the backup to be encrypted, but we don’t want it it use the same encryption key as the database itself. This is mostly about managing keys. If I’m managing multiple databases, I don’t want to record the encryption key for each as part of the restore process. That is opening us to a missed key and a useless backup that we can do nothing about.

Instead, when you setup backups (for encrypted databases it is mandatory, for regular databases, optional) we’ll give you the option to provide a public key that we’ll then use to encrypted the backup. That means that you can more safely store it in cloud scenarios, and regardless of how many databases you have, as long as you have the private key, you’ll be able to restore the backup.

Finally, we have one last topic to cover with regards to encryption, the overall system configuration. Each database can be encrypted, sure, but the management of the database (such as connection strings that it replicates to, API keys that it uses to store backups and a lot of other potentially sensitive information) is still stored in plain text. For that matter, even if the database shouldn’t be encrypted, you might still want to encrypted the full system configuration. That lead to somewhat of a chicken and egg problem.

On the one hand, we can’t encrypt the server configuration from the get go, because then the admin will not know what the key is, and they might need that if they need to move machines, etc. But once we started, we are using the server configuration, so we can’t just encrypt that on the fly. What we ended up using is a set of command line parameters, so if the admins wants to run encrypted server configuration, they can stop the server, run a command to encrypt the server configuration and setup the appropriate encryption key retrieval process (DPAPI, for example, under the right user).

That gives us the chance to make the user aware of the key and allow to save it in a secure location. All members in a cluster with an encrypted server configuration must also have encrypted server configuration, which prevents accidental leaks.

I think that this is about it, with regards to the operations details of managing encryption, Smile. Pretty sure that I missed something, but this post is getting long as it is.

RavenDB 4.0Attachments

time to read 2 min | 251 words

image

I previously wrote about the new attachments feature in RavenDB 4.0. Now it is ready to be seen by outside eyes.

As you can see in the image on the right, documents can now have attached attachments (I’m sorry, couldn’t think about a better way to phrase this). This give you the ability to store binary data inside RavenDB, but not as some free floating value that has only very loose connection to the rest of the system. Those attachments are strongly tied to their parent document, and allow you to store related information easily and right next to the document.

 

 

 

 

That also means that you can take advantage of RavenDB’s ACID nature and actually make modifications to both attachments and documents at the same time. For example, consider the following code:

Here we get the user’s profile picture, generate a thumbnail from it and then associate both picture and thumbnail with the document, we are also updating the status of the user to indicate that they have a profile picture and then submit it all as one single transaction. That means that you don’t have to sync between different sources.

Attachments in RavenDB are also kept consistent through replication, so you won’t see partial results between nodes, and the attachments themselves are using de-duplication techniques to reduce the amount of storage that we take.

I’m really happy with this feature.

Storing secrets in Linux

time to read 3 min | 545 words

We need to store an encryption key on Linux & Windows. On Windows, the decision is pretty much trivial, you throw that into DPAPI, and can rely on the operating system to handle that for us. In particular, it is very easy to analyze key scenarios such as “someone stole the hard disk” and say that either the thief wouldn’t be able to get the plain text key, or we can blame Microsoft for that Smile.

On Linux, the situation seems to be much more chaotic. There is libsecret, which seems to be much wider in scope than DPAPI. Whereas DPAPI has 2 methods (protect & unprotect), libsecret has a lot of moving pieces, which is quite scary. That is leaving aside the issue of having no managed implementation and having to dance around Gnome specific data types in the API (need to pass GCancellable & GError into it) which increase the complexity.

Other options include using some sort of hardware / software security modules (such as HashiCorp Vault), which is great in theory, but requires us to either take a dependency on something that might not be there, or try to support a wide variety of options (Keywhiz, Chef, Puppet, CloudHSM, etc). That isn’t a really good alternative from our point of view.

Looking into how Mono implemented the DPAPI on Linux, they did it by writing a master key to an XML file and relied on file system ACLs to prevent anyone from seeing that information. This end up being this:

chmod(path, (S_IRUSR | S_IWUSR | S_IXUSR) );

Which has the benefit of only allowing that user to access it, but given that I’ve gotten the physical disk, I’m able to easily mount that on machine that I control as root and access anything that I like. On Windows, by the way, the way this is prevented is that the user must have logged in, and a key that is derived from their password is used to decrypt all protected data as needed, so without the user logging in, you cannot decrypt anything. For that matter, even the administrator on the machine can’t recover the data if they want to, because resetting the user’s password will cause all such information to be lost.

There is the Gnome.Keyring project as well, which hasn’t been updated in 7 years, and obviously doesn’t support the kwallet (which libsecret does). OWASP seems to be throwing the towel there and just recommend to rely on the file system ACL.

The Linux Kernel has a Key Retention API, but it seems to be targeted primarily toward giving file systems access to the secrets they need, and it looks like it isn’t going to survive reboots (it is primarily a caching mechanism, it looks like?).

So after all this research, I can say that I don’t like libsecret, it seems too cumbersome and will need users to jump through some hoops in some cases (install additional packages, setup access, etc). 

Setting up the permissions via the ACL seems to be the common way to handle this scenario, but it doesn’t sit well with me.

Any recommendations?

Tripping on the TPL

time to read 2 min | 364 words

I thought that I found a bug in the TPL, but it looks like its working (more or less) by design. Basically, when a task is completed, all awaiting tasks will be notified on that, which is pretty much what you would expect. What isn’t usually expected is that those tasks can interfere with one another. Consider the following code:

We have two tasks, which accept a parent task and do something with it. What do you think will happen when we run the following code?

Unless you are very quick on the draw, running this code will result in a timeout message, but how? We know that we have a much shorter duration for the task than the timeout, so what is going on?

Well, effectively what is going on is that the parent task has a list of children that it will notify, and by default, it will do so synchronously and sequentially. If a child task blocks for whatever reason (for example, it might be processing a lot of work), the other children of the parent task will not be notified.

If there is a timeout setup, it will be triggered, even though the parent task was already completed. It took us a lot of time to figure out the repro in this issue, and we were certain that this is some sort of race condition in the TPL. I had a blog post talking all about it, but the Microsoft team is fast enough that they were able to literally answer my issue before I had the time to complete my blog post. That is really impressive.

I should note that the suggestion, using RunContinuationsAsynchronously, works quite well for creating a new Task or using TaskCompletionSource, but there is no way to specify that when you are using Task.Run. What is worse for us is that since this is not the default (for perfectly good performance reasons, by the way), this means that any code that we call into might trigger this. I would have much rather to be able to specify than when waiting on the task, rather than when creating it.

FUTURE POSTS

  1. Zombies vs. Ghosts: The great debate - about one day from now
  2. Bug stories: The data corruption in the cluster - 3 days from now
  3. Bug stories: How do I call myself? - 4 days from now

There are posts all the way to Jun 28, 2017

RECENT SERIES

  1. PR Review (2):
    23 Jun 2017 - avoid too many parameters
  2. Reviewing Noise Search Engine (4):
    20 Jun 2017 - Summary
  3. RavenDB 4.0 (7):
    13 Jun 2017 - The etag simplification
  4. De-virtualization in CoreCLR (2):
    01 May 2017 - Part II
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats