Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,373 | Comments: 47,292

filter by tags archive

RavenDB 4.0Managing encrypted databases

time to read 6 min | 1012 words

imageOn the right you can see how the new database creation dialog looks like, when you want to create a new encrypted database. I talked about the actual implementation of full database encryption before, but todays post’s focus is different.

I want to talk a out managing encrypted databases. As an admin tasked working with encrypted data, I need to not only manage the data in the database itself, but I also need to handle a lot more failure points when using encryption. The most obvious of them is that if you have an encrypted database in the first place, then the data you are protecting is very likely to be sensitive in nature.

That raise the immediate question of who can see that information. For that matter, are you allowed to see that information? RavenDB 4.0 has support for time limited credentials, so you register to get credentials in the system, and using whatever workflow you have the system generate a temporary API key for you that will be valid for a short time and then expire.

What about all the other things that an admin needs to do? The most obvious example is how do you handle backups, either routine or emergency ones. It is pretty obvious that if the database is encrypted, we also want the backups to be encrypted, but are they going to use the same key? How do you restore? What about moving the database from one machine to the other?

In the end, it all hangs on the notion of keys. When you create a new encrypted database, we’ll generate a key for you, and require that you confirm for us that you have persisted that information in some manner. You can print it, download it, etc. And you can see the key right there in plain text during the db creation. However, this is the last time that the database key will ever reside in plain text.

So what about this QR code, what is it doing there? Put simply, it is there to capture attention. It replicates the same information that you have in the key itself, obviously. But what for?

The idea is that users are often hurrying through the process, (the “Yes, dear!” mode) and we want to encourage them to stop of a second and think. The use of the QR code make it also much more likely that the admin will print and save the key in an offline manner, which is likely to be safer than most methods.

So this is how we encourage administrators to safely remember the encryption key. This is useful because that give the admin the ability to take a snapshot on one machine, and then recover it on another, where the encryption key is not available, or just move the hard disk between machines if the old one failed. It is actually quite common in cloud scenarios to have a machine that has an attached cloud storage, and if the machine fails, you just spin up a new machine and attach the storage to the new one.

We keep the encryption keys secret by utilizing system specific keys (either DPAPI or decryption key that only the specific user can access), so moving machines like that will require the admin to provide the encryption key so we can continue working.

The issue of backups is different. It is very common to have to store long term backups, and needing to restore them in a separate location / situation. At that point, we need the backup to be encrypted, but we don’t want it it use the same encryption key as the database itself. This is mostly about managing keys. If I’m managing multiple databases, I don’t want to record the encryption key for each as part of the restore process. That is opening us to a missed key and a useless backup that we can do nothing about.

Instead, when you setup backups (for encrypted databases it is mandatory, for regular databases, optional) we’ll give you the option to provide a public key that we’ll then use to encrypted the backup. That means that you can more safely store it in cloud scenarios, and regardless of how many databases you have, as long as you have the private key, you’ll be able to restore the backup.

Finally, we have one last topic to cover with regards to encryption, the overall system configuration. Each database can be encrypted, sure, but the management of the database (such as connection strings that it replicates to, API keys that it uses to store backups and a lot of other potentially sensitive information) is still stored in plain text. For that matter, even if the database shouldn’t be encrypted, you might still want to encrypted the full system configuration. That lead to somewhat of a chicken and egg problem.

On the one hand, we can’t encrypt the server configuration from the get go, because then the admin will not know what the key is, and they might need that if they need to move machines, etc. But once we started, we are using the server configuration, so we can’t just encrypt that on the fly. What we ended up using is a set of command line parameters, so if the admins wants to run encrypted server configuration, they can stop the server, run a command to encrypt the server configuration and setup the appropriate encryption key retrieval process (DPAPI, for example, under the right user).

That gives us the chance to make the user aware of the key and allow to save it in a secure location. All members in a cluster with an encrypted server configuration must also have encrypted server configuration, which prevents accidental leaks.

I think that this is about it, with regards to the operations details of managing encryption, Smile. Pretty sure that I missed something, but this post is getting long as it is.

Beginning the RavenDB 4.0 book

time to read 12 min | 2208 words

You might have seen me talking about how close we are to a RavenDB beta release. Today marked a very important step along the route to an actual release. I’ve shifted my focus. Instead of going head down in the code and pushing things forward and doing all the sort of crazy stuff that you have seen me talking about for the past year and a half, I got started on the Inside RavenDB 4.0 book.

I say started because just the rough table of contents took me almost the entire day to complete. I’m expecting that this will take the majority of my time for the next few months, which means that you’ll get all the drib and drabs from the raw drafts as they are composed.  I’m also using this as a pretty nice way to go over the entire product and see how it all comes together as a cohesive whole, instead of looking at just a single piece every time.

Given that the period of putting bugs in the code is almost over, I feel that I can safely let the rest of the team fish out all the oopsies hat I managed to get in and focus on the product, rather than the code. This is the second time that I have made such a shift (and the third time I’m writing a book), and it still feels awkward. On the other hand, there is a great sense of accomplishment when you see how things just click together and all that hard work is finally real in a way that no code review or artificial scenario can replicate.

Here is what I have planned so far for the book. Your comments are welcome as always.

One of the major challenges in writing this book came in considering how to structure it. There are so many concepts that relate to one another that it can be difficult to try to understand them in isolation. We can't talk about modeling documents before we understand the kind of features that we have available for us to work with, for example. Considering this, I'm going to introduce concepts in stages.

Part I - The basics of RavenDB

Focus: Developers

This part contains a practical discussion on how to build an application using RavenDB, and we'll skip theory and concepts in favor of getting things done. This is what you'll want new hires to read before starting to work with an application using RavenDB, we'll keep the theory and the background information for the next part.

  • Chapter 2 - Zero to RavenDB - focuses on setting you up with a RavenDB instance, introduce the studio and some key concepts and walk you through the Hello World equivalent of using RavenDB by building a very simple To Do app.
  • Chapter 3 - CRUD operations - discusses RavenDB the basics of connecting from your application to RavenDB and the basics that you need to know to do CRUD properly. Entities, documents, attachments, collections and queries.
  • Chapter 4 - The Client API - explores more advanced client scenarios, such as lazy requests, patching, bulk inserts, and streaming queries and using persistent subscriptions. We'll talk a bit about data modeling and working with embedded and related documents.

Part II - Ravens everywhere

Focus: Architects

This part focuses on the theory of building robust and high performance systems using RavenDB. We'll go directly to working with a cluster of RavenDB nodes on commodity hardware, discuss distribution of data and work across the cluster and how to best structure your systems to take advantage of what RavenDB brings to the table.

  • Chapter 5 - Clustering Setup - walks through the steps to bring up a cluster of RavenDB nodes and working with a clustered database. This will also discuss the high availability and load balancing features in RavenDB.
  • Chapter 6 - Clustering Deep Dive - takes you through the RavenDB clustering behavior, how it works and how the both servers & clients are working together to give you a seamless distributed experience. We'll also discuss error handling and recovery in a clustered environment.
  • Chapter 7 - Integrating with the Outside World - explores using RavenDB along side additional systems, for integrating with legacy systems, working with dedicated reporting databases, ETL process, long running background tasks and in general how to make RavenDB fit better inside your environment.
  • Chapter 8 - Clusters Topologies - guides you through setting up several different clustering topologies and their pros and cons. This is intend to serve as a set of blueprints for architects to start from when they begin building a system.

Part III - Indexing

Focus: Developers, Architects

This part discuss how RavenDB index data to allow for quick retrieval of information, whatever it is a single document or aggregated data spanning years. We'll cover all the different indexing methods in RavenDB and how you can should use each of them in your systems to implement the features you want.

  • Chapter 9 - Simple Indexes - introduces indexes and their usage in RavenDB. Even though we have performed queries and worked with the data, we haven't actually dealt with indexes directly so far. Now is the time to lift the curtain and see how RavenDB is searching for information and what it means for your applications.
  • Chapter 11 - Full Text Search - takes a step beyond just querying the raw data and shows you how you can search your entities using powerful full text queries. We'll talk about the full text search options RavenDB provides, using analyzers to customize this for different usages and languages.
  • Chapter 13 - Complex indexes - goes beyond simple indexes and shows us how we can query over multiple collections at the same time. We will also see how we can piece together data at indexing time from related documents and have RavenDB keep the index consistent for us.
  • Chapter 13 - Map/Reduce - gets into data aggregation and how using Map/Reduce indexes allows you to get instant results over very large data sets with very little cost. Making reporting style queries cheap and accessible at any scale. Beyond simple aggregation, Map/Reduce in RavenDB also allows you to reshape the data coming from multiple source into a single whole, regardless of complexity.
  • Chapter 14 - Facet and Dynamic Aggregation - steps beyond static aggregation provided by Map/Reduce and give you the ability to run dynamic aggregation queries on your data, or just facet your search results to make it easier for the user to find what they are looking for.
  • Chapter 15 - Artificial Documents and Recursive Map/Reduce - guides you through using indexes to generate documents, instead of the other way around, and then use that both for normal operations and to support recursive Map/Reduce and even more complex reporting scenarios.
  • Chapter 16 - The Query Optimizier - takes you into the RavenDB query optimizer, index management and how RavenDB is treating indexes from the inside out. We'll see the kind of machinery that is running behind the scenes to get everything going so when you make a query, the results are there at once.
  • Chapter 17 - RavenDB Lucene Usage - goes into (quite gory) details about how RavenDB is making use of Lucene to implement its indexes. This is intended mostly for people who need to know what exactly is going on and how does everything fit together. This is how the sausage is made.
  • Chapter 18 - Advanced Indexing Techniques - dig into some specific usages of indexes that are a bit... outside the norm. Using spatial queries to find geographical information, generating dynamic suggestions on the fly, returning highlighted results for full text search queries. All the things that you would use once in a blue moon, but when you need them you really need them.

Part IV - Operations

Focus: Operations

This part deals with running and supporting a RavenDB cluster or clusters in production. From how you spina new cluster to decommissioning a downed node to tracking down performance problems. We'll learn all that you need (and then a bit more) to understand what is going on with RavenDB and how to customize its behavior to fit your own environment.

  • Chapter 19 - Deploying to Production - guides you through several production deployment options, including all the gory details about how to spin up a cluster and keep it healthy and happy. We'll discuss deploying to anything from a container swarm to bare metal, the networking requirements and configuration you need, security implications and anything else that the operation teams will need to comfortably support a RavenDB cluster in that hard land called production.
  • Chapter 20 - Security - focuses solely on security. How you can control who can access which database, running an encrypted database for highly sensitive information and securing a RavenDB instance that is exposed to the wild world.
  • Chapter 21 - High Availability - brings failure to the table, repeatedly. We'll discuss how RavenDB handles failures in production, how to understand, predict and support RavenDB in keeping all of your databases accessible and high performance in the face of various errors and disasters.
  • Chapter 22 - Recovering from Disasters - covers what happens after disaster strikes. When machines melt down and go poof, or someone issues the wrong command and the data just went into the incinerator. Yep, this is where we talk about backups and restore and all the various reasons why operations consider them sacred.
  • Chapter 23 - Monitoring - covers how to monitor and support a RavenDB cluster in production. We'll see how RavenDB externalize its own internal state and behavior for the admins to look at and how to make sense out of all of this information.
  • Chapter 24 - Tracking Performance - gets into why a particular query or a node isn't performing up to spec. We'll discuss how one would track down such an issue and find the root cause responsible for such a behavior, a few common reasons why such things happen and how to avoid or resolve them.

Part V - Implementation Details

Focus: RavenDB Core Team, RavenDB Support Engineers, Developers who wants to read the code

This part is the orientation guide that we throw at new hires when we sit them in front of the code. It is full of implementation details and background information that you probably don't need if all you want to know is how to build an awesome system on top of RavenDB.

On the other hand, if you want to go through the code and understand why RavenDB is doing something in a particular way, this part will likely answer all those questions.

  • Chapter 25 - The Blittable Format - gets into the details of how RavenDB represents JSON documents internally, how we go to this particular format and how to work with it.
  • Chapter 26 - The Voron Storage Engine - breaks down the low-level storage engine we use to put bits on the disk. We'll walk through how it works, the various features it offers and most importantly, why it had ended up in this way. A lot of the discussion is going to be driven by performance consideration, extremely low-level and full of operating system and hardware minutiae.
  • Chapter 27 - The Transaction Merger - builds on top of Voron and comprise one of the major ways in which RavenDB is able to provide high performance. We'll discuss how it came about, how it is actually used and what it means in terms of actual code using it.
  • Chapter 28 - The Rachis Consensus - talks about how RavenDB is using the Raft consuensus protocol to connect together different nodes in the cluster, how they are interacting with each other and the internal details of how it all comes together (and fall apart and recover again).
  • Chapter 31 - Cluster State Machine - brings the discussion one step higher by talking about how the RavenDB uses the result of the distributed consensus to actually manage all the nodes in the cluster and how we can arrive independently on each node to the same decision reliably.
  • Chapter 30 - Lording over Databases - peeks inside a single node and explores how a database is managed inside that node. More importantly, how we are dealing with multiple databases on the same node and what kind of impact each database can have on its neighbors.
  • Chapter 31 - Replication - dives into the details of how RavenDB manages multi master distributed database. We'll go over change vectors to ensure conflict detection (and aid in its resolution) how the data is actually being replicated between the different nodes in a database group.
  • Chapter 32 - Internal Architecture - gives you the overall view of the internal architecture of RavenDB. How it is built from the inside, and the reasoning why the pieces came together in the way they did. We'll cover both high-level architecture concepts and micro architecture of the common building blocks in the project.

Part VI - Parting

This part summarizes the entire book and provide some insight about what our future vision for RavenDB is.

  • Chapter 33 - What comes next - discusses what are our (rough) plans for the next major version and our basic roadmap for RavenDB.
  • Chapter 34 - Are we there yet? Yes! - summarize the book and let you go and start actually using all of this awesome information.

RavenDB 4.0Working with attachments

time to read 3 min | 597 words

imageIn my previous post, I talked about attachments, how they look in the studio and how to work with them from code. In this post, I want to dig a little deeper into how they are actually working.

Attachments are basically blobs that can be attached to a document, a document can have any number of attachments attached to it, and the actual contents of the attachment is actually stored separately from the document. One of the advantages of this separate storage is that it also allows us to handle de-duplication.

The trivial example is needing to attach the same file to a lot of documents will result in just a single instance of that file being kept around. There are actually quite a lot of use cases that call for this (for example, imagine the default profile picture), but this really shines when you start working with revisions. Every time that document changes (which include modifications to attachments, of course), a new revision is created. Instead of having each revision clone all of the attachments, or not have attachments tracked by revisions, each revisions will simply reference the same attachment data.  That way, you can get a whole view of the document at a point in time, implement auditing and tracking, etc.

Another cool aspect of attachments being attached to documents is that they flow in the same manner over replication. So if you modified a document and added an attachment, that modification will be replicated at the same time (and in the same transaction) as that attachment.

In terms of actually working with the attachments, we keep track of the references between documents and attachments internally, and expose them via the document metadata.

This is done so you don’t need to make any additional server calls to get the attachments on a specific documents. You just need to load it, and you have it all there.

This looks like this, note that you can get all the relevant information about the attachment directly from the document, without having to go elsewhere. This is also how you can compare attachment changes across revisions, and this allow you to write conflict resolution scripts that operate on documents and attachments seamlessly.

image

Putting the attachment information inside the document metadata is a design decision that was made because for the vast majority of the cases, the number of attachments per document is pretty small, and even in larger cases (dozens or hundreds of attachments) it works very well. If you have a case where a single document has many thousands or tens of thousands of attachments, that will likely be a very high load on the metadata, and you should consider splitting the attachments into multiple documents (sub documents with some domain knowledge will work).

Let us consider a big customer, to whom we keep issuing invoices. A good problem to have is that eventually we’ll issue enough invoices to that customer that we start suffering from very big metadata load just because of the attachment tracking. We can handle that by using (customer/1234/invoices/2017-04, customer/1234/invoices/2017-05 ) as the documents we’ll use to hang the attachments on.

This was done intentionally, because it mimics the same way you’ll split a file that has an unbounded growth (keeping all invoicing data for a big customer in a single document is also not a good idea, and has the same solution).

RavenDB 4.0Attachments

time to read 2 min | 251 words

image

I previously wrote about the new attachments feature in RavenDB 4.0. Now it is ready to be seen by outside eyes.

As you can see in the image on the right, documents can now have attached attachments (I’m sorry, couldn’t think about a better way to phrase this). This give you the ability to store binary data inside RavenDB, but not as some free floating value that has only very loose connection to the rest of the system. Those attachments are strongly tied to their parent document, and allow you to store related information easily and right next to the document.

 

 

 

 

That also means that you can take advantage of RavenDB’s ACID nature and actually make modifications to both attachments and documents at the same time. For example, consider the following code:

Here we get the user’s profile picture, generate a thumbnail from it and then associate both picture and thumbnail with the document, we are also updating the status of the user to indicate that they have a profile picture and then submit it all as one single transaction. That means that you don’t have to sync between different sources.

Attachments in RavenDB are also kept consistent through replication, so you won’t see partial results between nodes, and the attachments themselves are using de-duplication techniques to reduce the amount of storage that we take.

I’m really happy with this feature.

RavenDB 4.0Full database encryption

time to read 8 min | 1403 words

imageLate last year I talked about our first thoughts about implementing encryption in RavenDB in the 4.0 version. I got some really good feedback and that led to this post, detailing the initial design for RavenDB encryption in the 4.0 version. I’m happy to announce that we are now pretty much done with regards to implementing, testing and banging on this, and we have working full database encryption in 4.0. This post will discuss how we implemented it and additional considerations regarding management and working with encryption databases and clusters.

RavenDB uses ChaCha20Poly1305 authenticated encryption scheme, with 256 bits keys. Each database has a master key, which on Windows is kept encrypted via DPAPI (on Linux we are still figuring out the best thing to do there), and we encrypt each page (or range of pages, if a value takes more than a single page) using its a key derived from the master key and the page number. We are also using a random 64 bits nonce for each page the first time it is encrypted, and then increment the nonce every time we need to encrypt the page again.

We are using libsodium as the encryption library, and in general it make it a pleasure to work with encryption, since it is very focused on getting things done and getting them done right. The number of decisions that we had to make by using it is minimal (which is good). The pattern of initial random generation of the nonce and then incrementing on each use is the recommended method for using ChaCha20Poly1305. The WAL is also encrypted on a per transaction basis, with a random nonce and a derived from the master key and the transaction id.

This encryption is done at the lowest possible level, so it is actually below pretty much anything in RavenDB itself. When a transaction need to access some data, it is decrypted on the fly for it, and then it is available for the duration of the transaction. When the transaction is closed, we’ll encrypt all modified data again, then wipe all the buffered we used to ensure that there is no leakage. The only time we have decrypted data in memory is during the lifetime of a transaction.

A really nice benefit of working at such low level is that all the feature of the database are automatically included. That means that adding our Lucene on Voron implementation, for example, is also part of this, and all indexes are encrypted as well, without having to take any special steps.

Encrypted database does poses some challenges. Not so much for the database author (that’s me) as to the database users (that’s you). If you take a backup of an encrypted database, restoring it pretty much requires that you’ll have the master key to enable actually accessing the data. When you are running in a cluster, that represent another challenge, since you need to make sure that all the nodes in the cluster running the database are encrypting it. Communication about the database should also be encrypted. Tasks such as periodic backup / export / etc also need special treatment. Key generation is also important, you want to make sure that the key isn’t “123456” or some such.

Here is what we came up with. When you create a new encrypted database, you’ll be presented with the following UI:
image

The idea is that you can either have the server generate the key for you (secured, cryptographically random 256 bits), and then we’ll show it to you and allow you to save it / print it / record it for later. This is the only chance you’ll have to get the key from the server.

After you have the key, you then select which nodes and how many this database is going to run on, which will cause it to create the master key on all of those for that particular database. In this manner, all the nodes for this database will use the same key, which simplify some operational tasks (recovery / backup / restore / etc). Alternatively, you can create the database on a single node, and then add additional nodes to it, which will give you the option to generate / provide a key.  I’m not sure whatever ops will be happy with either option, but we have the ability to either have all of them using the same key or having separate key for each node.

Encrypted databases can only reside on nodes that communicate with each other via HTTPS / SSL. The idea is that if you are running an encrypted database, requests between the different nodes should also be private, so you’ll have protection for data at rest and as the data is being moved around. Note that we verify that at the sender, not the origin side. Primarily, this is about typical deployment patterns. We expect to see users running RavenDB behind firewalls / proxy / load balancers, we expect people to use nginx as the SSL endpoint and then talk http to ravendb (which is reasonable if they are on the same machine or using unix sockets).  Regardless, when sending data over the wire, we are always talking RavenDB instance to RavenDB instance via TCP connection, and for encrypted databases, this will require SSL connection to work.

Other things, such as ETL tasks, will generate a suggestion to the user to ensure that they are also secured, but we’ll assume here that this is an explicit operation to move some of the data to another location, possibly filtering it, so we’ll not block it if it isn’t using https (leaving aside the fact that just figuring out whatever the connection to a relational database is using SSL or not is too complex to try), so we’ll warn about it and trust the user.

Finally, we have the issue of backups and exports. Those can be done on a regular basis, and frequently you’ll want them to be done to a remote location (cloud, S3, Dropbox, etc). In that case, all automated backup processes will require you to generate a public / private key pair (and only retain the public key, obviously). From then on, all the backups will be encrypted using the public key and uploaded to the cloud. That means that even if your cloud account is hacked, the hackers can’t do anything with the data, since they are missing the private key.

Again, to encourage users to actually do the right thing and save the private key, we’ll offer it in a form that is easily maintained safely. So you’ll be asked to print it and store it in some file folder somewhere (in addition to whatever digital backups you have), so you can restore it at a later point when / if you need it.

For manual operations, such as exporting an encrypted database, we’ll warn if you are trying to export without a key, but allow it (since forcing a key for a manual operation does nothing to security). Conversely, exporting a non encrypted database will also allow you to provide key pair and encrypt it, both for manual operations and automatic backup configurations.

Aside from those considerations, you can pretty much treat an encrypted database as a regular one (except, of course, that the data is encrypted at all times unless actively accessed). That means that all features would just work. The cost of actually encrypting and decrypting all the time is another concern, and we have seen about 60% additional cost in write speed and about 15% extra cost for reading.

Just to give you some idea about the performance we are talking about… Ingesting the entire Stack Overflow dataset, some 52GB in size and over 18 million documents can be done in 22 minutes with encryption. Without encryption we are faster, roughly 13 and a half minutes on that same machine, but that still gives us a rate of close to 2.5 GB per minute with encryption.

As you can tell from the mockup, while we have completed most of the encryption work around the engine, the actual UI and behavioral semantics are still in a bit of a flux. Your comments about those are welcome.

RavenDB 4.0 on Docker

time to read 1 min | 156 words

image[6]RavenDB 4.0 now has official Docker support. That means that if you have Docker installed, you can get RavenDB 4.0 running on your system with the following a single command:

This single command is enough to spin a new RavenDB instance that you can immediately start using.

If you are using the default Docker setup on Windows, you can immediately go to:

http://10.0.75.2:8080/

And you are pretty much done. That is a pretty awesome experience, I think.

We have support for Ubuntu and Windows Nano Server images, and you can run a more full pledge script that give you more exact control over what is going on.

Distributed task assignment with failover

time to read 5 min | 818 words

When building a distributed system, one of the more interesting aspects is how you are going to distribute tasks assignment. In other words, given that you have multiple nodes, how do you decide which node will do what? In some cases, that is relatively easy, you can say “all nodes will process read requests”, but in others, this is more complex. Let us take the case where you have several nodes, and you need to have a regular backup of a database that is replicated between all those nodes. You probably don’t want to run the backup across all the nodes, after all, they are pretty much the same and you don’t want to backup the exact same thing multiple times. On the other hand, you probably don’t want to assign this work statically, if you do, and if the node that is responsible for the backup is down, you got no backup.

Another example of the problem can be seen when you have other processes that you would like to be sticky if possible, and only jump around if there is a failure. Brining up a new node online is a common thing to do in a cluster, and the ideal scenario in that case is that a single node will feed it all the data that it needs. If we have multiple nodes doing that, they are likely to overlap and they might very well overload the poor new server. So we want just one node to update its state, but if that node goes down midway, we need someone else to pick up the slack.. For RavenDB, those sort of tasks includes things like ETL processes, Subscriptions, backup, bootstrapping new servers and more. We keep discovering new things that can use this sort of behavior.

But how do we actually make this work?

One way of doing this is to take advantage of the fact that RavenDB is using Raft and have the leader do task assignment. That works quite well, but it doesn’t scale. What do I meant by that? I mean that as the number of tasks that we need to manage grows, the complexity in the task assignment portion of the code grows as well. Ideally, I don’t want to have to consider twenty different variables and considerations before deciding what operation should go on which server, and trying to balance that sort of work in one place has proven to be challenging.

Instead, we have decided to allocate work in the cluster based on simple rules. Each task in the cluster has a key (which is typically generated by hashing some of its parameters), and that task can be assigned to a group of servers. Given those two details, we can use Jump Consistent Hashing to spread the load around. However, that doesn’t handle failover. We have a heartbeat process that can detect and notify nodes that a sibling has went down, so combining those two, we get the following code:

What we are doing here is rely on two different properties. Jump Consistent Hashing to let us know which node is responsible for what, and the Raft cluster leader that keep track of all the live nodes and let us know when a node goes down. When we need to assign a task, we use its hashed key to find its preferred owner, and if it is alive, that is that. But if it currently down, we do two things, we remove the downed node from the topology and re-hash the key with the new number of nodes in the cluster. That gives us a new preferred node, and so on until we find a live one.

The reason we rehash on failover is that Jump Consistent Hashing is going to usually point to the same position in the topology (that is why we choose it in the first place, after all), so we rehash to get a different position so it won’t all fall unto the next node in the list. All downed node tasks are fairly distributed among the remaining live cluster members.

The nice thing about this is that aside from keeping the live/down list up to date, the Raft cluster doesn’t really need to do something. This is a consistent algorithm, so different nodes operating on the same data can arrive at the same result, so a node going down will result in another node picking up on updating the new server up to spec and another will start a backup process. And all of that logic is right where we want it, right next to where the task logic itself is written.

This allow us to reason much more effectively about the behavior of each independent task, and also allow each node to let you know where each task is executing.

Overloading the Windows Kernel and locking a machine with RavenDB benchmarks

time to read 4 min | 781 words

During benchmarking RavenDB, we have run into several instances where the entire machine would freeze for a long duration, resulting in utter non responsiveness.

This has been quite frustrating to us, since a frozen machine make it kinda hard to figure out what is going on. But we finally figured it out, and all the details are right here in the screen shot.

image

What you can see is us running our current benchmark, importing the entire StackOverflow dataset into RavenDB. Drive C is the system drive, and drive D is the data drive that we are using to test our RavenDB’s performance.

Drive D is actually a throwaway SSD. That is, an SSD that we use purely for benchmarking and not for real work. Given the amount of workout we give the drive, we expect it to die eventually, so we don’t want to trust it with any other data.

At any rate, you can see that due to a different issue entirely, we are now seeing data syncs in excess of 8.5 GB. So basically, we wrote 8.55GB of data very quickly into a memory mapped file, and then called fsync. At the same time, we started increasing our  scratch buffer usage, because calling fsync ( 8.55 GB ) can take a while. Scratch buffers are a really interesting thing, they were born because of Linux crazy OOM design, and are basically a way for us to avoid paging. Instead of allocating memory on the heap like normal, which would then subject us to paging, we allocate a file on disk (mark it as temporary & delete on close) and then we mmap the file. That give us a way to guarantee that Linux will always have a space to page out any of our memory.

This also has the advantage of making it very clear how much scratch memory we are currently using, and on Azure / AWS machines, it is easier to place all of those scratch files on the fast temp local storage for better performance.

So we have a very large fsync going on, and a large amount of memory mapped files, and a lot of activity (that modify some of those files) and memory pressure.

That force the Kernel to evict some pages from memory to disk, to free something up. And under normal conditions, it would do just that. But here we run into a wrinkle, the memory we want to evict belongs to a memory mapped file, so the natural thing to do would be to write it back to its original file. This is actually what we expect the Kernel to do for us, and while for scratch files this is usually a waste, for the data file this is exactly the behavior that we want. But that is beside the point.

Look at the image above, we are supposed to be only using drive D, so why is C so busy? I’m not really sure, but I have a hypothesis.

Because we are currently running a very large fsync, I think that drive D is not currently process any additional write requests. The “write a page to disk” is something that has pretty strict runtime requirements, it can’t just wait for the I/O to return whenever that might be. Consider the fact that you can open a memory mapped file over a network drive, I think it very reasonable that the Kernel will have a timeout mechanism for this kind of I/O. When the pager sees that it can’t write to the original file fast enough, it shrugs and write those pages to the local page file instead.

This turns an  otherwise very bad situation (very long wait / crash) into a manageable situation. However, with the amount of work we put on the system, that effectively forces us to do heavy paging (in the orders of GBs) and that in turn lead us to a machine that appears to be locked up due to all the paging. So the fallback error handling is actually causing this issue by trying to recover, at least that is what I think.

When examining this, I wondered if this can be considered a DoS vulnerability, and after careful consideration, I don’t believe so. This issue involves using a lot of memory to cause enough paging to slow everything down, the fact that we are getting there in a somewhat novel way isn’t opening us to anything that wasn’t there already.

RavenDB Bootcamp

time to read 1 min | 105 words

image

We have RavenDB Bootcamp ready to go. If you want to learn about RavenDB, we have 18 parts series that take you through working with RavenDB in easily digestible pieces.

You can either go through them all or register to get them once a day via email so it doesn’t take too much time all at once.

They are available in our docs, and it is a great way to learn RavenDB from nothing.

Reducing the cost of occasionally needed information

time to read 3 min | 486 words

Consider the case when you need to call a function, and based on the result of the function, and some other information, do some work. That work require additional information, that can only be computed by calling the function.

Sounds complex? Let us look at the actual code, I tried to make it simple, but keep the same spirit.

This code has three different code paths. We are inserting a value that is already in the tree, so we update it and return. We insert a new value, but the page has enough space, so we add it and return.

Or, the interesting case for us, we need to split the page, which requires modifying the parent page (which may require split that page, etc). So the question is, how do we get the parent page?

This is important because we already found the parent page, we had to go through it to find the actual page we returned in FindPageFor. So ideally we’ll just return the list of parent pages whenever we call FindPageFor.

The problem is that in all read scenarios, which by far outweigh the  write operations, never need this value. What is more, even write operations only need it occasionally. On the other hand, when we do need it, we already did all the work needed to just give it to you, and it would be a waste to do it all over again.

We started the process (this particular story spans 4 years or so) with adding an optional parameter. If you passed a not null value, we’ll fill it with the relevant details. So far, so good, reads will send it null, but all the writes had to send it, and only a few of them had to actually use it.

The next step was to move the code to use an out lambda. In other words, when you called us, we’ll also give you a delegate that you can call, which will give you the details you need. This way, you’ll only need to pay the price when you actually needed it.

It looked like this:

However, that let to a different problem, while we only paid the price of calling the lambda when we needed it, we still paid the price  of allocating the labmda itself.

The next step was to create two overloads, one which is used only for reads, and one for writes. The writes one allows us to send a struct out parameter. The key here is that because it is a struct there is no allocation, and the method is written so we put the values that were previously captured on the lambda on the struct. Then if we need to actually generate the value, we do that, but have no additional allocations.

And this post is now about six times longer than the actual optimization itself.

FUTURE POSTS

  1. The pain of HTTPS - 10 hours from now

There are posts all the way to May 30, 2017

RECENT SERIES

  1. RavenDB 4.0 (4):
    19 May 2017 - Managing encrypted databases
  2. De-virtualization in CoreCLR (2):
    01 May 2017 - Part II
  3. re (17):
    26 Apr 2017 - Writing a Time Series Database from Scratch
  4. Performance optimizations (2):
    11 Apr 2017 - One step forward, ten steps back
  5. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats