Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,950 | Comments: 49,481

filter by tags archive
time to read 4 min | 797 words

In my last post, I talked about how to store and query time series data in RavenDB. You can query over the time series data directly, as shown here:

You’ll note that we project a query over a time range for a particular document. We could also query over all documents that match a particular query, of course. One thing to note, however, is that time series queries are done on a per time series basis and each time series belong to a particular document.

In other words, if I want to ask a question about time series data across documents, I can’t just query for it, I need to do some prep work first. This is done to ensure that when you query, we’ll be able to give you the right results, fast.

As a reminder, we have a bunch of nodes that we record metrics of. The metrics so far are:

  • Storage – [ Number of objects, Total size used, Total storage size].
  • Network – [Total bytes in, Total bytes out]

We record these metrics for each node at regular intervals. The query above can give us space utilization over time in a particular node, but there are other questions that we would like to ask. For example, given an upload request, we want to find the node with the most free space. Note that we record the total size used and the total storage available only as time series metrics. So how are we going to be able to query on it? The answer is that we’ll use indexes. In particular, a map/reduce index, like the following:

This deserve some explanation, I think. Usually in RavenDB, the source of an index is a docs.[Collection], such as docs.Users. In this case, we are using a timeseries index, so the source is timeseries.[Collection].[TimeSeries]. In this case, we operate over the Storage timeseries on the Nodes collection.

When we create an index over a timeseries, we are exposed to some internal structural details. Each timestamp in a timeseries isn’t stored independently. That would be incredibly wasteful to do. Instead, we store timeseries together in segments. The details about how and why we do that don’t really matter, but what does matter is that when you create an index over timeseries, you’ll be indexing the segment as a whole. You can see how the map access the Entries collection on the segment, getting the last one (the most recent) and output it.

The other thing that is worth noticing in the map portion of the index is that we operate on the values of the time stamp. In this case, Values[2] is the total amount of storage available and Values[1] is the size used. The reduce portion of the index, on the other hand, is identical to any other map/reduce index in RavenDB.

What this index does, essentially, is tell us what is the most up to date free space that we have for each particular node. As for querying it, let’s see how that works, shall we?

image

Here we are asking for the node with the least disk space that can contain the data we want to write. This can be reduce fragmentation in the system as a whole, by ensuring that we use the best fit method.

Let’s look at a more complex example of indexing time series data, computing the total network usage for each node on a monthly basis. This is not trivial because we record network utilization on a regular basis, but need to aggregate that over whole months.

Here is the index definition:

As you can see, the very first thing we do is to aggregate the entries based on their year and month. This is done because a single segment may contain data from multiple months. We then sum up the values for each month and compute the total in the reduce.

image

The nice thing about this feature is that we are able to aggregate large amount of data and benefit from the usual advantages of RavenDB map/reduce indexes. We have already massaged the data to the right shape, so queries on it are fast.

Time series indexes in RavenDB allows us to merge time series data from multiple documents, I could have aggregated the computation above across multiple nodes to get the total per customer, so I’ll know how much to charge them at the end of the month, for example.

I would be happy to know hear about any other scenarios that you can think of for using timeseries in RavenDB, and in particular, what kind of queries you’ll want to do on the data.

time to read 4 min | 633 words

RavenDB 5.0 is coming soon and the big new there is time series support. We have gotten to the point where we can actually show off what we can do, which makes me very happy. You can use the nightlies builds to explore time series support in RavenDB 5.0. Client side packages for 5.0 are also available.

image

I went ahead and created a new database and created some documents:

image

Time series are often used for monitoring, so I decided to go with the flow and see what kind of information we would want to store there. Here is how we can add some time series data to the documents:

I want to focus on this for a bit, because it is important. A time series in RavenDB has the following details:

  • The timestamp to associate to the values – in the code above, this is the current time (UTC)
  • The tag associated with the timestamp – in the code above, we record what devices and interfaces these measurements belong to.
  • The measurements themselves – RavenDB allows you to record multiple values for a single timestamp. We threat them as an array of values, and you can chose to put them in a single time series or to split them.

Let’s assume that we have quite a few measurements like this and that we want to look at the data. You can explore things in the Studio, like so:

image

We have another tab in the Studio that you can look at which will give you some high level details about the timeseries for a particular document. We can dig deeper, too, and see the actual values:

image

You can also query the data to see the patterns and not just the individual values:

The output will look like this:

image

And you can click on the eye to get more details in chart form. You can see a little bit of this here, but it is hard to do it justice with a small screen shot:

image

Here is what the data you get back from this query:

The ability to store and process time series data is very important for monitoring, IoT and healthcare systems. RavenDB is able to do quite well in these areas. For example, to aggregate over 11.7 million heartrate details over 6 years at a weekly resolution takes less than 50 ms.

We have tested timeseries that contained over 150 million entries and we can aggregate results back over the entire data set in under three seconds. That is a nice number, but it doesn’t match what dedicated time series databases can do. It represents a rate of about 65 million rows / second. ScyllaDB recently published a benchmark in which they talk about billion rows / sec. But they did that on 83 nodes, so they did just 12 million / sec per node. Less than a fifth of RavenDB’s speed.

But that is being unfair, to be honest. While timeseries queries are really interesting, we don’t really expect users to query very large amount of data using raw queries. That is what we have indexes for, after all. I’m going to talk about this in depth in my next post.

time to read 6 min | 1100 words

When it comes to security, the typical question isn’t whatever they are after you but how much. I love this paper on threat modeling, and I highly recommend it. But sometimes, you have information that you just don’t want to have. In other words, you want to store information inside of the database, but without the database or application being able to read said information without a key supplied by the user.

For example, let’s assume that we need to store the credit card information of a customer. We need to persist this information, but we don’t want to know it. We need something more from the user in order to actually use it.

The point of this post isn’t actually to talk about how to store credit card information in your database, instead it is meant to walk you through an approach in which you can keep data about a user that you can only access in the context of the user.

In terms of privacy, that is a very important factor. You don’t need to worry about a rogue DBA trawling through sensitive records or be concerned about a data leak because of an unpatched hole in your defenses. Furthermore, if you are carrying sensitive information that a third party may be interested in, you cannot be compelled to give them access to that information. You literally can’t, unless the user steps up and provide the keys.

Note that this is distinctly different (and weaker) than end to end encryption. With end to end encryption the server only ever sees encrypted blobs. With this approach, the server is able to access the encryption key with the assistance of the user. That means that if you don’t trust the server, you shouldn’t be using this method. Going back to the proper threat model, this is a good way to ensure privacy for your users if you need to worry about getting a warrant for their data. Basically, consider this as one of the problems this is meant to solve.

When the user logs in, they have to use a password. Given that we aren’t storing the password, that means that we don’t know it. This means that we can use that as the user’s personal key for encrypting and decrypting the user’s information. I’m going to use Sodium as the underlying cryptographic library because that is well known, respected and audited. I’m using the Sodium.Core NuGet package for my code samples. Our task is to be able to store sensitive data about the user (in this case, the credit card information, but can really be anything) without being able to access it unless the user is there.

A user is identified using a password, and we use Argon2id to create the password hash. This ensures that you can’t brute force the password. So far, this is fairly standard. However, instead of asking Argon2 to give us a 16 bytes key, we are going to ask it to give us a 48 bytes key. There isn’t really any additional security in getting more bytes. Indeed, we are going to consider only the first 16 bytes that were returned to us as important for verifying the password. We are going to use the remaining 32 bytes as a secret key. Let’s see how this looks like in code:

Here is what we are doing here. We are getting 48 bytes from Argon2id using the password. We keep the first 16 bytes to authenticate the user next time. Then we generate a random 256 bits key and encrypt that using the last part of the output of the Argon2id call. The function returns the generated config and the encryption key. You can now encrypt data using this key as much as you want. But while we assume that the CryptoConfig is written to a persistent storage, we are not keeping the encryption key anywhere but memory. In fact, this code is pretty cavalier about its usage. You’ll typically store encryption keys in locked memory only, wipe them after use, etc. I’m skipping these steps here in order to get to the gist of things.

Once we forget about the encryption key, all the data we have about the user is effectively random noise. If we want to do something with it, we have to get the user to give us the password again. Here is what the other side looks like:

We authenticate using the first 16 bytes, then use the other 32 to decrypt the actual encryption key and return that. Without the user’s password, we are blocked from using their data, great!

You’ll also notice that the actual key we use is random. We encrypt it using the key derived from the user’s password but we are using a random key. Why is that? This is to enable us to change passwords. If the user want to change the password, they’ll need to provide the old password as well as the new. That allows us to decrypt the actual encryption key using the key from the old password and encrypt it again with the new one.

Conversely, resetting a user’s password will mean that you can no longer access the encrypted data. That is actually a feature. Leaving aside the issue of warrants for data seizure, consider the case that we use this system to encrypt credit card information. If the user reset their password, they will need to re-enter their credit card. That is great, because that means that even if you managed to reset the password (for example, by gaining access to their email), you don’t get access tot he sensitive information.

With this kind of system in place, there is one thing that you have to be aware of. Your code needs to (gracefully) handle the scenario of the data not being decryptable. So trying to get the credit card information and getting an error should be handled and not crash the payment processing system Smile. It is a different mindset, because it may violate invariants in the system. Only users with a credit card may have a pro plan, but after a password reset, they “have” a credit card, in the sense that there is data there, but it isn’t useful data. And you can’t check, unless you had the user provide you with the password to get the encryption key.

It means that you need to pay more attention to the data model you have. I would suggest not trying to hide the fact that the data is encrypted behind a lazily decryption façade but deal with it explicitly.

time to read 4 min | 798 words

RavenDB has two separate APIs that allow you to get push notifications from the database. The first one is the Subscriptions API, which allows you to define a query such as:

And then subscribe to it like so:

RavenDB will now push batches of orders that match your query to the client. This is done in a reliable manner. If the client fails for any reason, it can reconnect and resume from where it left off. If the server failed, the cluster will automatically reassign the work to another node and the client will pick up from where it left off. The subscription is also persistent, that means that whenever you connect to it, you don’t start from the beginning. After the subscription has caught up with all the documents that match the query, it isn’t over. Instead, the client will wait for new or updated documents to come in so the server can push them immediately. The typical latency between a document change and the subscription processing it is about twice the ping time between the client and server (so in the order of milliseconds). Only a single client at a time can have a particular subscription open, but multiple clients can contend on the subscription. One of them will win and the others will wait for the subscription to become available (when the first client stop / fail / crash, etc).

This make subscriptions highly suitable for business processing. It is reliable, you already have high availability on the server side and you can easily add that on the client side. You can use complex queries and do quite a bit of work on the database side, before it ever reaches your code. Subscriptions also allow you to run queries over revisions, so instead of getting the current state of the document, you’ll be called with the (prev, current) tuple on any document change. That gives you even more power to work with.

On the other hand, subscriptions requires RavenDB to manage quite a bit of (distributed) state and as such consume resources at the cluster level.

The Changes API, on the other hand, has a very different model, let’s look at the code first, and then discuss this in details:

As you can see, we can subscribe to changes on a document or a collection. We actually have quite a bit of events that we can respond to. A document change (by id, prefix or collection), an index (created / removed, indexing batch completed, etc), an operation (created / status changed / completed), a counter (created / modified), etc.

Some things that can be seen even from just this little bit of code. The Changes API is not persistent. That means, if you’ll restart the client and reconnect, you’ll not get anything that already happened. This is intended for ongoing usage, not for critical processing. You also cannot do any complex queries with changes. You have the filters that are available and that is it. Another important distinction is that with the Subscription API, you are getting the document (and can also include additional ones), but with the Changes API, you’re getting the document id only.

The most common scenario for the Changes API is to implement this:

image

Whenever a user is editing a particular document, you’ll subscribe to the document and if it changed behind the scenes, you can notify the user about this so they won’t continue to edit the document and get an optimistic concurrency error on save.

The Changes API is also used internally by RavenDB to implement a lot of features in the Studio and for tracking long running operations from the client. It is lightweight and requires very little resources from the server (and none from the cluster). On the other hand, it is meant to be a best effort feature. If the Changes connection has failed, the client will transparently reconnect to the server and re-subscribe to all the pending subscriptions. However, any changes that happened while the client was not connected are lost.

The two APIs are very similar on the surface, both of them allow you to get push notifications from RavenDB but their usage scenarios and features are very different. The Changes API is literally that, it is meant to allow you to watch for changes. Probably because you have a human sitting there and looking at things. It is meant to be an additional feature, not a guarantee. The Subscriptions API, on the other hand, is a reliable system and can ensure that you’ll not miss out of notifications that matter to you.

You can read more about Subscriptions in RavenDB in the book, I decided a whole chapter to it.

time to read 3 min | 469 words

I was talking with a developer about their system architecture and they mentioned that they are going through some complexity at the moment. They are changing their architecture to support higher scaling needs. Their current architecture is fairly simple (single app talking to a database), but in order to handle future growth, they are moving to a distributed micro service architecture. After talking with the dev for a while, I realized that they were in a particular industry that had a hard barrier for scale.

I’m not sure how much I can say, so let’s say that they are providing a platform to setup parties for newborns in a particular country. I went ahead and checked how many babies you had in that country, and the number has been pretty stable for the past decade, sitting on around 60,000 babies per year.

Remember, this company provide a specific service for newborns. And that service is only applicable for that country. And there are about 60,000 babies per year in that country. In this case, this is the time to do some math:

  • We’ll assume that all those births happen on a single month
  • We’ll assume that 100% of the babies will use this service
  • We’ll assume that we need to handle them within business hours only
  • 4 weeks x 5 business days x 8 business hours = 160 hours to handle 60,000 babies
  • 375 babies to handle per hour
  • Let’s assume that each baby requires 50 requests to handle
  • 18,750 requests / hour
  • 312 requests / minute
  • 5 requests / second

In other words, given the natural limit of their scaling (number of babies per year), and using very pessimistic accounting for the load distribution, we get to a number of requests to process that is utterly ridiculous.

It would be hard to not handle this properly on any server you care to name. In fact, you can get a machine under 150$ / month that has 8 cores. That gives you a core per requests per second, with 3 to spare.

Even if we have to deal with spikes of 50 requests / second. Any reasonable server ( the < 150% / month I mentioned) should be able to easily handle this.

About the only way for this system to get additional load is if there is a population explosion, at which point I assume that the developers will be busy handling nappies, not watching the CPU utilization.

For certain type of applications, there is a hard cap of what load you can be expected to handle. And you should absolutely take advantage of this. The more stuff you can not do, the better you are. And if you can make reasonable assumptions about your load, you don’t need to go crazy.

Simpler architecture means faster time to market, meaning that you can actually deliver value, rather than trying to prepare for the Babies’ Apocalypse.

time to read 4 min | 628 words

I posted about the @refresh feature in RavenDB, explaining why it is useful and how it can work. Now, I want to discuss a possible extension to this feature. It might be easier to show than to explain, so let’s take a look at the following document:

The idea is that in addition to the data inside the document, we also specify behaviors that will run at specified times. In this case, if the user is three days late in paying the rent, they’ll have a late fee tacked on. If enough time have passed, we’ll mark this payment as past due.

The basic idea is that in addition to just having a @refresh timer, you can also apply actions. And you may want to apply a set of actions, at different times. I think that the lease payment processing is a great example of the kind of use cases we envision for this feature. Note that when a payment is made, the code will need to clear the @refresh array, to avoid it being run on a completed payment.

The idea is that you can apply operations to the documents at a future time, automatically. This is a way to enhance your documents with behaviors and policies with ease. The idea is that you don’t need to setup your own code to execute this, you can simply let RavenDB handle it for you.

Some technical details:

  • RavenDB will take the time from the first item in the @refresh array. At the specified time, it will execute the script, passing it the document to be modified. The @refresh item we are executing will be removed from the array. And if there are additional items, the next one will be schedule for execution.
  • Only the first element in the @refresh array only. So if the items aren’t sorted by date, the first one will be executed and the persisted again. The next one (which was earlier than the first one) is already ready for execution, so will be run on the next tick.
  • Once all the items in the @refresh array has been processed, RavenDB will remove the @refresh metadata property.
  • Modifications to the document because of the execution of @refresh scripts are going to be handled as normal writes. It is just that they are executed by RavenDB directly. In other words, features such as optimistic concurrency, revisions and conflicts are all going to apply normally.
  • If any of the scripts cause an error to be raised, the following will happen:
    • RavenDB will not process any future scripts for this document.
    • The full error information will be saved into the document with the @error property on the failing script.
    • An alert will be raised for the operations team to investigate.
  • The scripts can do anything that a patch script can do. In other words, you can put(), load(), del() documents in here.
  • We’ll also provide a debugger experience for this in the Studio, naturally.
  • Amusingly enough, the script is able to modify the document, which obviously include the @refresh metadata property. I’m sure you can imagine some interesting possibilities for this.

We also considered another option (look at the Script property):

The idea is that instead of specifying the script to run inline, we can reference a property on a document. The advantage being is that we can apply changes globally much easily. We can fix a bug in the script once. The disadvantage here is that you may be modifying a script for new values, but not accounting for the old documents that may be referencing it. I’m still in two minds about whatever we should allow a script reference like this.

This is still an idea, but I would like to solicit your feedback on it, because I think that this can add quite a bit of power to RavenDB.

time to read 5 min | 921 words

imageAn old adage about project managers is that they are people who believe that you can get 9 women together and get a baby in a single month. I told that to my father once and he laughed so much we almost had to call paramedics. The other side of this saying is that you can get nine women and get nine babies in nine months. This is usually told in terms of latency vs. capacity. In other words, you have to wait for 9 months to get a baby, but you can get any number of babies you want in 9 months. Baby generation is an embarrassingly parallel problem (I would argue that the key here is embarrassingly). Given a sufficient supply of pregnant women (a problem I’ll leave to the reader), you can get any number of babies you want.

We are in the realm of project management here, mind, so this looks like a great idea. We can set things up so we’ll have parallel work and get to the end with everything we wanted. Now, there is a term for nine babies, is seems: nonuplets.

I believe it is pronounced: No, NO, @!#($!@#.

A single baby is a lot of work, a couple of them is a LOT of work, three together is LOT^2. And I don’t believe that we made sufficient advances in math to compute the amount of work and stress involved in having nine babies at the same time. It would take a village, or nine.

This is a mostly technical blog, so why am I talking about babies? Let’s go back to the project manager for a bit? We can’t throw resources at the problem to shorten the time to completion (9 women, 1 month, baby). We can parallelize the work (9 women, 9 months, 9 babies), though. The key observation here, however, is that you probably don’t want to get nine babies all at once. That is a LOT of work. Let’s consider the point of view of the project manager. In this case, we have sufficient supply of people to do the work, and we have 9 major features that we want done. We can’t throw all the people at one feature and get it down in 1 month. See, Mythical Man Month for details, as well as pretty much any other research on the topic.

We can create teams for each feature, and given that we have no limit to the number of people working on this, we can deliver (pun intended) all the major features at the right time frame. So in nine months, we are able to deliver nine major features. At least, that is the theory.

In practice, in nine months, the situation for the project is going to look like this:

image

In other words, you are going to spend as much time trying to integrate nine major features as you’ll be changing diapers for nine newborn babies. I assume that you don’t have experience with that (unless you are working in day care), but that is a lot.

Leaving aside the integration headache, there are also other considerations that the project manager needs to deal with. For example, documentation for all the new features (and their intersections).

Finally, there is the issue of marketing, release cadence and confusion. If you go with the nine babies / nine months options, you’ll have slower and bigger releases. That means that your customers will get bigger packages with more changes, making them more hesitant to upgrade. In terms of marketing, it also means that you have to try to push many new changes all at once, leading to major features just not getting enough time in the daylight.

Let’s talk about RavenDB itself. I’m going to ignore RavenDB 4.0 release, because that was a major exception. We had to rebuild the project to match a new architecture and set of demands. Let’s look at RavenDB 4.1, the major features there were:

  1. JavaScript indexes
  2. Cluster wide transactions
  3. Migration from SQL, MongoDB and CosmosDB
  4. RavenDB Embedded
  5. Distributed Counters

For RavenDB 4.2, the major features were:

  1. Revisions Revert
  2. Pull Replication
  3. Graph queries
  4. Encrypted backups
  5. Stack trace capture on production

With five major features in each release (and dozens of smaller features), it is really hard to give a consistent message on a release.

In software, you don’t generally have the concept of inventory: Stuff that you already paid for but haven’t yet been sold to customers. Unreleased features, on the other hand, are exactly that. Development has been paid for, but until the software has been released, you are not going to be able to see any benefits of it.

With future releases of RavenDB, we are going to reduce the number of major features that we are going to be working on per release. Instead of spreading ourselves across many such features, we are going to try to focus on one or two only per release. We’re also going to reduce the scope of such releases, so instead of doing a release every 6 – 8 months, we will try to do a release every 3 – 4.

For 5.0, for example, the major feature we are working on is time series. There are other things that are already in 5.0, but there are no additional major features, and as soon as we properly complete the time series work, we’ll consider 5.0 ready to ship.

time to read 2 min | 271 words

After a long journey, I have an actual data structure implemented. I only lightly tested it, and didn’t really do too much with it. In fact, as it current stands, I didn’t even implement a way to delete the table. I relied on closing the process to release the memory.

It sounds like a silly omission, right? Something that is easily fixed. But I run into a tricky problem with implementing this. Let’s write the simplest free method we can:

Simple enough, no? But let’s look at one setup of the table, shall we?

As you can see, I have a list of buckets, each of them point to a page. However, multiple buckets may point to the same page. The code above is going to double free address 0x00748000!

I need some way to handle this properly, but I can’t actually keep track of whatever I already deleted a bucket. That would require a hash table, and I’m trying to delete one Smile. I also can’t track it in the memory that I’m going to free, because I can’t access it after free() was called. So what to do?

I thought about this for a while, and I came up with the following solution.

What is going on here? Because we may have duplicates, we first sort the buckets. We want to sort them by the value of the pointer. Then we simply scan through the list and ignore the duplicates, freeing each bucket only once.

There is a certain elegance to it, even if the qsort() usage is really bad, in terms of ergonomics (and performance).

time to read 3 min | 582 words

The naïve overflow handling I wrote previously kept me up at night. I really don’t like it. I finally figured out what I could do to handle this in an elegant fashion.

The idea is to:

  • Find the furthest non overflow piece from the current one.
  • Read its keys and try to assign them to its natural location.
  • If successfully moved all non native keys, mark the previous piece as non overlapping.
  • Go back to the previous piece and do it all over again.

Maybe it will be better to look at it in code?

There is quite a lot that is going on here, to be frank. We call this method after we deleted a value and go a piece to be completely empty. At this point, we scan the next pieces to see how far we have to go to find the overflow chain. We then proceed from the end of the chain backward. We try to move all the keys in the piece that aren’t native to the piece to their proper place. If we are successful, we mark the previous piece as non overflowing, and then go back one piece and continue working.

I intentionally scan more pieces than the usual 16 limit we use for put, because I want to reduce overflows as much as possible (to improve lookup times). To reduce the search costs, we only search within the current chain, and I know that the worst case scenario for that is 29 in truly random cases.

This should do amortize the cost of fixing the overflows on deletes to a high degree, I hope.

Next, we need to figure out what to do about compaction. Given that we are already doing some deletion book keeping when we clear a piece, I’m going to also do compaction only when a piece is emptied. For that matter, I think it make sense to only do a page level compaction attempt when the piece we just cleared is still empty after an overflow merge attempt. Here is the logic:

Page compaction is done by finding a page’s sibling and seeing if we can merge them together. A sibling page is the page that share the same key prefix with the current page except a single bit. We need to check that we can actually do the compaction, which means that there is enough leaf pages, that the sizes of the two pages are small enough, etc. There are a lot of scenarios we are handling in this code. We verify that even if we have enough space theoretically, the keys distribution may cause us to avoid doing this merge.

Finally, we need to handle the most complex parts. We re-assign the buckets in the hash, then we see if we can reduce the number of buckets and eventually the amount of memory that the directory takes. The code isn’t trivial, but it isn’t really complex, just doing a lot of things:

With this, I think that I tackled the most complex pieces of this data structure. I wrote the code in C because it is fun to get out and do things in another environment. I’m pretty sure that there are bugs galore in the implementation, but that is a good enough proof of concept to do everything that I wanted it to do.

However, writing this in C, there is one thing that I didn’t handle, actually destroying the hash table. As it turns out, this is actually tricky, I’ll handle that in my next post.

time to read 5 min | 955 words

In the world of design (be it software or otherwise), being able to make assumptions is a good thing. If I can’t assume something, I have to handle it. For example, if I can assume a competent administrator, I don’t need to write code to handle a disk full error. A competent admin will never let that scenario to happen, right?

In some cases, such assumptions are critical to being able to design a system at all. In physics, you’ll often run into questions involving spherical objects in vacuum, for example. That allows us to drastically simplify the problem. But you know what they say about assuming, right? I’m not a physicist, but I think it is safe to say most applied physics don’t involve spherical objects in vacuum. I am a developer, and I can tell you that if you skip handling a disk full due to assumption of competent admin, you won’t pass a code review for production code anywhere.

And that leads me to the trigger for this post. We have Howard Chu, who I have quite a bit of respect for, with the following statements:

People still don't understand that dynamically growing the DB is stupid. You store the DB on a filesystem partition somewhere. You know how much free space you want to allow for the DB. Set the DB maxsize to that. Done. No further I/O overhead for growth required.

Whether you grow dynamically or preallocate, there is a maximum size of free space on your storage system that you can't exceed. Set the DB maxsize in advance, avoid all the overhead of dynamically allocating space. Remember this is all about *efficiency*, no wasted work.

I have learned quite a lot from Howard, and I very strongly disagree with the above line of thinking.

Proof by contradiction: RavenDB is capable of handling dynamically extending the disk size of the machine on the fly. You can watch it here, it’s part of a longer video, but you just need to watch it for a single minute to see how I can extend the disk size on the system while it is running and can immediately make use of this functionality.  With RavenDB Cloud, we monitor the disk size on the fly and extend it automatically. It means that you can start with a small disk and have it grow as you data size increase, without having to figure out up front how much disk space you’ll need. And the best part, you have exactly zero downtime while this is going on.

Howard is correct that being able to set the DB max size at the time that you pen it will simplify things significantly. There is non trivial amount of dancing about that RavenDB has to do in order to achieve this functionality. I consider the ability to dynamically extend the size required for RavenDB a mandatory feature, because it simplify the life of the operators and make it easier to use RavenDB. You don’t have to ask the user a question that they don’t have enough information to answer very early in the process. RavenDB will Just Work, and be able to use as much of your hardware as you have available. And as you can see in the video, be able to take advantage of flexible hardware arrangements on the fly.

I have two other issues that I disagree with Howard on:

“You know how much free space you want to allow for the DB” – that is the key assumption that I disagree with. You typically don’t know that. I think that if you are deploying an LDAP server, which is one of Howard’s key scenarios, you’ll likely have a good idea about sizing upfront. However, for most scenarios, there is really no way to tell upfront. There is also another aspect. Having to allocate a chuck of disk space upfront is a hostile act for the user. Leaving aside the fact that you ask a question they cannot answer (which they will resent you for), having to allocate 10GB to store a little bit of data (because the user will not try to compute an optimal value) is going to give a bad impression on the database. “Oh, you need so much space to store so little data.”

In terms of efficiencies, that means that I can safely start very small and grow as needed, so I’m never surprising the user with a unexpected disk utilization or forcing them to hit arbitrary limits. For doing things like tests, ad-hoc operations or just normal non predictable workloads, that gives you a lot of advantages.

“…avoid the overhead of dynamically allocating space” – There is complexity involved in being able to dynamically grow the space, yes, but there isn’t really much (or any) overhead. Where Howard’s code will return an ENOSPC error, mine will allocate the new disk space, map it and move on. Only when you run out of the allocated space will you run into issues. And that turn out to be rare enough. Because it is an expensive operation, we don’t do this often. We double the size of the space allocated (starting from 256KB by default) on each hit, all the way to the 1 GB mark, after which we allocate a GB range each time. What this means is that in terms of the actual information we give to the file system, we do big allocations, allowing the file system to optimize the way the data is laid out on the physical disk.

I think that the expected use case and deployment models are very different for my databases and Howard’s, and that lead to a very different world view about what are the acceptable assumptions you can make.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB 5.0 (2):
    21 Jan 2020 - Exploring Time Series–Part II
  2. Webinar (2):
    15 Jan 2020 - RavenDB’s unique features
  3. Challenges (2):
    03 Jan 2020 - Spot the bug in the stream–answer
  4. Challenge (55):
    02 Jan 2020 - Spot the bug in the stream
  5. re (26):
    27 Dec 2019 - Writing a very fast cache service with millions of entries
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats