Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,950 | Comments: 49,481

filter by tags archive
time to read 4 min | 797 words

In my last post, I talked about how to store and query time series data in RavenDB. You can query over the time series data directly, as shown here:

You’ll note that we project a query over a time range for a particular document. We could also query over all documents that match a particular query, of course. One thing to note, however, is that time series queries are done on a per time series basis and each time series belong to a particular document.

In other words, if I want to ask a question about time series data across documents, I can’t just query for it, I need to do some prep work first. This is done to ensure that when you query, we’ll be able to give you the right results, fast.

As a reminder, we have a bunch of nodes that we record metrics of. The metrics so far are:

  • Storage – [ Number of objects, Total size used, Total storage size].
  • Network – [Total bytes in, Total bytes out]

We record these metrics for each node at regular intervals. The query above can give us space utilization over time in a particular node, but there are other questions that we would like to ask. For example, given an upload request, we want to find the node with the most free space. Note that we record the total size used and the total storage available only as time series metrics. So how are we going to be able to query on it? The answer is that we’ll use indexes. In particular, a map/reduce index, like the following:

This deserve some explanation, I think. Usually in RavenDB, the source of an index is a docs.[Collection], such as docs.Users. In this case, we are using a timeseries index, so the source is timeseries.[Collection].[TimeSeries]. In this case, we operate over the Storage timeseries on the Nodes collection.

When we create an index over a timeseries, we are exposed to some internal structural details. Each timestamp in a timeseries isn’t stored independently. That would be incredibly wasteful to do. Instead, we store timeseries together in segments. The details about how and why we do that don’t really matter, but what does matter is that when you create an index over timeseries, you’ll be indexing the segment as a whole. You can see how the map access the Entries collection on the segment, getting the last one (the most recent) and output it.

The other thing that is worth noticing in the map portion of the index is that we operate on the values of the time stamp. In this case, Values[2] is the total amount of storage available and Values[1] is the size used. The reduce portion of the index, on the other hand, is identical to any other map/reduce index in RavenDB.

What this index does, essentially, is tell us what is the most up to date free space that we have for each particular node. As for querying it, let’s see how that works, shall we?

image

Here we are asking for the node with the least disk space that can contain the data we want to write. This can be reduce fragmentation in the system as a whole, by ensuring that we use the best fit method.

Let’s look at a more complex example of indexing time series data, computing the total network usage for each node on a monthly basis. This is not trivial because we record network utilization on a regular basis, but need to aggregate that over whole months.

Here is the index definition:

As you can see, the very first thing we do is to aggregate the entries based on their year and month. This is done because a single segment may contain data from multiple months. We then sum up the values for each month and compute the total in the reduce.

image

The nice thing about this feature is that we are able to aggregate large amount of data and benefit from the usual advantages of RavenDB map/reduce indexes. We have already massaged the data to the right shape, so queries on it are fast.

Time series indexes in RavenDB allows us to merge time series data from multiple documents, I could have aggregated the computation above across multiple nodes to get the total per customer, so I’ll know how much to charge them at the end of the month, for example.

I would be happy to know hear about any other scenarios that you can think of for using timeseries in RavenDB, and in particular, what kind of queries you’ll want to do on the data.

time to read 6 min | 1100 words

When it comes to security, the typical question isn’t whatever they are after you but how much. I love this paper on threat modeling, and I highly recommend it. But sometimes, you have information that you just don’t want to have. In other words, you want to store information inside of the database, but without the database or application being able to read said information without a key supplied by the user.

For example, let’s assume that we need to store the credit card information of a customer. We need to persist this information, but we don’t want to know it. We need something more from the user in order to actually use it.

The point of this post isn’t actually to talk about how to store credit card information in your database, instead it is meant to walk you through an approach in which you can keep data about a user that you can only access in the context of the user.

In terms of privacy, that is a very important factor. You don’t need to worry about a rogue DBA trawling through sensitive records or be concerned about a data leak because of an unpatched hole in your defenses. Furthermore, if you are carrying sensitive information that a third party may be interested in, you cannot be compelled to give them access to that information. You literally can’t, unless the user steps up and provide the keys.

Note that this is distinctly different (and weaker) than end to end encryption. With end to end encryption the server only ever sees encrypted blobs. With this approach, the server is able to access the encryption key with the assistance of the user. That means that if you don’t trust the server, you shouldn’t be using this method. Going back to the proper threat model, this is a good way to ensure privacy for your users if you need to worry about getting a warrant for their data. Basically, consider this as one of the problems this is meant to solve.

When the user logs in, they have to use a password. Given that we aren’t storing the password, that means that we don’t know it. This means that we can use that as the user’s personal key for encrypting and decrypting the user’s information. I’m going to use Sodium as the underlying cryptographic library because that is well known, respected and audited. I’m using the Sodium.Core NuGet package for my code samples. Our task is to be able to store sensitive data about the user (in this case, the credit card information, but can really be anything) without being able to access it unless the user is there.

A user is identified using a password, and we use Argon2id to create the password hash. This ensures that you can’t brute force the password. So far, this is fairly standard. However, instead of asking Argon2 to give us a 16 bytes key, we are going to ask it to give us a 48 bytes key. There isn’t really any additional security in getting more bytes. Indeed, we are going to consider only the first 16 bytes that were returned to us as important for verifying the password. We are going to use the remaining 32 bytes as a secret key. Let’s see how this looks like in code:

Here is what we are doing here. We are getting 48 bytes from Argon2id using the password. We keep the first 16 bytes to authenticate the user next time. Then we generate a random 256 bits key and encrypt that using the last part of the output of the Argon2id call. The function returns the generated config and the encryption key. You can now encrypt data using this key as much as you want. But while we assume that the CryptoConfig is written to a persistent storage, we are not keeping the encryption key anywhere but memory. In fact, this code is pretty cavalier about its usage. You’ll typically store encryption keys in locked memory only, wipe them after use, etc. I’m skipping these steps here in order to get to the gist of things.

Once we forget about the encryption key, all the data we have about the user is effectively random noise. If we want to do something with it, we have to get the user to give us the password again. Here is what the other side looks like:

We authenticate using the first 16 bytes, then use the other 32 to decrypt the actual encryption key and return that. Without the user’s password, we are blocked from using their data, great!

You’ll also notice that the actual key we use is random. We encrypt it using the key derived from the user’s password but we are using a random key. Why is that? This is to enable us to change passwords. If the user want to change the password, they’ll need to provide the old password as well as the new. That allows us to decrypt the actual encryption key using the key from the old password and encrypt it again with the new one.

Conversely, resetting a user’s password will mean that you can no longer access the encrypted data. That is actually a feature. Leaving aside the issue of warrants for data seizure, consider the case that we use this system to encrypt credit card information. If the user reset their password, they will need to re-enter their credit card. That is great, because that means that even if you managed to reset the password (for example, by gaining access to their email), you don’t get access tot he sensitive information.

With this kind of system in place, there is one thing that you have to be aware of. Your code needs to (gracefully) handle the scenario of the data not being decryptable. So trying to get the credit card information and getting an error should be handled and not crash the payment processing system Smile. It is a different mindset, because it may violate invariants in the system. Only users with a credit card may have a pro plan, but after a password reset, they “have” a credit card, in the sense that there is data there, but it isn’t useful data. And you can’t check, unless you had the user provide you with the password to get the encryption key.

It means that you need to pay more attention to the data model you have. I would suggest not trying to hide the fact that the data is encrypted behind a lazily decryption façade but deal with it explicitly.

time to read 4 min | 628 words

I posted about the @refresh feature in RavenDB, explaining why it is useful and how it can work. Now, I want to discuss a possible extension to this feature. It might be easier to show than to explain, so let’s take a look at the following document:

The idea is that in addition to the data inside the document, we also specify behaviors that will run at specified times. In this case, if the user is three days late in paying the rent, they’ll have a late fee tacked on. If enough time have passed, we’ll mark this payment as past due.

The basic idea is that in addition to just having a @refresh timer, you can also apply actions. And you may want to apply a set of actions, at different times. I think that the lease payment processing is a great example of the kind of use cases we envision for this feature. Note that when a payment is made, the code will need to clear the @refresh array, to avoid it being run on a completed payment.

The idea is that you can apply operations to the documents at a future time, automatically. This is a way to enhance your documents with behaviors and policies with ease. The idea is that you don’t need to setup your own code to execute this, you can simply let RavenDB handle it for you.

Some technical details:

  • RavenDB will take the time from the first item in the @refresh array. At the specified time, it will execute the script, passing it the document to be modified. The @refresh item we are executing will be removed from the array. And if there are additional items, the next one will be schedule for execution.
  • Only the first element in the @refresh array only. So if the items aren’t sorted by date, the first one will be executed and the persisted again. The next one (which was earlier than the first one) is already ready for execution, so will be run on the next tick.
  • Once all the items in the @refresh array has been processed, RavenDB will remove the @refresh metadata property.
  • Modifications to the document because of the execution of @refresh scripts are going to be handled as normal writes. It is just that they are executed by RavenDB directly. In other words, features such as optimistic concurrency, revisions and conflicts are all going to apply normally.
  • If any of the scripts cause an error to be raised, the following will happen:
    • RavenDB will not process any future scripts for this document.
    • The full error information will be saved into the document with the @error property on the failing script.
    • An alert will be raised for the operations team to investigate.
  • The scripts can do anything that a patch script can do. In other words, you can put(), load(), del() documents in here.
  • We’ll also provide a debugger experience for this in the Studio, naturally.
  • Amusingly enough, the script is able to modify the document, which obviously include the @refresh metadata property. I’m sure you can imagine some interesting possibilities for this.

We also considered another option (look at the Script property):

The idea is that instead of specifying the script to run inline, we can reference a property on a document. The advantage being is that we can apply changes globally much easily. We can fix a bug in the script once. The disadvantage here is that you may be modifying a script for new values, but not accounting for the old documents that may be referencing it. I’m still in two minds about whatever we should allow a script reference like this.

This is still an idea, but I would like to solicit your feedback on it, because I think that this can add quite a bit of power to RavenDB.

time to read 5 min | 921 words

imageAn old adage about project managers is that they are people who believe that you can get 9 women together and get a baby in a single month. I told that to my father once and he laughed so much we almost had to call paramedics. The other side of this saying is that you can get nine women and get nine babies in nine months. This is usually told in terms of latency vs. capacity. In other words, you have to wait for 9 months to get a baby, but you can get any number of babies you want in 9 months. Baby generation is an embarrassingly parallel problem (I would argue that the key here is embarrassingly). Given a sufficient supply of pregnant women (a problem I’ll leave to the reader), you can get any number of babies you want.

We are in the realm of project management here, mind, so this looks like a great idea. We can set things up so we’ll have parallel work and get to the end with everything we wanted. Now, there is a term for nine babies, is seems: nonuplets.

I believe it is pronounced: No, NO, @!#($!@#.

A single baby is a lot of work, a couple of them is a LOT of work, three together is LOT^2. And I don’t believe that we made sufficient advances in math to compute the amount of work and stress involved in having nine babies at the same time. It would take a village, or nine.

This is a mostly technical blog, so why am I talking about babies? Let’s go back to the project manager for a bit? We can’t throw resources at the problem to shorten the time to completion (9 women, 1 month, baby). We can parallelize the work (9 women, 9 months, 9 babies), though. The key observation here, however, is that you probably don’t want to get nine babies all at once. That is a LOT of work. Let’s consider the point of view of the project manager. In this case, we have sufficient supply of people to do the work, and we have 9 major features that we want done. We can’t throw all the people at one feature and get it down in 1 month. See, Mythical Man Month for details, as well as pretty much any other research on the topic.

We can create teams for each feature, and given that we have no limit to the number of people working on this, we can deliver (pun intended) all the major features at the right time frame. So in nine months, we are able to deliver nine major features. At least, that is the theory.

In practice, in nine months, the situation for the project is going to look like this:

image

In other words, you are going to spend as much time trying to integrate nine major features as you’ll be changing diapers for nine newborn babies. I assume that you don’t have experience with that (unless you are working in day care), but that is a lot.

Leaving aside the integration headache, there are also other considerations that the project manager needs to deal with. For example, documentation for all the new features (and their intersections).

Finally, there is the issue of marketing, release cadence and confusion. If you go with the nine babies / nine months options, you’ll have slower and bigger releases. That means that your customers will get bigger packages with more changes, making them more hesitant to upgrade. In terms of marketing, it also means that you have to try to push many new changes all at once, leading to major features just not getting enough time in the daylight.

Let’s talk about RavenDB itself. I’m going to ignore RavenDB 4.0 release, because that was a major exception. We had to rebuild the project to match a new architecture and set of demands. Let’s look at RavenDB 4.1, the major features there were:

  1. JavaScript indexes
  2. Cluster wide transactions
  3. Migration from SQL, MongoDB and CosmosDB
  4. RavenDB Embedded
  5. Distributed Counters

For RavenDB 4.2, the major features were:

  1. Revisions Revert
  2. Pull Replication
  3. Graph queries
  4. Encrypted backups
  5. Stack trace capture on production

With five major features in each release (and dozens of smaller features), it is really hard to give a consistent message on a release.

In software, you don’t generally have the concept of inventory: Stuff that you already paid for but haven’t yet been sold to customers. Unreleased features, on the other hand, are exactly that. Development has been paid for, but until the software has been released, you are not going to be able to see any benefits of it.

With future releases of RavenDB, we are going to reduce the number of major features that we are going to be working on per release. Instead of spreading ourselves across many such features, we are going to try to focus on one or two only per release. We’re also going to reduce the scope of such releases, so instead of doing a release every 6 – 8 months, we will try to do a release every 3 – 4.

For 5.0, for example, the major feature we are working on is time series. There are other things that are already in 5.0, but there are no additional major features, and as soon as we properly complete the time series work, we’ll consider 5.0 ready to ship.

time to read 2 min | 271 words

After a long journey, I have an actual data structure implemented. I only lightly tested it, and didn’t really do too much with it. In fact, as it current stands, I didn’t even implement a way to delete the table. I relied on closing the process to release the memory.

It sounds like a silly omission, right? Something that is easily fixed. But I run into a tricky problem with implementing this. Let’s write the simplest free method we can:

Simple enough, no? But let’s look at one setup of the table, shall we?

As you can see, I have a list of buckets, each of them point to a page. However, multiple buckets may point to the same page. The code above is going to double free address 0x00748000!

I need some way to handle this properly, but I can’t actually keep track of whatever I already deleted a bucket. That would require a hash table, and I’m trying to delete one Smile. I also can’t track it in the memory that I’m going to free, because I can’t access it after free() was called. So what to do?

I thought about this for a while, and I came up with the following solution.

What is going on here? Because we may have duplicates, we first sort the buckets. We want to sort them by the value of the pointer. Then we simply scan through the list and ignore the duplicates, freeing each bucket only once.

There is a certain elegance to it, even if the qsort() usage is really bad, in terms of ergonomics (and performance).

time to read 3 min | 582 words

The naïve overflow handling I wrote previously kept me up at night. I really don’t like it. I finally figured out what I could do to handle this in an elegant fashion.

The idea is to:

  • Find the furthest non overflow piece from the current one.
  • Read its keys and try to assign them to its natural location.
  • If successfully moved all non native keys, mark the previous piece as non overlapping.
  • Go back to the previous piece and do it all over again.

Maybe it will be better to look at it in code?

There is quite a lot that is going on here, to be frank. We call this method after we deleted a value and go a piece to be completely empty. At this point, we scan the next pieces to see how far we have to go to find the overflow chain. We then proceed from the end of the chain backward. We try to move all the keys in the piece that aren’t native to the piece to their proper place. If we are successful, we mark the previous piece as non overflowing, and then go back one piece and continue working.

I intentionally scan more pieces than the usual 16 limit we use for put, because I want to reduce overflows as much as possible (to improve lookup times). To reduce the search costs, we only search within the current chain, and I know that the worst case scenario for that is 29 in truly random cases.

This should do amortize the cost of fixing the overflows on deletes to a high degree, I hope.

Next, we need to figure out what to do about compaction. Given that we are already doing some deletion book keeping when we clear a piece, I’m going to also do compaction only when a piece is emptied. For that matter, I think it make sense to only do a page level compaction attempt when the piece we just cleared is still empty after an overflow merge attempt. Here is the logic:

Page compaction is done by finding a page’s sibling and seeing if we can merge them together. A sibling page is the page that share the same key prefix with the current page except a single bit. We need to check that we can actually do the compaction, which means that there is enough leaf pages, that the sizes of the two pages are small enough, etc. There are a lot of scenarios we are handling in this code. We verify that even if we have enough space theoretically, the keys distribution may cause us to avoid doing this merge.

Finally, we need to handle the most complex parts. We re-assign the buckets in the hash, then we see if we can reduce the number of buckets and eventually the amount of memory that the directory takes. The code isn’t trivial, but it isn’t really complex, just doing a lot of things:

With this, I think that I tackled the most complex pieces of this data structure. I wrote the code in C because it is fun to get out and do things in another environment. I’m pretty sure that there are bugs galore in the implementation, but that is a good enough proof of concept to do everything that I wanted it to do.

However, writing this in C, there is one thing that I didn’t handle, actually destroying the hash table. As it turns out, this is actually tricky, I’ll handle that in my next post.

time to read 3 min | 579 words

Building data structures is fun, until you need to actually implement all the non core stuff. In the previous post, we covered iteration, but now we have to deal with the most annoying of features, deletions. In some data structures, implementing deletions can take significantly more time and effort than all other work combined. Let’s see what it takes to handle deletions in the hash table as it stands.

I started things out by just scanning for the right value and removing it verbatim. Here is what this looked like:

This works. The value is removed, future modifications or queries can run and everything Just Works. Even overflow operations will just work, including if we deleted all the data from a piece, it will still be marked as overflow and queries / modifications will proceed to get the right value from the right place.

In particular, we are missing handling for overflows and compaction. Overflows inside a page happens when we have can’t fit a key value pair in its natural piece (a 64 bytes boundary inside the page), so we place it on a nearby piece. Compaction happens when we removed enough data that we can merge sibling pages and free a page from the system.

Let’s handle the overflow case first, because it is easier. One option we have for handling overflows is to check if there is any overflow for a page, and after freeing some memory, check the next pieces for keys that we can move to our piece. That is actually quite complex, because there are two types of such keys. The first type refers to keys that belong directly to the piece we removed from, but the second type of keys that we have in play here are keys that overflow past this piece.

In other words, let’s say that we deleted a value from piece #17. We need to check pieces 18 – 33 for keys that belong on piece #17. That is the first type. The second type is to check the next pieces for keys whose native location is earlier than piece #17. The idea is that we’ll place that data nearer its ideal location.

The problem here is that we now have to do a lot of work on deletion, and that isn’t something that I’m a fan of. One of the common use cases for deletes is massive deletes, so we’ll spend time re-arranging the keys, only to have them deleted immediately afterward. Instead, I think that I’ll take advantage on the organization of pieces in the hash table. Instead of handling overflows whenever a delete is issued, we’ll handle them only when a piece is emptied. That also means that we can be sure that we’ll have space for the keys we want to move.

Here is what I came up with at 2:30 AM:

I’m not happy about this, though. It does the job, but you’ll note one thing it does not do. It doesn’t clear the overflow flag. This is because overflow is a transitive property. That is, I may have moved all the keys that belong to a piece to that piece, and no other piece have keys that belong to it. But keys that belong to previous pieces may be located on pieces after it. If we want to clear the overflow flag, we need to be ready to do a whole lot more.

But that is something that I’ll do at a more reasonable hour.

time to read 3 min | 563 words

I run perf tests and memory utilization tests on my implementation and finally got it to the right place. But the API I have is pretty poor. I can put a key and value, or get the value by key. But we probably want a few more features.

I changed the put implementation to be:

This allows me to do an atomic replace and get the old value from the table. That is a nice low hanging fruit. But the key feature that I want to talk about today is iteration, as you might have figured out from the post title Smile.

I’m writing this code in C, because I find it interesting to practice in different environments, and C doesn’t really have an iteration API. So here is what I came up with:

If this was a public API I was building, I would probably want to hide the implementation details of the hash_iteration_state. Right now, I get a allocation and failure free API, because the caller is responsible for supplying the space for the state.

Here is how we can iterate using this API:

Not too bad, right? And this is basically what you’ll get when you use yield and such in languages that support native iterations. In C, you need to manage this yourself, but I don’t think that I got too lost here.

We store the current state in the state variable, and simply traverse the data in the buckets / pieces as they come.

Looking at this code, what is missing? Error handling…

But wait, I can hear you say, how can there be errors here? There are no moving pieces that can break, surely.

Well, the caller of our API may provide some moving pieces for us. For example, consider this code:

In other words, if we iterate and modify the data, what is going to happen? Well, we may change the position of values, which will lead us to skipping some values, iterating over some values twice, etc. What is worse, this may violate invariants in the code. In particular, the invariant in question is that current_piece_byte_pos always points to the start of a new key. If the data moved because of the put, this doesn’t hold true any longer.

I added protection to that by adding a version field to the directory, which is incremented whenever we call a put / replace on the directory. Then we can check if the value has changed. The issue is how do we report this in? Right now, I wrote:

image

I guess I could have done better by changing the return value to an int and returning better error code directly. This is a perfect case for exception, I think, since this is an edge case that should never be hit in real code. The fact that modifying the hash table will invalidate the iterator and cause it to stop working, on the other hand, might not be immediately obvious to the caller. More likely than not, though, anyone trying to write mutating code such as the one above will quickly figure out that this isn’t working and check exactly why.

Because of that, I decided to keep the bool return value, to simplify the life of our callers.

The full code is here.

time to read 1 min | 111 words

I did a video cast when Chen a few weeks ago. I think it went quite well.

Oren shares his story on how he taught himself how to code in "Prison" and became an open-source developer, coding a prison management system and how he founded Hibernating Rhinos. If you are an open-source developer or entrepreneur, you'll find in this videocast some great ideas by Oren - about how to plan ahead and fail properly, promoting an open-source project, and giving advice to people who are beginning their journey in the open-source world.

You can watch it here:

Let me know what you think.

time to read 7 min | 1260 words

In the previous post, I wrote about how I changed the structure of the hash leaf page to increase data density. I managed to get it down to 32MB range when I’m using random keys. That is a pretty great number, for memory usage, but what is the cost in terms of performance?

Well, let’s figure it out, shall we?

I added some tracing code and got the first result:

3.124000 us/op with 32.007813 MB

That is not to shabby, right? Let’s see where we are spending most of our time, shall we? I opened the profiler and got:

image

Okay, that is a good point, isn’t it? Changing to release mode gives us:

1.471000 us/op with 32.007813 MB

that is much nicer, but still, profiler please…

As a side note, it actually takes less time to run the profiler than for it to analyze its output. I was looking at this for a while.

image

The result was… stunning:

image

What is this thing? And why did it take almost 50% of my runtime?

As it turns out, I was compiling for x86, and I’m using a lot of shifts on 64 bits numbers. This _allshl seems to be part of the x86 runtime. That means that what I expected to be a cheap instruction on a register was actually a method call.

That is interesting, but easy to fix. When running in Release/x64, we get the following results:

0.723 us/op with 32.007813 MB

Okay, so we are under a microsecond per op, and very reasonable memory, good to go, right?

Well, remember that I did absolutely zero optimizations so far? What does the profiler tell us now? Here is an interesting hotspot:

image

That is reasonable, we are benching this method, after all. But inside that method, we see:

image

This is the part where we scan an existing piece to see if the value is inside it or not. This tell us if we need to add a new value or update an existing one. It make sense this will be hot, we have to do it on each put to the data related to the piece where we want to put the new key.

There are a few ways to deal with this, we can try to move from the simple varint mode to a more complex (and performant) system. StreamVByte would probably be a good solution, in term of raw performance. But it is meant for 32 bits numbers and doesn’t play nice with being able to remove and add values from the stream easily.

I could also try to play games, instead of calling this function twice, call it once and pass both k and v. However, that is almost assuredly a false play. The varint method is small enough that it doesn’t really matter, the compiler can inline it and play its own optimizations. Also, I tried it and there was no noticeable performance change, so that’s down.

Another way to deal with it is to reduce the number of times we call this function. And here is where things get interesting. Why is this called so much? Because during the put process, we find a page to put a value, then in that page, we find a piece (a 64 byte range) that we will put the key and value in. When we get to the piece, we need to check the already existing data if the key is there or not. So far, so good, but there is another factor to consider, overflows.

A piece may overflow and spill into consecutive pieces. After all, that is what allowed us to reduce the memory usage from 147MB to just 32MB in the random integers scenario. However, that also means that we may need to scan much larger piece of the page. That explains why we are seeing so much usage of the decoding function.

Let’s look at the previous behavior, where we have no overflow at all?

0.551000 us/op with 147.320313 MB

That is a much cheaper cost, but much higher memory. It looks like the typical compute vs. memory cycle, but let’s look at the actual costs?

image

You’ll notice that we spend most of our time on increasing the hash table size, allocating and moving memory, etc. So even though we are faster, that isn’t a good option for us.

One thing to note, we are looking for the same key, and decoding all the data to find it. But we don’t actually need to do that, we already have the key, and encoded it to its varint form. We can do a search on the raw encoded data to find it. It won’t be good enough for the positive case (we may have a value that was encoded to the same form), but it should help for the common case of inserting a new value. If we find something with memmem(), we still need to decode the data itself and see if the pattern we found is a key or a value, but that should help.

I tested it using GCC’s implementation, and the performance dropped by almost 50%, it took 1.3 us/op! Maybe if I was using a SIMD optimized implementation, that would help, but given the kind of data we are looking for, it didn’t pan out.

Another option is to reduce the number of times we’ll try to overflow a value. Right now, if we can’t put a value in its proper place, we’ll try putting it in any of the other locations. That means that we may probe as many as 127 pieces. It also means that during put, we have to scan overflow chains. As we saw in the previous post, that can add up to scanning up to 1.8 KB of data for a single put. What happens if we limit the overflow amount?

Let’s see if we limit the overflow to 32 probes. Now it only takes 0.403 us/op, which is a huge improvement. But what about the memory size? It’s easier to look things up as a table:

Max chain Overall Time (sec) us/op Size (MB)
10.5450000.545000147.320313
20.3590000.35900075.156250
40.3720000.37200055.523438
80.3220000.32200036.882813
160.3360000.33600032.226563
320.4480000.44800032.007813
640.5960000.59600032.007813
1280.7700000.77000032.007813

These numbers are interesting, but let’s look at them as a graph, shall we?

image

We can see that the size drops sharply as the performance is best between 8 and 16 probe attempts, and all we are left choosing is the memory cost.

If we go with 8 probe attempts, we’ll pay with additional 4.875 MB, but with 16 probe attempts, we’ll use just 224KB more with a cost of 0.044 us/op more than the optimal value.

We could go to 32, of course, which gives us optimal size, with about 60% of the cost of doing the full scan. However, by paying just 224KB more, we get down to 43% of the initial cost. And that certainly seems like it is worth it.

You can find the full source code (a little bit cleaned up) here.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB 5.0 (2):
    21 Jan 2020 - Exploring Time Series–Part II
  2. Webinar (2):
    15 Jan 2020 - RavenDB’s unique features
  3. Challenges (2):
    03 Jan 2020 - Spot the bug in the stream–answer
  4. Challenge (55):
    02 Jan 2020 - Spot the bug in the stream
  5. re (26):
    27 Dec 2019 - Writing a very fast cache service with millions of entries
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats