Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,919 | Comments: 49,399

filter by tags archive
time to read 4 min | 700 words

There is a specific scenario that I run into that could be really helped by an O(1) lookup cost on a disk persistent data structure. Voron, our storage engine library, is built on top of a whole big pile of B+Trees, which has an O(logN) lookup cost. I could use that, but I wanted to see if we could do better.

The natural thing to do when you hear about O(1) costs is to go fetch the nearest hash table, so I spent some time thinking about how to build a hash table that would be persisted to disk. My needs are relatively simple, I believe:

  • O(1) lookup (at least in the common case).
  • Able to support greater than memory size.
  • Mutable (writes & deletes)
  • Keys and values are limited to int64.

It doesn’t sound hard, right?

But when I started thinking about it, I run into a pretty hard limit. If the size of the data is greater than memory, then we have to take into account data access costs. A simple approach here would be to allocate a section in the file for the hash table and use a hash to get to the right location in the file. That works, if you don’t need to support mutations. But when you do, you run into a big problem. At some point, the load factor of the hash table is going to increase to the point where you need to regrow it. At that point, you may need to re-hash the entire thing.

Assume that the hash table size at this point is 4 GB, you need to re-hash it to 8GB and you have just 1 GB available. That is going to take some time and be a pretty horrible process all around. That is as far as I got when I considered directly translating in memory hash table to disk based one. I’m pretty lucky that I don’t have to do that, because there is a wealth of research on the matter.

These go back to before I was born, although B+Trees predate them by a decade or so. They key here is to use extensible hashing. The Wikipedia article is pretty good, I especially liked the Python code showing how things work there. The original paper on the topic is also quite interesting and is of interest to people who care about the details of how storage engines work.

I believe that my next step is going to be playing with some codebases that implement these ideas. I decided to look at how this is done with the DBM family of systems. They are old, some of them are probably direct implementations of the extensible hashing paper, but I’m mostly interested in seeing how things fit together at this point.

All of that said, I run into a lot of red flags along the way.

Modern B-Tree Techniques discuss the issue of B-Trees vs. Hashes Indexes and come to the conclusion that you shouldn’t bother. They cover quite a few aspects of this issue, from complexity of implementation to usage scenarios.

The Berkley DB documentation states that for anything requiring locality of reference, B-Trees are the way to go. However, for large amount of data, their Hash implementation uses less metadata, so might be better. That said, this doesn’t match my expectation for the way the system will behave. Looking at this StackOverflow answer, it seems very likely that if you have a working set that is greater than memory, the hash implementation will hit page faults all the time and the B-Tree implementation will be able to keep at least most of its metadata in memory, benefiting greatly from that.

Indeed, we have this interesting quote from Berkley DB as well:

Hash access method is appropriate for data sets so large that not even the Btree indexing structures fit into memory. At that point, it's better to use the memory for data than for indexing structures. This trade-off made a lot more sense in 1990 when main memory was typically much smaller than today.

All in all, this seems like a nice area for me to look into. I’ll go and read some code now, and maybe I’ll write about it.

time to read 1 min | 194 words

While tracing a bug, I ended up with the following minimum reproduction:

The error you’ll get is:

Unable to sort because the IComparer.Compare() method returns inconsistent results. Either a value does not compare equal to itself, or one value repeatedly compared to another value yields different results.

The nasty thing about this is that if we had just 16 items in the array, this code would work. So this would appear to successfully work most times, and then break.

The underlying issue is that Array.Sort will use different sorting algorithms based on the size of the array to be sorted. Under 16 items, it’ll use an insertion sort, but over that, an introspection sort will be used (up to a limit, and then heap sort. Go read the code.).

What is key here is that our comparison function is broken. It doesn’t understand that two values can be equal. Because of that, comparing two equal values result in both of them being smaller than one another. That cause an error, and .NET issues this error. When you know what went wrong, the fix is pretty easy:

Now we properly handle this scenario, and everything will work.

time to read 6 min | 1023 words

Following my posts about search, I wanted to narrow my focus a bit and look into the details of implementing a persistent data structure inside Voron.

Voron is RavenDB’s storage engine and forms the lowest layers of RavenDB. It is responsible for speed, safety, transactions and much more. It is also a very low level piece of code, which has a lot of impact on the design and implementation.

Some of the things that we worry about when worrying Voron code are:

  • Performance – reduce computation / allocations (ideally to zero) for writes.
  • Zero copies – no cost for reads.
  • Safety – concurrent transactions can operate without interfering with one another.
  • Applicability – we tend to implement low level features that enable us to do a lot more on the higher tiers of the code.
  • Scale – handling data that may be very large, millions and billions of results.

In this case, I want to look into what it would take to implement a persistent set. If I was working in memory, I would be using Set<Int64>, but when using a persistent data structure, things are more interesting. The set we use will simply record Int64 values. This is important for a bunch of reasons.

First, Int64 is big, such values are used as file pointers, artificial ids, etc. Even though it seems limiting, we can get a lot more functionality than expected.

Second, if we are using a set of Int64, we can implement that using a bitmap. A set value indicate that the value is in the set, which allows us to do set union, intersection and exclusion cheaply. The only problem here is that a bitmap with Int64 values is… a problem. Imagine that I have the following code:

set.Add(82_100_447_308);

We would need to use 76GB(!) of memory to hold a bitmap for this set. That is obviously not going to be a workable solution for us. Luckily, there are other alternatives. Roaring Bitmaps are efficient in both time and space, so that is great. We just need to have an implementation that can work with a persistent model.

In order to understand how I’m going to go about implementing this feature, you need to understand how Voron is built. Voron is composed of several layers, the paging layer, which managed transactions and ACID and the data structure layer, which managed B+Trees, tables, etc.

In this case, we are implementing something at the data structure layer. And the first hurdle to jump through is decide how the data should look like. On the fact of it, this is a fairly simple decision, most of the decisions has already been made and outline in the previous post. We are going to have a sorted array of segment metadata, which will host individual segments with the set bits. This works if we have a single set, but in our case, we expect lots.

If we are going to use this for storing the posting lists, we have to deal with the following scenarios (talking about the specific documents matching the terms in the index):

  1. Many such lists that have a single item (unique id, date, etc)
  2. Lots of lists that have just a few values (Customer’s field in an order, for example)
  3. Few lists that have many values ( OrderCompleted: true, for example, can be safely expected to be about 99% of the total results)
  4. Many lists that have moderate amount of values (Each of the Tags options , for example)

That means that we have to think very carefully about each scenario. The third and forth options are relatively similar and can probably be best served by the roaring bitmap that we discussed. But what about the first two?

To answer that, we need to compute the metadata required to maintain the roaring set. At a minimum, we are going to have one SegmentMetadata involved, but we’ll also need an offset for that segment’s data, so that means that the minimum size involved has got to be 16 bytes (SegmentMetadata is 8 bytes, and a file offset is the same). There is also some overhead to store these values, which is 4 bytes each. So to store a single value using roaring set we’ll need:

  • 16 bytes for the segment metadata and actual segment’s offset
  • 4 bytes storage metadata for the previous line’s data
  • 2 bytes (single short value) to mark the single flipped bit
  • 4 bytes storage metadata for the segment itself

In short, we are getting to 26 bytes overhead if we just stored everything as a roaring set. Instead of doing that, we are going to try to do better and optimize as much as possible the first two options (unique id and very few matches). We’ll set a limit of 28 bytes (which, together with the 4 bytes storage metadata will round up to nice 32 bytes). Up to that limit, we’ll simple store the document ids we have as delta encoded varint.

Let’s say that we need to store the following document id lists:

List

Encoding

[12394]

[234, 96]

[319333, 340981,342812]

[229, 190, 19, 144, 169, 1, 167, 14]

You can see that the first list, which is 8 bytes in size, we encoded using merely 2 bytes. The second list, composed of three 8 bytes values (24 bytes) was encoded to merely 8 bytes. Without delta encoding, that value would be decoded to: [229, 190, 19, 245, 231, 20, 156, 246, 20], an additional byte. This is because we substract from each number the previous one, hopefully allowing to pack the value in a much more compact manner.

With a size limit of 28 bytes, we can pack quite a few ids in the list. In my experiments, I could pack up to 20 document ids (so 160 bytes, without encoding) into that space with realistic scenario. Of course, we may get a bad pattern, but that would simply mean that we have to build the roaring set itself.

I’m going to go ahead and do just that, and then write a post about the interesting tidbits of the code that I’ll encounter along the way.

time to read 8 min | 1407 words

Iimage care about usability and the user experience for our users. We spend a lot of time on making sure that things are running smoothly. When we created RavenDB Cloud, I knew that it was important to create a good experience for our cloud offering.

One of the most important things that I did was to go and look at other people’s offerings and see where they failed to meet customer expectations. I recently run into this article about AWS Elastic. Similar issues has been raised about it for a while. And it was one of my explicit design goals of what not to do in our cloud offerings.

Summarizing the discussion, it seems like the following major issues across the board.

  • Backups – Being able to have your own backup schedule and destinations. Retention policies based on your needs, etc. AWS Elastic uses hourly backups and you only get 14 days of that. Cosmos DB, to look at an Azure offering, is taking a backup every 4 hours and you have a maximum of two of them.

We have users that needs to be able to go back weeks / months / years and look at the state of their database at a given point in time. The 8 hours backup period for Cosmos DB is really short, but even 14 days on AWS Elastic is short enough that you probably need to roll some other solutions for that.

With RavenDB Cloud, you have automatic backups (hourly) with a retention period that defaults to 14 days. The key here, by the way, is defaults. You are absolutely free to define your own backup policies (per database or per cluster), that means that you can set your own destinations (want to do cross cloud backups, no problem) and your own retention policies.

  • Visibility – This is a very common complaint, it showed up in pretty much all resources that I checked and have been a common cause of complaints among the peers I sampled when we did the background for RavenDB Cloud. In particular, no logs or access to the debug endpoints is a killer for productivity. If you can’t tell what the problem is, you can’t fix it, so you’ll need to call support and have someone look things over. Here are a few choice quotes, that I think goes to the heart of things.

Here is Liz Bennett talking about things in 2017:

I feel equipped to deal with most Elasticsearch problems, given access to administrative Elasticsearch APIs, metrics and logging. AWS’s Elasticsearch offers access to none of that. Not even APIs that are read-only, such as the /_cluster/pending_tasks API.

Without access to logs, without access to admin APIs, without node-level metrics (all you get is cluster-level aggregate metrics) or even the goddamn query logs, it’s basically impossible to troubleshoot your own Elasticsearch cluster.

AWS’s Elasticsearch doesn’t provide access to any of those things, leaving you no other option but to contact AWS’s support team. But AWS’s support team doesn’t have the time, skills or context to diagnose non-trivial issues

And here is Nick Price, just a few days ago:

So your cluster resize job broke (on a service you probably chose so you wouldn’t have to deal with this stuff in the first place), so you open a top severity ticket with AWS support. Invariably, they’ll complain about your shard count or sizing and will helpfully add a link to the same shard sizing guidelines you’ve read 500 times by now. And then you wait for them to fix it. And wait. And wait. The last time I tried to resize a cluster and it locked up, causing a major production outage, it took SEVEN DAYS for them to get everything back online.

They couldn’t even tell if they’d fixed the problem and had to have me verify whether they had restored connectivity between their own systems.

I think that the reason this is the case is that AWS Elastic is a multi tenant service. so each instance is shared among multiple clients. That means that they have to limit the endpoints that you can access, to not leak data from other clients.

With RavenDB Cloud, you get all the usual diagnostic features features that you would usually get. We don’t want you to escalate things to support. The ideal scenario is that if you run into any trouble, you’ll be able to figure things out completely on your own. Hell, we do active monitoring and will contact our customers if we see outliers in the system behavior to gives them a heads up.

You can go to your RavenDB Cloud instance, pull the logs, watch ongoing traffic and even get a stack trace of the running production system. Everything is there, in the box.

  • Flexibility – The most common cited issue with AWS Elastic seems to be related to the issue of not being able to make changes in the environment. Or, rather, you can do that (increasing node count or changing the instance types) but when you do that, you are going to have a full blown migration step. That means that you’ll double the number of nodes you’ll run and incur a really expensive operation to copy all the data. Given that Elastic already has this feature (add a node to an existing cluster) the decision not to support it is likely related to constraints on the AWS Elastic multi tenancy layer.

I’m not really sure what to say here, RavenDB Cloud has no such issue. To be rather more exact, our multi tenant architecture was specifically designed so the outward facing differences between how you’ll operate RavenDB on your own systems vs. how you’ll operate RavenDB on the Cloud will be minimal.

Adding and removing node is certainly possible. And in fact, in my intro video to RavenDB Cloud I showed how I can upgrade a cluster along multiple axes, while it is being actively written to. All with exactly zero interruptions in service. Client code that was busy reading and writing from the cluster didn’t even noticed that. That was certainly not an easy feature to implement, but I considered this to be the baseline of what we had to offer.

  • Support – Both Nick and Liz had a poor experience with AWS Elastic support. You can read the quotes above, or read the full posts for the whole picture.

I don’t like support. We explicitly modeled the company so support is a cost center, not a source of revenue. That means that we want to close each support incident as soon as possible.  What this doesn’t mean is that we do an auto close of all issues on creation. That would give us fantastic closure rates, I imagine, but at a cost.

Instead, we have a multi layered system to deal with things. Consider the scenario Nick run into, a full disk. That is certainly something that you might run into, no?

  1. Automatic recovery - With RavenDB,  a single full disk will simply cause that node to reject writes, nothing else. And the cluster as a whole will continue to operate normally. With RavenDB Cloud, we have a lot more control, and our monitoring systems will automatically alert on disk full and increase its size automatically behind the scenes with no impact on users.
  2. Diagnostics - Active monitoring means that you’ll be notified in the RavenDB Studio about suspicious issues (for example, you run out of IOPS) and be able to investigate them with full visibility. RavenDB does a lot of work to ensure that if something is broken, you’ll know about it.
  3. Front line support – if you need to call our support, the person answering the call is going to be able to help you. They would typically be an engineer that was involved either in the actual building / managing of RavenDB Cloud or (2nd tier) involved in the development of RavenDB itself.

My goal with a support call is to get you back to speed as soon as possible, and the usual metrics for that are measured in minutes, not days.

We are now several months post the launch of RavenDB Cloud and the pickup of customers has been great. What is more important from my end, however, is that we are seeing how this kind of investment in our architecture and setup is paying off.

time to read 5 min | 843 words

In the previous posts in this series, I explored a bit how to generate a full text index on top of the Enron data set. In particular, we looked at (rudimentary) analysis of text in the first post and looked into posting lists (list of matching documents for specific terms) in the second one. It occurred to me that we need to actually have a much better understanding of the kind of requirements that we have from posting lists in general, so let’s look at them, shall we?

  • Add to the list (increasing numbers only).
  • Iterate the list (all, or from starting point).
  • Reduce disk space and memory utilization as much as possible.

The fact that I want to be able to add to the list is interesting. The typical use case in full text search is to generate the full blown posting list from scratch every time. The typical model is to use LSM (Log Structure Merge) and take advantage on the fact that we are dealing with sorted list to merge them cheaply.

Iterating the list is something you’ll frequently do, to find all the matches or to merge two separate lists. Here is the kind of API that I initially had in mind:

As you can see, there isn’t much there, which is intentional. I initially thought about using this an the baseline of a couple of test implementations using StreamVByte, FastPFor as well as Gorrilla compression. The problem is that there is the need to balance compression ratio with the cost of actually going over the list. Given that my test cases showed a big benefit for using Roaring Bitmaps, I decided to look at them first and see what I can get out of it.

RoaringBitamps is a way to store (efficiently) a set of bits, they are very widely used in the industry. The default implementation is also entirely suitable for my purposes. Mostly because they make use of managed memory, and a hard requirement that I have placed on this series is that I want to be able to use persistent memory. In other words, I want to be able to write the data out, then be able to do everything on top of memory mapped data, without having to parse it.

Roaring Bitmaps works in the following manner. Each 64K range of integers is divided into each own 8KB segments. Given that I’m using Voron as a persistence library, these numbers don’t work for my needs. Voron uses an 8KB page size, so we’ll drop these numbers by half. Each range will be 32K of integers and take a maximum of 4KB of disk space. This allows me to store it much more efficiently inside of Voron. Each segment, in turn, has a type. The types can be either:

  • Array – if the number of set bits in the segment is less than 2048, the data will use a simple sorted array implementation, with each value taking 2 bytes.
  • Bitmap – if the number of set bits in the segment is between 2048 and 30,720, the segment will use a total of 4096 bytes and be a standard bitmap.
  • Reversed array – if the number of set bits in the segment is higher than 30,720, we’ll store in the segment the unset bits as a sorted array.

This gives us quite a few advantages:

  • It is straightforward to build this incrementally (remember that we only ever add items in the end).
  • It is quite efficient in terms of space saving in the case of sparse / busy usage.
  • It is cheap (computationally) to work with and process.
  • It is very simple to use from memory mapped file without having to parse / create managed objects.

The one thing that we still need to take into account is how to deal with the segment metadata. How do we know what segment belong to what range. In order to handle that, we’ll define the following:

The idea is that we need to store two important pieces of information. The start location (is always going to be a multiple of 32K) and the number of set bits (which has a maximum of 32K). Therefor, we can pack all of them into a single int64. The struct is merely there for convenience.

In other words, in addition to the segments with the actual set bits, we are also going to have an array of all the segment’s metadata. In practice, we’ll also need another value here, the actual location of the segment’s data, but that is merely another int64, so that is still very reasonable.

As this is currently a mere exercise, I’m going to skip actually building the implementation, but it seems like it should be a fairly straightforward approach. I might do another post about how to actually implement this feature on Voron, because it is interesting. But I think that this is already long enough.

We still have another aspect to consider. So far, we talked only about the posting lists, but we also need to discuss the terms. But that is a topic for the next post in the series.

time to read 3 min | 551 words

In full text search terminology, a posting list is just a list of document ids. These are used to store and find matches for particular terms in the index.

I took the code from the previous post and asked it to give me the top 50 most frequent terms in the dataset and their posting lists. The biggest list had over 200,000 documents, and I intentionally use multiple threads to build things, so the actual list is going to be random from run to run (which adds a little more real-worldedness to the system*).

*Yes, I invented that term. It make sense, so I’m sticking with it.

I took those posting lists and just dumped them to a file, in the simplest possible format. Here are the resulting files:

image

There are a few things to note here. As you can see, the file name is the actual term in the index, the contents of the file is a sorted list of int64 of the document ids (as 8 bytes little endian values).

I’m using int64 here because Lucene uses int32 and thus has the ~2.1 billion document limit, which I want to avoid. It also make it more fun to work with the data, because of the extra challenge.  The file sizes seems small, but the from file contains over 250,000 entries.

When dealing with posting lists, size matter, a lot. So let’s see what it would take to reduce the size here, shall we?

image

Simply zipping the file gives us a massive space reduction, so there is a lot left on the table, which is great.

Actually, I might have skipped a few steps:

  • Posting lists are sorted, because it helps do things like union / intersect queries.
  • Posting lists are typically only added to.
  • Removal are handled separately, with a merge step to clean this up eventually.

Because the value is sorted, the first thing I tried was to use a diff model with variable sized int. Here is the core code:

Nothing really that interesting, I have to admit, but it did cut the size of the file to 242KB, which is nice (and better than ZIP). Variable sized integers are used heavily by Lucene, so I’m very familiar with them. But there are other alternatives.

  • StreamVByte is a new one, with some impressive perf numbers, but only gets us to 282 KB (but it is possible / likely that my implementation of the code is bad).
  • FastPFor compresses the (diffed) data down to 108KB.
  • RoaringBitmap gives us a total of 64KB.

There are other methods, but they tend to go to the esoteric and not something that I can very quickly test directly.

It is important to note that there are several separate constraints here:

  • Final size on disk
  • Computational cost to generate that final format
  • Computation cost to go from the final format to the original values
  • How much (managed) memory is required during this process

That is enough for now, I believe. My next post will deal delve into the actual semantics that we need to implement to get a good behavior from the system. This is likely going to be quite interesting.

time to read 4 min | 645 words

Full text search is a really interesting topic, which I have been dipping my toes into again and again over the years. It is a rich area of research, and there has been quite a few papers, books and articles about the topic. I read a bunch of projects for doing full text search, and I have been using Lucene for a while.

I thought that I would write some code to play with full text search and see where that takes me. This is a side project, and I hope it will be an interesting one. The first thing that I need to do is to define the scope of work:

  • Be able to (eventually) do full text search queries
  • Compare and contrast different persistence strategies for this
  • Be able to work with multiple fields

What I don’t care about: Analysis process, actually implementing complex queries (I do want to have the foundation for them), etc.

Given that I want to work with real data, I went and got the Enron dataset. That is over 517,000 emails from Enron totaling more than 2.2 GB. This is one of the more commonly used test datasets for full text search, so that is helpful. The first thing that we need to do is to get the data into a shape that we can do something about it.

Enron is basically a set of MIME encoded files, so I’ve used MimeKit to speed the parsing process. Here is the code of the algorithm I’m using for getting the relevant data for the system. Here is the relevant bits:

As you can see, this is hardly a sophisticated approach. We are spawning a bunch of threads, processing all half million emails in parallel, select a few key fields and do some very basic text processing. The idea is that we want to get to the point where we have enough information to do full text search, but without going through the real pipeline that this would take.

Here is an example of the output of one of those dictionaries:

As you can see, this is bare bones (I forgot to index the Subject, for example), but on my laptop (8 cores Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz) with 16 GB of RAM, we can index this amount of data in under a minute and a half.

So far, so good, but this doesn’t actually gets us anywhere, we need to construct an inverted index, so we can ask questions about the data and be able to find stuff out. We are already about half way there, which is encouraging. Let’s see how far we can stretch the “simplest thing that could possibly work”… shall we?

Here is the key data structures:

Basically, we have an array of fields, each of which holds a dictionary from each of the terms and a list of documents for the terms.

For the full code for this stage, look at the following link, it’s less than 150 lines of code.

Indexing the full Enron data set now takes 1 minute, 17 seconds, and takes 2.5 GB in managed memory.

The key is that with this in place, if I want to search for documents that contains the term: “XML”, for example, I can do this quite cheaply. Here is how I can “search” over half a million documents to get all those that have the term HTML in them:

image

As you can imagine, this is actually quite fast.

That is enough for now, I want to start actually exploring persistence options now.

The final code bits are here, I ended up implementing stop words as well, so this is a really cool way to show off how you can do full text search in under 200 lines of code..

time to read 3 min | 474 words

RavenDB, as of 4.0, requires that the document identifier will be a string. In fact, that has always been the requirement, but in previous versions, we allowed you to pretend that this isn’t the case. That has led to… some complexities, because people had a number id in their model, but inside RavenDB that was represented as a string, always.

I just got the following question:

In my entities, can I have the Id property of any type instead string to avoid primitive obsession? I would use a generic Id<tentity> type for ids. This type can be converted into string before saving in DB by calling ToString() and transformed from string into Id<tentity> (when fetching from DB) by invocation of static method like public Id<tentity> FromString(string id).

The short answer for this is that no, there is no way to do this. A document id in your model has to be a string.

The longer answer is that you can absolutely do this, but you have to understand the divergence of your entity model vs. the document model. The key is that RavenDB doesn’t actually require that your model would have an Id property. It is usually defined, because it makes things easier, but that isn’t required. RavenDB is perfectly happy managing the document key internally. Combine that with the ability to modify how documents are converted to entities, and you have a solution. Let’s look at the code…

And here is how it looks like:

image

The idea is that we customize a few things inside of RavenDB.

  • We tell the serializer that it should ignore the UserId property
  • We tell RavenDB that after creating an entity from the server, we should setup the Id property as we want it.
  • We do the same just before we store the entity in the server, just to be sure that we got the complete package.
  • We disable the usual identity generation logic for the documents we care about and tell RavenDB that it should ignore trying to set the identity property on the document on its own.

The end result is that we have an entity with a strongly typed identifier in our model. It took a bit of work, but not overly so.

That said, I would suggest that you should either have a string identifier property in your model or not have one at all (either option takes no code in RavenDB). Having an identifier and jumping through hoops like that tend to make for awkward experience. For example, RavenDB has no idea about this property, so if you need to support queries as well, you’ll need to extend the query support. It’s possible, but shows that there is additional complexity that can be avoided.

time to read 2 min | 256 words

RavenDB is highly concurrent distributed database. That means that we take the idea of race conditions, multiple that by network hiccups and then raise to the power of hair pulling. Now, we have architectural structure to help with a lot of that, but sometimes you need to write and verify what happens when a particular sequence of events in a five node cluster happens. For fun, you may need to orchestrate a particular order of operations across multiple disparate processes (sometimes on different machines). As you can imagine, that is… challenging.

I wanted to give you a hint of some of the techniques that we use to handle this. We have code that looks like this, sprinkled throughout our code base (Rachis is the name of our Raft cluster implementation):

This is where a leader connects to a follower to setup their relationship:

image

This is called during leader election:

image

These methods are implemented in the following manner:

image

In other words, they will set a ManualResetEvent that we setup as part of our testing infrastructure. The code isn’t even being run on production release, but it allow us to very carefully structure the exact sequence of events that we need to expose specific behaviors in the system.

time to read 2 min | 346 words

I run into this post, in which the author describe how they got ERROR 1000294 from IBM DataPower Gateway as part of an integration effort. The underlying issue was that he sent JSON to the endpoint in an order that it wasn’t expected.

After asking the team at the other end to fix it, the author got back an estimation of effort for 9 people for 6 months (4.5 man years!). The author then went and figured out that the fix for the error was somewhere deep inside DataPower:

Validate order of JSON? [X]

The author then proceeded to question the competency  / moral integrity of the estimation.

I believe that the author was grossly unfair, at best, to the people doing the estimation. Mostly because he assumed that unchecking the box and running a single request is a sufficient level of testing for this kind of change. But also because it appears that the author never considered once what is the reason this setting may be in place.

  • The sort order of JSON has been responsible for Remote Code Execution vulnerabilities.
  • The code processing the JSON may not do that in a streaming fashion, and therefor expect the data in a particular order.
  • Worse, the code may just assume the order of the fields and access them by index. Change the order of the fields, and you may reverse the Creditor and Debtor fields.
  • The code may translate the JSON to another format and send it over to another system (likely, given the mentioned legacy system.

The setting is there to protect the system, and unchecking that value means that you have to check every single one of the integration points (which may be several layers deep) to ensure that there isn’t explicit or implied ordering to the JSON.

In short, given the scope and size of the change:  “Fundamentally alter how we accept data from the outside world”, I can absolutely see why they gave this number.

And yes, for 99% of the cases, there isn’t likely to be any different, but you need to validate for that nasty 1% scenario.

FUTURE POSTS

  1. Optimizing access patterns for extendible hashing - 5 hours from now
  2. Building extendible hash leaf page - about one day from now

There are posts all the way to Nov 19, 2019

RECENT SERIES

  1. re (24):
    12 Nov 2019 - Document-Level Optimistic Concurrency in MongoDB
  2. Voron’s Roaring Set (2):
    11 Nov 2019 - Part II–Implementation
  3. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  4. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  5. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats