Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 7,059 | Comments: 49,783

filter by tags archive
time to read 5 min | 970 words

A customer reported that on their system, they suffered from frequent cluster elections in some cases. That is usually an indication that the system resources are hit in some manner. From experience, that usually means that the I/O on the machine is capped (very common in the cloud) or that there is some network issue.

The customer was able to rule these issues out. The latency to the storage was typically withing less than a millisecond and the maximum latency never exceed 5 ms. The network monitoring showed that everything was fine, as well. The CPU was hovering around the 7% CPU and there was no reason for the issue.

Looking at the logs, we saw very suspicious gaps in the servers activity, but with no good reason for them. Furthermore, the customer was able to narrow the issue down to a single scenario. Resetting the indexes would cause the cluster state to become unstable. And it did so with very high frequency.

“That is flat out impossible”, I said. And I meant it. Indexing workload is something that we have a lot of experience managing and in RavenDB 4.0 we have made some major changes to our indexing structure to handle this scenario. In particular, this meant that indexing:

  • Will run in dedicated threads.
  • Are scoped to run outside certain cores, to leave the OS capacity to run other tasks.
  • Self monitor and know when to wind down to avoid impacting system performance.
  • Indexing threads are run with lower priority.
  • The cluster state, on the other hand, is run with high priority.

The entire thing didn’t make sense. However… the customer did a great job in setting up an environment where they could show us: Click on the reset button, and the cluster become unstable.  So it is impossible, but it happens.

We explored a lot of stuff around this issue. The machine is big and running multiple NUMA node, maybe it was some interaction with that? It was highly unlikely, and eventually didn’t pan out, but that is one example of the things that we tried.

We setup a similar cluster on our end and gave it 10x more load than what the customer did, on a set of machines that had a fraction of the customer’s power. The cluster and the system state remain absolutely stable.

I officially declared that we were in a state of perplexation.

When we run the customer’s own scenario on our system, we saw something, but nothing like what we saw on their end. One of the things that we usually do when we investigate resource constraint issues is to give the machines under test a lot less capability. Less memory and slower disks, for example, means that it is much easier to surface many problems. But the worse we made the situation for the test cluster, the better the results became.

We changed things up. We gave the cluster machines with 128 GB of RAM and fast disks and tried it again. The situation immediately reproduced.

Cue facepalm sound here.

Why would giving more resources to the system cause instability in the cluster? Note that the other metrics also suffered, which made absolutely no sense.

We started digging deeper and we found the following index:

It is about as simple an index as you can imagine it would be and should cause absolutely no issue for RavenDB. So what was going on? We then looked at the documents…

image

I would expect the State field to be a simple enum property. But it is an array that describe the state changes in the system. This array also holds complex objects. The size of the array is on average about 450 items and I saw it hit a max of 13,000 items.

That help clarify things. During index, we have to process the State property, and because it is an array, we index each of the elements inside it. That means that for each document, we’ll index between 400 – 13,000 items for the State field. What is more, we have a complex object to index. RavenDB will index that as a JSON string, so effectively the indexing would generate a lot of strings. These strings are going to be held in memory until the end of the indexing batch. So far, so good, but the problem in this case was that there were enough resources to have a big batch of documents.

That means that we would generate over 64 million string objects in one of those batches.

Enter GC, stage left.

The GC will be invoked based on how many allocations you have (in this case, a lot) and how many live objects you have. In this case, also a lot, until the index batch is completed. However, because we run GC multiple times during the indexing batch, we had promoted significant numbers of objects to the next generation, and Gen1 or Gen2 collections are far more expensive.

Once we knew what the problem was, it was easy to find a fix. Don’t index the State field. Given that the values that were indexed were JSON strings, it is unlikely that the customer actually queried on them (later confirmed by talking to the customer).

On the RavenDB side, we added monitoring for the allocation frequency and will close the indexing batch early to prevent handing the GC too much work all at once.

The reason we failed to reproduce that on lower end machine was simple, RavenDB already used enough memory so we closed the batch early, before we could gather enough objects to cause the GC to really work hard. When running on a big machine, it had time to get the ball rolling and hand the whole big mess to the GC for cleanup.

time to read 2 min | 213 words

I was asked what the meaning of default(object) is in the following piece of code:

The code is something that you’ll see a lot in RavenDB indexes, but I understand why it is a strange construct. The default(object) is a way to null. This is asking the C# compiler to add the default value of the object type, which is null.

So why not simply say null there?

Look at the code, we aren’t setting a field here, we are creating an anonymous object. When we set a field to null, the compiler can tell what the type of the field is from the class definition and check that the value is appropriate. You can’t set a null to a Boolean properly, for example.

With anonymous objects, the compiler need to know what the type of the field is, and null doesn’t provide this information. You can use the (object)null construct, which has the same meaning as default(object), but I find the later to be syntactically more pleasant to read.

It may make more sense if you’ll look at the following code snippet:

This technique is probably only useful if you deal with anonymous objects a lot. That is something that you do frequently with RavenDB indexes, which is how I run into this syntax.

time to read 3 min | 577 words

email-me-clipart | free clip art from: www.fg-a.com/email1.s… | FlickrWe got a feature request that we don’t intend to implement, but I thought the reasoning is interesting enough for a blog post. The feature request:

If there is a critical error or major issue with the current state of the database, for instance when the data is not replicated from Node C to Node A due to some errors in the database or network it should send out mail to the administrator to investigate on  the issue. Another example is, if the database not active due to some errors then it should send out mail as well.

On its face, the request is very reasonable. If there is an error, we want to let the administrator know about it, not hide it in some log file. Indeed, RavenDB has the concept of alerts just for that reason, to surface any issues directly to the admin ahead of time. We also have a mechanism in place to allow for alerts for the admin without checking in with the RavenDB Studio manually: SNMP. The Simple Network Monitoring Protocol is designed specifically to enable this kind of monitoring and RavenDB expose a lot of state via that which you can act upon in your monitoring system.

Inside your monitoring system, you can define rules that will alert you. Send an SMS if the disk space is low, or email on an alert from RavenDB, etc. The idea of actively alerting the administrator is something that you absolutely want to have.

Having RavenDB send those emails, not so much. RavenDB expose monitoring endpoint and alerts, it doesn’t act or report on them. That is the role of your actual monitoring system. You can setup Zabbix or talk to your Ops team which likely already have one installed.

Let’s talk about the reason that RavenDB isn’t a monitoring system.

Sending email is actually really hard. What sort of email provider do you use? What options are required to set it up a connection? Do you need X509 certificate or user/pass combo? What happens if we can’t send the email? That is leaving aside the fact that actually getting the email delivered is hard enough. Spam, SPF, DKIM and DMARC is where things start. In short, that is a lot of complications that we’ll have to deal with.

For that matter, what about SMS integration? Surely that would also help. But no one uses SMS today, we want WhatsApp integration, and Telegram, and … You go the point.

Then there are social issues. How will we decide if we need to send an email or not? There should be some policy, and ways to configure that. If we won’t have that, we’ll end up sending either too many emails (which will get flagged / ignored) or too few (why aren’t you telling me about XYZ issue?).

A monitoring system is built to handle those sort of issues, it is able to aggregate reports and give you a single email with the current status, open issues for you to fix and do a whole lot more that is simply outside the purview or RavenDB. There is also the most critical alert of all, if RavenDB is down, it will not be able report that it is down because it is down.

The proper way to handle this is to setup integration with a monitoring system, so we’ll not be implementing this feature request.

time to read 2 min | 380 words

A sadly common place “attack” on applications is called “Web Parameter Tampering”. This is the case where you have a URL such as this:

https://secret.medical.info/?userid=823

And your users “hack” you using:

https://secret.medical.info/?userid=999

And get access to another users records.

As an aside, that might actually be considered to be hacking, legally speaking. Which make me want to smash my head on the keyboard a few time.

Obviously, you need to run your security validation on parameters, but there are other reasons to want to avoid to expose the raw identifiers to the user. If you are using the a incrementing counter of some kind, creating two values might cause you to leak the rate in which your data change. For example, a competitor might want to create an order once a week and track the number of the order. That will give you a good indications of how many orders there have been in that time frame.

Finally, there are other data leakage issues that you want to might want to take into account. For example, “users/321” means that you are likely to be using RavenDB while “users/4383-B” means that you are using RavenDB 4.0 or higher and “607d1f85bcf86cd799439011” means that you are using MongoDB.

A common reaction to this is to switch your ids to use guids. I hate that option, it means that you are entering very unfriendly territory  for the application. Guids convey no information to the developers working with the system and they are hard to work with, from a humane point of view. They are also less nice for the database systemto work with.

A better alternative is to simply mask the information when it leaves your system. Here is the code to do so:

You can see that I’m actually using AES encryption to hide the data, and then encoding it in the Bitcoin format.

That means that an identifier such as "users/1123" will result in output such as this:

bPSPEZii22y5JwUibkQgUuXR3VHBDCbUhC343HBTnd1XMDFZMuok

The length of the identifier is larger, but not overly so and the id is even URL safe Smile. In addition to hiding the identifier itself, we also ensure that the users cannot muck about in the value. Any change to the value will result in an error to unmask it.

time to read 3 min | 490 words

In my previous post, I showed how you can search for a range of cleared bit in a bitmap, as part of an allocation strategy. I want to do better than just finding a range, I want to find the best range. Best is define as:

  • Near a particular location (to increase locality).
  • Create the least amount of fragmentation.

Here is the bitmap in question: [0x80520A27102E21, 0xE0000E00020, 0x1A40027F025802C8, 0xEE6ACAE56C6C3DCC].

And a visualization of it:

download (1)

What we want is to find the following ranges:

  • single cell near the red marker (9th cell)
  • 2 cells range near the red marker (9th cell)
  • 4 cells range near the yellow marker (85th cell)
  • 7 cells range near the green marker (198th cell)

We have an existing primitive already, the find_next() function. Let’s see how we can use it to find the right buffer for us.

Here is how it goes, we scan forward from the beginning of the word that you and trying to find a range that match the requirement. We’ll find the smallest range we can find, but as we go further from the starting point, we’ll loosen our requirements. If we can’t find anything higher, we’ll search lower that the given position. Regardless, if we find an exact match, one where we don’t need to fragment for, we’ll take that if it is close enough.

Here are the results from the code (the code itself will follow shortly):

  • single cell near the 9th – 12 – as expected
  • 2 cells near the 9th – 27 – because there is a two cell hole there and we won’t fragment
  • 4 cells near 85 – 64 – I mentioned that we are scanning only forward, right? But we are scanning on a word boundary. So the 85th cell’s bit is in the 2nd word and the first 4 bits there are free.
  • 7 cells near 198 – 47 – we couldn’t find anything suitable higher, so we started to look from the bottom.

Here is the full code for this. I want to highlight a couple of functions:

Here you can see how I’m approach the issue. First, get the data from the start of the word containing the desired location, then see if I can get the best match. Or, if I can’t, just try to get anything.

The fact that I can play around with the state of the system and resume operations at ease make it possible to write this code.

Note that there is a tension in allocation code between:

  • Finding the best possible match
  • The time that it takes to find the match
  • The resources consumed by the find

In particular, if I need to scan a lot of memory, I may run into issue. It is usually better to get something as soon as possible than wait for the perfect match.

If you are interested in a nice mental workout, implement the find_prev function. That is a somewhat of a challenge, because you need to manage the state in reverse. Hint, use: __builtin_ctzll().

time to read 2 min | 314 words

In my previous post I shows how you can find a range of cleared bits in a bitmap. I even got fancy and added features such as find range from location, etc. That was great to experiment with, but when I tried to take the next step, being able to ask question such as find me the next range from location or get the smallest range available, I run into problems.

The issue was simple, I had too much state in my hands and I got lost trying to manage it all. Instead, I sat down and refactor things down. I introduced an explicit notion of state:

This structure allows me to keep track of what is going on when I’m calling functions. Which means that I can set it up once, and then call find_next immediately, without needing to remember the state at the call site.

This behavior also helped a lot in the actual implementation. I’m processing the bitmap one uint64 at a time, which means that I have to account for ranges that cross a word boundary. The core of the routine is the find_range_once call:

This function operates on a single word, but the state is continuous. That means that if I have a range that spans multiple words, I can manage that without additional complications. Here is how I’m using this:

We are getting all the range from the current word, then decide if we should move to the next, etc.

Notice that I’m using the current member in the struct to hold a copy of the current word, that is so I can mask a copy and move on.

Note that if the first cleared bit is in the start, I’m overflowing from ULONG_MAX back to zero. And with that, I’m able to do much more targeted operations on the bitmap, but I’ll discuss that in the next post.

time to read 3 min | 483 words

Consider the following scenario, I have a buffer and I need to allocate from this buffer a certain size.  Here is how this looks like:

The buffer is divided into cells and I want to be able to find a free cell. An common way to handle this is to use a bitmap (something called a bit array or bit vector). In the case of the buffer above, we have 64 cells to deal with.

If I use a single bit to signal if a cell is free or in use, I can handle all of that with a single uint64 value. Let’s assume that I have a big buffer and thus an array of uint64 that represent the allocation bitmap. In the case of the buffer above, the X marks a busy cell and the bitmap for this is:

  • 0x80520A27102E21
  • 0b1000010001110100000010001110010001010000010010100000000100000000

I’m going to base this code on this post be Lemire. In the post, he is finding the set bits, and I need a range of the inverse. It isn’t hard to convert Lemire’s code to a range search. The complexity is that I wanted to enable search from arbitrary location. Here is the code, I’ll discuss it below:

The idea is that we use Lemire’s code to find the first set bit and see if the distance from the last set bit is enough to satisfy the range request we have. Using this function, we can tell that the 0x80520A27102E21 value has the following ranges:

  • 8 consecutive cells at: [47]
  • 7 consecutive cells at: [47]
  • 6 consecutive cells at: [14,47]
  • 5 consecutive cells at: [14,36,47]
  • 4 consecutive cells at: [1,14,36,47,51]
  • 3 consecutive cells at: [1,6,14,17,21,30,36,47,50]
  • 2 consecutive cells at: [1,3,6,14,16,18,21,27,30,36,38,42,47,49,51,53]
  • 1 consecutive cell at: [1,2,3,4,6,7,8,12,14,15,16,17,18,19,21,22,23,27,28,30,31,32,34,36,37,38,39,40,42,43,45,47,48,49,50,51,52,53,54]

Note that the function gives us quite a bit of information about the bitmap structure. We can tell how big a range we got back (so we can make a decision and get the optimal one). Here is an example of using this function to do best fit locality aware allocations. We have the following buffer, composed of the following bitmap: [0x80520A27102E21, 0xE0000E00020, 0x1A40027F025802C8, 0xEE6ACAE56C6C3DCC].

What we want is to find the following ranges:

  • single cell near the red marker (9th cell)
  • 2 cells range near the red marker (9th cell)
  • 4 cells range near the yellow marker (85th cell)
  • 7 cells range near the green marker (198th cell)

download (1)

We want to get the best cell range for each request, where best is defined as:

  • Near the requested range
  • Will create the least amount of fragmentation.

I’ll try to give an answer to this in my next post, see how you do here.

A word of caution. It took a while to refactor the code above to its final form. I’ll have another post that discuss the changes as well. Consider this code a template, it hasn’t been well tested and I re-wrote it since this post several times.

time to read 1 min | 95 words

250x250I’m going to be talking tomorrow about using RavenDB as a Queuing Infrastructure. You can register to the webinar here.

I’m excited about this webinar, because I get to show off multiple fun ways to utilize RavenDB in ways that I don’t think people realize it can be used.

RavenDB is not a queue, but it has a bunch of features that we added especially for this purpose which enable some really interesting scenarios.

See you tomorrow!

time to read 2 min | 297 words

The task today is to do some fraud detection. We have a set of a few million records of transactions, each one of them looking like this:

image

We want to check a new transaction against the history of data for this particular customer. The problem is that we have a very limited time frame to approve / reject a transaction. Trying to crawl through a lot of data about the customer may very well lead to the transaction timing out. So we prepare in advance. Here is the index I created, which summarize information about a particular customer:

I’ll have to admit that I’m winging all the actual logic here. I’m not sure what you’ll need for approving a transaction, but I pulled up a few things. Here is the result of the index for a particular customer:

image

What is most important, however, is the time it takes to get this result:

image

The idea is that we are able to gather information that can quickly determine if the next transaction is valid. For example, we can check if the current merchant is in the list of good destinations. If so, that is a strong indication it is good.

We can also check if the amount of the transaction matches previous amounts for the transaction, as well as many other checks. The key here is that we can get RavenDB to do all the work in the background and immediately produce the right result for us when needed.

time to read 4 min | 796 words

imageThere have been a couple of cases where I was working on a feature, usually a big and complex one, that made me realize that I’m just not smart enough to actually build it.

A long time ago (five or six years), I had to tackle free space handling inside of Voron. When an operation cause a page to be released, we need to register that page somewhere, so we’ll be able to reuse it later on. This doesn’t seem like too complex a feature, right? It is just a free list, what is the issue?

The problem was that the free list itself was managed as pages, and changes to the free list might cause pages to be allocated or de-allocated. This can happen while the free list operation is running. Basically, I had to make the free list code re-entrant. That was something that I tried to do for a long while, before just giving up. I wasn’t smart enough to handle that scenario properly. I could plug the leaks, but it was too complex and obvious that I’m going to run into additional cases where this would cause issues.

I had to use another strategy to handle this. Instead of allowing the free list to do dynamic allocations, I allocated the free list once, and then used a bitmap mode to store the data itself. That means that modifications to the free list couldn’t cause us to mutate the structure of the free list itself. The crisis was averted and the feature was saved.

I just checked, and the last meaningful change that happened for this piece of code was in Oct 2016, so it has been really stable for us.

The other feature where I realized I’m not smart enough to handle came up recently. I was working on the documents compression feature. One of the more important aspects of this feature is the ability to retrain, which means that when compressing a document, we’ll use a dictionary to reduce the size, and if the dictionary isn’t well suited, we’ll create a new one. The first implementation used a dictionary per file range. So all the documents that are placed on a particular location will use the same dictionary. That has high correlation to the documents written at the same time and had the benefit of ensuring that new updates to the same document will use its original dictionary. That is likely to result in good compression rate over time.

The problem was that during this process, we may find out that the dictionary isn’t suited for our needs and that we need to retrain. At this point, however, we were already computed the required space. But… the new dictionary may compress the data different (both higher and lower than what was expected). The fact that I could no longer rely on the size of the data during the save process lead to a lot of heartache. The issue was worse because we first try to compress the value using a specific dictionary, find that we can’t place it in the right location and need to put it in another place.

However, to find the new place, we need to be know what is the size that we need to allocate. And the new location may have a different dictionary, and there the compressed value is different…

Data may move inside RavenDB for a host of reasons, compaction, defrag, etc. Whenever we would move the data, we would need to recompress it, which led to a lot of complexity. I tried fighting that for a while, but it was obvious that I can’t manage the complexity.

I changed things around. Instead of having a dictionary per file range, I tagged the compressed value with a dictionary id. That way, each document could store the dictionary that it was using. That simplified the problem space greatly, because I only need to compress the data once, and afterward the size of the data remains the same. It meant that I had to keep a registry of the dictionaries, instead of placing a dictionary at the end of the specified file range, and it somewhat complicates recovery, but the overall system is ever so much simpler and it took a lot less effort and complexity to stabilize.

I’m not sure why just these problems shown themselves to be beyond my capabilities. But it does bring to mind a very important quote:

When I Wrote It, Only God and I Knew the Meaning; Now God Alone Knows

I also have to admit that I have had no idea that this quote predates code and computing. The earlier known reference for it is from 1826.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Webinar recording (9):
    27 Aug 2020 - The App that Guarantees You're Going Out This Saturday Night
  2. Podcast (3):
    17 Aug 2020 - #SoLeadSaturday with Oren Eini
  3. RavenDB Webinar (3):
    01 Jun 2020 - Polymorphism at Scale
  4. Talk (5):
    23 Apr 2020 - Advanced indexing with RavenDB
  5. Challenge (57):
    21 Apr 2020 - Generate matching shard id–answer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats