Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 333 words

RavenDB is really great in aggregations. Even if you have a stupendous amount of information, it will do really well in crunching through the data and summarizing it for you. This is due to the way RavenDB implements map/reduce operations, it allows us to instantly give you aggregation results, regardless of data size. However, this approach requires that you’ll tell RavenDB up front how you want to do the aggregation. This allows RavenDB to do the work ahead of time, as you modify the data, instead of each time you query for it.

A question was raised in the mailing list. How can you use RavenDB to do arbitrary ranges during aggregation. For example, let’s say that we want to be able to look at the total charges for a customer, but slice it so we’ll have:

  • Active – 7 days back
  • Recent  – 7 – 21 days back
  • History – 21 – 60 days back

And each user can define they own time period for Active / Recent / History ranges. That complicate things somewhat, but luckily, RavenDB has multiple ways to solve this issue. First, let’s create the appropriate index for this.

This isn’t really anything special. It will simply aggregate the data by company and by date. That isn’t enough for what we want to query. For that, we need to reach for another tool in our belt, facets.

Here is the relevant query:

image

And here is what the output looks like:

image

The idea is that we have two stages for the process. First, we use the map/reduce index to pre-aggregate the data at a daily level. Then we use facets to roll up the information to the level we desire. Instead of having to go through the raw data, we can operate on the partially aggregated data and dramatically reduce the overall cost.

time to read 2 min | 243 words

The question recently came up in a discussion with a customer. They have an existing binary storage solution that they want to migrate to RavenDB. No problems, right? RavenDB has attachments support for just this reason, after all.

The key from their perspective is that their current solution gives them a file system abstraction, and they want to keep the same concept when moving the data to RavenDB. In RavenDB, we tend to think about the data as binaries attached to documents, not as raw files. But the actual solution ended up being quite elegant.

We start by creating the following “document”.

At the moment, it is pretty bare bones, but you can add additional items here, such as owner, collaborators, etc. That depends entirely on your system and needs and doesn’t impact how we will be using this. With just this class, we can now build the API. I’ll first show the code, and then discuss it.

As you can see, there really isn’t much here. The idea is that the Folder is used to store the child folders, and the files inside the folder are attachments on the folder document. We expose the typical file system operations (put, list, get file, etc) in a simple interface.

This allows you to build transactional file system on top of RavenDB and expose a natural looking file system format. More advanced usages can be to get multiple levels of the folder tree, implementing permissions, ownership, etc.

time to read 1 min | 107 words

Webinar banner

Tomorrow I’m going to be giving a webinar about RavenDB Cloud, among the topics I’m going to cover are:

  • The type of work you can hand over to us while you put more time into your application
  • The different types of instances you can use and the resources you can provision
  • Setting up a distributed database instance and securing it in minutes
  • Provisioning a free instance to try it out
  • Ways to save money on the cloud
  • Getting BANG for your BUCK! How RavenDB performs fast on less expensive machines
  • Monitoring costs, performance, events

You can register to the webinar using the following link.

time to read 2 min | 272 words

We have a bunch of smart people whose job description does not include breaking RavenDB, but nevertheless they manage to do so on a regular basis. This is usually done while subjecting RavenDB to new and interesting ways to distress it. The latest issue was being able to crash the system with an impossible bug. That was interesting to go through, so it is worth a post.

Take a look at the following code:

This code will your application (you can discard multi threading as a source, by the way) in a particularly surprising way.

Let’s assume that we have a system that is running under low memory conditions. The following sequence of events is going to occur:

  • BufferedChannel is allocated.
  • The BufferedChannel constructor is being run.
  • The attempt to allocate the _destinations list fails.
  • An exception is thrown upward and handled.

So far, so good, right? Except that now your system is living on borrowed time.

You see, this class has a finalizer. And the fact that the constructor wasn’t called doesn’t matter to the finalizer. It will still call the Finalize method, but that method calls to Dispose, and Dispose expect to get a valid object. At this point, we are going to get a Null Reference Exception from the finalizer, which is considered to be a critical error and fail the entire process.

For additional fun, this kind of failure will happen only in under harsh conditions. When the OS refuses allocations, which doesn’t happen very often. We have been running memory starvation routines for a while, and it takes a specific set of failures to get it failing in just the right manner to cause this.

time to read 1 min | 190 words

We run into a hard failure during some diagnostics runs that was quite interesting. Here it the relevant code:

The code was meant to serve as a repository of commands so we can see in what order they run into the system. (Yes, we could use a queue, instead, but the code above isn’t what we actually run, it is just enough to reproduce the problem I wanted to talk about). And it failed, because there was already a value with the same key.

That was strange, because we didn’t expect that to happen, obviously. We wrote the following code:

And we got some really interesting results. On my laptop, I get:

Changes: 9388, Idle: 5442

We got widely variant results. Mostly, I believe, it related to the system configuration and the CPU frequency. But the basic problem is that on that particular machine, we were fast enough to be able to run commands faster than the machine could count Smile.

That is a good problem to have, I think. We fixed it by making sure our time source is monotonically increasing, which is good enough for debug code.

time to read 3 min | 564 words

We got a bug report recently about RavenDB messing up a query. The query in question was:

from index TestIndex where Amount < 0

The query, for what it is worth, is meant to find accounts that have pre-paid. The problem was that this query returned bad results. This was something very strange. Comparisons of numbers is a very well trodden field in RavenDB.

How can this be? Let’s look a bit deeper into things, shall we? Luckily the reproduction is pretty simple, let’s take a look:

With an index like this, we can be fairly certain that the query above will return no results, right? After all, this isn’t any kind of complex math. But the query will return results for this index. And that is strange.

You might have noticed that the numbers we use are decimals ( 1.5m is the C# postfix for decimal constant ). The problem repeats with decimal, double and float. When we started looking at this, we assumed that the problem was with floating point precision. After all, any comparison of floating point values will warn you about this. But using decimal, we are supposed to be protected from this.

When faced with such an issue, we dove deep into how RavenDB does numeric range comparisons. Sadly, this is not a trivial matter, but after reviewing everything, we were certain that the code is correct, it had to be something else.

I finally ended up with the following code:

And that gave me some interesting output:

00000000 - 00000000 - 00000000 - 00000000
00000000 - 00000000 - 00000000 - 80010000

0.0 == 0 ? True

And that was quite surprising.We can get the same using double of floats:

And this code gives us:

00-00-00-00-00-00-00-00
00-00-00-00-00-00-00-80
00-00-00-00
00-00-00-80

What is going on here?

Well, let’s look at the format of floating point numbers for a sec, shall we? There is a great discussion of this here, but the key observation for this particular issue can be seen by looking at the binary representation.

Here is 2.0:

 image

And is here -2.0:

image

As usual, the first bit is the sign marker. But unlike int32 or int64, with floating point, it is absolutely legal to have the following byte patterns:

image

image

The first one here is zero, and the second one is negative zero. They are equal to one another, but the binary representation is different. And that caused our issue.

Part of how RavenDB does range numeric queries on floating point it to translate them to a lexical value that can be compared using memcmp(), there is some magic involved, but the key observation was that we didn’t account for negative zero in the problem. The negative zero was translated to –9,223,372,036,854,775,808 (long.MinValue) and that obviously is smaller than zero.

The fix in our code was to handle this explicitly, like so:

And that is how I started out by chasing bugs and ended up with a sum total of negative zero.

time to read 2 min | 395 words

imageThis Sunday, our monitoring systems sent us an urgent ping. A customer instance in RavenDB Cloud was non responsive. Looking into the issue, we realized the following:

  • RavenDB on that instance was inaccessible from the outside world.
  • Azure’s metrics said that the system is fine.
  • Admin was unable to SSH to the machine.
  • Unable to run commands on the machine.

The instance was part of a cluster, and the cluster has automatically failed over all work to the rest of the actives nodes. I don’t believe that the customer using this cluster was even aware that this happened.

We reached out to Azure support, and we got the following messages:

  • There was a platform issue which has been resolved.
  • Engineers determined that instances of a backend service became unhealthy, preventing requests from completing.  Engineers determined that the issue was self-healed by the Azure platform.

There has been no outage in Azure in general on Sunday, this is just the normal status of things. Some backend thingamajig broke, and an instance went down. You might be looking at 99.999% numbers and consider them impressive, but remember that this applies to the whole system.

For one of our instances, we had a downtime of several hours. It all depends on your point of view. At scale, failure is not something that you try to avoid, it is absolutely going to happen, and you need to manage it.

In this case, RavenDB Cloud isn’t reliant on a single instance. Instead, a RavenDB cluster inside of RavenDB is distributed among multiple availability zones (or availability sets on Azure, if zones aren’t available in the region) to maximize survivability of the nodes in the cluster. And with the way RavenDB is designed, even a single node being up can mask most of the failures.

To be honest, I didn’t expect to run into such issues so soon after launching RavenDB Cloud, but we did planned for it, and I’m happy to say that in this instance, everything worked. Even under unpredictable failure scenario, everything kept on ticking.

Usually when I write such posts, I’m doing a postmortem to analyze what we did wrong. In this case, I wanted to highlight the same process that we run when stuff actually go the way we expect it to.

This makes me very happy.

time to read 4 min | 753 words

In my last post, I looked into the infrastructure of mimalloc, to be frank. A few details on how it is actually hooking the system allocator when used, mostly. I’m still using what is effectively a random scatter/gather approach to the codebase. Mostly because it is small. I’m trying to… grok it would be the best term, I guess. I’m going over the code because it also give me a good idea about practices in what seems to be a damn good C codebase.

I should mention that I drove right into the code, there is also the tech report, which I intend to read, but only after I got through enough of the code to get a good feeling for it.

I run into the code in the options.c file, for instance:

image

This is a really nice way to get configuration values from the user. What I find really interesting, to be frank, is not the actual options, which would be interesting later on, but the fact that this is such a nice way to represent things in a readable manner.

I’m doing similar things in C# (a much higher level language) to create readable options (typically using dictionary & dictionary initializer). I like how the code is able to express things so succinctly in a language with far fewer features.

However, the order of parameters is critical (is should match the mi_option_t enum values), and there is no way to express this in the code.

I also liked this code, which is part of reading configuration values from env variables:

image

I like that this is using strstr() in reverse in this manner. It is really elegant.

Going over the OS abstraction layer, on the other hand, show some granliness, take a look here:

image

I actually simplified the code abit, because it also had #if there for BSD, Linux, etc. I find it harder to follow this style, maybe adding indentation would help, but I have had to read this function multiple times, filtering for specific OSes to get it right.

I did find this tidbit, which is interesting:

image

This is attempting to do allocation with exclusive access. I wonder how this is actually used for. It looks like mimalloc is attempting to allocate in specific addresses, so that should be interesting.

Indeed, in _mi_os_reset() they will explicitly ask the OS to throw the memory away by calling MADV_FREE or MEM_RESET. I find this interesting, because this let the OS know that the memory can be thrown away, but the allocation still persists. I’m currently looking into some fragmentation issues in 32bits, which won’t be helped by this scenario. Then again, I don’t think that mimalloc is primarily intended for 32 bits systems (I can see code handling 32 bits, but I don’t think this is the primary use case or that 32 bits had a lot of attention).

The mi_os_alloc_aligned_ensured() method call is doing some interesting things. If I need a 16MB buffer, but aligned on 1MB boundary, I have no real way to ask this from the OS. So this is implemented directly by over-allocating. To be fair, I can’t think of a good reason why you’ll want to do something like that (you have no guarantees about what the actual physical memory layout would be after all, and that is the only scenario I can think this would be useful. Given that page aligned memory (which is what you get anyway from the OS) is usually sufficient, I wonder what is the use case for that here.

I get why mimalloc have to struggle with this, given that it is limited to just returning a pointer back (malloc interface), and doesn’t have a good way to tell that it played games with the alignment when you pass a value to free(). There seems to be a lot of code around here to deal with memory alignment requirements.  I think that I’ll need to go up the stack to figure out what kind of alignment requirements it has.

That is enough for now, I think. I’m going to get to the core of mimalloc in the next post.

time to read 4 min | 673 words

mimalloc is a memory allocator that is small and efficient, at least so the docs say. Which was interesting enough for me to take a look. We have had to do a lot of work in memory allocation inside RavenDB, and looking into how other people are doing that is always interesting. What was really attractive for me here was the fact that this is a small codebase, so I can go over that fairly quickly, and the amount of complexity involved is going to be limited.

As usual, I’m just going over the code, recording my impressions.

I started by looking at the API in mimalloc.h, and while it seems threatening, pretty much all of details here are around making it clear to the compiler what this code is doing, to enable additional optimizations.

image

If you ignore all of these, this is just a pretty normal function declaration with the usual malloc signature.

The code is very well commented, but it looks like it is going to take a serious inspection to actually figure out what is going on. I started by going over the header files, and they show some tantalizing details, but I’m missing context that I assume that I’ll get when I’ll go over the actual code.

The code talks about segments and blocks. I think that a segment is a chunk of memory that we allocated from the OS, and a block is a chunk of memory that we allocate to users of mimalloc. I like that there is an explicit model here of multi threading. There is a heap per thread, it seems, but you can free memory from another thread as well. This match very closely what we have done with RavenDB’s internal memory management.

I started to read the OS specific parts of the code, and I hit gold (as in, stuff that is really interesting to read) almost immediately:

image

There is a long discussion that details a lot of really fascinating details here, including certain level of “are you really going to go there?! OMG, you went there!”. For example, in order to patch malloc(), mimalloc need to patch atexit() to ensure that the process can shutdown normally.

I started reading the init.c file, and it isn’t about memory management at all, it is all about integrating mimalloc into the system, and that is quite fascinating on its own.

Here is how the memory is patched on ARM (similar code exists for X86 and X64):

image

What you see here is building of raw assembly instructions to do a task. I do wonder how this works, given that I would expect usual executable memory to be non writable, but I’ll look at that in a bit.

Then we have the actual list of functions to patch:

image

The last lines are scary, I have to admit.

And I found how the code deals with the memory protection:

image

I’m currently doing what is effectively random reads throughout the codebase, mostly because this is close to midnight and I’m not going to really be able to grok anything. I run into this function:

image

Nothing “special”, but it did lead me to some interesting articles about hashing, and what they are good for, how to build them, etc.

One of the reasons that I love doing these code reviews is that I learn so much more than what I expected to.

For example, the way mimalloc initialize its random is really quite interesting (see: _mi_random_init()) and elegant.

time to read 3 min | 441 words

The upgrade process from RavenDB 3.5 and earlier to RavenDB 4.x is not easy. This is because I made a conscious decision to not have backward compatibility between these versions. I made that decision because we had to be able to make massive changes internally in order to get to the targets that we set to ourselves. I actually discussed that decision in detail in a previous blog post and a talk.

Four years later, I still stand by that decision, but I also regret the spanner that it threw into the works. Migrating RavenDB applications to 4.x from previous versions is harder than it should. In retrospect, we probably should have invested the time in a compatibility layer that would make it easier to migrate.

I wanted to take a moment and talk about RavenDB 5.0, expected in 2020, and our plans for that release. We are going to be doing some minor cleanup of the API. Methods and classes that are marked as [Obsolete] will be removed. These tend to be at the very edge of the explored API and have been marked as such for quite some time. Beyond these change (which you’ll a clear and obvious alternative for), you aren’t going to need to do much at all.

Our goal for converting an application from RavenDB 4.x to 5.x is that the process is for 90% of the projects - Update NuGet packages, compile, you are done. For the 10%, it may mean that you need to make some minor changes. For example, change DisableEntitiesTracking to NoTracking if you are using the low level query API.

We also intend to allow at least the vast majority of operations to just work between 4.x client and 5.x server. In other words, even when you upgrade server versions, you aren’t going to have to upgrade the client version unless you want to use the new features.

There are also additional considerations that we have to take into account:

  • RavenDB now have official clients for: .NET, JVM, Go, Python, Node.JS, C++. As well as a unofficial clients.
  • RavenDB Cloud instances are maintained by us, and will be upgraded to newer versions on a regular schedule.

The cost of making a backward incompatible change at this point is too high for us to take lightly, and we are going to try very hard to avoid it. The move from 3.5 to 4.x was a one time thing that we had to do in order to continue evolving the product, not something that we plan again anytime soon.

We are also offering migration services for clients who want to move their applications from 3.x to 4.x.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}