Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,516 | Comments: 47,933

filter by tags archive

reEntity Framework Core performance tuning–Part III

time to read 1 min | 166 words

I mentioned in the previous post that I’ll take the opportunity to show of some interesting queries. The application itself is here, and you can see how to UI look in the following screenshot:

image

I decided to see what would be the best way to come up with the information we need for this kind of query. Here is what I got.

image

This is using a select object style to get a complex projection back from the server. Here are the results:

image

As you can see, we are able to get all the data we want, in a format that is well suited to just sending directly to the UI with very little work and with tremendous speed.

reDifferent I/O Access Methods for Linux

time to read 6 min | 1063 words

The “Different I/O Access Methods for Linux, What We Chose for Scylla, and Why” is quite fascinating. It is a pleasure to be able to read in depth into another database implementation strategy and design decisions. In particular where they don’t match what we are doing, because we can learn from the differences.

The article is good both in terms of discussing the general I/O approaches on Linux in general and the design decisions and implications for ScyllaDB in particular.

ScyllaDB chose to use AIO/DIO (async direct I/O) using a dedicated library called Seastar. And they are effectively managing their own memory and I/O scheduling internally, skipping the kernel entirely. Given how important I/O is for a database, I can say that I strongly resonate with this approach, and it is something that we have tried for a while with RavenDB. That included paying careful attention to how we are sending I/O, controlling our own caching, etc.

We are in a different position from Scylla, of course, since we are using managed code, which introduced a different set of problems (and benefits), but during the design of RavenDB 4.0, we chose to go in a very different direction.

But first, let me show you the single statement in the post that caused me to write this blog post:

The great advantage of letting the kernel control caching is that great effort has been invested by the kernel developers over many decades into tuning the algorithms used by the cache. Those algorithms are used by thousands of different applications and are generally effective. The disadvantage, however, is that these algorithms are general-purpose and not tuned to the application. The kernel must guess how the application will behave next, and even if the application knows differently, it usually has no way to help the kernel guess correctly.

This statement is absolutely correct. The kernel caching algorithms (and the kernel behavior in general) is tailored to suit a generic set of requirements, which means that if you deviate from the way the kernel expects, it is going to be during extra work and you can experience significant problems (performance and otherwise).

Another great resource that I want to point you to is the following article: “You’re Doing It Wrong” which had a major effect on the way RavenDB 4.0 is designed.

What we did, basically, is to look at how the kernel is doing things, and then see how we can fit our own behavior to what the kernel is expecting. Actually, I’m lying here, because there is no one kernel here. RavenDB runs on Windows, Linux and OSX (Darwin kernel). So we have three very different systems with wildly different optimizations that we need to run optimally on. Actually, to be fair, we consider Windows & Linux as the main targets for deployment, but that still give us very different stacks to work on.

The key here was to be predictable in all things and be sure that whatever operations we make, the kernel can successfully predict them. This can be a lot of work, but not something that you’ll usually see in the code. It involves laying out the data so it is nearby in the file and ensuring that we have hotspots that the kernel can recognize and optimize for us, etc. And it involves a lot of work with the guts of the system to make sure that we match what the kernel expects.

For example, consider this statement from the article:

…application-level caching allows us to cache not only the data read from disk but also the work that went into merging data from multiple files into a single cache item.

This is a good example of the different behavior. ScyllaDB is using LSM model, which means that in order to read data, they typically need to lookup in multiple files. RavenDB uses a different model (B+Tree with MVCC) which typically means that store all the data in a single file. Furthermore, the way we store the information, we can access it directly via memory map without doing any work to prepare it ahead of time. That means that we can lean entirely on the page cache and gain all the benefits thereof.

The ScyllaDB approach also limits them to running only on Linux and only on specific configurations. Because they rely on async direct I/O, which is a… fiddly beast in Linux at the best of times, you need to make sure that everything matches just so in order to be able to get it working. Just running this on a stock Ubuntu won’t work, since ext4 will block for many operations. Another problem in my view is that this assumes that they are the only player on the machine. If you need to run with additional software on the machine, that can cause fights over resources. For production, this is less of a problem, but for running on a developer machine, that is frequently something that you need to take into account. The kernel will already do that for you (which is useful even in production when people put SQL Server & RavenDB on the same box) so you don’t have to worry about it too much. I’m not sure that this concern is even valid for ScyllaDB, since they tend to be deployed in clusters of dedicated machines (or at least Docker instances) so they have better control over the environment. That certainly make it easier if you can dictate such things.

Another consideration for the RavenDB approach is that we want to be, as much as possible, friendly to the administrator. When we lean on what the kernel does, we usually get that for free. The administrator can usually dictate policies to the kernel and have it follow them, and good sys admins both know how and know when to do that. On the other hand, if we wrote it all ourselves, we would also need to provide the hooks to modify the behavior (and monitor it, and train users in how it works, etc).

Finally, it is not a small thing to remember that if you let the kernel cache your data, that means that that cache is still around if you restart the database (but not the machine), which means that your mostly alleviate the issue of slow cold start if you needed to do things like update configuration or the database binaries.

reEntity Framework Core performance tuning–Part II

time to read 4 min | 701 words

After looking at this post detailing how to optimize data queries in EF Core, I obviously decided that I need to test how RavenDB handles the same load.

To make things fair, I tested this on my laptop, running on battery mode. The size of the data isn’t that much, only 100,000 books and half a million reviews, so I decided to increase that by an order of magnitude.

image

The actual queries we make from the application are pretty simple and static. We can sort by votes / publication date / price (ascending /descending) and we can filter by number of votes and the publication year.

image

This means that we don’t have an explosion of querying options, so that simplify the kind of work we are doing. To make things simple for myself, I kept the same model of books / authors and reviews as separate collections. This isn’t the best model for document database, but it allows us to compare apples to apples against the work the EF Core based solution and the RavenDB solution need to do.

A major cost in Jon’s solution is the need to aggregate the reviews for a book (so the average for the review can be computed). In the end, the only way to get the solution required was to just manually calculate the average reviews for each book and store the computation in the book. We’ll discuss this a bit more in a few minutes, for now, I want to turn our eyes toward the simplest possible query in this page, getting 100 books sorted by the book id.

Because we aren’t running on the same machine, it is hard to make direct parallels, but on Jon’s machine he got 80 ms for this kind of query on 100,000 books. When increasing the data to half a million  books, the query time rose to 150ms. Running the same query gives us the results instantly (zero ms). Querying and sorting by the title, for example, give us the results in 19 ms for a page size of 100 books.

Now, let us look at the major complexity for this system, sorting and filtering by the number of votes in the system. This is hard because the reviews are stored separately from the books. With EF Core, there is the need to join between the tables, which is quite expensive and eventually led Jon to take upon himself the task of manually maintaining the values. With RavenDB, we can use a map/reduce index to handle this all for us. More specifically, we are going to use a multi map/reduce index.

Here is what the index definition looks like:

image

We map the results from both the Books and the BookReviews into the same shape, and then reduce them together into the final output, which contains the relevant aggregation.

Now, let us do some queries, shall we? Here is us querying over the entire dataset (an order of magnitude higher than the EF Core sample set), filtering by the published date and ordering by the computed votes average. In here, we get the first 100 items, and you can see that we got over 289,753 total results:

image

One very interesting feature of this query is that we are asking to include the book document for the results. This is handled after the query (so no need to do a join to the entire 289K+ results), and we are able to get everything we want in a very simple fashion.

Oh, and the total time? 17 ms. Compared to the 80ms result for EF with 1/10 of the data size. That is pretty nice (and yes, different machines, hard to compare, etc).

I’ll probably have another post on this topic, showing off some of the cool things that you can do with RavenDB and queries.

reEntity Framework Core performance tuning–part I

time to read 3 min | 501 words

I run into a really interesting article about performance optimizations with EF Core and I thought that it deserve a second & third look. You might have noticed that I have been putting a lot of emphasis on performance and I had literally spent years on optimizing relational database access patterns, including building a profiler dedicated for inspecting what an OR/M is doing. I got the source and run the application.

I have a small bet with myself, saying that in any application using a relational database, I’ll be able to find a SELECT N+1 issue within one hour. So far, I think that my rate is 92% or so. In this case, I found the SELECT N+1 issue on the very first page load.

image

Matching this to the code, we have:

image

Which leads to:

image

And here we can already tell that there is a problem, we aren’t accessing the authors. This actually happens here:

image

So we have the view that is generating 10 out of 12 queries. And the more results per page you have, the more this costs.

But this is easily fixed once you know what you are looking at. Let us look at something else, the actual root query, it looks like this:

image

Yes, I too needed a minute to recover from this. We have:

  1. One JOIN
  2. Two correlated sub queries

Jon was able to optimize his code by 660ms to 80ms, which is pretty awesome. But that is all by making modifications to the access pattern in the database.

Given what I do for a living, I’m more interested in what it does inside the database, and here is what the query plan tells us:

image

There are only a few tens of thousands of records and the query is basically a bunch of index seeks and nested loop joins. But note that the way the query is structured forces the database to evaluate all possible results, then filter just the top few. That means that you have to wait until the entire result set has been processed, and as the size of your data grows, so will the cost of this query.

I don’t think that there is much that can be done here, given the relational nature of the data access ( no worries, I’m intending to write another post in this series, you guess what I’m going to write there, right?Smile ).

reWriting a Time Series Database from Scratch

time to read 7 min | 1273 words

This is a fascinating post about building a time series database from scratch. The author explicitly warns that they have no background in databases, but given that this is my day job, I decided I might throw in my 2 cents about the way they do things.

Note: Large protion of the original post talk about their existing system (which they are replacing), and most of my criticisms are about that system. Where I'm talking about the new system (toward the end of this post) , I'm noting that explicitly.

I started writing this post about halfway through reading  the post, mostly because I got all sort of comments and wanted to keep a record of things as I’m reading it. It is possible that some of the things that I’m concerned about would be answered later in the post. Without further ado, my comments. It will probably won’t make sense without reading the original post.

They are storing their time series data using keys similar to:

{__name__="requests_total", path="/status", method="GET", instance=”10.0.0.1:80”}

And then they allow to do queries on both the keys and time (all GET requests on /status in the past week). I would have thought about this, but it it a very interesting way to handle queries on such an environment. They also seem to need to do queries that are both in the same series and across series, they call it vertical and horizontal queries.

Let's just consider the main take away: sequential and batched writes are the ideal write pattern for spinning disks and SSDs alike.

Yes, that is pretty much the key for good performance.

So ideally, samples for the same series would be stored sequentially so we can just scan through them with as few reads as possible. On top, we only need to know where this sequence starts to access all data points.

There's obviously a strong tension between the ideal pattern for writing collected data to disk and the layout that would be significantly more efficient for serving queries. It is the fundamental problem our TSDB has to solve.

Yes, being able both write efficiently and query efficiently are two very different problems that you need to handle, and often you need to select which one you’ll prefer.

We create one file per time series that contains all of its samples in sequential order.

What?! No, you can’t do that. At least, you can’t do that and get reasonable behavior. To start with, even though you are doing batch, you are actually guaranteeing that your write pattern would be random. Why is that?

Because if you are writing every KB (which is what they do) per file, you are basically ensuring that the OS will have to write to different sectors / pages on the physical drive. So if you have writes to 100 series, you are going to be writing to 100 different on disk locations. To make things worse, you aren’t writing in 4KB increments, which basically means that you are doing buffered writes. That information isn’t in the post, but it is a safe assumption, since if they weren’t doing buffered writes, there is no way that the operating system will be able to catch up with the kind of load. That in turn means that you don’t have any durability whatsoever.

In fact, this is mentioned explicitly, but only in the case of losing the writes made in the application buffer. I’m assuming that they either don’t care or have a different manner for avoiding / dealing with data loss / corruption in the case of machine failure. More to the point, given that they do compression, they need to be able to recover from partially written data to the files, and from the post, I don’t think that they are doing that.

But those issues are only just the start. On Windows, you don’t typically worry about the number of open files you have, but on Linux, there is a typical a limit on the number of open files you can have. It is common to have that around 64K max open files. Using a file per series means that you are very likely going to run into issues with the number of open files you have, in fact, it would be very easy to construct a query that would hit that limit, and that would have a global impact on the server.

Other issues with such large number of files is that file systems don’t really like it when you have so many files, this is discussed in the post as running out of inodes, but we have noticed performance degradation on directories with large number of files on a wide variety of file systems.

When you implement expiration of data, it also means that you have to move data from the middle of the file to the beginning, and then truncate it, that leads to a huge amount of additional I/O, likely blocking and is probably something that an operations guy is looking at.

Another issue is that because they have a file per series, they need to cache it aggressively, leading to competition between the database cache and the operation system page cache. It also means that the application is sensitive to allocation patterns, and on Linux, that is a really bad thing to do, because the silly OOM killer.

When querying data that is not cached in memory, the files for queried series are opened and the chunks containing relevant data points are read into memory. If the amount of data exceeds the memory available, Prometheus quits rather ungracefully by getting OOM-killed.

Yep, that was my first thought when I started reading this. I follow the reasoning on why OOM is there, but I still think it is a very silly choice.

The actual solution that they present is to have all the data for a particular time frame (2 hours, in their case) in a block directory. I’m not sure why they have a separate directory from block (with multiple chunks per block), and mmaping the whole thing. That is a far better solution, but they also implement compaction, which get you right back to write amplification, and I don’t quite get why. The example they give is that if you have a week long query, you don’t want to merge results across 80 blocks, but I would assume that this is surely better than having to write the same data over and over again.

If the series with IDs 10, 29, and 9 contain the label app="nginx", the inverted index for the label "nginx" is the simple list [10, 29, 9], which can be used to quickly retrieve all series containing the label.

This just screams at me that it is wrong. Oh, not the actual content, but look at the list, it should be [9, 10, 29], because working with the sorted data is going to enable so many more interesting scenarios. Oh, it seems like that is what they are doing in the new version, so that is good.

I think that I found the right location for the code, and this line scares me, basically, it means that this will issue an fsync once every 10 seconds. That means that there is a 10 seconds period of potential data loss. Given the data that is kept, that is probably not an issue, and losing 10 seconds of samples is likely not going to be an issue.

That means that you can basically do all writes to memory all the time, and that gives you a major performance boost all around, at the expense of safety.

This is an interesting enough topic that I’ll do another post, discussing how I’ll implement the same scenario with Voron.

reWhy Uber Engineering Switched from Postgres to MySQL

time to read 6 min | 1173 words

The Uber Engineering group have posted a really great blog post about their move from Postgres to MySQL. I mean that quite literally, it is a pleasure to read, especially since they went into such details as the on-disk format and the implications of that on their performance.

For fun, there is another great post from Uber, about moving from MySQL to Postgres, which also has interesting content.

Go ahead and read both, and we’ll talk when you are done. I want to compare their discussion to what we have been doing.

In general, Uber’s issue fall into several broad categories:

  • Secondary indexes cost on write
  • Replication format
  • The page cache vs. buffer pool
  • Connection handling

Secondary indexes

Postgres maintain a secondary index that points directly to the data on disk, while MySQL has a secondary index that has another level of indirection. The images show the difference quite clearly:

Postgres MySQL
Postgres_Tuple_Property_

MySQL_Index_Property_

I have to admit that this is the first time that I ever considered the fact that the indirection’s manner might have any advantage. In most scenarios, it will turn any scan on a secondary index into an O(N * logN) cost, and that can really hurt performance. With Voron, we have actually moved in 4.0 from keeping the primary key in the secondary index to keeping the on disk position, because the performance benefit was so high.

That said, a lot of the pain the Uber is feeling has to do with the way Postgres has implemented MVCC. Because they write new records all the time, they need to update all indexes, all the time, and after a while, they will need to do more work to remove the old version(s) of the record. In contrast, with Voron we don’t need to move the record (unless its size changed), and all other indexes can remain unchanged. We do that by having a copy on write and a page translation table, so while we have multiple copies of the same record, they are all in the same “place”, logically, it is just the point of view that changes.

From my perspective, that was the simplest thing to implement, and we get to reap the benefit on multiple fronts because of this.

Replication format

Postgres send the WAL over the wire (simplified, but easier to explain) while MySQL send commands. When we had to choose how to implement over the wire replication with Voron, we also sent the WAL. It is simple to understand, extremely robust and we already had to write the code to do that. Doing replication using it also allows us to exercise this code routinely, instead of it only running during rare crash recovery.

However, sending the WAL has issues, because it modify the data on disk directly, and issue there can cause severe problems (data corruption, including taking down the whole database). It is also extremely sensitive to versioning issues, and it would be hard if not impossible to make sure that we can support multiple versions replicating to one another. It also means that any change to the on disk format needs to be considered with distributed versioning in mind.

But what killed it for us was the fact that it is almost impossible to handle the scenario of replacing the master server automatically. In order to handle that, you need to be able to deterministically let the old server know that it is demoted and should accept no writes, and the new server that it can now accept writes and send its WAL onward. But if there is a period of time in which both can accept write, then you can’t really merge the WAL, and trying to is going to be really hard.  You can try using distributed consensus to run the WAL, but that is really expensive (about 400 writes / second in our benchmark, which is fine, but not great, and impose a high latency requirement).

So it is better to have a replication format that is more resilient to concurrent divergent work.

OS Page Cache vs Buffer Pool

From the post:

Postgres allows the kernel to automatically cache recently accessed disk data via the page cache. … The problem with this design is that accessing data via the page cache is actually somewhat expensive compared to accessing RSS memory. To look up data from disk, the Postgres process issues lseek(2) and read(2) system calls to locate the data. Each of these system calls incurs a context switch, which is more expensive than accessing data from main memory. … By comparison, the InnoDB storage engine implements its own LRU in something it calls the InnoDB buffer pool. This is logically similar to the Linux page cache but implemented in userspace. While significantly more complicated than Postgres’s design…

So Postgres is relying on the OS Page Cache, while InnoDB implements its own. But the problem isn’t with relying on the OS Page Cache, the problem is how you rely on it. And the way Postgres is doing that is by issuing (quite a lot, it seems) system calls to read the memory. And yes, that would be expensive.

On the other hand, InnoDB needs to do the same work as the OS, with less information, and quite a bit of complex code, but it means that it doesn’t need to do so many system calls, and can be faster.

Voron, on the gripping hand, relies on the OS Page Cache to do the heavy lifting, but generally issues very few system calls. That is because Voron memory map the data, so access it is usually a matter of just pointer dereference, the OS Page Cache make sure that the relevant data is in memory and everyone is happy. In fact, because we memory map the data, we don’t have to manage buffers for the system calls, or to do data copies, we can just serve the data directly. This ends up being the cheapest option by far.

Connection handling

Spawning a process per connection is something that I haven’t really seen since the CGI days. It seems pretty harsh to me, but it is probably nice to be able to kill a connection with a kill –9, I guess. Thread per connection is also something that you don’t generally see. The common situation today, and what we do with RavenDB, is to have a pool of threads that all manage multiple connections at the same time, often interleaving execution of different connections using async/await on the same thread for better performance.

reWhy you can't be a good .NET developer

time to read 3 min | 526 words

This post is in reply to Rob’s post, go ahead and read it, I’ll wait.

My answer to Rob’s post can be summarize in a single word:

In particular, this statement:

it is impossible to be a good .NET developer. To work in a development shop with a team is to continually cater for the lowest common denominator of that team and the vast majority of software shops using .NET have a whole lot of lowest common denominator to choose their bad development decisions for.

Um, nope. That only apply to places that are going for the lowest common denominator. To go from there to all .NET shops is quite misleading. I’ll give our own example, of building a high performance database in managed code, which has very little lowest common denominator anything anywhere, but that would actually be too easy.

Looking at the landscape, I can see quite a lot of people doing quite a lot of interesting things at the bleeding edge. Now, it may be that this blog is a self selecting crowd, but when you issue statements as “you can’t be a good .NET developers”, that is a pretty big statement to stand behind.

Personally, I think that I’m pretty good developer, and while I dislike the term “XYZ developer”, I do 99% of my coding in C#.

Now, some shops have different metrics, they care about predictability of results, so they will be extremely conservative in their tooling and language usage, the old “they can’t handle that, so we can’t use it” approach. This has nothing to do with the platform you are using, and all to do with the type of culture of the company you are at.

I can certainly find good reasons for that behavior, by the way, when your typical product lifespan is measured in 5 – 10 years, you have a very different mindset than if you aim at most a year or two away. Making decisions on brand new stuff is dangerous, we lost a lot when we decided to use Silverlight, for example. And the decision to go with CoreCLR for RavenDB was made with explicit back off strategy in case that was sunk too.

Looking at the kind of directions that people leave .NET for, it traditionally have been to the green green hills of Rails, then it was Node.JS, not I think it is Elixir, although I’m not really paying attention. That means that in the time a .NET developer (assuming that they investing in themselves and continuously learning) invested in their platform, learned a lot on how to make it work properly, the person who left for greener pastures has had to learn multiple new frameworks and platforms. If you think that this doesn’t have an impact on productivity, you are kidding yourself.

The reason you see backlash against certain changes (project.json coming, going and then doing disappearing acts worthy of Houdini) is that there is value in all of that experience.

Sure, sometimes change is worth it, but it needs to be measured against its costs. And sometimes there are non trivial.

reWhy You Should Never Use MongoDB

time to read 3 min | 570 words

I was pointed at this blog post, and I thought that I would comment, from a RavenDB perspective.

TL;DR summary:

If you don’t know what how to tie your shoes, don’t run.

The actual details in the posts are fascinating, I’ve never heard about this Diaspora project. But to be perfectly honest, the problems that they run into has nothing to do with MongoDB or its features. They have a lot to do with a fundamental lack of understanding on how to model using a document database.

In particular, I actually winced in sympathetic pain when the author explained how they modeled a TV show.

They store the entire thing as a single document:

Hell, the author even talks about General Hospital, a show that has 50+ sessions and 12,000 episodes in the blog post. And at no point did they stop to think that this might not be such a good idea?

A much better model would be something like this:

  • Actors
    • [episode ids]
  • TV shows
    • [seasons ids]
  • Seasons
    • [review ids]
    • [episode ids]
  • Episodes
    • [actor ids]

Now, I am not settled in my own mind if it would be better to have a single season document per season, containing all episodes, reviews, etc. Or if it would be better to have separate episode documents.

What I do know is that having a single document that large is something that I want to avoid. And yes, this is a somewhat relational model. That is because what you are looking at is a list of independent aggregates that have different reasons to change.

Later on in the post the author talks about the problem when they wanted to show “all episodes by actor”, and they had to move to a relational database to do that. But that is like saying that if you stored all your actors information as plain names (without any ids), you would have hard time to handle them in a relational database.

Well, duh!

Now, as for the Disapora project issues.  They appear to have been doing some really silly things there. In particular, let us store store the all information multiple times:

You can see for example all the user information being duplicated all over the place.  Now, this is a distributed social network, but that doesn’t call for it to be stupid about it.  They should have used references instead of copying the data.

Now, to be fair, there are very good reasons why you’ll want to duplicate the data, when you want a point in time view of it. For example, if I commented on a post, I probably want my name in that post to remain frozen. It would certainly make things much easier all around. But if changing my email now requires that we’ll run some sort of a huge update operation on the entire database… well, it ain’t the db fault. You are doing it wrong.

Now, when you store references to other documents, you have a few options. If you are using RavenDB, you have Include support, so you can get the associated documents easily enough. If you are using mongo, you have an additional step in that you have to call $in(the ids), but that is about it.

I am sorry, but this is blaming the dancer blaming the floor.

reHow memory mapped files, filesystems and cloud storage works

time to read 4 min | 641 words

Kelly has an interesting post about memory mapped files and the cloud. This is in response to a comment on my post where I stated that we don’t reserve space up front in Voron because we  support cloud providers that charge per storage.

From Kelly’s post, I assume she thinks about running it herself on her own cloud instances, and that is what here pricing indicates. Indeed, if you want to get a 100GB cloud disk from pretty much anywhere, you’ll pay for the full 100GB disk from day 1. But that isn’t the scenario that I actually had in mind.

I was thinking about the cloud providers. Imagine that you want to go to RavenHQ, and get a db there. You sigh up for a 2 GB plan, and all if great. Except that on the very first write, we allocate a fixed 10 GB, and you start paying overage charges. This isn’t what you pay when you run on your own hardware. This is what you would have to deal with as a cloud DBaaS provider, and as a consumer of such a service.

That aside, let me deal a bit with the issues of memory mapped files & sparse files. I created 6 sparse files, each of them 128GB in size in my E drive.

As you can see, this is a 300GB disk, but I just “allocated” 640GB of space in it.

image

This also shows that there has been no reservation of space on the disk. In fact, it is entirely possible to create files that are entirely too big for the volume they are located on.

image

I did a lot of testing with mmap files & sparseness, and I came to the conclusion that you can’t trust it. You especially can’t trust it in a cloud scenario.

But why? Well, imagine the scenario where you need to use a new page, and the FS needs to allocate one for you. At this point, it need to find an available page. That might fail, let us imagine that this fails because of no free space, because that is easiest.

What happens then? Well, you aren’t access things via an API, so there isn’t an error code it can return, or an exception to be thrown.

In Windows, it will use Standard Exception Handler to throw the error. In Linux, that will be probably generate a SIVXXX error. Now, to make things interesting, this may not actually happen when you are writing to the newly reserved page, it may be deferred by the OS to a later point in time (or if you call msync / FlushViewOfFile).  At any rate, that means that at some point the OS is going to wake up and realize that it promised something it can’t deliver, and in that point (which, again, may be later than the point you actually wrote to that page) you are going to find yourself in a very interesting situation. I’ve actually tested that scenario, and it isn’t a good one form the point of view of reliability. You really don’t want to get there, because then all bets are off with regards to what happens to the data you wrote. And you can’t even do graceful error handling at that point, because you might be past the point.

Considering the fact that disk full is one of those things that you really need to be aware about, you can’t really trust this intersection of features.

reKiip’s MongoDB’s experience

time to read 7 min | 1301 words

We got asked several times to respond to this post, about the reason Kiip moved away from MongoDB:

image

On the surface, RavenDB and MongoDB are really similar, looking at the Good parts of the Kiip post, we have schemalessness, easy replication, rich query langauge and we can be access from multiple languages.

But under the hood, RavenDB operates in a completely different way than MongoDB does. A vast majority of the issues that Kiip run into are actually low level (really low level, is some cases) issues that shouldn’t really be visible to the user.

Non-counting B-Trees

The fact that MongoDB uses non counting B-Trees? The only reason that the user care about that is that it actually impacts performance, but the Kiip blog mentions a bunch of other issues related to that.

In RavenDB, we use Lucene as the indexing format, and we really don’t care about the actual format of the indexes. We natively support Count() and limit / skip, because we feel that those are actually core parts of what most users need. In fact, our API allows us to get the total count of results of a paged query as a by product of actually making the query. There isn’t any additional cost for doing this.

Poor Memory Management

MongoDB relies on the OS to do the memory management, by letting the OS memory manager to do its work. That is actually quite a smart decision, because I can guarantee that more work has gone into optimizing the OS memory manager than could have been invested by the MongoDB project. But that is just part of the work.

In RavenDB, we are actually a managed application, so we don’t have directly control over memory. That doesn’t mean that we don’t actually manage it. We have several layers of caching in place, exactly because we know more than the OS about our own usage scenarios. In many cases, even if you are making a totally new request, it would never hit the disk, because we are keeping track on hot data and making sure that it resides in memory. This applies to both indexes and documents, mind. And during the indexing process we are very careful about memory management.

Sure, the OS memory manager is more optimized, but the database knows what is going on, and can predict its own usage patterns. That is how RavenDB does a lot of magic relating to auto configuration.

Uncompressed field names

In MongoDB, it is considered good practice to shorten field names for space optimization. But MongoDB doesn’t do it for you automatically.

RavenDB doesn’t compress field names, but at the same time, it isn’t a good practice to do so. In fact, I think that this is a horrible little mess. There are a lot of arguments against compressing field names, not the least of which is that it makes it pretty hard to figure out what it is that you are actually trying to do. Looking at the raw data, something that is done fairly frequently when debugging and troubleshooting becomes harder to work with and manage:

{
  "a2": "nathan ",
  "d3": "",
  "a2": "2012-05-17T00:00:00.0000000",
  "h3": "2012-04-15T00:00:00.0000000",
  "r2": "archanid@sample.com",
  "o2": "8169cd4a-babf-4015-a3c7-4d503642e021",
  "o1": "products/NHProf"
}

Anyone wants to figure out what this document is about? And at least in this one, the data itself tells you a lot about the actual content.

There are far better alternatives in place. In RavenDB, we do full response / request compression, and we allow to do document compression on disk as well. If we were ever to get to the point where this would be a serious problem (and so far, it isn’t, even on large data sets), it would be less than a week of work to implement string interning inside RavenDB, so we would use the same string references for field values.

Global write lock

MongoDB (as of the current version at the time of writing: 2.0), has a process-wide write lock. … At this point, all other operations including reads are blocked because of the write lock.

Now, to be fair, also have a write lock, but it isn’t nearly as bad as it is in MongoDB. RavenDB write lock is actually for… writes, and it doesn’t interfere with the either reads or indexes. It is on the list of things to remove, but the crazy part is. So far, and we have really demanding users, no one cares. The reason that no one cares is that this is really small lock, and it only affects writes, it is not Stop the World type of thing.

Safe off by default

I am just going to let Kiip’s words stand for themselves (emphasis mine):

This is a crazy default, although useful for benchmarks. As a general analogy: it’s like a car manufacturer shipping a car with air bags off, then shrugging and saying “you could’ve turned it on” when something goes wrong.

RavenDB entire philosophy is around Safe by Default. That is the only thing that really make sense, because otherwise… Well… here is what happenned at Kiip:

We lost a sizable amount of data at Kiip for some time before realizing what was happening and using safe saves where they made sense (user accounts, billing, etc.).

Offline table compaction

Every now and then, you need to take down MongoDB and let it compact its on disk data. This is another Stop the World operation, and the only way to keep up when you do so is to have a hot standby ready.

RavenDB does all maintenance task while the server is up and serving requests. You don’t need any downtime just because RavenDB need to arrange some data on disk, we take care of that live, and with no interruption in service.

Secondaries do not keep hot data in RAM

As Kiip explains it:

The primary doesn’t relay queries to secondary servers, preventing secondaries from maintaining hot data in memory. This severely hinders the “hot-standby” feature of replica sets, since the moment the primary fails and switches to a secondary, all the hot data must be once again faulted into memory.

RavenDB doesn’t do so either, but for a drastically different reason. As I mentioned earlier, the way RavenDB works is quite different. When you are running a hot standby node, it will get the new data from the server and index it. We keep the index open, so for a lot of the data, it is already going to be in memory. For the rest, as I mentioned, we have several layers of caches that would help prevent needing to page gigabytes on data into memory.

Conclusion

As an utterly unbiased observer (Smile), I can say that RavenDB rocks.

What we are actually seeing here is that RavenDB put different emphasis on different things. I really care for making the common application level scenarios easy and nice to work with. And I had enough time supporting production level apps that I tried very hard to make sure that RavenDB can take care of itself for most scenarios without any hand holding.

FUTURE POSTS

  1. NHibernate Profiler 5.0 Alpha has been released - 12 hours from now
  2. The best features are the ones you never knew were there: You can’t do everything - 3 days from now
  3. You are doing it REALLY wrong, the shortest code review ever - 4 days from now
  4. Carefully performing invalid operations to get the wrong error and the right result - 5 days from now
  5. If you have a finalizer, watch your ctor - 6 days from now

And 7 more posts are pending...

There are posts all the way to Dec 11, 2017

RECENT SERIES

  1. PR Review (9):
    08 Nov 2017 - Encapsulation stops at the assembly boundary
  2. API Design (9):
    27 Jul 2016 - robust error handling and recovery
  3. Production postmortem (21):
    07 Aug 2017 - 30% boost with a single line change
  4. The best features are the ones you never knew were there (5):
    21 Nov 2017 - Unsecured SSL/TLS
  5. RavenDB Setup (2):
    23 Nov 2017 - How the automatic setup works
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats