Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,919 | Comments: 49,399

filter by tags archive
time to read 2 min | 313 words

I run into this blog post talking about how to handle optimistic concurrency in MongoDB and it brought to mind a very fundamental difference in the design philosophy between RavenDB and MongoDB.

If you’ll read the associated blog post, you’ll see guidance on how to build a simple optimistic concurrency using the MongoDB API. It looks like a relatively straightforward thing, but there is a lot of complexity going on here.

With RavenDB, we have decided that the responsibility of such tasks is on us, and not our users. Here is how you’ll write the same thing in RavenDB:

session.Advanced.OptimisticConcurrency = true;

And you are done. There are also options to set it globally (for all actions), for a particular session, as shown above or for a particular document or documents in a bigger transaction. About the only thing that we don’t handle is retries if the update failed, to allow you to re-run your business logic.

The reason I’m writing this is actually at the very end of the post:

This works just fine if I "remember" to include that Where clause correctly, but there's a better way if we want a general solution. For that, I'd do pretty much what I would have in the Life Beyond Distributed Transactions series - introduce a Repository, Unit of Work, and Identity Map.

This is exactly right. It looks trivial to do something like that when you are looking into a trivial scenario, but put it in a real application and the complexity sprouts. For example, try doing the same thing with multiple documents that need to change together. You have to implement quite a lot of code to do so (identity map, unit of work, hopefully not a repository Smile).

With RavenDB, all of that is just there and available for you. No need to do anything, It Just Works.

time to read 11 min | 2033 words

I was pointed to this blog post which talks about the experience of using RavenDB in production. I want to start by saying that I love getting such feedback from our users, for a whole lot of reasons, not the least of which is that it is great to hear what people are doing with our database.

Alex has been using RavenDB for a while, so he had the chance to use RavenDB 3.5 and 4.2, that is a good from my perspective, because means that he had the chance to see what the changes were and see how they impacted his routine usage of RavenDB. I’m going to call out (and discuss) some of the points that Alex raise in the post.

Speaking about .NET integration:

Raven’s .NET client API trumps MongoDB .NET Driver, CosmosDB + Cosmonaut bundle and leaves smaller players like Cassandra (with DataStax C# Driver), CouchDB (with MyCouch) completely out of the competition.

When I wrote RavenDB 0.x, way before the 1.0 release, it took two months to build the core engine, and another three months to build the .NET client. Most of that time went on Linq integration, by the way. Yes, it literally took more time to build the client than the database core. We put a lot of effort into that. I was involved for years in the NHibernate project and I took a lot of lessons from there. I’m very happy that it shows.

Speaking about technical support:

RavenDB has a great technical support on Google Groups for no costs. All questions, regardless of the obtained license, get meaningful answers within 24 hours and quite often Oren Eini responds personally.

Contrary to the Google Groups, questions on StackOverflow are often neglected. It’s a mystery why Raven sticks to a such archaic style of tech support and hasn’t migrated to StackOverflow or GitHub.

I care quite deeply about the quality of our support, to the point where I’ll often field questions directly, as Alex notes. I have an article on Linked In that talks about my philosophy in that regard which may be of interest.

As to Alex’s point about Stack Overflow vs Google Groups, the key difference is the way we can discuss things. In Stack Overflow, the focus is on an answer, but that isn’t our usual workflow when providing support. Here is a question that would fit Stack Overflow very well, there is a well defined problem with all the details and we are able to provide an acceptable answer in a single exchange. That kind of interaction, on the other hand, is quite rare. It is a lot more common to have to have a lot more back and forth and we tend to try to give a complete solution, not just answer the initial question.

Another issue is that Stack Overflow isn’t moderated by us, which means that we would be subject to rules that we don’t necessarily want to adhere to. For example, we get asked similar questions all the time, which are marked as duplicated and closed on Stack Overflow, but we want to actually answer people. 

GitHub issues are a good option for this kind of discussion, but they tend to cause people to raise issues, and one of the reasons that we have the google group is to create discussion. I guess it comes down to the different community that spring up from the communication medium.

Speaking about documentation:

RavenDB does have the official docs, which are easily navigable and searchable. Works well for beginners and provides good samples to start with, but there are gaps here and there, and it has far less coverage of the functionality than many popular free and open source projects.

Ultimately, as a developer, I want to google my question and it’s acceptable to have the answer on a non-official website. But if you’re having a deep dive with RavenDB, it’s unlikely to find it in the official docs, nor StackOverflow, nor GitHub.

Documentation has always been a chore for me. The problem is that I know what the software does, so it can be hard to even figure out what we need to explain. This year we have hired a couple more technical writers specifically to address the missing pieces in our documentation. I think we are doing quite well at this point.

What wasn’t available at the time of this post and is available now is the book. All of the details about RavenDB that you could care too and more are detailed there are are available. It is also available to Google, so in many cases your Google search will point you to the right section in the book that may answer your question.

image

I hope that these actions covered the gaps that Alex noted in our documentation. And there is also this blog, of course Smile.

Speaking about issues that he had run into:

It’s reliable and does work well. Unless it doesn’t. And then a fix gets promptly released (a nightly build could be available within 24 hours after reporting the issue). And it works well again.

All these bugs (and others I found) have been promptly fixed.

… stability of the server and the database integrity are sacred. Even a slight risk of losing it can keep you awake at night. So no, it’s a biggy, unless the RavenDB team convinces me otherwise.

I didn’t include the list of issues that Alex pointed to on purpose. The actual issues don’t matter that much, because he is correct, from Alex’s perspective, RavenDB aught to Just Work, and anything else is our problem.

We spend a lot of time on ensuring a high quality for RavenDB. I had a two parts with Jeffery Palermo about just that, and you might be interested in this keynote that goes into some of the challenges that are involved in making RavenDB.

One of the issues that he raised was RavenDB crashing (or causing Windows to crash) because of a bug in the Windows Kernel that was deployed in a hotfix. The hotfix was quietly patched some time later by Microsoft, but in the meantime, RavenDB would crash.  And a user would blame us, because we crashed.

Another issue (RavenDB upgrade failing) was an intentional choice by us in the upgrade, however. We had a bug that can cause data corruption in some cases, we fixed it, but we had to deal with potentially problematic state of existing databases. We chose to be conservative and ask the user to take an explicit action in this case, to prevent data loss. It isn’t ideal, I’m afraid, but I believe that we have done the best that we could after fixing the underlying issue. In doubt, we pretty much always have to fall on the prevent data loss vs. availability side.

Speaking about Linq & JS support:

let me give you a sense of how often you’ll see LINQ queries throwing NotSupportedException in runtime.

But in front of us a special case — a database written in the .NET! There is no need in converting a query to SQL or JavaScript.

I believe that I mentioned already that the initial Linq support for RavenDB literally took more time than building RavenDB itself, right? Linq is an awesome feature, for the consumer. For the provider, it is a mess. I’m going to quote Frans Bouma on this:

Something every developer of an ORM with a LINQ provider has found out: with a LINQ provider you're never done. There are always issues popping up due to e.g. unexpected expressions in the tree.

Now, as Alex points out. RavenDB is written in .NET, so we could technically use something like Serialize.Linq and support any arbitrary expression easily, right?

Not really, I’m afraid, and for quite a few reasons:

  • Security – you are effectively allowing a user to send arbitrary code to be executed on the server. That is never going to end up well.
  • Compatibility – we want to make sure that we are able to change our internals freely. If we are forced to accept (and then execute) code from the client, that freedom is limited.
  • Performance – issuing a query in this manners means that we’ll have to evaluate the query on each document in the relevant collection. A full table scan. That is not a feature that RavenDB even has, and for a very good reason.
  • Limited to .NET only – we currently have client for .NET, JVM, Go, Python, C++ and Node.JS. Having features just for one client is not something that we want, it really complicates our lives.

We think about queries using RQL, which are abstract in nature and don’t tie us down with regards to how we implement them. That means that we can use features such as automatic indexes, build fast queries, etc.

Speaking about RQL:

Alex points out some issues with RQL as well. The first issue relates to the difference between a field existing and having a null value. RavenDB make a distinction between these state. A field can have a null value or it can have a missing value. In a similar way to the behavior of NULL in SQL, which can often create similar confusion. The problem with RavenDB is that the schema itself isn’t required, so different documents can have different fields, so in our case, there is an additional level. A field can have a value, be null or not exist. And we reflect that in our queries. Unfortunately, while the behavior is well defined and documented, just like NULL behavior in SQL, it can be surprising to users. 

Another issue that Alex brings up is that negation queries aren’t supported directly. This is because of the way we process queries and one of the ways we ensure that users are aware of the impact of the query. With negation query, we have to first match all documents the exclude all those that match the negation. For large number of documents, that can be expensive. Ideally, the user have a way to limit the scope of the matches that are being negated, which can really help performance.

Speaking about safe by default:

RavenDB is a lot less opinionated than it used to be. Alex rightfully points that out. As we got more users, we had to expand what you could do with RavenDB.  It still pains me to see people do things that are going to be problematic in the end (extremely large page sizes are one good example), but our users demanded that. To quote Alex:

I’d preferred a slowed performance in production and a warning in the server logs rather than a runtime exception.

Our issue with this approach is that no one looks at the logs and that this usually come to a head at 2 AM, resulting in a support call from the ops team about a piece of software that broke. Because of this, we have removed many such features, while turning them to alerts, and the very few that remained (mostly just the number of requests per session) can be controlled globally by the admin directly from the Studio. This ensures that the ops team can do something if you hit the wall, and of course, you can also configure this from the client side globally easily enough.

As a craftsman, it pains me to remove those limits, but I have to admit that it significantly reduced the number of support calls that we had to deal with.

Conclusion:

Overall, I can say RavenDB is a very good NoSQL database for .NET developers, but the ”good” is coming with a bunch of caveats. I’m confident in my ability to develop any enterprise application with RavenDB applying the Domain-driven design (DDD) philosophy and practices.

I think that this is really the best one could hope for. I think that Alex’s review that is honest and to the point. Moreover, it is focused and detailed. That make very valuable. Because he got to his conclusions not out of brief tour of RavenDB but actually holding it up in the trenches.

Thanks, Alex.

time to read 2 min | 346 words

I run into this post, in which the author describe how they got ERROR 1000294 from IBM DataPower Gateway as part of an integration effort. The underlying issue was that he sent JSON to the endpoint in an order that it wasn’t expected.

After asking the team at the other end to fix it, the author got back an estimation of effort for 9 people for 6 months (4.5 man years!). The author then went and figured out that the fix for the error was somewhere deep inside DataPower:

Validate order of JSON? [X]

The author then proceeded to question the competency  / moral integrity of the estimation.

I believe that the author was grossly unfair, at best, to the people doing the estimation. Mostly because he assumed that unchecking the box and running a single request is a sufficient level of testing for this kind of change. But also because it appears that the author never considered once what is the reason this setting may be in place.

  • The sort order of JSON has been responsible for Remote Code Execution vulnerabilities.
  • The code processing the JSON may not do that in a streaming fashion, and therefor expect the data in a particular order.
  • Worse, the code may just assume the order of the fields and access them by index. Change the order of the fields, and you may reverse the Creditor and Debtor fields.
  • The code may translate the JSON to another format and send it over to another system (likely, given the mentioned legacy system.

The setting is there to protect the system, and unchecking that value means that you have to check every single one of the integration points (which may be several layers deep) to ensure that there isn’t explicit or implied ordering to the JSON.

In short, given the scope and size of the change:  “Fundamentally alter how we accept data from the outside world”, I can absolutely see why they gave this number.

And yes, for 99% of the cases, there isn’t likely to be any different, but you need to validate for that nasty 1% scenario.

time to read 1 min | 166 words

I mentioned in the previous post that I’ll take the opportunity to show of some interesting queries. The application itself is here, and you can see how to UI look in the following screenshot:

image

I decided to see what would be the best way to come up with the information we need for this kind of query. Here is what I got.

image

This is using a select object style to get a complex projection back from the server. Here are the results:

image

As you can see, we are able to get all the data we want, in a format that is well suited to just sending directly to the UI with very little work and with tremendous speed.

time to read 6 min | 1063 words

The “Different I/O Access Methods for Linux, What We Chose for Scylla, and Why” is quite fascinating. It is a pleasure to be able to read in depth into another database implementation strategy and design decisions. In particular where they don’t match what we are doing, because we can learn from the differences.

The article is good both in terms of discussing the general I/O approaches on Linux in general and the design decisions and implications for ScyllaDB in particular.

ScyllaDB chose to use AIO/DIO (async direct I/O) using a dedicated library called Seastar. And they are effectively managing their own memory and I/O scheduling internally, skipping the kernel entirely. Given how important I/O is for a database, I can say that I strongly resonate with this approach, and it is something that we have tried for a while with RavenDB. That included paying careful attention to how we are sending I/O, controlling our own caching, etc.

We are in a different position from Scylla, of course, since we are using managed code, which introduced a different set of problems (and benefits), but during the design of RavenDB 4.0, we chose to go in a very different direction.

But first, let me show you the single statement in the post that caused me to write this blog post:

The great advantage of letting the kernel control caching is that great effort has been invested by the kernel developers over many decades into tuning the algorithms used by the cache. Those algorithms are used by thousands of different applications and are generally effective. The disadvantage, however, is that these algorithms are general-purpose and not tuned to the application. The kernel must guess how the application will behave next, and even if the application knows differently, it usually has no way to help the kernel guess correctly.

This statement is absolutely correct. The kernel caching algorithms (and the kernel behavior in general) is tailored to suit a generic set of requirements, which means that if you deviate from the way the kernel expects, it is going to be during extra work and you can experience significant problems (performance and otherwise).

Another great resource that I want to point you to is the following article: “You’re Doing It Wrong” which had a major effect on the way RavenDB 4.0 is designed.

What we did, basically, is to look at how the kernel is doing things, and then see how we can fit our own behavior to what the kernel is expecting. Actually, I’m lying here, because there is no one kernel here. RavenDB runs on Windows, Linux and OSX (Darwin kernel). So we have three very different systems with wildly different optimizations that we need to run optimally on. Actually, to be fair, we consider Windows & Linux as the main targets for deployment, but that still give us very different stacks to work on.

The key here was to be predictable in all things and be sure that whatever operations we make, the kernel can successfully predict them. This can be a lot of work, but not something that you’ll usually see in the code. It involves laying out the data so it is nearby in the file and ensuring that we have hotspots that the kernel can recognize and optimize for us, etc. And it involves a lot of work with the guts of the system to make sure that we match what the kernel expects.

For example, consider this statement from the article:

…application-level caching allows us to cache not only the data read from disk but also the work that went into merging data from multiple files into a single cache item.

This is a good example of the different behavior. ScyllaDB is using LSM model, which means that in order to read data, they typically need to lookup in multiple files. RavenDB uses a different model (B+Tree with MVCC) which typically means that store all the data in a single file. Furthermore, the way we store the information, we can access it directly via memory map without doing any work to prepare it ahead of time. That means that we can lean entirely on the page cache and gain all the benefits thereof.

The ScyllaDB approach also limits them to running only on Linux and only on specific configurations. Because they rely on async direct I/O, which is a… fiddly beast in Linux at the best of times, you need to make sure that everything matches just so in order to be able to get it working. Just running this on a stock Ubuntu won’t work, since ext4 will block for many operations. Another problem in my view is that this assumes that they are the only player on the machine. If you need to run with additional software on the machine, that can cause fights over resources. For production, this is less of a problem, but for running on a developer machine, that is frequently something that you need to take into account. The kernel will already do that for you (which is useful even in production when people put SQL Server & RavenDB on the same box) so you don’t have to worry about it too much. I’m not sure that this concern is even valid for ScyllaDB, since they tend to be deployed in clusters of dedicated machines (or at least Docker instances) so they have better control over the environment. That certainly make it easier if you can dictate such things.

Another consideration for the RavenDB approach is that we want to be, as much as possible, friendly to the administrator. When we lean on what the kernel does, we usually get that for free. The administrator can usually dictate policies to the kernel and have it follow them, and good sys admins both know how and know when to do that. On the other hand, if we wrote it all ourselves, we would also need to provide the hooks to modify the behavior (and monitor it, and train users in how it works, etc).

Finally, it is not a small thing to remember that if you let the kernel cache your data, that means that that cache is still around if you restart the database (but not the machine), which means that your mostly alleviate the issue of slow cold start if you needed to do things like update configuration or the database binaries.

time to read 4 min | 701 words

After looking at this post detailing how to optimize data queries in EF Core, I obviously decided that I need to test how RavenDB handles the same load.

To make things fair, I tested this on my laptop, running on battery mode. The size of the data isn’t that much, only 100,000 books and half a million reviews, so I decided to increase that by an order of magnitude.

image

The actual queries we make from the application are pretty simple and static. We can sort by votes / publication date / price (ascending /descending) and we can filter by number of votes and the publication year.

image

This means that we don’t have an explosion of querying options, so that simplify the kind of work we are doing. To make things simple for myself, I kept the same model of books / authors and reviews as separate collections. This isn’t the best model for document database, but it allows us to compare apples to apples against the work the EF Core based solution and the RavenDB solution need to do.

A major cost in Jon’s solution is the need to aggregate the reviews for a book (so the average for the review can be computed). In the end, the only way to get the solution required was to just manually calculate the average reviews for each book and store the computation in the book. We’ll discuss this a bit more in a few minutes, for now, I want to turn our eyes toward the simplest possible query in this page, getting 100 books sorted by the book id.

Because we aren’t running on the same machine, it is hard to make direct parallels, but on Jon’s machine he got 80 ms for this kind of query on 100,000 books. When increasing the data to half a million  books, the query time rose to 150ms. Running the same query gives us the results instantly (zero ms). Querying and sorting by the title, for example, give us the results in 19 ms for a page size of 100 books.

Now, let us look at the major complexity for this system, sorting and filtering by the number of votes in the system. This is hard because the reviews are stored separately from the books. With EF Core, there is the need to join between the tables, which is quite expensive and eventually led Jon to take upon himself the task of manually maintaining the values. With RavenDB, we can use a map/reduce index to handle this all for us. More specifically, we are going to use a multi map/reduce index.

Here is what the index definition looks like:

image

We map the results from both the Books and the BookReviews into the same shape, and then reduce them together into the final output, which contains the relevant aggregation.

Now, let us do some queries, shall we? Here is us querying over the entire dataset (an order of magnitude higher than the EF Core sample set), filtering by the published date and ordering by the computed votes average. In here, we get the first 100 items, and you can see that we got over 289,753 total results:

image

One very interesting feature of this query is that we are asking to include the book document for the results. This is handled after the query (so no need to do a join to the entire 289K+ results), and we are able to get everything we want in a very simple fashion.

Oh, and the total time? 17 ms. Compared to the 80ms result for EF with 1/10 of the data size. That is pretty nice (and yes, different machines, hard to compare, etc).

I’ll probably have another post on this topic, showing off some of the cool things that you can do with RavenDB and queries.

time to read 3 min | 501 words

I run into a really interesting article about performance optimizations with EF Core and I thought that it deserve a second & third look. You might have noticed that I have been putting a lot of emphasis on performance and I had literally spent years on optimizing relational database access patterns, including building a profiler dedicated for inspecting what an OR/M is doing. I got the source and run the application.

I have a small bet with myself, saying that in any application using a relational database, I’ll be able to find a SELECT N+1 issue within one hour. So far, I think that my rate is 92% or so. In this case, I found the SELECT N+1 issue on the very first page load.

image

Matching this to the code, we have:

image

Which leads to:

image

And here we can already tell that there is a problem, we aren’t accessing the authors. This actually happens here:

image

So we have the view that is generating 10 out of 12 queries. And the more results per page you have, the more this costs.

But this is easily fixed once you know what you are looking at. Let us look at something else, the actual root query, it looks like this:

image

Yes, I too needed a minute to recover from this. We have:

  1. One JOIN
  2. Two correlated sub queries

Jon was able to optimize his code by 660ms to 80ms, which is pretty awesome. But that is all by making modifications to the access pattern in the database.

Given what I do for a living, I’m more interested in what it does inside the database, and here is what the query plan tells us:

image

There are only a few tens of thousands of records and the query is basically a bunch of index seeks and nested loop joins. But note that the way the query is structured forces the database to evaluate all possible results, then filter just the top few. That means that you have to wait until the entire result set has been processed, and as the size of your data grows, so will the cost of this query.

I don’t think that there is much that can be done here, given the relational nature of the data access ( no worries, I’m intending to write another post in this series, you guess what I’m going to write there, right?Smile ).

time to read 7 min | 1273 words

This is a fascinating post about building a time series database from scratch. The author explicitly warns that they have no background in databases, but given that this is my day job, I decided I might throw in my 2 cents about the way they do things.

Note: Large protion of the original post talk about their existing system (which they are replacing), and most of my criticisms are about that system. Where I'm talking about the new system (toward the end of this post) , I'm noting that explicitly.

I started writing this post about halfway through reading  the post, mostly because I got all sort of comments and wanted to keep a record of things as I’m reading it. It is possible that some of the things that I’m concerned about would be answered later in the post. Without further ado, my comments. It will probably won’t make sense without reading the original post.

They are storing their time series data using keys similar to:

{__name__="requests_total", path="/status", method="GET", instance=”10.0.0.1:80”}

And then they allow to do queries on both the keys and time (all GET requests on /status in the past week). I would have thought about this, but it it a very interesting way to handle queries on such an environment. They also seem to need to do queries that are both in the same series and across series, they call it vertical and horizontal queries.

Let's just consider the main take away: sequential and batched writes are the ideal write pattern for spinning disks and SSDs alike.

Yes, that is pretty much the key for good performance.

So ideally, samples for the same series would be stored sequentially so we can just scan through them with as few reads as possible. On top, we only need to know where this sequence starts to access all data points.

There's obviously a strong tension between the ideal pattern for writing collected data to disk and the layout that would be significantly more efficient for serving queries. It is the fundamental problem our TSDB has to solve.

Yes, being able both write efficiently and query efficiently are two very different problems that you need to handle, and often you need to select which one you’ll prefer.

We create one file per time series that contains all of its samples in sequential order.

What?! No, you can’t do that. At least, you can’t do that and get reasonable behavior. To start with, even though you are doing batch, you are actually guaranteeing that your write pattern would be random. Why is that?

Because if you are writing every KB (which is what they do) per file, you are basically ensuring that the OS will have to write to different sectors / pages on the physical drive. So if you have writes to 100 series, you are going to be writing to 100 different on disk locations. To make things worse, you aren’t writing in 4KB increments, which basically means that you are doing buffered writes. That information isn’t in the post, but it is a safe assumption, since if they weren’t doing buffered writes, there is no way that the operating system will be able to catch up with the kind of load. That in turn means that you don’t have any durability whatsoever.

In fact, this is mentioned explicitly, but only in the case of losing the writes made in the application buffer. I’m assuming that they either don’t care or have a different manner for avoiding / dealing with data loss / corruption in the case of machine failure. More to the point, given that they do compression, they need to be able to recover from partially written data to the files, and from the post, I don’t think that they are doing that.

But those issues are only just the start. On Windows, you don’t typically worry about the number of open files you have, but on Linux, there is a typical a limit on the number of open files you can have. It is common to have that around 64K max open files. Using a file per series means that you are very likely going to run into issues with the number of open files you have, in fact, it would be very easy to construct a query that would hit that limit, and that would have a global impact on the server.

Other issues with such large number of files is that file systems don’t really like it when you have so many files, this is discussed in the post as running out of inodes, but we have noticed performance degradation on directories with large number of files on a wide variety of file systems.

When you implement expiration of data, it also means that you have to move data from the middle of the file to the beginning, and then truncate it, that leads to a huge amount of additional I/O, likely blocking and is probably something that an operations guy is looking at.

Another issue is that because they have a file per series, they need to cache it aggressively, leading to competition between the database cache and the operation system page cache. It also means that the application is sensitive to allocation patterns, and on Linux, that is a really bad thing to do, because the silly OOM killer.

When querying data that is not cached in memory, the files for queried series are opened and the chunks containing relevant data points are read into memory. If the amount of data exceeds the memory available, Prometheus quits rather ungracefully by getting OOM-killed.

Yep, that was my first thought when I started reading this. I follow the reasoning on why OOM is there, but I still think it is a very silly choice.

The actual solution that they present is to have all the data for a particular time frame (2 hours, in their case) in a block directory. I’m not sure why they have a separate directory from block (with multiple chunks per block), and mmaping the whole thing. That is a far better solution, but they also implement compaction, which get you right back to write amplification, and I don’t quite get why. The example they give is that if you have a week long query, you don’t want to merge results across 80 blocks, but I would assume that this is surely better than having to write the same data over and over again.

If the series with IDs 10, 29, and 9 contain the label app="nginx", the inverted index for the label "nginx" is the simple list [10, 29, 9], which can be used to quickly retrieve all series containing the label.

This just screams at me that it is wrong. Oh, not the actual content, but look at the list, it should be [9, 10, 29], because working with the sorted data is going to enable so many more interesting scenarios. Oh, it seems like that is what they are doing in the new version, so that is good.

I think that I found the right location for the code, and this line scares me, basically, it means that this will issue an fsync once every 10 seconds. That means that there is a 10 seconds period of potential data loss. Given the data that is kept, that is probably not an issue, and losing 10 seconds of samples is likely not going to be an issue.

That means that you can basically do all writes to memory all the time, and that gives you a major performance boost all around, at the expense of safety.

This is an interesting enough topic that I’ll do another post, discussing how I’ll implement the same scenario with Voron.

time to read 6 min | 1173 words

The Uber Engineering group have posted a really great blog post about their move from Postgres to MySQL. I mean that quite literally, it is a pleasure to read, especially since they went into such details as the on-disk format and the implications of that on their performance.

For fun, there is another great post from Uber, about moving from MySQL to Postgres, which also has interesting content.

Go ahead and read both, and we’ll talk when you are done. I want to compare their discussion to what we have been doing.

In general, Uber’s issue fall into several broad categories:

  • Secondary indexes cost on write
  • Replication format
  • The page cache vs. buffer pool
  • Connection handling

Secondary indexes

Postgres maintain a secondary index that points directly to the data on disk, while MySQL has a secondary index that has another level of indirection. The images show the difference quite clearly:

Postgres MySQL
Postgres_Tuple_Property_

MySQL_Index_Property_

I have to admit that this is the first time that I ever considered the fact that the indirection’s manner might have any advantage. In most scenarios, it will turn any scan on a secondary index into an O(N * logN) cost, and that can really hurt performance. With Voron, we have actually moved in 4.0 from keeping the primary key in the secondary index to keeping the on disk position, because the performance benefit was so high.

That said, a lot of the pain the Uber is feeling has to do with the way Postgres has implemented MVCC. Because they write new records all the time, they need to update all indexes, all the time, and after a while, they will need to do more work to remove the old version(s) of the record. In contrast, with Voron we don’t need to move the record (unless its size changed), and all other indexes can remain unchanged. We do that by having a copy on write and a page translation table, so while we have multiple copies of the same record, they are all in the same “place”, logically, it is just the point of view that changes.

From my perspective, that was the simplest thing to implement, and we get to reap the benefit on multiple fronts because of this.

Replication format

Postgres send the WAL over the wire (simplified, but easier to explain) while MySQL send commands. When we had to choose how to implement over the wire replication with Voron, we also sent the WAL. It is simple to understand, extremely robust and we already had to write the code to do that. Doing replication using it also allows us to exercise this code routinely, instead of it only running during rare crash recovery.

However, sending the WAL has issues, because it modify the data on disk directly, and issue there can cause severe problems (data corruption, including taking down the whole database). It is also extremely sensitive to versioning issues, and it would be hard if not impossible to make sure that we can support multiple versions replicating to one another. It also means that any change to the on disk format needs to be considered with distributed versioning in mind.

But what killed it for us was the fact that it is almost impossible to handle the scenario of replacing the master server automatically. In order to handle that, you need to be able to deterministically let the old server know that it is demoted and should accept no writes, and the new server that it can now accept writes and send its WAL onward. But if there is a period of time in which both can accept write, then you can’t really merge the WAL, and trying to is going to be really hard.  You can try using distributed consensus to run the WAL, but that is really expensive (about 400 writes / second in our benchmark, which is fine, but not great, and impose a high latency requirement).

So it is better to have a replication format that is more resilient to concurrent divergent work.

OS Page Cache vs Buffer Pool

From the post:

Postgres allows the kernel to automatically cache recently accessed disk data via the page cache. … The problem with this design is that accessing data via the page cache is actually somewhat expensive compared to accessing RSS memory. To look up data from disk, the Postgres process issues lseek(2) and read(2) system calls to locate the data. Each of these system calls incurs a context switch, which is more expensive than accessing data from main memory. … By comparison, the InnoDB storage engine implements its own LRU in something it calls the InnoDB buffer pool. This is logically similar to the Linux page cache but implemented in userspace. While significantly more complicated than Postgres’s design…

So Postgres is relying on the OS Page Cache, while InnoDB implements its own. But the problem isn’t with relying on the OS Page Cache, the problem is how you rely on it. And the way Postgres is doing that is by issuing (quite a lot, it seems) system calls to read the memory. And yes, that would be expensive.

On the other hand, InnoDB needs to do the same work as the OS, with less information, and quite a bit of complex code, but it means that it doesn’t need to do so many system calls, and can be faster.

Voron, on the gripping hand, relies on the OS Page Cache to do the heavy lifting, but generally issues very few system calls. That is because Voron memory map the data, so access it is usually a matter of just pointer dereference, the OS Page Cache make sure that the relevant data is in memory and everyone is happy. In fact, because we memory map the data, we don’t have to manage buffers for the system calls, or to do data copies, we can just serve the data directly. This ends up being the cheapest option by far.

Connection handling

Spawning a process per connection is something that I haven’t really seen since the CGI days. It seems pretty harsh to me, but it is probably nice to be able to kill a connection with a kill –9, I guess. Thread per connection is also something that you don’t generally see. The common situation today, and what we do with RavenDB, is to have a pool of threads that all manage multiple connections at the same time, often interleaving execution of different connections using async/await on the same thread for better performance.

time to read 3 min | 526 words

This post is in reply to Rob’s post, go ahead and read it, I’ll wait.

My answer to Rob’s post can be summarize in a single word:

In particular, this statement:

it is impossible to be a good .NET developer. To work in a development shop with a team is to continually cater for the lowest common denominator of that team and the vast majority of software shops using .NET have a whole lot of lowest common denominator to choose their bad development decisions for.

Um, nope. That only apply to places that are going for the lowest common denominator. To go from there to all .NET shops is quite misleading. I’ll give our own example, of building a high performance database in managed code, which has very little lowest common denominator anything anywhere, but that would actually be too easy.

Looking at the landscape, I can see quite a lot of people doing quite a lot of interesting things at the bleeding edge. Now, it may be that this blog is a self selecting crowd, but when you issue statements as “you can’t be a good .NET developers”, that is a pretty big statement to stand behind.

Personally, I think that I’m pretty good developer, and while I dislike the term “XYZ developer”, I do 99% of my coding in C#.

Now, some shops have different metrics, they care about predictability of results, so they will be extremely conservative in their tooling and language usage, the old “they can’t handle that, so we can’t use it” approach. This has nothing to do with the platform you are using, and all to do with the type of culture of the company you are at.

I can certainly find good reasons for that behavior, by the way, when your typical product lifespan is measured in 5 – 10 years, you have a very different mindset than if you aim at most a year or two away. Making decisions on brand new stuff is dangerous, we lost a lot when we decided to use Silverlight, for example. And the decision to go with CoreCLR for RavenDB was made with explicit back off strategy in case that was sunk too.

Looking at the kind of directions that people leave .NET for, it traditionally have been to the green green hills of Rails, then it was Node.JS, not I think it is Elixir, although I’m not really paying attention. That means that in the time a .NET developer (assuming that they investing in themselves and continuously learning) invested in their platform, learned a lot on how to make it work properly, the person who left for greener pastures has had to learn multiple new frameworks and platforms. If you think that this doesn’t have an impact on productivity, you are kidding yourself.

The reason you see backlash against certain changes (project.json coming, going and then doing disappearing acts worthy of Houdini) is that there is value in all of that experience.

Sure, sometimes change is worth it, but it needs to be measured against its costs. And sometimes there are non trivial.

FUTURE POSTS

  1. Optimizing access patterns for extendible hashing - 6 hours from now
  2. Building extendible hash leaf page - about one day from now

There are posts all the way to Nov 19, 2019

RECENT SERIES

  1. re (24):
    12 Nov 2019 - Document-Level Optimistic Concurrency in MongoDB
  2. Voron’s Roaring Set (2):
    11 Nov 2019 - Part II–Implementation
  3. Searching through text (3):
    17 Oct 2019 - Part III, Managing posting lists
  4. Design exercise (6):
    01 Aug 2019 - Complex data aggregation with RavenDB
  5. Reviewing mimalloc (2):
    22 Jul 2019 - Part II
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats