Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,319 | Comments: 46,932

filter by tags archive

RavenDB Conference videosKnow Thy Costs

time to read 1 min | 131 words

In this talk from the RavenDB conference, Federico Lois is discussing the kind of performance work and optimizations that goes into RavenDB.

Performance happens. Whether you're designed for it or not it doesn’t matter, she is always invited to the party (and you better find her in a good mood). Knowing the cost of every operation, and how it distributes on every subsystem will ensure that when you are building that proof-of-concept (that always ends up in production) or designing the latest’s enterprise-grade application; you will know where those pesky performance bugs like to inhabit. In this session, we will go deep into the inner working of every performance sensitive subsystem. From the relative safety of the client to the binary world of Voron.

Low level Voron optimizationsHigh data locality

time to read 3 min | 592 words

After talking about increasing the Voron page size, let us talk about another very important optimization. High data locality. The importance of locality comes up again and again in performance.The cost of getting the next bit of data can be so prohibitedly expensive that it dominates everything else, including standard Big O time complexity metrics. A fun discussion of that is here.

Remember that Voron actually stores all data in pages, and that means that it needs some way to allocate new pages. And by default, whenever you allocate a page, we use a page from the end of the file. In certain scenarios (pure sequential inserts), that generates some pretty good allocation pattern, but even there it can cause issues. Let us consider what the database file looks like after a while:

image

Imagine that the green sections are all pages that belong to the same B+Tree inside Voron. Traversing the B+Tree now means that we have a very high probability of having to jump around in the file a lot. Since we are memory mapped, we wouldn’t typically feel this, since we aren’t actually hitting the disk that often, but it has several interesting implications:

  • Startup time can increase rapidly, since we need to issue many I/O requests to different places in the file
  • Flush / sync time is also increased, because it need to touch more of the disk

Trees are typically used for indexes in Voron, and a document collection would typically have a few different storage indexes (lookup by etag, lookup by name, etc). Because they store different data, they have different growth pattern, so they are going to allocate pages at different rate, which means that the scattering of the pages across the data file is even more sever.

The change we just finished implementing is going to do several important things all at once:

  • Pages for all the storage indexes of a collection are going to be pre-allocated, and when they run out, be allocated again in batches.
  • The indexes will ask the storage to allocate pages nearby the sibling page, to increase locality even further.
  • All indexes will use the same pre-allocation buffer, so they all reside in roughly the same place.

That also give us some really interesting optimizations opportunities. Since indexes are typically order of magnitude smaller than the data they cover, it is possible to ask the operation system to prefetch the sections that we reserved for indexes for each collection in advance, leading to far less paging in the future and improving the startup time.

It also means that the operation system can issue a lot more continuous reads and writes, which is perfectly in line with what we want.

The new allocation strategy ends up looking like this:

image

In this case, we have enough data to fill the first pre-allocated section, and then we allocate a new one. So instead of 4 operations to load things, we can do this in 2.

Even without explicit prefetching on our end, this is going to be great because the operating system is going to be able to recognize the pattern of access and optimize the access itself.

RavenDB Conference videosLessons from the Trenches

time to read 1 min | 94 words

In this talk from the RavenDB conference, Dan Bishop is talking about lessons learned from running RavenDB in production for a very long time.

It's easy, fun, and simple to get a prototype application built with RavenDB, but what happens when you get to the point of shipping v1.0 into Production? Many of the subtle decisions made during development can have undesirable consequences in Production. In this session, Dan Bishop will explore some of the pain points that arise when building, deploying, and supporting enterprise-grade applications with RavenDB.

Leaning on the compiler with intentional compilation errors in refactoring

time to read 2 min | 393 words

Recently I had to make a major and subtle change in our codebase. We have a highly used object that does something (allows us to read structured data from a buffer). That object is created at least once per document load, and it is very widely used. It also has a very short lifetime, and it is passed around a lot between various methods. The problem we have is that we don’t want to pay the GC cycles for this object, since that take away from real performance.

So we want to make it a structure, but at the same time, we don’t want to pass it to methods, because it isn’t a small struct. The solution was to make sure that it is a struct that is always passed by reference. But I did mention that it is widely used, right? Just changing it to struct is going to have a high cost of copying it, and we want to make sure that we don’t switch one cost with another.

The problem is that there is no difference in the code between passing a class argument to a function and passing a struct argument. So the idea is that we’ll lean on the compiler. We’ll change the object name to by ObjStruct, and then start going over each usage scenario, fixing it in turn. At the same time, because of things like “var” automatic variables and the lack of ref returns in the current version of C#, we make sure that we change a method like this:

TableValueReader GetCurrentReaderFor(IIterator it);

Into:

void GetCurrentReaderFor(IIterator it, out TableValueReader result);

And that means that the callers of this method will break as well, and so on and so forth until we reach all usage locations. That isn’t a particularly fun activity, but it is pretty straightforward to do, and it allows you to make large scale changes easily and safely.

Note that this requires two very important features:

  • Reasonable compilation timeframes – you probably wouldn’t use this approach in C++ if your build time was measured in multiple minutes
  • An actual compiler – I don’t know how you do large scale refactoring in dynamic languages. Tests can help, but the feedback cycle is much faster and more straightforward when you deliberately let the compiler know what you want it to do.

RavenDB Conference videosRavenDB Embedded at Massive Scales

time to read 1 min | 117 words

In this talk from the RavenDB conference, Rodrigo Rosauro is talking about deploying RavenDB at massive scale, to over 36,000 locations and a total number of machine that exceed half a million.

One particular (and often forgotten) use-case for RavenDB is its usage as an embedded database. This operation mode allows application providers to abstract the complexity of database administration from their end-users while, at the same time, providing you a fully functional document store.

During this talk we will explore the challenges faced while deploying RavenDB in a massive number of machines throughout the globe (aiming at hundreds of thousands), and how RavenDB improved the capabilities of our application.

Low level Voron optimizationsThe page size bump

time to read 5 min | 864 words

Explaining the usage pages seems to be one of the things is either hit of miss for me. Either people just get it, or they struggle with the concept. I have written extensively on this particular topic, so I’ll refer it to that post for the details on what exactly pages in a database are.

Voron is currently using 4KB pages. That is pretty much the default setting, since everything else also works in units of 4KB. That means that we play nice with requirements for alignment, CPU page sizes, etc.  However, 4KB is pretty small, and that lead to trees that has higher depth. And the depth of the tree is one of the most major reasons for concern for database performance (the deeper the tree, the more I/O we have to do).

We previously tested using different page sizes (8KB, 16KB and 32KB), and we saw that our performance decreased as a result. That was surprising and completely contrary to our expectations. But a short investigation revealed what the problem was. Whenever you modify a value, you dirty up the entire page. That means that we would need to write that entire page back to storage (which means making a bigger write to the journal, then applying a bigger write to the data filed, etc).

In effect, when increasing the page size to 8KB, we also doubled the amount of I/O that we had to deal with. That was a while ago, and we recently implemented journal diffing, as a way to reduce the amount of unnecessary data that we write to disk. A side affect of that is that we no longer had a 1:1 correlation between a dirty page and full page write to disk. That opened up the path to increasing the page sizes. There is still an O(PageSize) cost to doing the actual diffing, of course, but that is memory to memory cost and negligible in compared to the saved I/O.

Actually making the change was both harder and easier then expected. The hard part was that we had to do a major refactoring working to split a shared value. Both the journal and the rest of Voron used the notion of Page Size. But while we want the page size of Voron to change, we didn’t want the journal write size to change. That led to a lot of frustration where we had to go over the entire codebase and look at each value and figure out whatever it meant writing to the journal, or pages as they are used in the rest of Voron. I’ve got another post scheduled talking about how you can generate intentional compilation errors to make this easy for you to figure it out.

Once we were past the journal issue, the rest was mostly dealing with places that made silent assumptions on the page size. That can be anything from “the max value we allow here is 512 (because we need to fit at least so many entries in)” to tests that wrote 1,000 values and expected the resulting B+Tree to be of a certain depth.

The results are encouraging, and we can see them mostly on the system behavior with very large data sets, those used to generate very deep trees, and this change reduced them significantly. To give some context, let us assume that we can fit 100 entries per page using 4KB pages.

That means that if we have as little as 2.5 million entries, we’ll have (in the ideal case):

  • 1 root page holding 3 entries
  • 3 branch pages holding 250 entries
  • 25,000 leaf pages holding the 2.5 million entries

With 8 KB pages, we’ll have:

  • 1 root page holding 63 entries
  • 12,500 lead pages holding 2.5 million entries

That is a reducing of a full level. The nice thing about B+Trees is that in both cases, the branch pages are very few and usually reside in main memory already, so you aren’t directly paying for their I/O.

What we are paying for is the search on them.

The cost of searching the 4KB tree is:

  • O(log2 of 3) for searching the root page
  • O(log2 of 100) for searching the relevant branch page
  • O(log2 of 100) for searching the leaf page

In other words, about 16 operations. For the 8 KB page, that would be:

  • O(log2 of 63) for searching the root page
  • O(log2 of 200) for searching the leaf page

It comes to 14 operations, which doesn’t seems like a lot, but a lot of our time goes on key comparisons on the key, so anything helps.

However, note that I said that the situation above was the ideal one, this can only happen if the data was inserted sequentially, which it doesn’t usually do. Page splits can cause the tree depth to increase very easily (in fact, that is one of the core reasons why non sequential keys are so strongly discourage in pretty much all databases.

But the large page size allows us to pack many more entries into a single page, and that also reduce the risk of page splits significantly. 

RavenDB Conference videosRavenDB for the Modern Web Developer

time to read 1 min | 87 words

In this talk from the RavenDB conference, Judah is discussing using RavenDB for web development, including how to postpone the robot apocalypse by finding some love to young robots lose in the javascript maze.

Modern web development is uniquely fast-paced, demands rapid development and responsiveness to changes. But our databases have been stuck in the 1970s with rigid schemas and antiquated query languages. Enter RavenDB: flexible, fast, designed for the 21st century. It's the perfect side to your web development dish.

Why you should avoid graceful error handling like the plague that it is

time to read 3 min | 536 words

A while ago I was reviewing a pull request by a team member and I realized that I’m looking at an attempt to negotiate graceful termination of a connection between two nodes. In particular, the code in question was invoked when one node was shutting down or had to tear down the connection for whatever reason.

That code was thrown out, and it made a very graceful arc all the way to the recycle bin.

But why? The underlying reason for this was to avoid needless error messages in the logs, which can trigger support calls and cost time & effort to figure out what is going on. That is an admirable goal, but at the same time, it is a false hope and a dangerous one at that.

Let us consider what it means that a node is shutting down. It means that it now needs to notify all its peers about this. It is no longer enough to just tear down all connections, it need to talk to them, and that means that we introduced network delays into the shutdown procedure. It also means that we now have to deal with error handling when we are trying to notify a peer that this node is shutting down,  and that way lead to madness.

On the other hand, we have the other node, which node needs to also handle its peer getting up in the middle of the conversation and saying “I’m going away now” mid sentence. For that matter, since the shutdown signal (which is the common case for this to be triggered) can happen at any time, now we need to have thread safety on shutdown so we can send a legible message to the other side, and the other side must be ready to accept the shutdown message at any time. (“Do you have any new documents for me” request that expects a “There are N messages for you” now also need to handle “G’dbye world” notification).

Doing this properly complicates the code at every level, and you still need to handle the rude shutdown scenario.

Furthermore, what is the other side is supposed to do with the information that this node is shutting down the connection voluntarily? It is supposed to not connect to it again? If so, what policy should it use to decided if the other side is down for valid reasons or actually unavailable?

Assuming that there is actually a reason why there is a TCP connection between the two nodes, any interruption in service, for whatever reason, is not a valid state.

And if we ensure that we are always ending the connection in the same rude manner, we also gain a very valuable feature. We make sure that the error handling portion of the code get exercised on a regular basis, so if there are any issues there, they will be discovered easily.

As for the original issue of reducing support calls because of transient / resolved errors. That can be solved by not logging the error immediately, but waiting a bit to verify that the situation actually warrants writing to the operations log (writing to the info log should obviously happen regardless).

RavenDB Conference videosIntroducing RavenDB 4.0

time to read 1 min | 101 words

About 8 months ago we had the RavenDB conference. We recorded the sessions, but we had… technical difficulties in getting the videos and their quality. The details don’t actually matter, and we’ll do better next time. I’ll be posting them for the next week or so.

This video is all about RavenDB 4.0, the design and thinking behind it and a lot of low level details into what is going on with RavenDB 4.0. Warning, this is close to 2 hours in length, and it goes pretty deep and it covers a lot of material.

What facts should be resident in a developer’s head?

time to read 2 min | 374 words

I just finished reading this reddit thread about a guy that cheated to get a job (the original post doesn’t really shed a good light on anyone there) and the comments related to that.

That got me thinking about our own interview process and what kind of things I expect people to have in their head, for interviews, but also in general. In particular, this comment:

Honestly I wouldn't hire someone who couldn't code a search or sort algorithm on a whiteboard and tell me how efficient or inefficient it is.

Another example I heard of in a conference is literally examining people on Knuth. To the point where it you didn’t know chapter & verse for Oscillating Sort (literally, in what chapter is that covered), for example, you are obviously worthless and there is no point in continuing in the interview process.

For myself, I don’t think that I bother keeping any actual details about algorithms in my head. Much more important is to know about the algorithms, and to be able to have just enough information to recognize that it might be a viable option for my use case and then read up on the details.

In interviews, and in their work, I would expect people to know about big O complexity and be able to tell what kind of complexity a specific piece of code had, as well as whatever this was good or not.

As for implementing a sorting algorithm, I sort of remember the details on how quicksort work, for example. But actually writing that from scratch? There are far too many details for this to actually be a viable option.

  • Lomuto or Hoare partitioning?
  • What about choice of pivot?
  • How about parallelism?
  • How do you handled repeated elements?

I just pulled those from the wikipedia page, mostly because I remembered this post about how to answer such interview questions.

Implementation details shouldn’t matter (beyond understanding costs & complexities), so they shouldn’t take up valuable mental space in your head, that is why we can search for that. The important thing is that you know that you need to lookup additional information, as well as know how to do so.

FUTURE POSTS

  1. Low level Voron optimizations: Primitives & abstraction levels - 16 hours from now
  2. RavenDB Conference Videos: Replication changes in 3.5 - about one day from now
  3. Analyzing RavenDB 4.0 with NDepends - 5 days from now
  4. The cost of counting: The silliest optimization… - 6 days from now
  5. The randomly failing test - 7 days from now

And 4 more posts are pending...

There are posts all the way to Mar 14, 2017

RECENT SERIES

  1. RavenDB Conference videos (12):
    01 Mar 2017 - Delving into Documents with Data Subscriptions
  2. Low level Voron optimizations (5):
    28 Feb 2017 - Transaction lock handoff
  3. Implementing low level trie (4):
    26 Jan 2017 - Digging into the C++ impl
  4. Answer (9):
    20 Jan 2017 - What does this code do?
  5. Challenge (48):
    19 Jan 2017 - What does this code do?
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats