Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,317 | Comments: 46,911

filter by tags archive

RavenDB Conference videosShould I use a document database?

time to read 1 min | 196 words

In this talk from the RavenDB conference, Elemar Júnior is talking about the differences between relational and document databases, and how you can utilize RavenDB for best effect.

I’ll hint that the answer to the question in the title is: Yes, RavenDB.

For the last 40 years or so, we used relational databases successfully in nearly all business contexts and systems of nearly all sizes. Therefore, if you feel no pain using a RDBMS, you can stay with it. But, if you always have to work around your RDBMS to get your job done, a document oriented database might be worth a look.

RavenDB is a 2nd generation document database that allows you to write a data-access layer with much more freedom and many less constraints. If you have to work with large volumes of data, thousands of queries per second, unstructured/semi-structured data or event sourcing, you will find RavenDB particularly rewarding.

In this talk we will explore some document database usage scenarios. I will share some data modeling techniques and many architectural criteria to help you to decide where safely adopt RavenDB as a right choice.

RavenDB Conference videosKnow Thy Costs

time to read 1 min | 131 words

In this talk from the RavenDB conference, Federico Lois is discussing the kind of performance work and optimizations that goes into RavenDB.

Performance happens. Whether you're designed for it or not it doesn’t matter, she is always invited to the party (and you better find her in a good mood). Knowing the cost of every operation, and how it distributes on every subsystem will ensure that when you are building that proof-of-concept (that always ends up in production) or designing the latest’s enterprise-grade application; you will know where those pesky performance bugs like to inhabit. In this session, we will go deep into the inner working of every performance sensitive subsystem. From the relative safety of the client to the binary world of Voron.

Low level Voron optimizationsHigh data locality

time to read 3 min | 592 words

After talking about increasing the Voron page size, let us talk about another very important optimization. High data locality. The importance of locality comes up again and again in performance.The cost of getting the next bit of data can be so prohibitedly expensive that it dominates everything else, including standard Big O time complexity metrics. A fun discussion of that is here.

Remember that Voron actually stores all data in pages, and that means that it needs some way to allocate new pages. And by default, whenever you allocate a page, we use a page from the end of the file. In certain scenarios (pure sequential inserts), that generates some pretty good allocation pattern, but even there it can cause issues. Let us consider what the database file looks like after a while:

image

Imagine that the green sections are all pages that belong to the same B+Tree inside Voron. Traversing the B+Tree now means that we have a very high probability of having to jump around in the file a lot. Since we are memory mapped, we wouldn’t typically feel this, since we aren’t actually hitting the disk that often, but it has several interesting implications:

  • Startup time can increase rapidly, since we need to issue many I/O requests to different places in the file
  • Flush / sync time is also increased, because it need to touch more of the disk

Trees are typically used for indexes in Voron, and a document collection would typically have a few different storage indexes (lookup by etag, lookup by name, etc). Because they store different data, they have different growth pattern, so they are going to allocate pages at different rate, which means that the scattering of the pages across the data file is even more sever.

The change we just finished implementing is going to do several important things all at once:

  • Pages for all the storage indexes of a collection are going to be pre-allocated, and when they run out, be allocated again in batches.
  • The indexes will ask the storage to allocate pages nearby the sibling page, to increase locality even further.
  • All indexes will use the same pre-allocation buffer, so they all reside in roughly the same place.

That also give us some really interesting optimizations opportunities. Since indexes are typically order of magnitude smaller than the data they cover, it is possible to ask the operation system to prefetch the sections that we reserved for indexes for each collection in advance, leading to far less paging in the future and improving the startup time.

It also means that the operation system can issue a lot more continuous reads and writes, which is perfectly in line with what we want.

The new allocation strategy ends up looking like this:

image

In this case, we have enough data to fill the first pre-allocated section, and then we allocate a new one. So instead of 4 operations to load things, we can do this in 2.

Even without explicit prefetching on our end, this is going to be great because the operating system is going to be able to recognize the pattern of access and optimize the access itself.

RavenDB Conference videosLessons from the Trenches

time to read 1 min | 94 words

In this talk from the RavenDB conference, Dan Bishop is talking about lessons learned from running RavenDB in production for a very long time.

It's easy, fun, and simple to get a prototype application built with RavenDB, but what happens when you get to the point of shipping v1.0 into Production? Many of the subtle decisions made during development can have undesirable consequences in Production. In this session, Dan Bishop will explore some of the pain points that arise when building, deploying, and supporting enterprise-grade applications with RavenDB.

RavenDB Conference videosRavenDB Embedded at Massive Scales

time to read 1 min | 117 words

In this talk from the RavenDB conference, Rodrigo Rosauro is talking about deploying RavenDB at massive scale, to over 36,000 locations and a total number of machine that exceed half a million.

One particular (and often forgotten) use-case for RavenDB is its usage as an embedded database. This operation mode allows application providers to abstract the complexity of database administration from their end-users while, at the same time, providing you a fully functional document store.

During this talk we will explore the challenges faced while deploying RavenDB in a massive number of machines throughout the globe (aiming at hundreds of thousands), and how RavenDB improved the capabilities of our application.

Low level Voron optimizationsThe page size bump

time to read 5 min | 864 words

Explaining the usage pages seems to be one of the things is either hit of miss for me. Either people just get it, or they struggle with the concept. I have written extensively on this particular topic, so I’ll refer it to that post for the details on what exactly pages in a database are.

Voron is currently using 4KB pages. That is pretty much the default setting, since everything else also works in units of 4KB. That means that we play nice with requirements for alignment, CPU page sizes, etc.  However, 4KB is pretty small, and that lead to trees that has higher depth. And the depth of the tree is one of the most major reasons for concern for database performance (the deeper the tree, the more I/O we have to do).

We previously tested using different page sizes (8KB, 16KB and 32KB), and we saw that our performance decreased as a result. That was surprising and completely contrary to our expectations. But a short investigation revealed what the problem was. Whenever you modify a value, you dirty up the entire page. That means that we would need to write that entire page back to storage (which means making a bigger write to the journal, then applying a bigger write to the data filed, etc).

In effect, when increasing the page size to 8KB, we also doubled the amount of I/O that we had to deal with. That was a while ago, and we recently implemented journal diffing, as a way to reduce the amount of unnecessary data that we write to disk. A side affect of that is that we no longer had a 1:1 correlation between a dirty page and full page write to disk. That opened up the path to increasing the page sizes. There is still an O(PageSize) cost to doing the actual diffing, of course, but that is memory to memory cost and negligible in compared to the saved I/O.

Actually making the change was both harder and easier then expected. The hard part was that we had to do a major refactoring working to split a shared value. Both the journal and the rest of Voron used the notion of Page Size. But while we want the page size of Voron to change, we didn’t want the journal write size to change. That led to a lot of frustration where we had to go over the entire codebase and look at each value and figure out whatever it meant writing to the journal, or pages as they are used in the rest of Voron. I’ve got another post scheduled talking about how you can generate intentional compilation errors to make this easy for you to figure it out.

Once we were past the journal issue, the rest was mostly dealing with places that made silent assumptions on the page size. That can be anything from “the max value we allow here is 512 (because we need to fit at least so many entries in)” to tests that wrote 1,000 values and expected the resulting B+Tree to be of a certain depth.

The results are encouraging, and we can see them mostly on the system behavior with very large data sets, those used to generate very deep trees, and this change reduced them significantly. To give some context, let us assume that we can fit 100 entries per page using 4KB pages.

That means that if we have as little as 2.5 million entries, we’ll have (in the ideal case):

  • 1 root page holding 3 entries
  • 3 branch pages holding 250 entries
  • 25,000 leaf pages holding the 2.5 million entries

With 8 KB pages, we’ll have:

  • 1 root page holding 63 entries
  • 12,500 lead pages holding 2.5 million entries

That is a reducing of a full level. The nice thing about B+Trees is that in both cases, the branch pages are very few and usually reside in main memory already, so you aren’t directly paying for their I/O.

What we are paying for is the search on them.

The cost of searching the 4KB tree is:

  • O(log2 of 3) for searching the root page
  • O(log2 of 100) for searching the relevant branch page
  • O(log2 of 100) for searching the leaf page

In other words, about 16 operations. For the 8 KB page, that would be:

  • O(log2 of 63) for searching the root page
  • O(log2 of 200) for searching the leaf page

It comes to 14 operations, which doesn’t seems like a lot, but a lot of our time goes on key comparisons on the key, so anything helps.

However, note that I said that the situation above was the ideal one, this can only happen if the data was inserted sequentially, which it doesn’t usually do. Page splits can cause the tree depth to increase very easily (in fact, that is one of the core reasons why non sequential keys are so strongly discourage in pretty much all databases.

But the large page size allows us to pack many more entries into a single page, and that also reduce the risk of page splits significantly. 

RavenDB Conference videosRavenDB for the Modern Web Developer

time to read 1 min | 87 words

In this talk from the RavenDB conference, Judah is discussing using RavenDB for web development, including how to postpone the robot apocalypse by finding some love to young robots lose in the javascript maze.

Modern web development is uniquely fast-paced, demands rapid development and responsiveness to changes. But our databases have been stuck in the 1970s with rigid schemas and antiquated query languages. Enter RavenDB: flexible, fast, designed for the 21st century. It's the perfect side to your web development dish.

RavenDB Conference videosIntroducing RavenDB 4.0

time to read 1 min | 101 words

About 8 months ago we had the RavenDB conference. We recorded the sessions, but we had… technical difficulties in getting the videos and their quality. The details don’t actually matter, and we’ll do better next time. I’ll be posting them for the next week or so.

This video is all about RavenDB 4.0, the design and thinking behind it and a lot of low level details into what is going on with RavenDB 4.0. Warning, this is close to 2 hours in length, and it goes pretty deep and it covers a lot of material.

RavenDB Conference videosIntroducing RavenDB 3.5

time to read 1 min | 114 words

About 8 months ago we had the RavenDB conference. We recorded the sessions, but we had… technical difficulties in getting the videos and their quality. The details don’t actually matter, and we’ll do better next time. I’ll be posting them for the next week or so.

This video shows off RavenDB 3.5 in all its glory, you can see all the cool new features.

From the new consensus based clustering to active data exploration, RavenDB 3.5 contains quite a lot of new features, improvement and fixes. In this keynote Oren Eini will showcase RavenDB 3.5 new features. Including SLAs, I/O monitoring, improved performance and stability, smarter replication, and more.

The crash at the Unicode text

time to read 4 min | 629 words

So a customer of ours has been putting RavenDB 4.0 through it paces, they have a large database with some pretty nasty indexes. They reported that when importing their existing data, they managed to crash RavenDB. The problem was probably some interaction of indexing and documents, but this was pretty consistent.

That was lucky, because I was able to narrow down the problem to the following document.

image

Literally just a week before getting this bug report we found a bug with multi byte Unicode characters (we were counting them in chars, but using that to index into a byte array), so that was our very first suspicion. In fact, review the relevant portions of the code, we were able to identify another location with a similar problem, and resolve it.

That didn’t solve the problem, however. I kept banging my head against the Unicode issue, until I finally just gave up and went to tracing the code very carefully. And then I found it.

The underlying issue was that the data here is part of an array inside a document, and we had an index that looked like this:

image

The idea is that we gather all the titles from the items in the action, and allow to search on those. And somehow, the title from the second entry above was always pointing to invalid memory.

It took a bunch more investigation before I found out the root cause.

Here is where we found the problem:

image

But it is going to take a bit of explaining first. The second title is pretty long, when using UTF8 it is 176 bytes, which is long enough for the blittable format to try to compress it, and it was indeed successfully compressed to a smaller size.

So far, so good, and a pretty awesome feature that we have. But the problem was that when we index a document, we need to access its property, and if it is a compress property, we need to decompress it, and that means that we need to decompress it somewhere. And at the time, using the native temp buffer seemed perfect, after all, it was exactly the kind of thing that it was meant for.

But in this case, a bit further down the document we had another compressed field, one that was even larger, so we again asked for a temp buffer. But since this is a temp buffer, and the expectation is that this will be short lived usage, we returned the old buffer and allocated a new one. However, we still maintained reference to the old buffer, and when we tried to index that title, it spat out garbage (because the memory was reused) or just crashed, depending on how severe we set the memory protection mechanism.

It was actually something that we already figured out and fixed, but we haven’t yet merged that branch, so it was quite annoying when we found the root cause.

We’ll also be removed the temp buffer concept entirely, we’ll have a buffer that you can checkout and return, and not something that can vanish behind you back.

But I have to say, that bug was certainly misleading in the extreme. I was sure it had to do with Unicode handling.

FUTURE POSTS

  1. Low level Voron optimizations: Recyclers do it over and over again. - one day from now
  2. RavenDB Conference videos: Zapping ever faster: how Zap sped up by two orders of magnitude using RavenDB - about one day from now
  3. Feature intersection bugs are the hardest to predict - 3 days from now
  4. RavenDB Conference videos: Implementing CQRS and Event Sourcing with RavenDB - 4 days from now
  5. How did the milk get to the fridge? - 5 days from now

And 10 more posts are pending...

There are posts all the way to Mar 10, 2017

RECENT SERIES

  1. RavenDB Conference videos (12):
    17 Feb 2017 - Should I use a document database?
  2. Low level Voron optimizations (5):
    14 Feb 2017 - High data locality
  3. Implementing low level trie (4):
    26 Jan 2017 - Digging into the C++ impl
  4. Answer (9):
    20 Jan 2017 - What does this code do?
  5. Challenge (48):
    19 Jan 2017 - What does this code do?
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats