Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,279 | Comments: 46,759

filter by tags archive

Protocol design implications: REST vs. TCP

time to read 3 min | 444 words

I was going over design documents today, and I noticed some common themes in the changes that we have between RavenDB 3.5 and RavenDB 4.0.

With RavenDB 3.5 (and all previous versions), we always had the communication layer as HTTP REST calls between nodes. When I designed RavenDB, REST was the thing to do, and it is reflected in the design of RavenDB itself. However, 8 years later, we sat down and considered whatever this is really appropriate for everything. The answer was a resounding no. In fact, while over 95% of RavenDB is still pure REST calls, we have moved certain key functions to using TCP directly.

Note that this goes in directly contrast to this post of mine from 2012: Why TCP is evil and HTTP is king.

The concerns in this post are still valid, but we have found that there are a few major reasons why we want to switch to TCP for certain stuff. In particular, the basic approach is that the a client will communicate with the server using HTTP calls, but servers communicate with one another using TCP. The great thing about TCP is that it is a stream oriented protocol, so I don’t need to carry state with me on every call.

With HTTP, each call is stateless, and I can’t assume anything about the other side. That means that I need to send the state, manage the state on the other side, and have to deal with potential issues such as concurrency in the same conversation, restarts of one side that the other side can’t easily detect, repeated validation on each call, etc.

With TCP, on the other hand, I can make a lot of assumptions about the conversation. I have state that I can carry between calls to the other side, and as long as the TCP connection is opened, I can assume that it is valid. For example, if I need to know what is the last item I sent to the remote end, I can query that at the beginning of the TCP connection, as part of the handshake, and then I can just assume that what I sent to the other side has arrived (since otherwise I’ll eventually get an error, requiring me to create a new TCP connection and do another handshake). On the other side, I can verify the integrity of a connection once, without requiring me to repeatedly verify our mutual state on each and every message being passed.

This has drastically simplified a lot of code on both the sending and receiving ends, and reduced the number of network roundtrips by a significant amount.

Getting the design ready for production troubleshooting

time to read 2 min | 339 words

The following is an excerpt from a design document for a major feature in RavenDB 4.0 that I’m currently reviewing, written by Tal.

One of the major problems when debugging such issues in production is the fact that most of the interesting information resides in memory and goes away when the server restarts, the sad thing is that the first thing an admin will do when having issues with the server is to recycle it, giving us very little to work with. Yes, we have logs, but debug level logs are very expensive and usually are not enabled in production (nor should they), we already have the ability to turn logs on, on a production system which is a great option but not enough. The root cause of a raft problem usually resides in the past so unless we have logs from the beginning of time there is not much use for them. The suggested solution is a persistent log for important events that indicate that things went south.

This is based on our experience (and frustration) from diagnosing production issues. By the time the admin see something is wrong, the problem already occurred, and in the process of handling the problem, the admin will typically focus on fixing it, rather than figuring out what exactly is going on.

Those kind of features, focusing explicitly on giving us enough information to find the root cause of the issue has been an on going effort for us. Yesterday they enabled us to get a debug package from a customer (a zip file that the server can generate with a lot of important information), go through it and figure out exactly what the problem was (the customer was running in 32 bits mode and running into virtual memory exhaustion) in one support roundtrip, rather than having to go back and forth multiple times to try to get a bunch of different data points to figure out the issue.

Also, go and read Release It, it has a huge impact on actual system design.

Business features vs. technical features

time to read 4 min | 613 words

A feature is something that your application/service does. Usually we don’t give it a lot more thought, but I recently had an interesting discussion about the exact distinctions between a business feature and a technical feature.

Let us imagine that we are talking about an application that allow to send snail mail, we already seen it before. A user will call the API and then a few days later a physical letter will show up at your door. So far, it is pretty simple. The question is, what can you offer in addition to expand the business.

For example, we might offer:

  • Mail tracking – providing a way to ensure that the recipient actually got the letter.
  • Snail mail to email – getting a physical email, and having that sent to the customer.

Those two are obvious extensions to the core business, and from the point of view of the business, it is great. From a technical perspective, that is a whole lot of complexity. You need to integrate with FedEx to handle the mail tracking, and you need to setup some sort of an automated system that will sort the mail, scan it and upload it to the customer’s account.

The problem is that at this point, you don’t really know what kind of reaction those features are going to have. They are both non trivial and in some cases require major capital expenditure to implement and are pretty hard to properly size upfront.

So you split it. Instead of doing this as a single feature, you have a business feature and a technical feature. A business feature means that your business offers this service, building that requires research to show that we can actually offer that, check whatever there are legal ramifications (some mail can be sensitive, privacy concerns, etc), check what kind of pricing we can charge, etc.  The technical feature is actually implementing all of that.

But the key observation here is that you don’t actually do the technical implementation, at least not just yet. You do the work around the business end of the feature, and then you announce this feature availability. As in, right now you can track the snail mail, or right now you can get your mail scanned and uploaded. This is done with minimal technical work in the backend, and with the caveat that this still experimental and pricing might change.

This isn’t cheating, mind you. Once you announced this feature, you wait to see what kind of reaction we’ll have. One of the options is that users will really love this feature, and start immediately using it. In this case, you have a good problem, people are flocking to give you money. In the meantime, you have Joe and Samantha, from the local high school working for minimum wage in the afternoon to manually do the work. So you can complete the customer expectations, as you are now working to complete the technical side and automate the whole thing (firing Joe and Samantha along the way).

The key here is that you don’t have to do any major upfront investment, in development or in facilities, before you can have this feature. And most of the time, even if it is a major feature, the ramp up time is enough for you to have a pretty good idea about what you actually need to do. And in the meantime, you have a micro service architecture, it is just that the services aren’t called FedExTrackingService and ScanAndSortPhysicalMailService but Joe and Samantha.

In other words, you have mechanical Turk the feature until you can teach you system to properly play chess.

HTTP benchmark and pipelining

time to read 29 min | 5795 words

Here is an interesting problem. If you want to load test a server, it is very hard to truly to do so. Put simply, after a while, the problem isn’t with your code, it is with the ability of the surrounding systems to actually get the requests to you fast enough.

In this case, let us talk about what is going on when you are actually doing an HTTP request.

We’ll start from the following code:

image

Seems pretty simple, right? And all we need to do is to actually send enough of those and we’ll be able to put enough load on the server to matter, right? Except that it doesn’t quite works like this. Let us see what the code above is actually doing by stripping away the HTTP later and dropping down to TCP, shall we?

image

 

And that looks good, right? Except that it is still hiding some details. I’m too lazy to go down to raw sockets and demonstrate the details, and anyway it would be way too much code to show here.

Here is a diagram that demonstrate what is going over the network for the two code sample above:

+---------+                      +---------+
| Client  |                      | Server  |
+---------+                      +---------+
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                |
     |                      [SYN-ACK] |
     |<-------------------------------|
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                | -----------------------------\
     |                                |-| Connection now established |
     |                                | |----------------------------|
     |                                |
     | [GET / HTTP 1.1]               |
     |------------------------------->|
     |                                | -------------------\
     |                                |-| The HTTP request |
     |                                | |------------------|
     |                                |
     |      [HTTP/1.1 302 Found ... ] |
     |<-------------------------------|
     |                                | --------------------\
     |                                |-| The HTTP response |
     |                                | |-------------------|
     |                                | -----------------------------------\
     |                                |-| Client now will close connection |
     |                                | |----------------------------------|
     |                                |
     | FIN                            |
     |------------------------------->|
     |                                |
     |                            ACK |
     |<-------------------------------|
     |                                |
     |                            FIN |
     |<-------------------------------|
     |                                |
     | ACK                            |
     |------------------------------->|
     |                                |

Note that this is for the simplest case, assuming that the response is just one packet, assume no packet drops, and ignore stuff like HTTPS, which adds another 4 packets to the initialization, and we are also accounting for the last 4 packets that are required to properly close a connection. This is important, because if you are trying to do high load benchmark, creating and not properly closing TCP connections means that you’ll soon run out of available ports (all your connections will be in CLOSE_WAIT or TIME_WAIT state).

Now, the problem is that this is really expensive. As in, wow expensive. So pretty much as soon as the web started to hit it off (mid 90s or so), people realized that this isn’t going to work, and the notion of Keep-Alive was born.

With Keep-Alive, you are going to reuse the same TCP connection to send multiple requests to the server. The idea is that once the connection is open, there is a strong likelihood that you’ll use it again soon, so why pay the 7 packets cost for opening & closing the TCP connection?

With that optimization, we then have:

+---------+                      +---------+
| Client  |                      | Server  |
+---------+                      +---------+
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                |
     |                      [SYN-ACK] |
     |<-------------------------------|
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                | -----------------------------\
     |                                |-| Connection now established |
     |                                | |----------------------------|
     |                                |
     | [GET / HTTP 1.1]               |
     |------------------------------->|
     |                                | -------------------\
     |                                |-| The HTTP request |
     |                                | |------------------|
     |                                |
     |      [HTTP/1.1 302 Found ... ] |
     |<-------------------------------|
     |                                | --------------------\
     |                                |-| The HTTP response |
     |                                | |-------------------|
     |                                |
     | [GET /index HTTP 1.1]          |
     |------------------------------->|
     |                                | -------------------\
     |                                |-| 2nd HTTP request |
     |                                | |------------------|
     |                                |
     |           [HTTP/1.1 200  ... ] |
     |<-------------------------------|
     |                                | --------------------\
     |                                |-| 2nd HTTP response |
     |                                | |-------------------|
     |                                | -----------------------------------\
     |                                |-| Client now will close connection |
     |                                | |----------------------------------|
     |                                |
     | FIN                            |
     |------------------------------->|
     |                                |
     |                            ACK |
     |<-------------------------------|
     |                                |
     |                            FIN |
     |<-------------------------------|
     |                                |
     | ACK                            |
     |------------------------------->|
     |                                |

And the more requests we make to the server, the better we are. Now, there is another trick that we can apply here. Remember that TCP is stream oriented, not packet oriented. That means that as far as the calling code is concerned, we aren’t actually seeing packets, just bytes arriving one after another.

So we can change the way things work to this:

+---------+                                                     +---------+
| Client  |                                                     | Server  |
+---------+                                                     +---------+
     |                                                               |
     | [SYN]                                                         |
     |-------------------------------------------------------------->|
     |                                                               |
     |                                                     [SYN-ACK] |
     |<--------------------------------------------------------------|
     |                                                               |
     | [SYN]                                                         |
     |-------------------------------------------------------------->|
     |                                                               | -----------------------------\
     |                                                               |-| Connection now established |
     |                                                               | |----------------------------|
     |                                                               |
     | [GET / HTTP 1.1, GET /data HTTP 1.1, GET /fast HTTP 1.1]      |
     |-------------------------------------------------------------->|
     |                                                               | -------------------------------------\
     |                                                               |-| 3 HTTP requests in a single apcket |
     |                                                               | |------------------------------------|
     |                                                               |
     |              [HTTP/1.1 302 Found ..., HTTP/1.1 200, HTTP 403] |
     |<--------------------------------------------------------------|
     |                                                               | ----------------------------------\
     |                                                               |-| All HTTP response in one packet |
     |                                                               | |---------------------------------|
     |                                                               | -----------------------------------\
     |                                                               |-| Client now will close connection |
     |                                                               | |----------------------------------|
     |                                                               |
     | FIN                                                           |
     |-------------------------------------------------------------->|
     |                                                               |
     |                                                           ACK |
     |<--------------------------------------------------------------|
     |                                                               |
     |                                                           FIN |
     |<--------------------------------------------------------------|
     |                                                               |
     | ACK                                                           |
     |-------------------------------------------------------------->|
     |                                                               |

What we did is pretty simple. Instead of waiting for the server to respond to the request, and only then reuse the connection to send the next one, we can send the requests immediately one after the other, without waiting.

In some cases, we can even package multiple requests into a single TCP packet. And the server (shouldn’t) care about that.

Here is what this looks like in practice:

Now, naïve server code will fail here, because it will read from the socket into a buffer, (including some part of the next request), and then forget about that.  But it isn’t hard to make sure that this work properly, and that is the key for all high performance servers.

Basically, the real problem is driving enough packets into the server to generate load. By pipelining requests like that, we reduce the number of packets we need to send while at the same time getting a lot higher load.

The cost of routing a packet is independent of its size, and while the size you send is important for bandwidth, the packet latency is much more important for actual speed (latency vs. bandwidth, again). So if we can pack the data into fewer packets, this is a net win. In other words, this is HTTP doing car pooling.

 

Image result for car pool lane

And now that you can drive enough requests into your server to actually stress it, you can work your way into actually handling this load.

Optimizing read transaction startup timeRacy data structures

time to read 5 min | 849 words

Finding the appropriate image for this post was hard, you try searching for “racy pictures” in Google Image Search, but you might not want to do it from work Smile.

Anyway, today at lunch we had a discussion about abstractions and at what level you should be working. The talk centered about the difference between working in low level C and working with a high level framework like C# and the relative productivity associated with it.

At one point the following argument was raised: “Well, consider the fact that you never need to implement List, for example”. To which my reaction was: “I did just that last week”.

Now, to forestall the nitpickers, pretty much any C developer will have an existing library of common data structures already in place, I know. And no, you shouldn’t be implementing basic data structures unless you have a really good reason.

In my case, I think I did. The issue is very simple. I need to have a collection of items that are safe for multi threaded reads, but they are mostly only ever accessed from a single thread, and are only ever modified by a single thread. Oh, and they are also extremely performance sensitive.

The reason we started looking into replacing them is that the concurrent data structures that we were using (ConcurrentDictionary & ConcurrentStack, in those cases) were too expensive. And a large part of that was because they gave us a lot more than what we actually needed (fully concurrent access).

So, how do we build a simple list that allow for the following:

  • Only one thread can write.
  • Multiple threads can read.
  • No synchronization on either end.
  • Stale reads allowed.

The key part here is the fact that we allow stale reads.

Here is the concrete scenario, we need to track all active transactions. A transaction is single threaded, but we allow thread hopping (because of async). So we define:

image

And then we have:

image

DynamicArray is just a holder for an array of Nodes. Whenever we need to add an item to the active transactions, we’ll get the local thread value, and do a linear search through the array. If we find a node that has a null Transaction value, we’ll use it. Otherwise, we’ll add a new Node value to the end of the array. If we run out of room in the array, we’ll double the array size.  All pretty standard stuff, so far. Removing a value from the array is also simple, all you need to do is to null the Transaction field on the relevant node.

Why all of this?

Well, only a single thread can ever register a transaction for a particular DynamicArray instance. That means that we don’t have to worry about concurrency here. However, we do need to worry about transactions that need to remove themselves from the list from other threads. That is why we don’t have any concurrency control here. Instead, removing the transaction is done by setting the node’s Transaction field to null. Since only the owning transaction can do that, this is safe.

Other threads, however, need to read this information. They do that by scanning through all the thread values, and then accessing the DynamicArray directly. Now, that means that we need to be safe for concurrent reading. This is done by having the array more or less static on most scenarios. After it get full enough, it will never grow again, and the values will remain there, so effectively other threads will be reading an array of Nodes. We do need to be careful when we expand the array to give more room. We do this by first creating the new array, copying the values to the new array, and only then setting it in the threaded instance.

This way, concurrent code may either see the old array or the new one, but never need to traverse both. And when traversing, it goes through the nodes and check their Transaction value.

Remember that the Transaction is only being set from the original thread, but can be reset from another thread if the transaction moved between threads. We don’t really care, the way it works, we read the node’s transaction field, and then check its value (once we have a stable reference). The idea is that we don’t worry about data races. The worst that can happen is that we’ll see an older view of the data, which is perfectly fine for our purposes.

This is pretty complex, but the code itself is simple enough, and the performance benefit justify it several times over.

RavenDB RetrospectiveThe governors

time to read 5 min | 860 words

imageRavenDB’s core philosophy is that It Just Works and that means that we try very hard to get things right. Conversely, that means that we are also trying to make it hard to do the wrong thing. Basically, we want to push you hard into the pit of success.

Part of that approach is what we call the governors. It is a set of features that will detect and abort known bad behavioral patterns.  I have already talked about Unbounded Result Sets and I recently run into this post, which shows how nasty a problem that can be, and how invisible.

Another governor we have in place is the session’s maximum request limit. A session is meant to be a scope, it has a very short duration and is typically used for a single request / processing a single message, etc. It is supposed to live as long as the business transaction. Because the session is scoped, we can reason that a single session that is making a lot of database operation is probably doing something pretty bad.

For example, it might be calling the database in a loop. Those kind of issues can be truly insidious. Let us look at the following code (taken from here):

image

image

This kind of thing is a silent performance killer. No one is likely to see this is happening, and it will silently increase the number of database operations that your application make, leading to increased DB load, higher page load times and all sort of problems associated with it.

In one particular case, I saw a single page load generate 17,000 queries to the database. The software in question grew over time, and people assumed that this was just it took to run the software. Their database server was a true monster (this was about a decade ago), with dedicated RAM disks, high CPU count and a truly ridiculous amount of memory. Just to explain, we are talking about something like this:

image

But a decade ago, and it had a quite a bit of space. Now, this kind of beasty can do 500K IOPS (I’m drooling just thinking about it), but it is damn expensive. Just to put things in perspective, I spent several weeks at that company working on this particular problem, the cost of those weeks of work didn’t even cover the cost of the drive on that machine.

And on that monster, we were seeing page load times in the tens of seconds, and extremely high system load. I was able to bring it down to about 70 queries per page load, and their database server has pretty much idled ever since (IIRC, they turn that machine into a VM host for all the rest of their software, actually).

This is something that can bite.

To avoid that, we have the max numbers of requests in the session, which will abort excessive amount of database chatter. This have two important effects:

  • It follow the “better let one bad request die rather than take down the entire application”.
  • It put a budget on the number of calls that you can make.

Now, that budget is actually really interesting. Because we have it, we need to think about how we can reduce the number of database calls that we have to process the request. That led to a whole bunch of features around that. Lazy requests, includes and transformers to name just a few.

That had a positive unintended consequence. RavenDB is fast,  really fast, but it is also typically deployed as a network database, that means that each database call actually go over the network, and we all remember our fallacies, right?

image

In our profiling, we found that most often, the real cost in a RavenDB application was the back & forth chatter with the database. Reducing the number of requests we make to the server has an immediate benefit. And RavenDB allows you to do that by pipelining requests with Lazy, predicting requests with Includes or running the whole thing on the server side with Transformers.

And, like all governors, you can control it, RavenDB allows you to decide what the limit should be (on that particular session or globally based on your actual needs and environment.

RavenDB RetrospectiveExplicit indexes & auto indexes

time to read 3 min | 492 words

imageRavenDB doesn’t provide any way for queries to do table scans*.

* That isn’t actually true, we have Data Exploration, which does just that, but we don’t provide an explicit API for it, and it is a DBA driven feature (I wanna get this report with a minimum of fuss without regards to how much it is going to cost me) than an API that is exposed.

What this means is that the cost of query operations in RavenDB is always going to be O(logN), instead of O(N). How does this relate to the topic of RavenDB retrospectives?

One of the things that I kept seeing over and over as a database consultant was that databases are complex, and that it is easy to write a query that works perfectly fine for a period of time, then fall over completely as the size of the data goes over a certain threshold. In particular, queries that use table scans are particularly vulnerable for this issue.

One of the design goals for RavenDB was to avoid that, completely. We did it by simply forbidding any query that doesn’t have an index. initially, that was a pretty annoying requirement, because every time that you needed a new query, you needed to go ahead and create an index. But early on we got the Auto Indexes feature.

Basically, it means that when you can query RavenDB without specifying which index you want to use, at which point the query optimizer will inspect the query and decide which index can serve it. The most interesting point here is that if there isn’t an index that can serve this query, the query optimizer is going to create one on the fly. See the previous post about BASE indexes and how we can afford to do that.

The fun part here is that the query optimizer is actually learning over time, and it will shape its indexes to best fit the kind of queries you are doing. It also makes RavenDB much more robust for New Version Degradation effects. NVD is what happens when you push a new version out, which have slightly different queries, which make previously used indexes ineffective, forcing all your queries to become full table scans. Here is an example of the kind of subtle issues that this can cause. With RavenDB, when you use auto indexes (in other words, when you don’t explicitly state which index to use), the query optimizer will take care of that, and it will create all the appropriate indexes (and retire the unused ones)  for you.

This in particular is a feature that I’m really proud of, it require very little from the user to work with, and it gets the Right Thing Done.

RavenDB RetrospectiveBASE Indexes

time to read 7 min | 1257 words

RavenDB was designed from the get go with ACID documents store, and BASE indexes. ACID stands for Atomic, Consistent, Isolated, Durable, and BASE stands for Basically Available, Soft state, Eventually consistent.

That design had been conceived by twin competing needs. First, and obvious, a database should never lose data. Second, we want to ensure that the system remains responsive even under load. It is quite common to have spike in production traffic, and we wanted to be able to be able to handle it with better aplomb.

In particular, the kind of promises that are made by RavenDB queries allow us to perform quite a few performance optimizations. In databases that require that all indexes will be up to date on transaction commit, you’ll find that there is a very high cost to adding indexes to the system, because each additional index means additional work is needed at query time. It also makes things such as aggregating indexes (map/reduce, in RavenDB terms) a lot harder to build.

By having BASE indexes, we gain the ability to batch multiple writes into a single index update operation. It also allows us to defer writing the indexes to the disk, avoiding costly I/O operations. But most importantly, by changing the kind of promise that we give to users, we are able to avoid a lot of locks, complexity and hardship inside RavenDB. This may seems like a small thing, but this is actually quite important. Take a look at this study:

image

In fact, there are a lot of studies on the overhead of locking in database systems, and that has been a hot research topic for many years. By choosing a different architecture, we can avoid a lot of those costs and complexities.

So far, that was the explanation from the point of view of the database creator. What about the users?

Here the tradeoff is more nuanced. On the one hand, there is a certain level of complexity that people have to deal with the notion that queries on just inserted data might not include it (stale queries), on the other hand, it means that queries are consistently faster and we can handle spikes in traffic and load much more easily and consistently.

But it is a mental model that can be hard to follow, even when you are familiar with it. Probably the most common issue with RavenDB’s BASE indexes is the case of Post / Redirect / Get. Let us look at how this may play out:

In here, we actually have two requests, one that adds a new order to the system, and the other that fetch the details. If you have redirected to the new order page, everything is going to work as expected, and you won’t notice anything even if the indexes are stale at the time of the request. But a pretty common scenario is to add the new order, and then go and look at the list of orders for this customer, and if the index didn’t have the chance to update between those two requests (which typically happen very quickly) then the customer will not see the new order.

That particular scenario is responsible for the vast majority the pain we have seen from our users around BASE indexes.

Now, one of the great things about BASE indexes is that the user get to choose whatever they want to wait for the up to date results or whatever they want whatever is there right now. And we have had mechanisms to control this at a very granular level (including options for personal consistency control, so different customers will have different waits depending on their own previous behavior). But we have found that this is something that puts a lot of responsibility on the developer to control the flow on their users on their applications.

So in RavenDB 3.5 we have changed things a bit. Now, instead of processing the write requests as soon as possible, you can ask for the server to wait until the relevant indexes has processed:

image

In other words, when you call SaveChanges, it will wait until the indexes has been updated, so when you return from the call, you can be certain that the results of any future queries will include all the changes on that transaction. This moves the responsibility to the  write side and make such scenarios much easier to handle.

Given all of that, and our experience with RavenDB for the past 8 years or so, we spiked how it would look like with ACID indexes, at least for certain things. The problem is that this pretty much takes out of the equation a lot of the power and flexibility that we get from Lucene (more on why you can’t do that in Lucene in a bit) and force us to offer what are essentially B+Tree indexes. Those are so limited that we would have to offer:

  • B+Tree indexes – ACID (simple property / range queries). With different indexes needed for different queries and ordering options.
  • Lucene indexes – BASE, full text, spatial, facets, etc queries. Much more flexible and easy to use.
  • Map/reduce indexes – BASE (because you aren’t going to run the full map/reduce during the original transaction).

The problem is that then we would have continuous burden of explaining when to use which index type, and how to deal with the different limitations. It will also make it much more complex if you have a query that can use multiple indexes, and there are problems associated with creating new ACID indexes on live systems. So it would generate a lot of confusion and complexity to users, for fairly small benefit that we can address already with the “wait on save” option.

As for why we can’t do it all via Lucene anyway, the problem is that this wouldn’t be sustainable. Lucene isn’t really meant for individual operations, it shines when you push large amount of data through it. It also doesn’t really have the facilities to be transactional, we have actually solved that particular problem in RavenDB 4.0, but it was neither pretty nor easy, and it doesn’t alleviate the issue of “we do best in large batches”. RavenDB’s BASE indexes are actually designed to take advantage of that particular aspect. Because under load, we’ll process bigger batches are reap the performance benefits that they bring.

BASE indexes also make for much simpler operations. You can define a new index without fearing locking the database, and it enables scenarios such as side by side indexing to update index definitions without impacting the running system.

Finally, a truly massive benefit of BASE indexes is that they allow us to change the following statement: more indexes means faster reads, slower writes. Fewer indexes means slower reads, faster writes. By movng the actual indexing work to a background task, we let the writes go though as fast as tehy possible can.

Indexes still have a cost, and the more indexes you have, the higher the cost (we still got to do some work here). But in the vast majority of the cases, we can squeeze this kind of work between writes, in times that the database would be idling. 

What that means is that you can have more indexes at the same cost, and that your queries are going to be using those indexes and are going to be fast.

 

Debug & Operations as a feature: Tracking allocations costs

time to read 2 min | 310 words

One of the things that we have learned from supporting RavenDB in production is that you by default, everything is a black box into which you have exactly zero input. And in order to figure out what the problems are, you need to use expert tools (WinDBG or VM MAP for example) that are typically more focused on developers, and not usually available in production.

In RavenDB 4.0, we have started from the get go with the notion that everything we do must be exposed, tracked and monitored. Here is the results of the latest effort in that direction.

image

And:

image

There are several important things here. First, you can see that we are tracking the managed and unmanaged allocations that are happening in the system. More than that, we are now able to track down exactly which part of the system is responsible for that.

In the screenshots above, you can see that the UsageIpAndQuantity index has allocated about 65 MB of unmanaged memory, and that we have a few memory mapped files storing the data for index #3.

The idea is that we can now glance at this endpoint and tell very quickly what is going on. And this is something that can be done in production. In fact, that is something that we’ll expose in the studio so you can see those value change over time.

We are also waiting for the CoreCLR to expose the managed allocations on  a per thread basis, which will give us even better metrics.

Stack arena memory allocation context

time to read 4 min | 705 words

Controlling memory allocations is something that any performant service need to do. In RavenDB 4.0 we have taken that to heart, and most of our memory usage is manual, instead of relying on the GC.

Now, there are a lot of really good reasons to want to avoid manual memory management. We have decades of usage telling us that this is a hard problem, with some nasty issues associated with it. But when we are talking about performance different of orders of magnitude, I’ll accept that burden.

The first thing that we did when making this decision is to see if we can change the way we approach unmanaged memory management to make it easier to avoid all of the common issues. We made several decisions:

  • All native allocations must go through a context.
  • A context is a unit of work that is single threaded.
  • A context is bounded in scope.
  • A context is reusable.

By placing those sort of limits on the native memory usage, we managed to gain a lot of simplicity. To start with, because the context is single threaded, we don’t have to implement an allocation strategy that is thread safe. Instead, we can just allocate a bunch of memory in the context, and parcel the memory from there. This looks like this:

image

The arena has allocated a certain size from the operation system. Allocating from that point is just a matter of bumping the next allocation pointer and handing that memory to the caller.

The nice thing about is that we don’t have the possibility for memory leaks, when the context scope ends, we can reset the next allocation pointer to the start, and we have “freed” all that memory. And we can start using it  again, without having to release it to the OS and get it back again. There are a whole bunch of edge cases (what happens when the arena is completely full, etc, but they are pretty obvious).

Not having to deal with multiple threads also present an interesting optimization. Once we are done with the memory, we can return it to the arena, but instead of maintaining free lists, etc, we do the following. If the returned memory is at the top of the arena (the last thing allocated), then we just bump the next allocation down, and the next allocation can reuse that memory. If that isn’t the case, we’ll do nothing, and rely on the context scope to handle this.

That, in turn, means that if I’m careful to free in the reverse order of allocations, I can get, within a single scope, memory reuse very cheaply. That is important, as we’ll see in a bit.

The context is also scoped to a certain operation, such as processing a request or an index batch. Because of that, even if we need to grow the arena, that is limited to however many operations we have to do inside that scope. And once the scope is over, we aren’t disposing of the context and releasing the memory, instead, we pool them so the next action will be able to use them. So a single OS level allocation is used across many actions, and the memory free cost is just setting a pointer.

But some scope need to do a lot of work, consider an index that is processing a big batch, in order to reduce the amount of memory that is being used, that is where the ability to free the top memory allocation comes in handy, because even inside the same scope, we can get memory reuse.

All in all, the allocation strategy was chosen to play to our strengths, and allow us to avoid doing multi threaded work in the hot paths, it also means that we get quite a lot of memory reuse, both within a scoped context and in different scopes, which reduce the overall amount of memory that we need to allocate. And the really great thing, it doesn’t come with a big pile of time spent in GC.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Implementing low level trie (2):
    14 Dec 2016 - Part II
  2. The performance regression in the optimization (2):
    01 Dec 2016 - Part II
  3. Digging into the CoreCLR (4):
    25 Nov 2016 - Some bashing on the cost of hashing
  4. Making code faster (10):
    24 Nov 2016 - Micro optimizations and parallel work
  5. Optimizing read transaction startup time (7):
    31 Oct 2016 - Racy data structures
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats