Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,330 | Comments: 47,018

filter by tags archive

Big data work on 32 bits

time to read 7 min | 1326 words

Related imageEver put on a pair of shoes that were just a bit too small for you? You think that it would be fine, but then time goes by, and it pinches. And then it hurts, and then you just can’t walk any longer.

That is how it feels to work with any reasonable amount of data (even low hundreds of MB) in 32 bits. The virtual address space is so limited that it is incredibly easily to just run out due to address space fragmentation and fail, and there is very little that you can actually do to recover from such a scenario. We have been working on making RavenDB 4.0 work reliably in 32 bits mode*, and it has been a PITA after PITA. In particular,  Voron was quite explicitly designed for an environment where it is hard to impossible to run out of address space, and out typical modus operandi is to map the entire data file (which can be hundreds of GB in size) as a single continuous memory region.

* And yes, I can’t imagine how much PITA it was to run in 16 bits. When I programmed to those sort of platforms, I never actually used anything that got to needing oh so much memory (I was at middle school at the time, IIRC).

That allows us to do all sort of tricks and optimizations, and it is utterly impossible to do on 32 bits. In fact, on most 32 bits system, after the process has been running for a while, just doing a single 256MB allocation is probably going to fail. You might have that much memory free, but you don’t have it in a continuous run. So we had to do drastic changes, and at the same time, 32 bits is an important requirement, but mostly it is on the sidelines. I want to support it, and I’m willing to put the effort to get there, but I’m not willing to pay for it if it costs me when running outside of 32 bits mode.

In other words, anything that would have a drastic impact of the “normal” mode of running in 64 bits was out, and we didn’t want to re-architect the whole thing. Luckily, we already had the place to put the different behavior. In Voron, the Pager is the entity that is responsible for mapping the file to memory, and hand out pointers from the file based on the page number asked. That meant that we had a well defined interface:

image

Because the file can grow, and the pager might hold multiple overlapping maps to the same file, we only allow pointers to the data file in the scope of a transaction. That ended up being a very good thing, since this enabled us to build a pager that could work in 32 bits mode. Instead of mapping the entire file, we’ll map just the pages that are required, and only for the duration of the transaction, after which we can unmap everything.  The idea is that we’ll only have the data that we are actually using mapped, so we’ll save a lot of address space.

That was the theory, at least, in practice, we run into several interesting issues.

  • We want to reduce the number of system calls we make, and the allocation granularity in Windows in 64KB, so we always need to map at least that much, instead of mapping 1 page at a time.
  • A value may actually be longer than 64KB, or not be aligned on 64KB boundary.
  • Transactions are concurrent, and are typically need to access the same areas (the top branches in the B+Trees we use, for example).

The end result is that we map in multiples of 64KB (on both Windows & Linux), and we check, based on the actual data, whatever we need to remap the data if it is more than is allocated in the current block. We are also sharing all the maps among all the concurrent transactions, to reduce the total amount of virtual address space we are using. This is a bit hard to do concurrently & safely, so we are racing it. In most cases, we’ll have a single mapping, but it is fine to map the same section twice from different transactions (there can only ever be a single write transaction, so all the others are just reading, and never from the same memory that the write tx is writing to). The alternative would have been to use a more invasive locking, and the performance cost isn’t worth it.

Once we got this working, we were most of the way there, but there were still a lot of reversed optimizations that we had to do. For example, in 64 bits mode, it make a lot of sense to try to pre-allocate data in advance to maintain locality, and as the data we dealt with became larger, so would our pre-allocations. But those would end up being around 32MB of data that we pre-allocate (so we need to prepare and touch all of it), and under load, it was actually one of the more common cases for failures due to fragmentation of the address space, because it would allocate (1MB, 2MB, 4MB, 8MB, 16MB, 32MB) and we had many such operations cutting the address space into fine bits.

Another location that cause us problem was with indexes, where we allocated a 16MB range for bloom filter. Made perfect sense to optimize things in 64 bits, but in 32 bits mode that is a huge chunk of my address space that I’m giving up. In both cases, we drastically reduced the sizes involved (to 64KB in the case of the bloom filter, and a max pre-allocation of 1 MB).

Another thing that was actually in our favor is that our memory usage is already policed quite carefully.  We didn’t do that intentionally for 32 bits, we did that because memory allocation is so expensive, but because we rarely ask for memory from the OS, and because we are typically reuse buffers, we were pretty much already there in terms of proper behavior of the system in 32 bits mode. The CLR is also pretty good about allocation in large sections from the OS and being able to move memory around to avoid internal fragmentation.

Overall, it works, we have tested that on the Stack Overflow dataset, and it is quite amazing to see it (you can see it chugging along, using anything between 300 MB – 600 MB as it is inserting 60+ GB of data). It is obviously less efficient than the 64 bits version, but given that we are so much faster in 4.0, I don’t think that anyone will actually notice, and if they do… Well, that is why we have hardware built in this decade.

There were other places where we had to take into account the 32 bits nature of the system, in which we actively undermined previous optimizations. Instead of allowing transactions to merge as much as possible to reduce I/O, we placed a hard limit on how much data can be accessed or modified by a transaction before we force it to commit. The same goes for indexing, instead of letting batches to run as long as we would usually like them, address space considerations forces us to reduce the batch size to account for how much address space is used in the current batch, and close it early (freeing up the address space for other uses).

The overall process is very unpleasant, because things that I would consider to be obvious and easy are suddenly so much harder than I would expect them to be.

Attachments, RavenFS and scoping out the market

time to read 3 min | 490 words

RavenFS is a pretty cool technology. It was designed to handle both very large files over geographically distributed environment and large number of small files in a single datacenter. It has some really cool features, such as the ability to run metadata searches, delta replication, etc. And yet, pretty much all our customers are using it primarily as a way to handle small set of binaries, typically strongly related to the documents.  We also got a lot of feedback / worries about attachments deprecation from customers.

This post is intended to lay out some of our thoughts regarding this feature. And the idea is that we are going with the market. We are going to merge RavenFS back into RavenDB.

Instead of having files with metadata, we’ll reverse things, you’ll have documents with attachments. Let us consider the simplest example that I could conceive. Users and profile pictures.

You are going to store the user’s information in “users/1” document. And then you need to store the profile pic somewhere. You’ll be able to do that by push that into RavenDB as an attachments. An attachment is always going to be tied to a specific document, and if it is deleted, all its attachments will also be deleted. So in this case, we’ll have “profile.png” attachment on “users/1”.

Of course, you don’t have just a single profile picture, you also have a thumbnail of that. So after the user has uploaded their pic and you attached that to the user’s document, we’ll have an offline process to generate the thumbnail and attach that as well to the document.

Documents will have a metadata flag that will indicate whatever they have attachments, and if they do, the metadata will contain the list of attachments they have. So loading the document will be enough to enable you to peek at all its attachments, however load an attachment would be a separate operation. You’ll always be able to access and attachment directly, naturally.  Attachment won’t have metadata or the ability to search them, instead, you can define your indexes on documents, as you normally do, and go from there to the attachments you desire.

Adding / deleting / modifying an attachment will also update the etag of the document they are attached to (since it updates the document metadata). The attachments will receive the same etag as their document at the time of modification, and will be replicated along the same manner. Obviously, only new attachments will be replicated whenever the document is updated. Conflicts on attachments is also a conflict on the document, and will be resolved based on however the document conflict is resolved.

Because attachments reside in the same location as documents, we can now have a transaction that spans both a document and attachment (not necessarily to the same document, mind), which will make things easier on our users.

Analyzing RavenDB 4.0 with NDepends

time to read 4 min | 648 words

I have a… difficult relationship with NDepends, on the one hand, it is an invaluable tool to provide you with good insight into your project. On the other hand, I don’t always agree with its recommendation and I have to teach it what I consider valuable. In my mind, it is the kind of tool that you reach for when you need the Expert Mode.

We use it for exploring our API surface area and seeing that it makes sense, to validate breaking changes and in general to get detailed view into troublesome spots in the codebase.

For example, it pointed out this guy as having to many parameters. I can’t say that I disagree, and it is a prime candidate to refactor to make it simpler.

image

On the other hand, it didn’t get to be this way by accident, it was added to slowly over time. Each new feature or bug fix needed just a bit more, and this grew and grew. At this point, however, this is extremely stable code that rarely changes, and modifying it will take a lot of work.

Just having tools around doesn’t mean that you get to turn off your head Smile.

One important thing is that if you are considering NDepends, you probably want to start doing so very early in the project lifecycle. One of the things that I find most interesting in the new version is the notion of estimated debt. Here is what this looks like for RaveDB 4.0

image

And here is the value for RavenDB 3.5

image

This is composed from estimated amount of effort that you need to put to fix things.

The really interesting part is that it does a pretty good job of finding troublesome locations even when you don’t have history / coverage information. For example, here is the list from RavenDB 3.5:

image

I find it highly amusing to see this. Mostly because the client side implementation is more complex (and are more frequently modified).

Digging a little bit deeper give us this:

image

And this is where I start arguing with the tool. Or, to be rather more exact, I have more information than NDepends on this usage. AsyncServerClient is how the entire client talks to the server, so it isn’t cohesive and it is certainly too big, but it is pretty much by design.

The details about boxing / unboxing, however, are much more interesting, and it is where you can start doing a lot of interesting things. In particular, you can customize NDepends to give you a lot of details and enforce rules about your codebase.

For example, given our recent need, here is all the big structures in our code:

image

You can do that for exploring, or you can add that as a rule (warnif).

Low level Voron optimizationsThe page size bump

time to read 5 min | 864 words

Explaining the usage pages seems to be one of the things is either hit of miss for me. Either people just get it, or they struggle with the concept. I have written extensively on this particular topic, so I’ll refer it to that post for the details on what exactly pages in a database are.

Voron is currently using 4KB pages. That is pretty much the default setting, since everything else also works in units of 4KB. That means that we play nice with requirements for alignment, CPU page sizes, etc.  However, 4KB is pretty small, and that lead to trees that has higher depth. And the depth of the tree is one of the most major reasons for concern for database performance (the deeper the tree, the more I/O we have to do).

We previously tested using different page sizes (8KB, 16KB and 32KB), and we saw that our performance decreased as a result. That was surprising and completely contrary to our expectations. But a short investigation revealed what the problem was. Whenever you modify a value, you dirty up the entire page. That means that we would need to write that entire page back to storage (which means making a bigger write to the journal, then applying a bigger write to the data filed, etc).

In effect, when increasing the page size to 8KB, we also doubled the amount of I/O that we had to deal with. That was a while ago, and we recently implemented journal diffing, as a way to reduce the amount of unnecessary data that we write to disk. A side affect of that is that we no longer had a 1:1 correlation between a dirty page and full page write to disk. That opened up the path to increasing the page sizes. There is still an O(PageSize) cost to doing the actual diffing, of course, but that is memory to memory cost and negligible in compared to the saved I/O.

Actually making the change was both harder and easier then expected. The hard part was that we had to do a major refactoring working to split a shared value. Both the journal and the rest of Voron used the notion of Page Size. But while we want the page size of Voron to change, we didn’t want the journal write size to change. That led to a lot of frustration where we had to go over the entire codebase and look at each value and figure out whatever it meant writing to the journal, or pages as they are used in the rest of Voron. I’ve got another post scheduled talking about how you can generate intentional compilation errors to make this easy for you to figure it out.

Once we were past the journal issue, the rest was mostly dealing with places that made silent assumptions on the page size. That can be anything from “the max value we allow here is 512 (because we need to fit at least so many entries in)” to tests that wrote 1,000 values and expected the resulting B+Tree to be of a certain depth.

The results are encouraging, and we can see them mostly on the system behavior with very large data sets, those used to generate very deep trees, and this change reduced them significantly. To give some context, let us assume that we can fit 100 entries per page using 4KB pages.

That means that if we have as little as 2.5 million entries, we’ll have (in the ideal case):

  • 1 root page holding 3 entries
  • 3 branch pages holding 250 entries
  • 25,000 leaf pages holding the 2.5 million entries

With 8 KB pages, we’ll have:

  • 1 root page holding 63 entries
  • 12,500 lead pages holding 2.5 million entries

That is a reducing of a full level. The nice thing about B+Trees is that in both cases, the branch pages are very few and usually reside in main memory already, so you aren’t directly paying for their I/O.

What we are paying for is the search on them.

The cost of searching the 4KB tree is:

  • O(log2 of 3) for searching the root page
  • O(log2 of 100) for searching the relevant branch page
  • O(log2 of 100) for searching the leaf page

In other words, about 16 operations. For the 8 KB page, that would be:

  • O(log2 of 63) for searching the root page
  • O(log2 of 200) for searching the leaf page

It comes to 14 operations, which doesn’t seems like a lot, but a lot of our time goes on key comparisons on the key, so anything helps.

However, note that I said that the situation above was the ideal one, this can only happen if the data was inserted sequentially, which it doesn’t usually do. Page splits can cause the tree depth to increase very easily (in fact, that is one of the core reasons why non sequential keys are so strongly discourage in pretty much all databases.

But the large page size allows us to pack many more entries into a single page, and that also reduce the risk of page splits significantly. 

Why you should avoid graceful error handling like the plague that it is

time to read 3 min | 536 words

A while ago I was reviewing a pull request by a team member and I realized that I’m looking at an attempt to negotiate graceful termination of a connection between two nodes. In particular, the code in question was invoked when one node was shutting down or had to tear down the connection for whatever reason.

That code was thrown out, and it made a very graceful arc all the way to the recycle bin.

But why? The underlying reason for this was to avoid needless error messages in the logs, which can trigger support calls and cost time & effort to figure out what is going on. That is an admirable goal, but at the same time, it is a false hope and a dangerous one at that.

Let us consider what it means that a node is shutting down. It means that it now needs to notify all its peers about this. It is no longer enough to just tear down all connections, it need to talk to them, and that means that we introduced network delays into the shutdown procedure. It also means that we now have to deal with error handling when we are trying to notify a peer that this node is shutting down,  and that way lead to madness.

On the other hand, we have the other node, which node needs to also handle its peer getting up in the middle of the conversation and saying “I’m going away now” mid sentence. For that matter, since the shutdown signal (which is the common case for this to be triggered) can happen at any time, now we need to have thread safety on shutdown so we can send a legible message to the other side, and the other side must be ready to accept the shutdown message at any time. (“Do you have any new documents for me” request that expects a “There are N messages for you” now also need to handle “G’dbye world” notification).

Doing this properly complicates the code at every level, and you still need to handle the rude shutdown scenario.

Furthermore, what is the other side is supposed to do with the information that this node is shutting down the connection voluntarily? It is supposed to not connect to it again? If so, what policy should it use to decided if the other side is down for valid reasons or actually unavailable?

Assuming that there is actually a reason why there is a TCP connection between the two nodes, any interruption in service, for whatever reason, is not a valid state.

And if we ensure that we are always ending the connection in the same rude manner, we also gain a very valuable feature. We make sure that the error handling portion of the code get exercised on a regular basis, so if there are any issues there, they will be discovered easily.

As for the original issue of reducing support calls because of transient / resolved errors. That can be solved by not logging the error immediately, but waiting a bit to verify that the situation actually warrants writing to the operations log (writing to the info log should obviously happen regardless).

Scaffolding code as sign of maturity

time to read 3 min | 548 words

One of the clearest signs of maturity that I’m looking for when reading code is the scaffolding that were used.

Just like in physical construction, it is often impossible to start by building everything properly from the get go, and you start by building temporary scaffolding to get you going. In some cases, those can be things that you need to actually build the software, but I have found that scaffolding are mostly useful in debugging and understanding issues.

For example, if I’m working on a complex data structure, it would be very useful to be able to dump it into a human readable format, so I can visually inspect it and understand how it works.

In the recent low level trie example, I have a file dedicated to doing just that, it contains some code to print the trie as well as validate the structure, and it was very useful to figure out certain things.

If the project is big enough, and mature enough, those scaffolding take on a permanent nature. In RavenDB, for example, they are debug endpoint and live visualizations that can help an administrator to figure out exactly what is going on with the system. The more complex the system, the more important the scaffolding become.

One very important consideration is what kind of scaffolding is being built. For example, if you throw a bunch pf printf all over the place while you are debugging, that is helpful, but that isn’t something that will remain over time, and in many cases, the second or third time that you find yourself having to add code to help you narrow down a problem, you want to make this sort of code a permeant part of your codebase.

One of the more complex pieces in building Voron was the B+Tree behavior, in particular when dealing with page splits and variable size data. We build a lot of code that would verify that structure and invariants for us, and it is running as part of our CI builds to ensure that stuff doesn’t sneak in.

One of the things that we teach our new hires is that one of the things that we need to have not just working code, but also all of the surrounding infrastructure. Everything that I need to work with, diagnose and manage the system in production over long periods of time. I distinctly remember a debugging session where we had to add a whole bunch of diagnostics code to figure out some really gnarly issue. I was pairing with another dev on that on his machine, and we were working on that for a long time. We finally managed to figure out what the problem was and fix that, and I left and got the final PR with the fix later in the day.

None of the diagnostics code was in it, and when I asked why the answer was: “We fixed the problem, and we didn’t need it, so I removed it.”

That is not the kind of thing that you remove, that is the kind of thing that you keep, because you can bet your bottom dollar that we’ll need it again, when the next problem shows up.

Protocol design implications: REST vs. TCP

time to read 3 min | 444 words

I was going over design documents today, and I noticed some common themes in the changes that we have between RavenDB 3.5 and RavenDB 4.0.

With RavenDB 3.5 (and all previous versions), we always had the communication layer as HTTP REST calls between nodes. When I designed RavenDB, REST was the thing to do, and it is reflected in the design of RavenDB itself. However, 8 years later, we sat down and considered whatever this is really appropriate for everything. The answer was a resounding no. In fact, while over 95% of RavenDB is still pure REST calls, we have moved certain key functions to using TCP directly.

Note that this goes in directly contrast to this post of mine from 2012: Why TCP is evil and HTTP is king.

The concerns in this post are still valid, but we have found that there are a few major reasons why we want to switch to TCP for certain stuff. In particular, the basic approach is that the a client will communicate with the server using HTTP calls, but servers communicate with one another using TCP. The great thing about TCP is that it is a stream oriented protocol, so I don’t need to carry state with me on every call.

With HTTP, each call is stateless, and I can’t assume anything about the other side. That means that I need to send the state, manage the state on the other side, and have to deal with potential issues such as concurrency in the same conversation, restarts of one side that the other side can’t easily detect, repeated validation on each call, etc.

With TCP, on the other hand, I can make a lot of assumptions about the conversation. I have state that I can carry between calls to the other side, and as long as the TCP connection is opened, I can assume that it is valid. For example, if I need to know what is the last item I sent to the remote end, I can query that at the beginning of the TCP connection, as part of the handshake, and then I can just assume that what I sent to the other side has arrived (since otherwise I’ll eventually get an error, requiring me to create a new TCP connection and do another handshake). On the other side, I can verify the integrity of a connection once, without requiring me to repeatedly verify our mutual state on each and every message being passed.

This has drastically simplified a lot of code on both the sending and receiving ends, and reduced the number of network roundtrips by a significant amount.

Getting the design ready for production troubleshooting

time to read 2 min | 339 words

The following is an excerpt from a design document for a major feature in RavenDB 4.0 that I’m currently reviewing, written by Tal.

One of the major problems when debugging such issues in production is the fact that most of the interesting information resides in memory and goes away when the server restarts, the sad thing is that the first thing an admin will do when having issues with the server is to recycle it, giving us very little to work with. Yes, we have logs, but debug level logs are very expensive and usually are not enabled in production (nor should they), we already have the ability to turn logs on, on a production system which is a great option but not enough. The root cause of a raft problem usually resides in the past so unless we have logs from the beginning of time there is not much use for them. The suggested solution is a persistent log for important events that indicate that things went south.

This is based on our experience (and frustration) from diagnosing production issues. By the time the admin see something is wrong, the problem already occurred, and in the process of handling the problem, the admin will typically focus on fixing it, rather than figuring out what exactly is going on.

Those kind of features, focusing explicitly on giving us enough information to find the root cause of the issue has been an on going effort for us. Yesterday they enabled us to get a debug package from a customer (a zip file that the server can generate with a lot of important information), go through it and figure out exactly what the problem was (the customer was running in 32 bits mode and running into virtual memory exhaustion) in one support roundtrip, rather than having to go back and forth multiple times to try to get a bunch of different data points to figure out the issue.

Also, go and read Release It, it has a huge impact on actual system design.

Business features vs. technical features

time to read 4 min | 613 words

A feature is something that your application/service does. Usually we don’t give it a lot more thought, but I recently had an interesting discussion about the exact distinctions between a business feature and a technical feature.

Let us imagine that we are talking about an application that allow to send snail mail, we already seen it before. A user will call the API and then a few days later a physical letter will show up at your door. So far, it is pretty simple. The question is, what can you offer in addition to expand the business.

For example, we might offer:

  • Mail tracking – providing a way to ensure that the recipient actually got the letter.
  • Snail mail to email – getting a physical email, and having that sent to the customer.

Those two are obvious extensions to the core business, and from the point of view of the business, it is great. From a technical perspective, that is a whole lot of complexity. You need to integrate with FedEx to handle the mail tracking, and you need to setup some sort of an automated system that will sort the mail, scan it and upload it to the customer’s account.

The problem is that at this point, you don’t really know what kind of reaction those features are going to have. They are both non trivial and in some cases require major capital expenditure to implement and are pretty hard to properly size upfront.

So you split it. Instead of doing this as a single feature, you have a business feature and a technical feature. A business feature means that your business offers this service, building that requires research to show that we can actually offer that, check whatever there are legal ramifications (some mail can be sensitive, privacy concerns, etc), check what kind of pricing we can charge, etc.  The technical feature is actually implementing all of that.

But the key observation here is that you don’t actually do the technical implementation, at least not just yet. You do the work around the business end of the feature, and then you announce this feature availability. As in, right now you can track the snail mail, or right now you can get your mail scanned and uploaded. This is done with minimal technical work in the backend, and with the caveat that this still experimental and pricing might change.

This isn’t cheating, mind you. Once you announced this feature, you wait to see what kind of reaction we’ll have. One of the options is that users will really love this feature, and start immediately using it. In this case, you have a good problem, people are flocking to give you money. In the meantime, you have Joe and Samantha, from the local high school working for minimum wage in the afternoon to manually do the work. So you can complete the customer expectations, as you are now working to complete the technical side and automate the whole thing (firing Joe and Samantha along the way).

The key here is that you don’t have to do any major upfront investment, in development or in facilities, before you can have this feature. And most of the time, even if it is a major feature, the ramp up time is enough for you to have a pretty good idea about what you actually need to do. And in the meantime, you have a micro service architecture, it is just that the services aren’t called FedExTrackingService and ScanAndSortPhysicalMailService but Joe and Samantha.

In other words, you have mechanical Turk the feature until you can teach you system to properly play chess.

HTTP benchmark and pipelining

time to read 29 min | 5795 words

Here is an interesting problem. If you want to load test a server, it is very hard to truly to do so. Put simply, after a while, the problem isn’t with your code, it is with the ability of the surrounding systems to actually get the requests to you fast enough.

In this case, let us talk about what is going on when you are actually doing an HTTP request.

We’ll start from the following code:

image

Seems pretty simple, right? And all we need to do is to actually send enough of those and we’ll be able to put enough load on the server to matter, right? Except that it doesn’t quite works like this. Let us see what the code above is actually doing by stripping away the HTTP later and dropping down to TCP, shall we?

image

 

And that looks good, right? Except that it is still hiding some details. I’m too lazy to go down to raw sockets and demonstrate the details, and anyway it would be way too much code to show here.

Here is a diagram that demonstrate what is going over the network for the two code sample above:

+---------+                      +---------+
| Client  |                      | Server  |
+---------+                      +---------+
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                |
     |                      [SYN-ACK] |
     |<-------------------------------|
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                | -----------------------------\
     |                                |-| Connection now established |
     |                                | |----------------------------|
     |                                |
     | [GET / HTTP 1.1]               |
     |------------------------------->|
     |                                | -------------------\
     |                                |-| The HTTP request |
     |                                | |------------------|
     |                                |
     |      [HTTP/1.1 302 Found ... ] |
     |<-------------------------------|
     |                                | --------------------\
     |                                |-| The HTTP response |
     |                                | |-------------------|
     |                                | -----------------------------------\
     |                                |-| Client now will close connection |
     |                                | |----------------------------------|
     |                                |
     | FIN                            |
     |------------------------------->|
     |                                |
     |                            ACK |
     |<-------------------------------|
     |                                |
     |                            FIN |
     |<-------------------------------|
     |                                |
     | ACK                            |
     |------------------------------->|
     |                                |

Note that this is for the simplest case, assuming that the response is just one packet, assume no packet drops, and ignore stuff like HTTPS, which adds another 4 packets to the initialization, and we are also accounting for the last 4 packets that are required to properly close a connection. This is important, because if you are trying to do high load benchmark, creating and not properly closing TCP connections means that you’ll soon run out of available ports (all your connections will be in CLOSE_WAIT or TIME_WAIT state).

Now, the problem is that this is really expensive. As in, wow expensive. So pretty much as soon as the web started to hit it off (mid 90s or so), people realized that this isn’t going to work, and the notion of Keep-Alive was born.

With Keep-Alive, you are going to reuse the same TCP connection to send multiple requests to the server. The idea is that once the connection is open, there is a strong likelihood that you’ll use it again soon, so why pay the 7 packets cost for opening & closing the TCP connection?

With that optimization, we then have:

+---------+                      +---------+
| Client  |                      | Server  |
+---------+                      +---------+
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                |
     |                      [SYN-ACK] |
     |<-------------------------------|
     |                                |
     | [SYN]                          |
     |------------------------------->|
     |                                | -----------------------------\
     |                                |-| Connection now established |
     |                                | |----------------------------|
     |                                |
     | [GET / HTTP 1.1]               |
     |------------------------------->|
     |                                | -------------------\
     |                                |-| The HTTP request |
     |                                | |------------------|
     |                                |
     |      [HTTP/1.1 302 Found ... ] |
     |<-------------------------------|
     |                                | --------------------\
     |                                |-| The HTTP response |
     |                                | |-------------------|
     |                                |
     | [GET /index HTTP 1.1]          |
     |------------------------------->|
     |                                | -------------------\
     |                                |-| 2nd HTTP request |
     |                                | |------------------|
     |                                |
     |           [HTTP/1.1 200  ... ] |
     |<-------------------------------|
     |                                | --------------------\
     |                                |-| 2nd HTTP response |
     |                                | |-------------------|
     |                                | -----------------------------------\
     |                                |-| Client now will close connection |
     |                                | |----------------------------------|
     |                                |
     | FIN                            |
     |------------------------------->|
     |                                |
     |                            ACK |
     |<-------------------------------|
     |                                |
     |                            FIN |
     |<-------------------------------|
     |                                |
     | ACK                            |
     |------------------------------->|
     |                                |

And the more requests we make to the server, the better we are. Now, there is another trick that we can apply here. Remember that TCP is stream oriented, not packet oriented. That means that as far as the calling code is concerned, we aren’t actually seeing packets, just bytes arriving one after another.

So we can change the way things work to this:

+---------+                                                     +---------+
| Client  |                                                     | Server  |
+---------+                                                     +---------+
     |                                                               |
     | [SYN]                                                         |
     |-------------------------------------------------------------->|
     |                                                               |
     |                                                     [SYN-ACK] |
     |<--------------------------------------------------------------|
     |                                                               |
     | [SYN]                                                         |
     |-------------------------------------------------------------->|
     |                                                               | -----------------------------\
     |                                                               |-| Connection now established |
     |                                                               | |----------------------------|
     |                                                               |
     | [GET / HTTP 1.1, GET /data HTTP 1.1, GET /fast HTTP 1.1]      |
     |-------------------------------------------------------------->|
     |                                                               | -------------------------------------\
     |                                                               |-| 3 HTTP requests in a single apcket |
     |                                                               | |------------------------------------|
     |                                                               |
     |              [HTTP/1.1 302 Found ..., HTTP/1.1 200, HTTP 403] |
     |<--------------------------------------------------------------|
     |                                                               | ----------------------------------\
     |                                                               |-| All HTTP response in one packet |
     |                                                               | |---------------------------------|
     |                                                               | -----------------------------------\
     |                                                               |-| Client now will close connection |
     |                                                               | |----------------------------------|
     |                                                               |
     | FIN                                                           |
     |-------------------------------------------------------------->|
     |                                                               |
     |                                                           ACK |
     |<--------------------------------------------------------------|
     |                                                               |
     |                                                           FIN |
     |<--------------------------------------------------------------|
     |                                                               |
     | ACK                                                           |
     |-------------------------------------------------------------->|
     |                                                               |

What we did is pretty simple. Instead of waiting for the server to respond to the request, and only then reuse the connection to send the next one, we can send the requests immediately one after the other, without waiting.

In some cases, we can even package multiple requests into a single TCP packet. And the server (shouldn’t) care about that.

Here is what this looks like in practice:

Now, naïve server code will fail here, because it will read from the socket into a buffer, (including some part of the next request), and then forget about that.  But it isn’t hard to make sure that this work properly, and that is the key for all high performance servers.

Basically, the real problem is driving enough packets into the server to generate load. By pipelining requests like that, we reduce the number of packets we need to send while at the same time getting a lot higher load.

The cost of routing a packet is independent of its size, and while the size you send is important for bandwidth, the packet latency is much more important for actual speed (latency vs. bandwidth, again). So if we can pack the data into fewer packets, this is a net win. In other words, this is HTTP doing car pooling.

 

Image result for car pool lane

And now that you can drive enough requests into your server to actually stress it, you can work your way into actually handling this load.

FUTURE POSTS

  1. What did all this optimization give us? - 10 hours from now
  2. Performance as a feature - 3 days from now
  3. Externalizing the HttpClient internals for fun & profit - 4 days from now
  4. Reducing the cost of occasionally needed information - 5 days from now

There are posts all the way to Mar 29, 2017

RECENT SERIES

  1. RavenDB Conference videos (12):
    03 Mar 2017 - Replication changes in 3.5
  2. Low level Voron optimizations (5):
    02 Mar 2017 - Primitives & abstraction levels
  3. Implementing low level trie (4):
    26 Jan 2017 - Digging into the C++ impl
  4. Answer (9):
    20 Jan 2017 - What does this code do?
  5. Challenge (48):
    19 Jan 2017 - What does this code do?
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats