Big data work on 32 bits

time to read 7 min | 1326 words

Related imageEver put on a pair of shoes that were just a bit too small for you? You think that it would be fine, but then time goes by, and it pinches. And then it hurts, and then you just can’t walk any longer.

That is how it feels to work with any reasonable amount of data (even low hundreds of MB) in 32 bits. The virtual address space is so limited that it is incredibly easily to just run out due to address space fragmentation and fail, and there is very little that you can actually do to recover from such a scenario. We have been working on making RavenDB 4.0 work reliably in 32 bits mode*, and it has been a PITA after PITA. In particular,  Voron was quite explicitly designed for an environment where it is hard to impossible to run out of address space, and out typical modus operandi is to map the entire data file (which can be hundreds of GB in size) as a single continuous memory region.

* And yes, I can’t imagine how much PITA it was to run in 16 bits. When I programmed to those sort of platforms, I never actually used anything that got to needing oh so much memory (I was at middle school at the time, IIRC).

That allows us to do all sort of tricks and optimizations, and it is utterly impossible to do on 32 bits. In fact, on most 32 bits system, after the process has been running for a while, just doing a single 256MB allocation is probably going to fail. You might have that much memory free, but you don’t have it in a continuous run. So we had to do drastic changes, and at the same time, 32 bits is an important requirement, but mostly it is on the sidelines. I want to support it, and I’m willing to put the effort to get there, but I’m not willing to pay for it if it costs me when running outside of 32 bits mode.

In other words, anything that would have a drastic impact of the “normal” mode of running in 64 bits was out, and we didn’t want to re-architect the whole thing. Luckily, we already had the place to put the different behavior. In Voron, the Pager is the entity that is responsible for mapping the file to memory, and hand out pointers from the file based on the page number asked. That meant that we had a well defined interface:

image

Because the file can grow, and the pager might hold multiple overlapping maps to the same file, we only allow pointers to the data file in the scope of a transaction. That ended up being a very good thing, since this enabled us to build a pager that could work in 32 bits mode. Instead of mapping the entire file, we’ll map just the pages that are required, and only for the duration of the transaction, after which we can unmap everything.  The idea is that we’ll only have the data that we are actually using mapped, so we’ll save a lot of address space.

That was the theory, at least, in practice, we run into several interesting issues.

  • We want to reduce the number of system calls we make, and the allocation granularity in Windows in 64KB, so we always need to map at least that much, instead of mapping 1 page at a time.
  • A value may actually be longer than 64KB, or not be aligned on 64KB boundary.
  • Transactions are concurrent, and are typically need to access the same areas (the top branches in the B+Trees we use, for example).

The end result is that we map in multiples of 64KB (on both Windows & Linux), and we check, based on the actual data, whatever we need to remap the data if it is more than is allocated in the current block. We are also sharing all the maps among all the concurrent transactions, to reduce the total amount of virtual address space we are using. This is a bit hard to do concurrently & safely, so we are racing it. In most cases, we’ll have a single mapping, but it is fine to map the same section twice from different transactions (there can only ever be a single write transaction, so all the others are just reading, and never from the same memory that the write tx is writing to). The alternative would have been to use a more invasive locking, and the performance cost isn’t worth it.

Once we got this working, we were most of the way there, but there were still a lot of reversed optimizations that we had to do. For example, in 64 bits mode, it make a lot of sense to try to pre-allocate data in advance to maintain locality, and as the data we dealt with became larger, so would our pre-allocations. But those would end up being around 32MB of data that we pre-allocate (so we need to prepare and touch all of it), and under load, it was actually one of the more common cases for failures due to fragmentation of the address space, because it would allocate (1MB, 2MB, 4MB, 8MB, 16MB, 32MB) and we had many such operations cutting the address space into fine bits.

Another location that cause us problem was with indexes, where we allocated a 16MB range for bloom filter. Made perfect sense to optimize things in 64 bits, but in 32 bits mode that is a huge chunk of my address space that I’m giving up. In both cases, we drastically reduced the sizes involved (to 64KB in the case of the bloom filter, and a max pre-allocation of 1 MB).

Another thing that was actually in our favor is that our memory usage is already policed quite carefully.  We didn’t do that intentionally for 32 bits, we did that because memory allocation is so expensive, but because we rarely ask for memory from the OS, and because we are typically reuse buffers, we were pretty much already there in terms of proper behavior of the system in 32 bits mode. The CLR is also pretty good about allocation in large sections from the OS and being able to move memory around to avoid internal fragmentation.

Overall, it works, we have tested that on the Stack Overflow dataset, and it is quite amazing to see it (you can see it chugging along, using anything between 300 MB – 600 MB as it is inserting 60+ GB of data). It is obviously less efficient than the 64 bits version, but given that we are so much faster in 4.0, I don’t think that anyone will actually notice, and if they do… Well, that is why we have hardware built in this decade.

There were other places where we had to take into account the 32 bits nature of the system, in which we actively undermined previous optimizations. Instead of allowing transactions to merge as much as possible to reduce I/O, we placed a hard limit on how much data can be accessed or modified by a transaction before we force it to commit. The same goes for indexing, instead of letting batches to run as long as we would usually like them, address space considerations forces us to reduce the batch size to account for how much address space is used in the current batch, and close it early (freeing up the address space for other uses).

The overall process is very unpleasant, because things that I would consider to be obvious and easy are suddenly so much harder than I would expect them to be.