Ayende @ Rahien

filter by tags archive

architecture (624) rss
bugs (451) rss
community (383) rss
databases (481) rss
design (899) rss
development (658) rss
hibernating-practices (74) rss
miscellaneous (592) rss
performance (397) rss
programming (1113) rss
raven (1483) rss
ravendb.net (570) rss
reviews (184) rss

2025
- December (8)
- November (4)
- October (4)
- September (10)
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Jan 31 2025

NTFS has an emergency stash of disk space

time to read 5 min | 915 words

Tweet Share Share 2 comments

Tags:

I would really love to have a better understanding of what is going on here!

If you format a 32 MB disk using NTFS, you’ll get the following result:

So about 10 MB are taken for NTFS metadata. I guess that makes sense, and giving up 10 MB isn’t generally a big deal these days, so I wouldn’t worry about it.

I write a 20 MB file and punch a hole in it between 6 MB and 18 MB (12 MB in total), so we have:

And in terms of disk space, we have:

The numbers match, awesome! Let’s create a new 12 MB file, like so:

And the disk is:

And now I’m running the following code, which maps the first file (with the hole punched in it) and writes 4 MB to it using memory-mapped I/O:

HANDLE hMapFile = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL);
if (hMapFile == NULL) {
    fprintf(stderr, "Could not create file mapping object: %x\n", GetLastError());
    exit(__LINE__);


}


char* lpMapAddress = MapViewOfFile(hMapFile, FILE_MAP_WRITE, 0, 0, 0);
if (lpMapAddress == NULL) {
    fprintf(stderr, "Could not map view of file: %x\n", GetLastError());
    exit(__LINE__);
}


for (i = 6 * MB; i < 10 * MB; i++) {
    ((char*)lpMapAddress)[i]++;
}


if (!FlushViewOfFile(lpMapAddress, 0)) {
    fprintf(stderr, "Could not flush view of file: %x\n", GetLastError());
    exit(__LINE__);
}


if (!FlushFileBuffers(hFile)) {
        fprintf(stderr, "Could not flush file buffers: %x\n", GetLastError());
    exit(__LINE__);
}

The end for this file is:

So with the other file, we have a total of 24 MB in use on a 32 MB disk. And here is the state of the disk itself:

The problem is that there used to be 9.78 MB that were busy when we had a newly formatted disk. And now we are using at least some of that disk space for storing file data somehow.

I’m getting the same behavior when I use normal file I/O:

moveAmount.QuadPart = 6 * MB;
SetFilePointerEx(hFile, moveAmount, NULL, FILE_BEGIN);


for (i = 6 ; i < 10 ; i++) {
    if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) {
        fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError());
        exit(__LINE__);
    }
}

So somehow in this sequence of operations, we get more disk space. On the other hand, if I try to write just 22 MB into a single file, it fails. See:

hFile = CreateFileA("R:/original_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
if (hFile == INVALID_HANDLE_VALUE) {
    printf("Error creating file: %d\n", GetLastError());
    exit(__LINE__);
}


for (int i = 0; i < 22; i++) {
    if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) {
        fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError());
        exit(__LINE__);
    }
}

You can find the full source here. I would love to understand what exactly is happening and how we suddenly get more disk space usage in this scenario.

Jan 29 2025

What happens when a sparse file allocation fails?

time to read 18 min | 3531 words

Tweet Share Share 0 comments

Tags:

Today I set out to figure out an answer to a very specific question. What happens at the OS level when you try to allocate disk space for a sparse file and there is no additional disk space?

Sparse files are a fairly advanced feature of file systems. They allow you to define a file whose size is 10GB, but that only takes 2GB of actual disk space. The rest is sparse (takes no disk space and on read will return just zeroes). The OS will automatically allocate additional disk space for you if you write to the sparse ranges.

This leads to an interesting question, what happens when you write to a sparse file if there is no additional disk space?

Let’s look at the problem on Linux first. We define a RAM disk with 32MB, like so:

sudo mkdir -p /mnt/ramdisk
sudo mount -t tmpfs -o size=32M tmpfs /mnt/ramdisk

And then we write the following code, which does the following (on a disk with just 32MB):

Create a file - write 32 MB to it
Punch a hole of 8 MB in the file (range is 12MB - 20MB)
Create another file - write 4 MB to it (there is now only 4MB available)
Open the original file and try to write to the range with the hole in it (requiring additional disk space allocation)

#define _GNU_SOURCE


#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <linux/falloc.h>
#include <errno.h>
#include <string.h>
#include <sys/random.h>


#define MB (1024 * 1024)


void write_all(int fd, const void *buf, size_t count)
{
    size_t bytes_written = 0;
    const char *ptr = (const char *)buf;


    while (bytes_written < count)
    {
        ssize_t result = write(fd, ptr + bytes_written, count - bytes_written);


        if (result < 0)
        {
            if (errno == EINTR)
                continue;


            fprintf(stderr, "Write error: errno = %d (%s)\n", errno, strerror(errno));
            exit(EXIT_FAILURE);
        }


        if (result == 0)
        {


            fprintf(stderr, "Zero len write is bad: errno = %d (%s)\n", errno, strerror(errno));
            exit(EXIT_FAILURE);
        }


        bytes_written += result;
    }
}


int main()
{
    int fd;
    char buffer[MB];


    unlink("/mnt/ramdisk/fullfile");
    unlink("/mnt/ramdisk/anotherfile");


    getrandom(buffer, MB, 0);


    ssize_t bytes_written;


    fd = open("/mnt/ramdisk/fullfile", O_RDWR | O_CREAT | O_TRUNC, 0644);
    if (fd == -1)
    {
        fprintf(stderr, "open full file: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }
    for (int i = 0; i < 32; i++)
    {
        write_all(fd, buffer, MB);
    }
    close(fd);


    fd = open("/mnt/ramdisk/fullfile", O_RDWR);
    if (fd == -1)
    {
        fprintf(stderr, "reopen full file: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }


    if (fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 12 * MB, 8 * MB) == -1)
    {
        fprintf(stderr, "fallocate failure: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }
    close(fd);


    fd = open("/mnt/ramdisk/anotherfile", O_RDWR | O_CREAT | O_TRUNC, 0644);
    if (fd == -1)
    {
        fprintf(stderr, "open another file: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }


    for (int i = 0; i < 4; i++)
    {
        write_all(fd, buffer, MB);
    }
    close(fd);


    // Write 8 MB to the hole in the first file
    fd = open("/mnt/ramdisk/fullfile", O_RDWR);
    if (fd == -1)
    {
        fprintf(stderr, "reopen full file 2: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }


    // Seek to the start of the hole
    if (lseek(fd, 12 * MB, SEEK_SET) == -1)
    {
        fprintf(stderr, "seek full file: errno = %d (%s)\n", errno, strerror(errno));
        exit(EXIT_FAILURE);
    }
    for (int i = 0; i < 8; i++)
    {
        write_all(fd, buffer, MB);
    }
    close(fd);


    printf("Operations completed successfully.\n");
    return 0;
}

As expected, this code will fail on the 5th write (since there is no disk space to allocate in the disk). The error would be:

Write error: errno = 28 (No space left on device)

Here is what the file system reports:

$ du -h /mnt/ramdisk/*
4.0M    /mnt/ramdisk/anotherfile
28M     /mnt/ramdisk/fullfile


$ ll -h /mnt/ramdisk/
total 33M
drwxrwxrwt 2 root   root     80 Jan  9 10:43 ./
drwxr-xr-x 6 root   root   4.0K Jan  9 10:30 ../
-rw-r--r-- 1 ayende ayende 4.0M Jan  9 10:43 anotherfile
-rw-r--r-- 1 ayende ayende  32M Jan  9 10:43 fullfile

As you can see, we have a total of 32 MB of actual size reported, but ll is reporting that we actually have files bigger than that (because we have hole punching).

What would happen if we were to run this using memory-mapped I/O? Here is the code:

fd = open("/mnt/ramdisk/fullfile", O_RDWR);


char *mapped_memory = mmap(NULL, 32 * MB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (mapped_memory == MAP_FAILED)
{
    fprintf(stderr, "fail mmap: errno = %d (%s)\n", errno, strerror(errno));
    exit(EXIT_FAILURE);
}


for (size_t i = (12 * MB); i < (20 * MB); i++)
{
    mapped_memory[i] = 1;
}
munmap(mapped_memory, 32 * MB);
close(fd);

This will lead to an interesting scenario. We need to allocate disk space for the memory, and we’ll do so (note that we are writing into the hole), and this code will fail with a segmentation fault.

It will fail in the loop, by the way, as part of the page fault to bring the memory in, the file system needs to allocate the disk space. If there is no such disk space, it will fail. The only way for the OS to behave in this case is to fail the write, which leads to a segmentation fault.

I also tried that on Windows. I defined a virtual disk like so:

$ diskpart
create vdisk file="D:\ramdisk.vhd" maximum=32
select vdisk file=D:\ramdisk.vhd"
attach vdisk
create partition primary
format fs=NTFS quick label=RAMDISK
assign letter=R
exit

This creates a 32MB disk and assigns it the letter R. Note that we are using NTFS, which has its own metadata, we have roughly 21MB or so of usable disk space to play with here.

Here is the Windows code that simulates the same behavior as the Linux code above:

#include <stdio.h>
#include <windows.h>


#define MB (1024 * 1024)


int main() {
    HANDLE hFile, hFile2;
    DWORD bytesWritten;
    LARGE_INTEGER fileSize, moveAmount;
    char* buffer = malloc(MB);
    
    int i;


        DeleteFileA("R:\\original_file.bin");
        DeleteFileA("R:\\another_file.bin");


    hFile = CreateFileA("R:/original_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
    if (hFile == INVALID_HANDLE_VALUE) {
        printf("Error creating file: %d\n", GetLastError());
        exit(__LINE__);
    }


    for (int i = 0; i < 20; i++) {
        if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) {
            fprintf(stderr, "WriteFile failed on iteration %d: %x\n", i, GetLastError());
            exit(__LINE__);
        }
        if (bytesWritten != MB) {
            fprintf(stderr, "Failed to write full buffer on iteration %d\n", i);
            exit(__LINE__);
        }
    }


    FILE_ZERO_DATA_INFORMATION zeroDataInfo;
    zeroDataInfo.FileOffset.QuadPart = 6 * MB; 
    zeroDataInfo.BeyondFinalZero.QuadPart = 18 * MB; 


    if (!DeviceIoControl(hFile, FSCTL_SET_SPARSE, NULL, 0, NULL, 0, NULL, NULL) || 
        !DeviceIoControl(hFile, FSCTL_SET_ZERO_DATA, &zeroDataInfo, sizeof(zeroDataInfo), NULL, 0, NULL, NULL)) {
                printf("Error setting zero data: %d\n", GetLastError());
        exit(__LINE__);
        }




    // Create another file of size 4 MB
    hFile2 = CreateFileA("R:/another_file.bin", GENERIC_READ | GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
    if (hFile2 == INVALID_HANDLE_VALUE) {
        printf("Error creating second file: %d\n", GetLastError());
        exit(__LINE__);
    }




    for (int i = 0; i < 4; i++) {
        if (!WriteFile(hFile2, buffer, MB, &bytesWritten, NULL)) {
            fprintf(stderr, "WriteFile 2 failed on iteration %d: %x\n", i, GetLastError());
            exit(__LINE__);
        }
        if (bytesWritten != MB) {
            fprintf(stderr, "Failed to write full buffer 2 on iteration %d\n", i);
            exit(__LINE__);
        }
    }


        moveAmount.QuadPart = 12 * MB;
    SetFilePointerEx(hFile, moveAmount, NULL, FILE_BEGIN);
    for (i = 0; i < 8; i++) {
        if (!WriteFile(hFile, buffer, MB, &bytesWritten, NULL)) {
            printf("Error writing to file: %d\n", GetLastError());
            exit(__LINE__);
        }
    }


    return 0;
}

And that gives us the exact same behavior as in Linux. One of these writes will fail because there is no more disk space for it. What about when we use memory-mapped I/O?

HANDLE hMapFile = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, 0, NULL);
if (hMapFile == NULL) {
    fprintf(stderr, "Could not create file mapping object: %x\n", GetLastError());
    exit(__LINE__);


}


char* lpMapAddress = MapViewOfFile(hMapFile, FILE_MAP_WRITE, 0, 0, 0);
if (lpMapAddress == NULL) {
    fprintf(stderr, "Could not map view of file: %x\n", GetLastError());
    exit(__LINE__);
}


for (i = 0; i < 20 * MB; i++) {
    ((char*)lpMapAddress)[i]++;
}

That results in the expected access violation:

I didn’t bother checking Mac or BSD, but I’m assuming that they behave in the same manner. I can’t conceive of anything else that they could reasonably do.

You can find my full source here.

Jan 27 2025

Configuration values & Escape hatches

time to read 2 min | 394 words

Tweet Share Share 0 comments

Tags:

RavenDB is meant to be a self-managing database, one that is able to take care of itself without constant hand-holding from the database administrator. That has been one of our core tenets from the get-go. Today I checked the current state of the codebase and we have roughly 500 configuration options that are available to control various aspects of RavenDB’s behavior.

These two statements are seemingly contradictory, because if we have so many configuration options, how can we even try to be self-managing? And how can a database administrator expect to juggle all of those options?

Database configuration is a really finicky topic. For example, RocksDB’s authors flat-out admit that out loud:

Even we as RocksDB developers don't fully understand the effect of each configuration change. If you want to fully optimize RocksDB for your workload, we recommend experiments and benchmarking.

And indeed, efforts were made to tune RocksDB using deep-learning models because it is that complex.

RavenDB doesn’t take that approach, tuning is something that should work out of the box, managed directly by RavenDB itself. Much of that is achieved by not doing things and carefully arranging that the environment will balance itself out in an optimal fashion. But I’ll talk about the Zen of RavenDB another time.

Today, I want to talk about why we have so many configuration options, the vast majority of which you, as a user, should neither use, care about, nor even know of.

The idea is very simple, deploying a database engine is a Big Deal, and as such, something that users are quite reluctant to do. When we hit a problem and a support call is raised, we need to provide some mechanism for the user to fix things until we can ensure that this behavior is accounted for in the default manner of RavenDB.

I treat the configuration options more as escape hatches that allow me to muddle through stuff than explicit options that an administrator is expected to monitor and manage. Some of those configuration options control whether RavenDB will utilize vectored instructions or the compression algorithm to use over the wire. If you need to touch them, it is amazing that they exist. If you have to deal with them on a regular basis, we need to go back to the drawing board.

Jan 24 2025

Partial writes, IO_Uring and safety

time to read 14 min | 2742 words

Tweet Share Share 0 comments

Tags:

In my previous post, I discussed how Linux will silently truncate a big write (> 2 GB) for you. That is expected by the interface of write(). The problem is that this behavior also applies when you use IO_Uring.

Take a look at the following code:

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
if (!sqe) {
    return 1;
}
io_uring_prep_write(sqe, fd, buffer, BUFFER_SIZE, 0);
io_uring_submit(&ring);


struct io_uring_cqe *cqe;
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret < 0) {
    return 2;
}

If BUFFER_SIZE is 3 GB, then this will write about 2 GB to the file. The number of bytes written is correctly reported, but the complexity this generates is huge. Consider the following function:

int32_t rvn_write_io_ring(
    void *handle,
    int32_t count,
    struct page_to_write *buffers,
    int32_t *detailed_error_code);

There is a set of buffers that I want to write, and the natural way to do that is:

int32_t rvn_write_io_ring(
    void *handle,
    int32_t count,
    struct page_to_write *buffers,
    int32_t *detailed_error_code)
{
    struct handle *handle_ptr = handle;
    for (size_t i = 0; i < count; i++)
    {
        struct io_uring_sqe *sqe = io_uring_get_sqe(
            &handle_ptr->global_state->ring);
        io_uring_prep_write(sqe,
            handle_ptr->file_fd, 
            buffers[i].ptr, 
            buffers[i].count_of_pages * VORON_PAGE_SIZE, 
            buffers[i].page_num * VORON_PAGE_SIZE
        );
    }
    return _submit_and_wait(&handle_ptr->global_state->ring,
                            count, detailed_error_code);
}


int32_t _submit_and_wait(
    struct io_uring* ring,
    int32_t count,
    int32_t* detailed_error_code)
{
    int32_t rc = io_uring_submit_and_wait(ring, count);
    if(rc < 0)
    {
        *detailed_error_code = -rc;
        return FAIL_IO_RING_SUBMIT;
    }


    struct io_uring_cqe* cqe;
    for(int i = 0; i < count; i++)
    {
        rc = io_uring_wait_cqe(ring, &cqe);
        if (rc < 0)
        {
            *detailed_error_code = -rc;
            return FAIL_IO_RING_NO_RESULT;
        }
        if(cqe->res < 0)
        {
            *detailed_error_code = -cqe->res;
            return FAIL_IO_RING_WRITE_RESULT;
        }
        io_uring_cqe_seen(ring, cqe);
    }
    return SUCCESS;
}

In other words, send all the data to the IO Ring, then wait for all those operations to complete. We verify complete success and can then move on. However, because we may have a write that is greater than 2 GB, and because the interface allows the IO Uring to write less than we thought it would, we need to handle that with retries.

After thinking about this for a while, I came up with the following implementation:

int32_t _submit_writes_to_ring(
    struct handle *handle,
    int32_t count,
    struct page_to_write *buffers,
    int32_t* detailed_error_code)
{
    struct io_uring *ring = &handle->global_state->ring;
    off_t *offsets = handle->global_state->offsets;
    memset(offsets, 0, count * sizeof(off_t));   


    while(true)
    {
        int32_t submitted = 0;


        for (size_t i = 0; i < count; i++)
        {
            off_t offset =  offsets[i];
            if(offset == buffers[i].count_of_pages * VORON_PAGE_SIZE)
                continue;


            struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
            if (sqe == NULL) // the ring is full, flush it...
                break;


            io_uring_sqe_set_data(sqe, i);
            io_uring_prep_write(sqe, handle->file_fd, 
                buffers[i].ptr + offset,
                buffers[i].count_of_pages * VORON_PAGE_SIZE - offset,
                buffers[i].page_num * VORON_PAGE_SIZE + offset);


            submitted++;
        }


        if(submitted == 0)
            return SUCCESS;
        
        int32_t rc = io_uring_submit_and_wait(ring, submitted);
        if(rc < 0)
        {
            *detailed_error_code = -rc;
            return FAIL_IO_RING_SUBMIT;
        }


        struct io_uring_cqe *cqe;
        uint32_t head = 0;
        uint32_t i = 0;
        bool has_errors = false;


        io_uring_for_each_cqe(ring, head, cqe) {
            i++;
            uint64_t index = io_uring_cqe_get_data64(cqe);
            int result = cqe->res;
            if(result < 0)
            {
                has_errors = true;
                *detailed_error_code = -result;
            }
            else
            {
                offsets[index] += result;
                if(result == 0)
                {
                    // there shouldn't be a scenario where we return 0 
                    // for a write operation, we may want to retry here
                    // but figuring out if this is a single happening, of if 
                    // we need to retry this operation (_have_ retried it?) is 
                    // complex enough to treat this as an error for now.
                    has_errors = true;
                    *detailed_error_code = EIO;   
                }
            }
        }
        io_uring_cq_advance(ring, i);


        if(has_errors)
            return FAIL_IO_RING_WRITE_RESULT;
    }
}

That is a lot of code, but it is mostly because of how C works. What we do here is scan through the buffers we need to write, as well as scan through an array of offsets that store additional information for the operation.

If the offset to write doesn’t indicate that we’ve written the whole thing, we’ll submit it to the ring and keep going until we either fill the entire ring or run out of buffers to work with. The next step is to submit the work and wait for it to complete, then run through the results, check for errors, and update the offset that we wrote for the relevant buffer.

Then, we scan the buffers array again to find either partial writes that we have to complete (we didn’t write the whole buffer) or buffers that we didn’t write at all because we filled the ring. In either case, we submit the new batch of work to the ring and repeat until we run out of work.

This code assumes that we cannot have a non-error state where we write 0 bytes to the file and treats that as an error. We also assume that an error in writing to the disk is fatal, and the higher-level code will discard the entire IO_Uring if that happens.

The Windows version, by the way, is somewhat simpler. Windows explicitly limits the size of the buffer you can pass to the write() call (and its IO Ring equivalent). It also ensures that it will write the whole thing, so partial writes are not an issue there.

It is interesting to note that the code above will effectively stripe writes if you send very large buffers. Let’s assume that we send it two 4 GB buffers, like so:

Offset	Size
Buffer 1	1 GB	4 GB
Buffer 2	10 GB	6 GB

The patterns of writes that will actually be executed are:

1GB .. 3 GB, 10 GB .. 12 GB
3 GB .. 5 GB, 12 GB .. 14 GB
14 GB .. 16 GB

I can “fix” that by never issuing writes that are larger than 2 GB and issuing separate writes for each 2 GB range, but that leads to other complexities (e.g., tracking state if I split a write and hit the full ring status, etc.). At those sizes, it doesn’t actually matter in terms of efficiency or performance.

Partial writes are almost always a sign of either very large writes that were broken up or some underlying issue that is already problematic, so I don’t care that much about that scenario in general. For the vast majority of cases, this will always issue exactly one write for each buffer.

What is really interesting from my point of view, however, is how even a pretty self-contained feature can get pretty complex internally. On the other hand, this behavior allows me to push a whole bunch of work directly to the OS and have it send all of that to the disk as fast as possible.

In our scenarios, under load, we may call that with thousands to tens of thousands of pages (each 8 KB in size) spread all over the file. The buffers are actually sorted, so ideally, the kernel will be able to take advantage of that, but even if not, just reducing the number of syscalls will result in performance savings.

Jan 22 2025

AnswerWhat does this code do?

time to read 4 min | 764 words

Tweet Share Share 2 comments

Tags:

I previously asked what the code below does, and mentioned that it should give interesting insight into the kind of mindset and knowledge a candidate has. Take a look at the code again:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/stat.h>


#define BUFFER_SIZE (3ULL * 1024 * 1024 * 1024) // 3GB in bytes


int main() {
    int fd;
    char *buffer;
    struct stat st;


    buffer = (char *)malloc(BUFFER_SIZE);
    if (buffer == NULL) {
        return 1;
    }


    fd = open("large_file.bin", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
    if (fd == -1) {
        return 2;
    }


    if (write(fd, buffer, BUFFER_SIZE) == -1) {
        return 3;
    }


    if (fsync(fd) == -1) {
        return 4;
    }


    if (close(fd) == -1) {
        return 5;
    }


    if (stat("large_file.bin", &st) == -1) {
        return 6;
    }


    printf("File size: %.2f GB\n", (double)st.st_size / (1024 * 1024 * 1024));


    free(buffer);
    return 0;
}

This program will output: File size: 2.00 GB

And it will write 2 GB of zeros to the file:

~$ head  large_file.bin  | hexdump -C
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
7ffff000

The question is why? And the answer is quite simple. Linux has a limitation of about 2 GB for writes to the disk. Any write call that attempts to write more than that will only write that much, and you’ll have to call the system again. This is not an error, mind. The write call is free to write less than the size of the buffer you passed to it.

Windows has the same limit, but it is honest about it

In Windows, all write calls accept a 32-bit int as the size of the buffer, so this limitation is clearly communicated in the API. Windows will also ensure that for files, a WriteFile call that completes successfully writes the entire buffer to the disk.

And why am I writing 2 GB of zeros? In the code above, I’m using malloc(), not calloc(), so I wouldn’t expect the values to be zero. Because this is a large allocation, malloc() calls the OS to provide us with the buffer directly, and the OS is contractually obligated to provide us with zeroed pages.

Jan 20 2025

ChallengeWhat does this code do?

time to read 3 min | 536 words

Tweet Share Share 1 comments

Tags:

Here is a pretty simple C program, running on Linux. Can you tell me what you expect its output to be?

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/stat.h>


#define BUFFER_SIZE (3ULL * 1024 * 1024 * 1024) // 3GB in bytes


int main() {
    int fd;
    char *buffer;
    struct stat st;


    buffer = (char *)malloc(BUFFER_SIZE);
    if (buffer == NULL) {
        return 1;
    }


    fd = open("large_file.bin", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
    if (fd == -1) {
        return 2;
    }


    if (write(fd, buffer, BUFFER_SIZE) == -1) {
        return 3;
    }


    if (fsync(fd) == -1) {
        return 4;
    }


    if (close(fd) == -1) {
        return 5;
    }


    if (stat("large_file.bin", &st) == -1) {
        return 6;
    }


    printf("File size: %.2f GB\n", (double)st.st_size / (1024 * 1024 * 1024));


    free(buffer);
    return 0;
}

And what happens when I run:

head  large_file.bin  | hexdump -C

This shows both surprising behavior and serves as a good opening for discussion on a whole bunch of issues. In an interview setting, that can give us a lot of insight into the sort of knowledge a candidate has.

Jan 17 2025

Production post-mortemInspecting ourselves to death

time to read 9 min | 1642 words

Tweet Share Share 0 comments

Tags:

The problem was that this took time - many days or multiple weeks - for us to observe that. But we had the charts to prove that this was pretty consistent. If the RavenDB service was restarted (we did not have to restart the machine), the situation would instantly fix itself and then slowly degrade over time.

The scenario in question was performance degradation over time. The metric in question was the average request latency, and we could track a small but consistent rise in this number over the course of days and weeks. The load on the server remained pretty much constant, but the latency of the requests grew.

That the customer didn’t notice that is an interesting story on its own. RavenDB will automatically prioritize the fastest node in the cluster to be the “customer-facing” one, and it alleviated the issue to such an extent that the metrics the customer usually monitors were fine. The RavenDB Cloud team looks at the entire system, so we started the investigation long before the problem warranted users’ attention.

I hate these sorts of issues because they are really hard to figure out and subject to basically every caveat under the sun. In this case, we basically had exactly nothing to go on. The workload was pretty consistent, and I/O, memory, and CPU usage were acceptable. There was no starting point to look at.

Those are also big machines, with hundreds of GB of RAM and running heavy workloads. These machines have great disks and a lot of CPU power to spare. What is going on here?

After a long while, we got a good handle on what is actually going on. When RavenDB starts, it creates memory maps of the file it is working with. Over time, as needed, RavenDB will map, unmap, and remap as needed. A process that has been running for a long while, with many databases and indexes operating, will have a lot of work done in terms of memory mapping.

In Linux, you can inspect those details by running:

$

600a33834000-600a3383b000 rSize: KernelPageSize: MMUPageSize: Rss: Pss: Shared_Clean: Shared_Dirty: Private_Clean: Private_Dirty: Referenced: Anonymous: LazyFree: AnonHugePages: ShmemPmdMapped: FilePmdMapped: Shared_Hugetlb: Private_Hugetlb: Swap: SwapPss: Locked: THPeligible: VmFlags: 600a3383b000-600a33847000 rSize: KernelPageSize: MMUPageSize: Rss: Pss: Shared_Clean: Shared_Dirty: Private_Clean: Private_Dirty: Referenced: Anonymous: LazyFree: AnonHugePages: ShmemPmdMapped: FilePmdMapped: Shared_Hugetlb: Private_Hugetlb: Swap: SwapPss: Locked: THPeligible: VmFlags:

cat /proc/22003/smaps punctuation">--p 00000000 08:30 214585                     /data/ravendb/Raven.Server 28 kB 4 kB 4 kB 28 kB 26 kB 4 kB 0 kB 24 kB 0 kB 28 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 rd mr mw me dw punctuation">-xp 00006000 08:30 214585                     /data/ravendb/Raven.Server 48 kB 4 kB 4 kB 48 kB 46 kB 4 kB 0 kB 44 kB 0 kB 48 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 kB 0 rd ex mr mw me dw

Here you can see the first page of entries from this file. Just starting up RavenDB (with no databases created) will generate close to 2,000 entries. The smaps virtual file can be really invaluable for figuring out certain types of problems. In the snippet above, you can see that we have some executable memory ranges mapped, for example.

The problem is that over time, memory becomes fragmented, and we may end up with an smaps file that contains tens of thousands (or even hundreds of thousands) of entries.

Here is the result of running perf top on the system, you can see that the top three items that hogs most of the resources are related to smaps accounting.

This file provides such useful information that we monitor it on a regular basis. It turns out that this can have… interesting effects. Consider that while we are running the scan through all the memory mapping, we may need to change the memory mapping for the process. That leads to contention on the kernel locks that protect the mapping, of course.

It’s expensive to generate the `smaps` file

Reading from /proc/[pid]/smaps is not a simple file read. It involves the kernel gathering detailed memory statistics (e.g., memory regions, page size, resident/anonymous/shared memory usage) for each virtual memory area (VMA) of the process. For large processes with many memory mappings, this can be computationally expensive as the kernel has to gather the required information every time /proc/[pid]/smaps is accessed.

When /proc/[pid]/smaps is read, the kernel needs to access memory-related structures. This may involve taking locks on certain parts of the process’s memory management system. If this is done too often or across many large processes, it could lead to contention or slow down the process itself, especially if other processes are accessing or modifying memory at the same time.

If the number of memory mappings is high, and the frequency with which we monitor is short… I hope you can see where this is going. We effectively spent so much time running over this file that we blocked other operations.

This wasn’t an issue when we just started the process, because the number of memory mappings was small, but as we worked on the system and the number of memory mappings grew… we eventually started hitting contention.

The solution was two-fold. We made sure that there is only ever a single thread that would read the information from the smaps (previously it might have been triggered from multiple locations). We added some throttling to ensure that we aren’t hammering the kernel with requests for this file too often (returning cached information if needed) and we switched from using smaps to using smaps_rollup instead. The rollup version provides much better performance, since it deals with summary data only.

With those changes in place, we deployed to production and waited. The result was flat latency numbers and another item that the Cloud team could strike off the board successfully.

Jan 15 2025

Accidenal complexity: A tale of two GUIDs

time to read 3 min | 462 words

Tweet Share Share 0 comments

Tags:

For a new feature in RavenDB, I needed to associate each transaction with a source ID. The underlying idea is that I can aggregate transactions from multiple sources in a single location, but I need to be able to distinguish between transactions from A and B.

Luckily, I had the foresight to reserve space in the Transaction Header, I had a whole 16 bytes available for me. Separately, each Voron database (the underlying storage engine that we use) has a unique Guid identifier. And a Guid is 16 bytes… so everything is pretty awesome.

There was just one issue. I needed to be able to read transactions as part of the recovery of the database, but we stored the database ID inside the database itself. I figured out that I could also put a copy of the database ID in the global file header and was able to move forward.

This is part of a much larger change, so I was going full steam ahead when I realized something pretty awful. That database Guid that I was relying on was already being used as the physical identifier of the storage as part of the way RavenDB distributes data. The reason it matters is that under certain circumstances, we may need to change that.

If we change the database ID, we lose the association with the transactions for that database, leading to a whole big mess. I started sketching out a design for figuring out that the database ID has changed, re-writing all the transactions in storage, and… a colleague said: why don’t we use another ID?

It hit me like a ton of bricks. I was using the existing database Guid because it was already there, so it seemed natural to want to reuse it. But there was no benefit in doing that. Instead, it added a lot more complexity because I was adding (many) additional responsibilities to the value that it didn’t have before.

Creating a Guid is pretty easy, after all, and I was able to dedicate one I called Journal ID to this purpose. The existing Database ID is still there, and it is completely unrelated to it. Changing the Database ID has no impact on the Journal ID, so the problem space is radically simplified.

I had to throw away heaps of complexity because of a single comment. I used the Database ID because it was there, never considering having a dedicated value for it. That single suggestion led to a better, simpler design and faster delivery.

It is funny how you can sometimes be so focused on the problem at hand, when a step back will give you a much wider view and a better path to the solution.

Jan 13 2025

The memory leak in ConcurrentQueue

time to read 3 min | 457 words

Tweet Share Share 1 comments

Tags:

We ran into a memory issue recently in RavenDB, which had a pretty interesting root cause. Take a look at the following code and see if you can spot what is going on:

ConcurrentQueue<Buffer> _buffers = new();


void FlushUntil(long maxTransactionId)
{
    List<Buffer> toFlush = new();
    while(_buffers.TryPeek(out buffer) && 
        buffer.TransactionId <= maxTransactionId)
    {
        if(_buffers.TryDequeue(out buffer))
        {
            toFlush.Add(buffer);
        }
    }


    FlushToDisk(toFlush);
}

The code handles flushing data to disk based on the maximum transaction ID. Can you see the memory leak?

If we have a lot of load on the system, this will run just fine. The problem is when the load is over. If we stop writing new items to the system, it will keep a lot of stuff in memory, even though there is no reason for it to do so.

The reason for that is the call to TryPeek(). You can read the source directly, but the basic idea is that when you peek, you have to guard against concurrent TryTake(). If you are not careful, you may encounter something called a torn read.

Let’s try to explain it in detail. Suppose we store a large struct in the queue and call TryPeek() and TryTake() concurrently. The TryPeek() starts copying the struct to the caller at the same time that TryTake() does the same and zeros the value. So it is possible that TryPeek() would get an invalid value.

To handle that, if you are using TryPeek(), the queue will not zero out the values. This means that until that queue segment is completely full and a new one is generated, we’ll retain references to those buffers, leading to an interesting memory leak.

Jan 10 2025

Performance discoveryIOPS vs. IOPS

time to read 15 min | 2973 words

Tweet Share Share 2 comments

Tags:

RavenDB is a transactional database, we care deeply about ACID. The D in ACID stands for durability, which means that to acknowledge a transaction, we must write it to a persistent medium. Writing to disk is expensive, writing to the disk and ensuring durability is even more expensive.

After seeing some weird performance numbers on a test machine, I decided to run an experiment to understand exactly how durable writes affect disk performance.

A few words about the term durable writes. Disks are slow, so we use buffering & caches to avoid going to the disk. But a write to a buffer isn’t durable. A failure could cause it to never hit a persistent medium. So we need to tell the disk in some way that we are willing to wait until it can ensure that this write is actually durable.

This is typically done using either fsync or O_DIRECT | O_DSYNC flags. So this is what we are testing in this post.

I wanted to test things out without any of my own code, so I ran the following benchmark.

I pre-allocated a file and then ran the following commands.

Normal writes (buffered) with different sizes (256 KB, 512 KB, etc).

dd if=/dev/zero of=/data/test bs=256K count=1024
dd if=/dev/zero of=/data/test bs=512K count=1024

Durable writes (force the disk to acknowledge them) with different sizes:

dd if=/dev/zero of=/data/test bs=256k count=1024 oflag=direct,sync
dd if=/dev/zero of=/data/test bs=256k count=1024 oflag=direct,sync

The code above opens the file using:

openat(AT_FDCWD, "/data/test", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC|O_DIRECT, 0666) = 3

I got myself an i4i.xlarge instance on AWS and started running some tests. That machine has a local NVMe drive of about 858 GB, 32 GB of RAM, and 4 cores. Let’s see what kind of performance I can get out of it.

Write sizeTotal writesBuffered writes

256 KB	256 MB	1.3 GB/s
512 KB	512 MB	1.2 GB/s
1 MB	1 GB	1.2 GB/s
2 MB	2 GB	731 Mb/s
8 MB	8 GB	571 MB/s
16 MB	16 GB	561 MB/s
2 MB	8 GB	559 MB/s
1 MB	1 GB	554 MB/s
4 KB	16 GB	557 MB/s
16 KB	16 GB	553 MB/s

What you can see here is that writes are really fast when buffered. But when I hit a certain size (above 1 GB or so), we probably start having to write to the disk itself (which is NVMe, remember). Our top speed is about 550 MB/s at this point, regardless of the size of the buffers I’m passing to the write() syscall.

I’m writing here using cached I/O, which is something that as a database vendor, I don’t really care about. What happens when we run with direct & sync I/O, the way I would with a real database? Here are the numbers for the i4i.xlarge instance for durable writes.

Write sizeTotal writesDurable writes

256 KB	256 MB	1.3 GB/s
256 KB	1 GB	1.1 GB/s
16 MB	16 GB	584 GB/s
64 KB	16 GB	394 MB/s
32 KB	16 GB	237 MB/s
16 KB	16 GB	126 MB/s

In other words, when using direct I/O, the smaller the write, the more time it takes. Remember that we are talking about forcing the disk to write the data, and we need to wait for it to complete before moving to the next one.

For 16 KB writes, buffered writes achieve a throughput of 553 MB/s vs. 126 MB/s for durable writes. This makes sense, since those writes are cached, so the OS is probably sending big batches to the disk. The numbers we have here clearly show that bigger batches are better.

My next test was to see what would happen when I try to write things in parallel. In this test, we run 4 processes that write to the disk using direct I/O and measure their output.

I assume that I’m maxing out the throughput on the drive, so the total rate across all commands should be equivalent to the rate I would get from a single command.

To run this in parallel I’m using a really simple mechanism - just spawn processes that would do the same work. Here is the command template I’m using:

parallel -j 4 --tagstring 'Task {}' dd if=/dev/zero of=/data/test bs=16M count=128 seek={} oflag=direct,sync ::: 0 1024 2048 3072

This would write to 4 different portions of the same file, but I also tested that on separate files. The idea is to generate a sufficient volume of writes to stress the disk drive.

Write sizeTotal writesDurable & Parallel writes

16 MB	8 GB	650 MB/s
16 KB	64 GB	252 MB/s

I also decided to write some low-level C code to test out how this works with threads and a single program. You can find the code here. I basically spawn NUM_THREADS threads, and each will open a file using O_SYNC | O_DIRECT and write to the file WRITE_COUNT times with a buffer of size BUFFER_SIZE.

This code just opens a lot of files and tries to write to them using direct I/O with 8 KB buffers. In total, I’m writing 16 GB (128 MB x 128 threads) to the disk. I’m getting a rate of about 320 MB/sec when using this approach.

As before, increasing the buffer size seems to help here. I also tested a version where we write using buffered I/O and call fsync every now and then, but I got similar results.

The interim conclusion that I can draw from this experiment is that NVMes are pretty cool, but once you hit their limits you can really feel it. There is another aspect to consider though, I’m running this on a disk that is literally called ephemeral storage. I need to repeat those tests on real hardware to verify whether the cloud disk simply ignores the command to persist properly and always uses the cache.

That is supported by the fact that using both direct I/O on small data sizes didn’t have a big impact (and I expected it should). Given that the point of direct I/O in this case is to force the disk to properly persist (so it would be durable in the case of a crash), while at the same time an ephemeral disk is wiped if the host machine is restarted, that gives me good reason to believe that these numbers are because the hardware “lies” to me.

In fact, if I were in charge of those disks, lying about the durability of writes would be the first thing I would do. Those disks are local to the host machine, so we have two failure modes that we need to consider:

The VM crashed - in which case the disk is perfectly fine and “durable”.
The host crashed - in which case the disk is considered lost entirely.

Therefore, there is no point in trying to achieve durability, so we can’t trust those numbers.

The next step is to run it on a real machine. The economics of benchmarks on cloud instances are weird. For a one-off scenario, the cloud is a godsend. But if you want to run benchmarks on a regular basis, it is far more economical to just buy a physical machine. Within a month or two, you’ll already see a return on the money spent.

We got a machine in the office called Kaiju (a Japanese term for enormous monsters, think: Godzilla) that has:

32 cores
188 GB RAM
2 TB NVMe for the system disk
4 TB NVMe for the data disk

I ran the same commands on that machine as well and got really interesting results.

Write sizeTotal writesBuffered writes

4 KB	16 GB	1.4 GB/s
256 KB	256 MB	1.4 GB/s
2 MB	2 GB	1.6 GB/s
2 MB	16 GB	1.7 GB/s
4 MB	32 GB	1.8 GB/s
4 MB	64 GB	1.8 GB/s

We are faster than the cloud instance, and we don’t have a drop-off point when we hit a certain size. We are also seeing higher performance when we throw bigger buffers at the system.

But when we test with small buffers, the performance is also great. That is amazing, but what about durable writes with direct I/O?

I tested the same scenario with both buffered and durable writes:

ModeBufferedDurable

1 MB buffers, 8 GB write	1.6 GB/s	1.0 GB/s
2 MB buffers, 16 GB write	1.7 GB/s	1.7 GB/s

Wow, that is an interesting result. Because it means that when we use direct I/O with 1 MB buffers, we lose about 600 MB/sec compared to buffered I/O. Note that this is actually a pretty good result. 1 GB/sec is amazing.

And if you use big buffers, then the cost of direct I/O is basically gone. What about when we go the other way around and use smaller buffers?

ModeBufferedDurable

128 KB buffers, 8 GB write	1.7 GB/s	169 MB/s
32 KB buffers, 2 GB	1.6 GB/s	49.9 MB/s
Parallel: 8, 1 MB, 8 GB	5.8 GB/s	3.6 GB/s
Parallel: 8, 128 KB, 8 GB	6.0 GB/s	550 MB/s

For buffered I/O - I’m getting simply dreamy numbers, pretty much regardless of what I do 🙂.

For durable writes, the situation is clear. The bigger the buffer we write, the better we perform, and we pay for small buffers. Look at the numbers for 128 KB in the durable column for both single-threaded and parallel scenarios.

169 MB/s in the single-threaded result, but with 8 parallel processes, we didn’t reach 1.3 GB/s (which is 169x8). Instead, we achieved less than half of our expected performance.

It looks like there is a fixed cost for making a direct I/O write to the disk, regardless of the amount of data that we write. When using 32 KB writes, we are not even breaking into the 200 MB/sec. And with 8 KB writes, we are barely breaking into the 50 MB/sec range.

Those are some really interesting results because they show a very strong preference for bigger writes over smaller writes.

I also tried using the same C code as before. As a reminder, we use direct I/O to write to 128 files in batches of 8 KB, writing a total of 128 MB per file. All of that is done concurrently to really stress the system.

When running iotop in this environment, we get:

Total DISK READ:         0.00 B/s | Total DISK WRITE:       522.56 M/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:     567.13 M/s
    TID  PRIO  USER     DISK READ DISK WRITE>    COMMAND
 142851 be/4 kaiju-1     0.00 B/s    4.09 M/s ./a.out
 142901 be/4 kaiju-1     0.00 B/s    4.09 M/s ./a.out
 142902 be/4 kaiju-1     0.00 B/s    4.09 M/s ./a.out
 142903 be/4 kaiju-1     0.00 B/s    4.09 M/s ./a.out
 142904 be/4 kaiju-1     0.00 B/s    4.09 M/s ./a.out
... redacted ...

So each thread is getting about 4.09 MB/sec for writes, but we total 522 MB/sec across all writes. I wondered what would happen if I limited it to fewer threads, so I tried with 16 concurrent threads, resulting in:

Total DISK READ:         0.00 B/s | Total DISK WRITE:        89.80 M/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:     110.91 M/s
    TID  PRIO  USER     DISK READ DISK WRITE>    COMMAND
 142996 be/4 kaiju-1     0.00 B/s    5.65 M/s ./a.out
 143004 be/4 kaiju-1     0.00 B/s    5.62 M/s ./a.out
 142989 be/4 kaiju-1     0.00 B/s    5.62 M/s ./a.out
... redacted ..

Here we can see that each thread is getting (slightly) more throughput, but the overall system throughput is greatly reduced.

To give some context, with 128 threads running, the process wrote 16GB in 31 seconds, but with 16 threads, it took 181 seconds to write the same amount. In other words, there is a throughput issue here. I also tested this with various levels of concurrency:

Concurrency(8 KB x 16K times - 128 MB)Throughput per threadTime / MB written

1	15.5 MB / sec	8.23 seconds / 128 MB
2	5.95 MB / sec	18.14 seconds / 256 MB
4	5.95 MB / sec	20.75 seconds / 512 MB
8	6.55 MB / sec	20.59 seconds / 1024 MB
16	5.70 MB / sec	22.67 seconds / 2048 MB

To give some context, here are two attempts to write 2GB to the disk:

ConcurrencyWriteThroughputTotal writtenTotal time

16	128 MB in 8 KB writes	5.7 MB / sec	2,048 MB	22.67 sec
8	256 MB in 16 KB writes	12.6 MB / sec	2,048 MB	22.53 sec
16	256 MB in 16 KB writes	10.6 MB / sec	4,096 MB	23.92 sec

In other words, we can see the impact of concurrent writes. There is absolutely some contention at the disk level when making direct I/O writes. The impact is related to the number of writes rather than the amount of data being written.

Bigger writes are far more efficient. And concurrent writes allow you to get more data overall but come with a horrendous latency impact for each individual thread.

The difference between the cloud and physical instances is really interesting, and I have to assume that this is because the cloud instance isn’t actually forcing the data to the physical disk (it doesn’t make sense that it would).

I decided to test that on an m6i.2xlarge instance with a 512 GB io2 disk with 16,000 IOPS.

The idea is that an io2 disk has to be durable, so it will probably have similar behavior to physical hardware.

DiskBuffer SizeWritesDurableParallelTotalRate

io2	256.00	1,024.00	No	1.00	256.00	1,638.40
io2	2,048.00	1,024.00	No	1.00	2,048.00	1,331.20
io2	4.00	4,194,304.00	No	1.00	16,384.00	1,228.80
io2	256.00	1,024.00	Yes	1.00	256.00	144.00
io2	256.00	4,096.00	Yes	1.00	1,024.00	146.00
io2	64.00	8,192.00	Yes	1.00	512.00	50.20
io2	32.00	8,192.00	Yes	1.00	256.00	26.90
io2	8.00	8,192.00	Yes	1.00	64.00	7.10
io2	1,024.00	8,192.00	Yes	1.00	8,192.00	502.00
io2	1,024.00	2,048.00	No	8.00	2,048.00	1,909.00
io2	1,024.00	2,048.00	Yes	8.00	2,048.00	1,832.00
io2	32.00	8,192.00	No	8.00	256.00	3,526.00
io2	32.00	8,192.00	Yes	8.00	256.00	150.9
io2	8.00	8,192.00	Yes	8.00	64.00	37.10

In other words, we are seeing pretty much the same behavior as on the physical machine, unlike the ephemeral drive.

In conclusion, it looks like the limiting factor for direct I/O writes is the number of writes, not their size. There appears to be some benefit for concurrency in this case, but there is also some contention. The best option we got was with big writes.

Interestingly, big writes are a win, period. For example, 16 MB writes, direct I/O:

Single-threaded - 4.4 GB/sec
2 threads - 2.5 GB/sec X 2 - total 5.0 GB/sec
4 threads - 1.4 X 4 - total 5.6 GB/sec
8 threads - ~590 MB/sec x 8 - total 4.6 GB/sec

Writing 16 KB, on the other hand:

8 threads - 11.8 MB/sec x 8 - total 93 MB/sec
4 threads - 12.6 MB/sec x 4- total 50.4 MB/sec
2 threads - 12.3 MB/sec x 2 - total 24.6 MB/sec
1 thread - 23.4 MB/sec

This leads me to believe that there is a bottleneck somewhere in the stack, where we need to handle the durable write, but it isn’t related to the actual amount we write. In short, fewer and bigger writes are more effective, even with concurrency.

As a database developer, that leads to some interesting questions about design. It means that I want to find some way to batch more writes to the disk, especially for durable writes, because it matters so much.

Expect to hear more about this in the future.

Oren Eini

Oren Eini

CEO of RavenDB

NTFS has an emergency stash of disk space

What happens when a sparse file allocation fails?

Configuration values & Escape hatches

Partial writes, IO_Uring and safety

AnswerWhat does this code do?

Windows has the same limit, but it is honest about it

ChallengeWhat does this code do?

Production post-mortemInspecting ourselves to death

It’s expensive to generate the `smaps` file

Accidenal complexity: A tale of two GUIDs

The memory leak in ConcurrentQueue

Performance discoveryIOPS vs. IOPS

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed

Oren Eini

CEO of RavenDB

NTFS has an emergency stash of disk space

What happens when a sparse file allocation fails?

Configuration values & Escape hatches

Partial writes, IO_Uring and safety

AnswerWhat does this code do?

Windows has the same limit, but it is honest about it

ChallengeWhat does this code do?

Production post-mortemInspecting ourselves to death

It’s expensive to generate the smaps file

Accidenal complexity: A tale of two GUIDs

The memory leak in ConcurrentQueue

Performance discoveryIOPS vs. IOPS

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

It’s expensive to generate the `smaps` file