Ayende @ Rahien

filter by tags archive

architecture (618) rss
bugs (451) rss
challanges (123) rss
community (381) rss
databases (481) rss
design (896) rss
development (647) rss
hibernating-practices (72) rss
miscellaneous (592) rss
performance (397) rss
programming (1093) rss
raven (1459) rss
ravendb.net (545) rss
reviews (184) rss

2025
- August (6)
- July (7)
- June (7)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

RavenDB - High-Performance NoSQL Document Database

Feb 03 2025

ChallengeGiving file system developer ulcer

time to read 4 min | 759 words

Tweet Share Share 2 comments

Tags:

I’m trying to reason about the behavior of this code, and I can’t decide if this is a stroke of genius or if I’m suffering from a stroke. Take a look at the code, and then I’ll discuss what I’m trying to do below:

HANDLE hFile = CreateFileA("R:/original_file.bin", 
GENERIC_READ | GENERIC_WRITE, 
FILE_SHARE_READ | FILE_SHARE_WRITE, 
NULL, 
OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, 
NULL);
if (hFile == INVALID_HANDLE_VALUE) {
    printf("Error creating file: %d\n", GetLastError());
    exit(__LINE__);
}




HANDLE hMapFile = CreateFileMapping(hFile, NULL, 
PAGE_READWRITE, 0, 0, NULL);
if (hMapFile == NULL) {
    fprintf(stderr, "Could not create file mapping object: %x\n", GetLastError());
    exit(__LINE__);
}


char* lpMapAddress = MapViewOfFile(hMapFile, FILE_MAP_WRITE, 0, 0, 0);
if (lpMapAddress == NULL) {
    fprintf(stderr, "Could not map view of file: %x\n", GetLastError());
    exit(__LINE__);
}


for (size_t i = 2 * MB; i < 4 * MB; i++)
{
        lpMapAddress[i]++;
}


HANDLE hDirect = CreateFileA("R:/original_file.bin", 
GENERIC_READ | GENERIC_WRITE, 
FILE_SHARE_READ | FILE_SHARE_WRITE, 
NULL, 
OPEN_ALWAYS, 
FILE_ATTRIBUTE_NORMAL, 
NULL);


SetFilePointerEx(hDirect, (LARGE_INTEGER) { 6 * MB }, & fileSize, FILE_BEGIN);
for (i = 6 ; i < 10 ; i++) {
    if (!WriteFile(hDirect, lpMapAddress + i * MB, MB, &bytesWritten, NULL)) {
        fprintf(stderr, "WriteFile direct failed on iteration %d: %x\n", i, GetLastError());
        exit(__LINE__);
    }
}

The idea is pretty simple, I’m opening the same file twice. Once in buffered mode and mapping that memory for both reads & writes. The problem is that to flush the data to disk, I have to either wait for the OS, or call FlushViewOfFile() and FlushFileBuffers() to actually flush it to disk explicitly.

The problem with this approach is that FlushFileBuffers() has undesirable side effects. So I’m opening the file again, this time for unbuffered I/O. I’m writing to the memory map and then using the same mapping to write to the file itself. On Windows, that goes through a separate path (and may lose coherence with the memory map).

The idea here is that since I’m writing from the same location, I can’t lose coherence. I either get the value from the file or from the memory map, and they are both the same. At least, that is what I hope will happen.

For the purpose of discussion, I can ensure that there is no one else writing to this file while I’m abusing the system in this manner. What do you think Windows will do in this case?

I believe that when I’m writing using unbuffered I/O in this manner, I’m forcing the OS to drop the mapping and refresh from the disk. That is likely the reason why it may lose coherence, because there may be already reads that aren’t served from main memory, or something like that.

This isn’t an approach that I would actually take for production usage, but it is a damn interesting thing to speculate on. If you have any idea what will actually happen, I would love to have your input.

Jan 20 2025

ChallengeWhat does this code do?

time to read 3 min | 536 words

Tweet Share Share 1 comments

Tags:

Here is a pretty simple C program, running on Linux. Can you tell me what you expect its output to be?

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/stat.h>


#define BUFFER_SIZE (3ULL * 1024 * 1024 * 1024) // 3GB in bytes


int main() {
    int fd;
    char *buffer;
    struct stat st;


    buffer = (char *)malloc(BUFFER_SIZE);
    if (buffer == NULL) {
        return 1;
    }


    fd = open("large_file.bin", O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
    if (fd == -1) {
        return 2;
    }


    if (write(fd, buffer, BUFFER_SIZE) == -1) {
        return 3;
    }


    if (fsync(fd) == -1) {
        return 4;
    }


    if (close(fd) == -1) {
        return 5;
    }


    if (stat("large_file.bin", &st) == -1) {
        return 6;
    }


    printf("File size: %.2f GB\n", (double)st.st_size / (1024 * 1024 * 1024));


    free(buffer);
    return 0;
}

And what happens when I run:

head  large_file.bin  | hexdump -C

This shows both surprising behavior and serves as a good opening for discussion on a whole bunch of issues. In an interview setting, that can give us a lot of insight into the sort of knowledge a candidate has.

Jul 01 2024

ChallengeEfficient snapshotable state

time to read 4 min | 778 words

Tweet Share Share 1 comments

Tags:

At the heart of RavenDB, there is a data structure that we call the Page Translation Table. It is one of the most important pieces inside RavenDB.

The page translation table is basically a Dictionary<long, Page>, mapping between a page number and the actual page. The critical aspect of this data structure is that it is both concurrent and multi-version. That is, at a single point, there may be multiple versions of the table, representing different versions of the table at given points in time.

The way it works, a transaction in RavenDB generates a page translation table as part of its execution and publishes the table on commit. However, each subsequent table builds upon the previous one, so things become more complex. Here is a usage example (in Python pseudo-code):

table = {}


with wtx1 = write_tx(table):
  wtx1.put(2, 'v1')
  wtx1.put(3, 'v1')
  wtx1.publish(table)


# table has (2 => v1, 3 => v1)


with wtx2 = write_tx(table):
  wtx2.put(2, 'v2')
  wtx2.put(4, 'v2')
  wtx2.publish(table)


# table has (2 => v2, 3 => v1, 4 => v2)

This is pretty easy to follow, I think. The table is a simple hash table at this point in time.

The catch is when we mix read transactions as well, like so:

# table has (2 => v2, 3 => v1, 4 => v2)


with rtx1 = read_tx(table):


        with wtx3 = write_tx(table):
                wtx3.put(2, 'v3')
                wtx3.put(3, 'v3')
                wtx3.put(5, 'v3')


                with rtx2 = read_tx(table):
                        rtx2.read(2) # => gives, v2
                        rtx2.read(3) # => gives, v1
                        rtx2.read(5) # => gives, None


                wtx3.publish(table)


# table has (2 => v3, 3 => v3, 4 => v2, 5 => v3)
# but rtx2 still observe the value as they were when
# rtx2 was created


        rtx2.read(2) # => gives, v2
        rtx2.read(3) # => gives, v1
        rtx2.read(5) # => gives, None

In other words, until we publish a transaction, its changes don’t take effect. And any read translation that was already started isn’t impacted. We also need this to be concurrent, so we can use the table in multiple threads (a single write transaction at a time, but potentially many read transactions). Each transaction may modify hundreds or thousands of pages, and we’ll only clear the table of old values once in a while (so it isn’t infinite growth, but may certainly reach respectable numbers of items).

The implementation we have inside of RavenDB for this is complex! I tried drawing that on the whiteboard to explain what was going on, and I needed both the third and fourth dimensions to illustrate the concept.

Given these requirements, how would you implement this sort of data structure?

Oct 13 2023

ChallengeFastest node selection metastable error state–answer

time to read 2 min | 290 words

Tweet Share Share 1 comments

Tags:

In the previous post, I showed a very simple request router that would always select the fastest node. That worked for a long while, until it didn’t, and the challenge is figuring out why.

As it turns out, the issue is a simple one of spooky action at a distance. Here is what happens. Assume that we have three servers and 10 clients. Each server is sized to handle 4 clients. So far, so good, the system has the capacity to spare.

The problem is in the manner in which clients will detect which is the fastest node in the cluster. The only thing that is considered is the state of the node at the time of selection. At that time, we may end up with all the nodes selecting one particular node as the fastest.

In other words, we have three servers, two of them have no clients talking to them and one of the servers has all the clients talking to it. That results in that node going down, obviously. The clients would then react appropriately, and select a new node to talk to. All of them would do that, find the fastest node, and… bring it down as well. Rinse & repeat.

The issue can be stated as Time Of Check vs Time Of Use, but also as a race condition, where all individual nodes end up doing a synchronized “wave” operation that kills the system.

How do you prevent this?

You introduce randomness into the system. You don’t test the status once, but re-check on a regular basis so you can respond to shifting load. You should also introduce randomness into the process. So the nodes won’t all do this exactly at the same time and end up in the same position.

Oct 12 2023

ChallengeFastest node selection metastable error state

time to read 1 min | 186 words

Tweet Share Share 12 comments

Tags:

Side note: Current state in Israel right now is bad. I’m writing this blog post as a form of escapism so I can talk about something that makes sense and follow logic and reason. I’ll not comment on the current status otherwise in this area.

Consider the following scenario. We have a bunch of servers and clients. The clients want to send requests for processing to the fastest node that they have available. But the algorithm that was implemented has an issue, can you see what this is?

To simplify things, we are going to assume that the work that is being done for each request is the same, so we don’t need to worry about different request workloads.

The idea is that each client node will find the fastest node (usually meaning the nearest one) and if there is enough load on the server to have it start throwing errors, it will try to find another one. This system has successfully spread the load across all servers, until one day, the entire system went down. And then it stayed down.

Can you figure out what is the issue?

Sep 19 2023

ChallengeSpot the bug

time to read 1 min | 27 words

Tweet Share Share 5 comments

Tags:

The following bug cost me a bunch of time, can you see what I’m doing wrong?

For fun, it’s so nasty because usually, it will accidentally work.

Jan 04 2023

Challengewhat does this code print?

time to read 1 min | 42 words

Tweet Share Share 4 comments

Tags:

Given the following code:

Can you guess what it will do?

Can you explain why?

I love that this snippet is under 20 lines of code, but being able to explain it shows a lot more knowledge about C# than you would expect.

Dec 14 2022

ChallengeWhat does this code print?

time to read 1 min | 30 words

Tweet Share Share 4 comments

Tags:

Take a look at the following code, what do you think it will print?

Since it obviously doesn’t print the expected value, why do you think this is the case?

Jul 01 2022

ChallengeFind the stack smash bug… – answer

time to read 2 min | 249 words

Tweet Share Share 2 comments

Tags:

Yesterday I presented a bug that killed the process in a particularly rude manner. This is a recursive function that guards against stack overflows using RuntimeHelpers.EnsureSufficientExecutionStack().

Because of how this function kills the process, it took some time to figure out what is going on. There was no StackOverflowException, just an exit code. Here is the relevant code:

This looks okay, we optimize for zero allocations on the common path (less than 2K items), but also handle the big one.

The problem is that our math is wrong. More specifically, take a look at this line:

var sizeInBytes = o.Count / (sizeof(byte) * 8) + o.Count % (sizeof(byte) * 8) == 0 ? 0 : 1;

Let’s assume that your count is 10, what do you think the value of this is going to be?

Well, it looks like this should give us 2, right?

10 / 8 + 10%8 == 0 ? 0 :1

The problem is in the operator precedence. I read this as:

(10 / 8) + (10 % 8 == 0 ? 0 : 1)

And the C# compiler read it as:

(10 / 8 + 10 % 8) == 0 ? 0 : 1

In other words, *#@*!*@!.

The end result is that we overwrite past our allocated stack. Usually that doesn’t do anything bad, since there is enough stack space. But sometimes, if the stack is aligned just right, we cross into the stack guard page and kill the process.

Opps, that was not expected.

Jun 30 2022

ChallengeFind the stack smash bug…

time to read 1 min | 171 words

Tweet Share Share 4 comments

Tags:

The following code is something that we ran into yesterday, under some conditions, this code will fail with a stack overflow. More specifically, the process crashes and the return code is –1073740791 (or as it is usually presented: 0xC0000409.

At this point in my career I can look at that error code and just recall that this is the Windows error code for a stack overflow, to be more precise, this is: STATUS_STACK_BUFFER_OVERRUN

That… makes sense, I guess, this is a recursive code, after all. Let’s take a look:

Except, that this code explicitly protects against this. Note the call to:

RuntimeHelpers.EnsureSufficientExecutionStack();

In other words, if we are about the run out of stack space, we ask the .NET framework to throw (just before we run out, basically).

This code doesn’t fail often, and we tried to push deeply nested structure through that, and we got an InsufficientExecutionStackException thrown.

Sometimes, however, when we run this code with a relatively flat structure (2 – 4 levels), it will just die with this error.

Can you spot the bug?

Oren Eini

Oren Eini

CEO of RavenDB

ChallengeGiving file system developer ulcer

ChallengeWhat does this code do?

ChallengeEfficient snapshotable state

ChallengeFastest node selection metastable error state–answer

ChallengeFastest node selection metastable error state

ChallengeSpot the bug

Challengewhat does this code print?

ChallengeWhat does this code print?

ChallengeFind the stack smash bug… – answer

ChallengeFind the stack smash bug…

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed