Reviewing Sled: Part III

Apr 23 2019

Reviewing SledPart III

time to read 6 min | 1078 words

Unusually for me, I had a bit of a pause in reviewing Sled. As a reminder, Sled is an embedded database engine written in Rust. I last stopped looking at the buffer management, but I still don’t really have a good grasp of what is going on.

The next file is the iterator. It looks like it translates between segments and messages in these segments. Here is the struct:

As you can see, the log iterator holds an iterator of segments, and iterating over it looks like it will go over all the messages in the segments in order. Yep, here is the actual work being done:

The next() method is fairly straightforward, I found. But I have to point out this:

First, the will need call is really interesting. Mostly because you have a pretty obvious way to do conditional compiling that doesn’t really sucks. #if is usually much more jarring in the code.

Second, I think that the style of putting really important functions inside an if result in a pretty dense code. Especially since the if is entered only on error. I would have preferred to have it as a stand alone variable, and then check if it failed.

What I don’t understand is the read_segment call. Inside that method, we have:

There are also similar calls on segment trailer. It looks like we have a single file for the data, but stuff that is too large is held externally, in the blob files.

We then get to this guy, which I find really elegant way to handle all the different states.

That is about it for interesting bits in the iterator, the next fun bit is the Log. I do have to admit that I don’t like the term log. It is too easy to get it confused with a debug log. In Voron, I used the term Journal or Write Ahead Journal (OH in the office: “did we waj it yet?”).

The fact that you need to figure out where to get the offset of the data you are about to write is really interesting. This is the method that does the bulk of the work:

Note that it just reserve and complete the operation. This also does not flush the data to disk. That is handled by the flusher or by explicit call. The reserve() method calls to reserve_internal() and there we find this gem:

I know what it does (conditional compilation), but I find it really hard to follow. Especially because it looks like a mistake, with buf being defined twice. This is actually a case where an #if statement would be better, in my eyes.

Most of the code in there is to manage calls to the iobuf, which I already reviewed. So I’m going to skip ahead and look at something that is going to be more interesting, the page cache. Sled has an interesting behavior, in that it can shred a page into multiple location, requiring some logic to bring it all back together. That is going to be really interesting to look at, I hope.

The file stats with this:

And this… takes a while to unpack. Remember that epoch is manual GC pattern for concurrent data structure without GC.

The cached_ptr value is a shared pointer to a Node (inside a lock free stack) that holds a CacheEntry with static lifetime and thread safe to a generic argument that must have static lifetime and be thread safe. And there is a unsigned long there as well.

No idea yet what is going on. But here is the first method on this struct:

That is… a lot. The cache entry is a discriminated union with the following options:

There are some awesome documentation comments here, including full blown sample code that really help understand what is going on in the code.

There seems to be a few key methods that are involved here:

allocate(val) – create a new page and write an initial value, gets back a page id.
link(id, val) – make a write to a page id. Which simply write a value out.
get(id) – read all the values for a page id, and uses a materializer to merge them all to a single value.
replace(id, val) – set the page id to the new value, removing all the other values it had.

The idea here, as I gather. Is to allow sequential writes to the data, but allow fast reads, mostly by utilizing SSD’s random read feature.

I’m trying to follow the code, but it is a bit complicated. In particular, we have:

This try to allocate either a free page or allocate a new one. One of the things that really mess with me here is that the use of the term Page. I’m using to B+Tree, where a page is literally some section of memory. Here it refers to something more nebulous. Key point here, I don’t see where the size is specified. But given that you can link items to page, that sort of make sense. I just need to get used to Pages != storage.

The fact that all of this is generic also make it hard to follow what is actually going on. I’m getting lost in here, so I think that I’ll stop for now.

Tweet Share Share 0 comments

Tags:

reviews

Oren Eini

Oren Eini

CEO of RavenDB