I just got the following email:
I didn’t give an answer, but I thought that I might provide a hint. Use binary data, and here is your underlying record structure.
I just got the following email:
I didn’t give an answer, but I thought that I might provide a hint. Use binary data, and here is your underlying record structure.
The Sled project is an embedded database written in Rust. I run into it a few times recently and given my day job, I decided to take a peek and understand how it works. The project talks about being Log Structure Merge (and also exposing this to the client) with B+Tree read performance. The last time I read an LSM codebase was quite some time ago, so this is going to be quite interesting, I hope.
As usual, I’m doing a stream of consciousness as I’m going through the code. I’m looking at commit 6ce3e1264d3bb46de768e8fe338bade02d0b30dc.
The first thing that I found is the actual source code for this:
The fact that there is a page cache component already tell me quite a lot about the project. That is uses pages (LevelDB does not, for example) and that it is cached (so likely not using mmap). That will likely be important later, but when starting out I enjoy jumping to conclusions and then seeing if I was right or not.
The following is taken from the readme for the pagecache:
This is quite interesting. Mostly because this shows a very different approach to building a page cache. It isn’t your usual B-Tree page, it seems. The docs reference LLAMA a lot, but I would rather go through the code first and read scholarly articles later.
As usual, I’m going over the code in lexical order, which means that the first real code file that I run into is a Rust implementation of doubly linked list. Nothing really interesting there, except massive amounts of unsafe code, so I’m going to skip to it’s usage, which is to implement a least recently used cache.
There are N shards, determined by the configuration of the system. This is controlled by cache_bits, where 2^CacheBits is the actual size. I’m not sure why the code uses this methods instead of just a number.
The codebase contains the values 0, 2 and 6 for cache_bits, giving us 1 shard, 4 shards and 64 shards, respectively. I assume that they use this to reduce contention on the same Shard instance for concurrent operations.
The LRU’s main method is accessed, which simply forward the call to one of the shards, so I’m going to skip that and see how that actually works when I’ll look at the shard code. Here is what the Shard looks like:
Note that capacity is actually the max capacity, not reserved space in the entries vector. Each entry contains the actual size that it is using, as well and the Node itself (that reside in the LRU). The actual value that is kept is a PageId, which is the native integer type for the platform (32 bits – 4 bytes / 64 bits – 8 bytes). I’m not sure yet if this is a persisted value, but for something that is meant to be used in a persisted store, I find that choice very strange. It means that on 32 bits machine, you can only use 4 billion pages, while 64 bits there is much higher limit. That means that a database created on 64 bits might not be valid for 32 bits.
When building Voron, we were very careful that data will be the same in 32 bits and 64 bits, Windows / MacOS / Linux, etc.
When you call accessed on the Shard, it will add it to the list, and if the size of all entries is higher than the capacity, will remove entries until the size in the cache is smaller than the maximum capacity.
Interestingly enough, Sled is using an Epoch based GC system in Rust, implemented by Crossbeam. I wrote about Epoch systems when I reviewed FASTER, and I understand that Epoch GC is pretty common in Rust concurrent programming, because otherwise you run the risk of memory corruption.
The ds (I assume data structure) folder of the page cache also provide a concurrent Stack implementation, which it pretty elegant to read, but you need to be well versed in the Epoch handling to understand how it all works correctly under the covers.
The interesting stuff in the ds folder is the pagetable, which start with the following structs:
Each of those is initialized to array of pointers that is 2 MB in size (262,144 pointers, basically). That is a pretty big structure, and the PageTable struct ties it all together:
The top of the file mention that this assume that the caller is going to have a dense key space, and this make sense. Theoretically, this will use a maximum of 512GB if you store a sparse set of data in the page table.
All the operations in the page table are some variant of the traverse operation, so let’s focus on that.
First, the key is split to the first 18 bits (0 .. 256K) and the rest of the value. The first part is used to find the level 1 value.
The debug_delay() calls are debug only calls that add random waits, which help catch race conditions. Given that the code uses explicit Atomic instructions, I’m not surprised to see that there is really no help from the Rust safety behavior to ensure that there aren’t any logical race conditions.
This is the first time I’m reading real Rust code that explicitly deal with concurrency (without using the hammer that is Mutex, at least), and it is… reassuring seeing the same kind of behaviors that I’m injecting into my code to validate things.
This next bit handles adding a new value (safely) at the next level. Given the dense assumption, we can safely assume that this is a rare operation
Look at the comment when we fail to win the compare_and_set call. Who is actually going to drop the unused created value? That is on the Guard and Epoch, I think. We associated the value with them when we called into_shared, and since we didn’t take it out from the guard, it will be removed when the Epoch is over.
The final act of traverse() is to just return the value from the second level, like so:
Note that this may be an empty value, so let’s see how this actually get used. I’m using the simplest caller, which is get().
You can see that we have to do explicit null checking, translating that to an Option call to make calling code idiomatic. The rest of the code there is just memory management, so I’m going to skip that and look at the rest of the pagecache code, now that I understand the underlying data structures.
Note, the code uses the Lsn as a logical type name, this maps to a 64 bits signed integer.
The first file in the pagecache crate is the blob_io one, which handles reading blob. Here is the (crate public only) functions that are implemented there.
Remember, LSN is the (Logical Sequence Number) and is a signed 64 bits integer. Config I’m not sure about at this point. The read_blob and write_blob are fairly simple. They just read / write (with CRC32 checks) data from disk. This is done on a full file each time. So that is quite interesting. The Config is responsible for mapping between an LSN and a file path.
The call to remove_blob will simply delete the mapped file from the disk, but it is the call to gc_blobs that is really interesting. What this does is to basically delete all the files whose name is greater than the stable_lsn provided. This is interesting, because it tells us a lot about the actual on disk persistent format.
Based on just the blob_io.rs file, I can tell that Sled is writing fairly large files to disk and reading them to memory. It is probably going to call gc_blobs() as part of its recovery and I’m willing to assume at this point that each blob is actually a segment. As I said, I like assuming things upfront when doing this kind of code reviews. I’m going to continue my reading and see if I was right.
The config file starts with usage instructions, which looks like:
So that is really interesting. First, we can see that Sled is probably only calling fsync once in a while, depending on its configuration. I looked into a real world project, and it uses the default value, which is flush every 500 ms and snapshot after 1 million. The reason this is interesting is that my experience tells me is that whenever I see a configuration option like that, it means that fsync is going to be pretty expensive and the database engine has chosen to allow users to trade performance for safety.
Voron, in contrast, actually don’t have any such feature. We don’t allow users to make that choice, instead, we accepted the fsync costs and dealt with them (using async commit and transaction merging, for example).
Anyway, we were talking about the config and how it works, here is how the Config gives us the path for a blob:
So this is pretty much just as expected. The data for each blob is a separate file under the blobs directory. Now we are only left with figuring out what that blob was.
The next file of interest is the DiskPtr file, which contains this struct:
And the only interesting method is the read():
So now we have more interesting data. There is stuff that is kept inline (presumably small items) and larger stuff that are read from the blobs directory. We already saw what the read_blob behavior is, but we haven’t checked what the meaning of read_message is. This is implemented later in the code, so I’m going to ignore it and just look at its return value:
So that allow us to make sense of what is going above. Basically, this is “gimme some bytes” call, regardless of where this is actually stored (which I don’t know yet).
The next file is event_log, and it is a debug tool to help verify the behavior of the system. I wholeheartedly approve and it is a really good sign for the maturity of the codebase.
The next is flusher, which is a dedicate thread that call this continuously:
There is nothing of interest in there, but it does indicate no coordination between flushing and other activities. Not sure whatever this is a good or bad idea.
Now, let’s see what this iobuf is all about, shall we?
The first thing I run into in this file is this:
And I can’t really figure this one out. I mean, why? Just setting InnerLsn = i64 should be good enough for all platforms, I think. I guess there might be some optimizations stuff that I’m missing, but that seems unlikely. In 64 bits, I would expect isize and i64 to have the exact same behavior.
This is now getting pretty late, so I’m going to close this for today. Tomorrow, I’m going to try to figure out how the rest of it works.
I don’t usually talk about what we call the Studio Features. These set of features impact the RavenDB Studio and are meant to make the lives of developers and administrators of RavenDB easier. We actually spend lot of effort on the Studio, because making features accessible and usable is considered to be a core part of the feature itself. We run the metrics, and bout 30% – 35% of the time spent building RavenDB 4.0 was spent on the Studio. We are still investing > 10% of our time and effort in continuously improving the Studio. I tend not to write about the Studio explicitly because we usually use it to show case the server features.
This feature, on the other hand, is purely a Studio feature. Yesterday I talked about Revisions in RavenDB and was reminded that we also have a new major feature for Revisions in the Studio only. Now you can diff different revisions directly in the Studio.
Here is how you invoke this feature:
When you click on the compare button, you’ll get the following screen:
You can compare a revision to the current document or a revision to revision.
This can make it very easy to see what changed between two versions of a document.
RavenDB allows you to tune, per transaction, what level of safety you want your changes to have. At the most basic level, you can select between a single node transaction and a cluster wide transaction.
We get questions from customers about the usage scenario for each mode. It seems obvious that we really want to always have the highest level of safety for our data, so why not make sure that all the data is using cluster wide transactions at all times?
I like to answer this question with the lottery example. In the lottery example, there are two very distinct phases for the system. First, you record lottery tickets as they are purchased. Second, you run the actual lottery and select the winning numbers. (Yes, I know that this isn’t exactly how it works, bear with me for the sake of a clear example).
While I’m recording purchased lottery ticket, I always want to succeed in recording my writes. Even if there is a network failure of some kind, I never want to lose a write. It is fine if only one node will accept this write, since it will propagate the data to the rest off the cluster once communication is restored. In this case, you can use the single node transaction mode, and rely on RavenDB replication to distribute the data to the rest of the cluster. This is also the most scalable approach, since we can operate on each node separately.
However, for selecting the winning numbers (and tickets), you never want to have any doubt or the possibility of concurrency issues with that. In this case, I want to ensure that there is just one lottery winner selection transaction that actually commit, and for those purposes, I’m going to use the cluster transaction mode. In this way, we ensure that a quorum of the cluster will confirm the transaction for it to go through. This is the right thing to do for high value, low frequency operations.
We also have additional settings available, beyond the single node / full cluster quorum, which is write to a single node and wait for the transaction to propagate the write to some additional nodes. I don’t really have a good analogy for this use case using the lottery example, though. Can you think of one?
The feature outlined in this post is a hidden behind a small button at a relatively obscure part of the RavenDB Studio (Database > Settings > Document Revisions). You can see how it looks like on the right. Despite its unassuming appearance, this is a pretty important feature. Revisions revert is a feature that we wish that no one use, though, which make it an interesting one.
Revision Revert allow you to take your entire database back to a particular moment in time. Documents changes will be undone, deleted documents will be restored, new documents will be removed, etc.
So far, this isn’t a surprising feature, being able to restore to a point in time is a feature that many other database have. How is this feature different? In most systems, a point in time restore require you to… well, restore. In a large database, that can take a lot of time. Revision Revert is an alternative to that. Instead of restoring from scratch, it utilize the revisions features in RavenDB to allow you to just hit the time machine button and go back to the desired time.
The common use case for that is immediately after the “Opps” moment. You have run an query without specifying a where clause, deployed a bad version of your app that removed important fields, etc.
Revision Revert is an online operation, you don’t need to take the database down. In fact, you can still serve reads while the process is going on. Since typically you’ll need to go back in time a relatively short period, this is a very quick process.
In a distributed system, the admin will invoke this process on one of the nodes in the system and the reverts will be applied on that node and then replicated from there to all the other nodes in the system. We have made every attempt to make what is likely to be a pretty stressful event as easy as possible.
You might have noticed the Window configuration in the screen above. What is that about?
To be honest, this is something that we expect most users to never really care about. It is there for correctness’ sake in a distributed environment. Let’s dig a little deeper into this feature.
First thing we need to talk about is time. The point in time that we’ll restore to is the user’s local time. This is converted into UTC internally and used to compute the cutoff point for the revert. In a distributed system, it is possible (even likely) that different machines on the network will have different clocks. (Note that while RavenDB will work just fine and do the Right Thing if your nodes have different timezones, we have found that really confusing. Better to keep all nodes on the same timezone and clock sync system).
This means that one problem for this feature is that changes happening on two machines at the same time may have different time stamps (in UTC, the local time is not relevant). You need to take that into account when using Revision Revert since that is what RavenDB uses to decide what stays and what go.
The second problem is that just because two updates happened at the same time, it doesn’t mean that we learned about them at the same time. What I mean here is that a change that two changes that happened at the same time on different machine may have reached a particular node at very different times. That is where the Window option come into play. We scan the revisions log for all changes to the system. And we scan them in the order that we learned about them. By default, we’ll go back 4 days until we are sure that there aren’t any revisions that we got out of order and missed.
A few additional things about this feature. Obviously, it requires that you’ll have revisions enabled (and have enough revisions capacity to go back far enough in time, naturally). It support live restores and operates nicely in a distributed environment. Note that if you are doing Revision Revert and not all your documents have revisions enabled, only those that have revisions will be reverted.
Currently we apply this revert globally, we are considering allowing you to select specific collections to revert, but I’m not sure how useful that would be in practice.
The most common network topology for RavenDB replication is a full mesh. For example, if you have three nodes in your cluster and a database that reside on all three nodes, you’ll have a replication topology that will look like this:
This works great when the number of nodes that you have in your cluster is reasonably small. However, we recently got a customer question about a different kind of topology. They have a bunch of nodes, in the order of a few dozens, which cooperate to perform some non trivial task. A key part of this is that the nodes are transient and identical. So a new node may pop up, live for a while (days, weeks, months) and then go away. At any given time you might have a few dozen nodes. That kind of environment won’t really work with a full mesh topology. If we would try, it would look something like that (fully connected network with 40 nodes):
This has a total of 780 connections(!) in it. You can create a topology like that, but a lot of the processing power in the network is going to be dedicated to just maintaining these connections. And you don’t actually need it. RavenDB’s replication algorithm is actually a gossip algorithm, and as you grow the number of nodes that take part in the replication, the less connection you need between nodes. In this case, we can take each of the live nodes and connect each of them to four other (random) nodes. The result would look like so:
Remember, each of the nodes is actually connected to a random four other nodes. RavenDB’s replication will ensure that a change to any document in any of the nodes under these conditions will propagate to all the other nodes efficiently.
This approach will also transparently handle any intermediary failures and be robust for nodes coming and leaving on the fly. RavenDB doesn’t implement gossip membership, mostly because that is very heavily dependent on the application and deployment pattern, but once you tell a node who its neighbors are, everything will proceed on its own.
I’m happy to let you know that as of last week, RavenHQ has been updated with full general availability of managed RavenDB 4.1 in the cloud.
If you aren’t familiar with RavenHQ, this update gives you the ability to get a managed RavenDB cluster in minutes. In this case, you get all the usual benefits of RavenDB as well as the peace and quite of someone else managing all the other details for you.
We have been steadily working on making RavenDB a self managed database, but there are still things that it can’t do on its own. RavenHQ close the loop. If your database is large and is about to run out of disk space, RavenHQ can automatically increase the allocated storage, for example. A machine went down? RavenHQ will transparently replace it and allow the cluster to recover without having any impact on your client code.
If you have used RavenHQ in the past, there are important changes. Previously, you would use RavenHQ to provision a database. That has changed, you now provision a cluster of nodes and you can create as many databases as you want. If you need a test database for CI, you can just create one on the fly, no extra charges or need to use additional APIs over the RavenDB native client.
All the usual management features of RavenDB are available as well, including the ability to extend RavenDB on the fly using index extensions and analyzers. We went over all the feedback that we got from the community and users and have done the same thing for RavenHQ that was done in the move from RavenDB 3.5 to RavenDB 4.0. Everything should be better, in the case of RavenDB, we had a rule that things should be at least 10x better, and I believe that you’ll find the RavenHQ experience similar.
As always, we would really love your feedback.
About a month ago I wrote about a particular issue that we wanted to resolve. RavenDB is using X509 certificates for authentication. These are highly secured and are a good answer for our clients who need to host sensitive information or are working in highly regulated environments. However, certificates have a problem, they expire. In particular, if you are following common industry best practices, you’ll replace your certificates every 2 – 3 months. In fact, the common setup of using RavenDB with Let’s Encrypt will do just that. Certificates will be replaced on the fly by RavenDB without the need for an administrator involvement.
If you are running inside a single cluster, that isn’t something you need to worry about. RavenDB will coordinate the certificate update between the nodes in such a way that it won’t cause any disruption in service. However, it is pretty common in RavenDB to have multi cluster topologies. Either because you are deployed in a geo-distributed manner or because you are running using complex topologies (edge processing, multiple cooperating clusters, etc). That means that when cluster A replaces its certificate, we need to have a good story for cluster B still allowing it access, even though the certificate has changed.
I outlined our thinking in the previous post, and I got a really good hint, 13xforever suggested that we’ll look at HPKP (HTTP Public Key Pinning) as another way to handle this. HPKP is a security technology that was widely used, run into issues and was replaced (mostly by certificate transparency). With this hint, I started to investigate this further. Here is what I learned:
Given these facts, we can rely on that to avoid the issues with certificate expiration, distributing new certificates, etc.
Whenever a cluster need a new certificate, it will use the same private/public key pair to generate the new certificate. Because the public key is the same (and we verify that the client has the private key during the handshake), even if the certificate itself changed, we can verify that the other side know the actual secret, the private key.
In other words, we slightly changed the trust model in RavenDB. From trusting a particular certificate, we trust that certificate’s private key. That is what grants access to RavenDB. In this way, when you update the certificate, as long as you keep the same key pair, we can still authenticate you.
This feature means that you can drastically reduce the amount of work that an admin has to do and lead you to a system that you setup once and just keeps working.
There are some fine details that we still had to deal with, of course. An admin may issue a certificate and want it to expire, so just having the user re-generate a new certificate with the private key isn’t really going to work for us. Instead, RavenDB validates that the chain of signatures on the certificate are the same. Actually, to be rather more exact, it verifies that the chain of signatures that signed the original (trusted by the admin) certificate and the new certificate that was just presented to us are signed by the same chain of public key hashes.
In this way, if the original issuer gave you a new certificate, it will just work. If you generate a new certificate on your own with the same key pair, we’ll reject that. The model that we have in mind here is trusting a driver’s license. If you have an updated driver’s license from the same source, that is considered just as valid as the original one on file. If the driver license is from Toys R Us, not so much.
Naturally, all such automatic certificate updates are going to be logged to the audit log, and we’ll show the updated certificates in the management studio as well.
As usual, we’ll welcome your feedback, the previous version of this post got us a great feature, after all.
This post really annoyed me. Feel free to go ahead and go through it, I’ll wait. The gist of the post, titled: “WAL usage looks broken in modern Time Series Databases?” is that time series dbs that uses a Write Ahead Log system are broken, and that their system, which isn’t using a WAL (but uses Log-Structure-Merge, LSM) is also broken, but no more than the rest of the pack.
This post annoyed me greatly. I’m building databases for a living, and for over a decade or so, I have been focused primarily with building a distributed, transactional (ACID), database. A key part of that is actually knowing what is going on in the hardware beneath my software and how to best utilize that. This post was annoying, because it make quite a few really bad assumptions, and then build upon them. I particularly disliked the outright dismissal of direct I/O, mostly because they seem to be doing that on very partial information.
I’m not familiar with Prometheus, but doing fsync() every two hours basically means that it isn’t on the same plane of existence as far as ACID and transactions are concerned. Cassandra is usually deployed in cases where you either don’t care about some data loss or if you do, you use multiple replicas and rely on that. So I’m not going to touch that one as well.
InfluxDB is doing the proper thing and doing fsync after each write. Because fsync is slow, they very reasonable recommend batching writes. I consider this to be something that the database should do, but I do see where they are coming from.
Postgres, on the other hand, I’m quite familiar with, and the description on the post is inaccurate. You can configure Postgres to behave in this manner, but you shouldn’t, if you care about your data. Usually, when using Postgres, you’ll not get a confirmation on your writes until the data has been safely stored on the disk (after some variant of fsync was called).
What really got me annoyed was the repeated insistence of “data loss or corruption”, which shows a remarkable lack of understanding of how WAL actually works. Because of the very nature of WAL, the people who build them all have to consider the nature of a partial WAL write, and there are mechanisms in place to handle it (usually by considering this particular transaction as invalid and rolling it back).
The solution proposed in the post is to use SSTable (sorted strings table), which is usually a component in LSM systems. Basically, buffer the data in memory (they use 1 second intervals to write it to disk) and then write it in one go. I’ll note that they make no mention of actually writing to disk safely. So no direct I/O or calls to fsync. In other words, a system crash may leave you a lot worse off than merely 1 second of lost data. In fact, it is possible that you’ll have some data there, and some not. Not necessarily in the order of arrival.
A proper database engine will:
I’m leaving aside major issues with LSM and SSTables, of which write amplification, and the inability to handle sustained high loads (because there is never a break in which you can do book keeping). Just the portions on the WAL usage (which shows broken and inefficient use) to justify another broken implementation is quite enough for me.
One of the primary reasons why businesses chose to use workflow engines is that they get pretty pictures that explain what is going on and look like they are easy to deal with. The truth is anything but that, but pretty sell.
My recommended solution for workflow has a lot going for it, if you are a developer. But if you’ll try to show a business analyst this code, they are likely to just throw their hands up in the air and give up. Where are the pretty pictures?
One of the main advantages of this kind of approach is that it is very rigid. You are handling things in the event handlers, registering the next step in the workflow, etc. All of which is very regimented. This is so for a reason. First, it make it very easy to look at the code and understand what is going on. Second, it allow us to process the code in additional ways.
Consider the following AST visitor, which operate over the same code.
This took me about twenty minutes to write, mostly to figure out the Graphviz notation. It take advantage of the fact that the structure of the code is predictable to generate the actual flow of actions from the code.
You get to use readable code and maintainable practices and show pretty pictures to the business people.
There are posts all the way to Nov 19, 2019