Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,195 | Comments: 46,074

filter by tags archive

Database Building 101Graph querying over large datasets

time to read 3 min | 522 words

I mentioned that maintaining physical ids is important for performance reasons in my previous post, but I skipped on exactly why. The short answer is that if I have a physical ids, it is much easier to implement locality and much easier to implement parallel locality.

Let us imagine a database whose size is about 100GB, running on a machine that has 6 GB of RAM. You need to do run some sort of computation that traverse the graph, but doing so naively will likely cause us to trash quite a lot, as we page memory in and out of the disk, only to jump far away in the graph, paging even more, and effectively killing all your performance.

Instead, we can do something like this, let us imagine that you have a machine with 4 cores on it, and the previous mention setup. And then you start 4 threads (each marked with a different color on the image, and start processing nodes.

image

However, there is a trick here, each thread has a queue, and only ids that fall without the area of responsibility of the thread will arrive there. But we aren’t done, inside a thread we define additional regions, and route requests to process each region into each own queue.

Finally, within each thread, we process one region at a time. So the idea is that while we are running over a region, we may produce work that will need to run on other regions (or even other threads), but we don’t care, we queue that work and continue emptying the work that exists on our own region. Only once once we have completed all work in a particular region will we move to the next one. The whole task complete when, in all threads, there are no more regions with work to be done.

Note that the idea here is that each thread is working on one region at a time, and that region maps to a section of the database file that was memory mapped. So we keep that are of the page cache alive and well.

When we move between regions, we can hint to the memory manager that we are going to need the next region, etc. We can’t escape the need to process the same region multiple times, because processing in one region may lead us to processing in another, and then back, but assuming we run the regions using least recently accessed, we can take advantage on the stuff remaining in the page cache (hopefully) from the previous run and using that.

Which is why the physical location on disk is important.

Note that the actual query that we run is less important. Typical graph queries are fall into one of two categories:

  • Some sort of Breadth First Search or Depth First Search and walking through the graph. 
  • Finding a sub-graph in the larger graph that matches this criteria.

In both cases, we can process such queries using the aforementioned process, and the reduction in random work that the database has to do is big.

Database Building 101Stable node ids

time to read 4 min | 647 words

A few posts ago, I talked about the problem of having unstable ids, in particular, ids that can be reused. That leads to quite a lot of complexity, as anyone who ever had to deal with Lucene documents ids knows.

So we are willing to pay something toward stable ids, the questions is what?

One way of doing that is to just store the physical id (unstable) and a virtual id (stable) in a B+Tree (actually, a pair of them, since you’ll need to refer to them back and forth). That means that for the most part, internally to the engine, we’ll use the physical id (with its nice property of having O(1) access time), but externally we’ll expose the stable virtual id (probably sequential numbering, since that is easiest).

Note that I still want to use the physical ids, I’ll discuss exactly why that is important in my next post, for now, let us just say that it is an important component of ensuring high performance for large datasets.

The problem with using B+Tree is that the cost of finding the virtual <—> physical id mapping is O(logN), which for 10 million nodes and 100 million edges is 23 & 24 respectively. Except that this isn’t the real cost function for B+Tree.

Assuming that we have 255 items per page, we actually would need to do 4 page lookups, and a total of 54 comparisons to find the right value. For the edges, we would need 5 page look ups and over 60 comparisons.  Note that this isn’t an issue on its own, but it is an issue when we are talking about having this kind of cost in the hot path of the application. And this is very likely going to be in the hot path.

Oh, there are ways around it, we can only translate back and forth at the edges of the database, so internally we’ll always use the physical address, and only translate it out when we are done. But that is hard to actually do properly, since you need the virtual address for a whole lot of stuff all over the place.

We can steal the idea of page translation tables from the processor. Something like this:

image

Effectively, we’ll lazy allocate segments of pages and pull them together into a hierarchy. So finding out the physical address of id 84 would involve looking at the root, finding the next page down with mod operation, and so forth until we find the right value and check there. This has the advantage of being simple, O(1) and obvious. It is also pretty good in terms of space saving, since the virtual id can be “stored” without taking any space (it is the position of the physical id in the “array” we created.

This has one drawback, there is no way to recover space. Because the indexer into this data structure is meaningful, we can’t just compact things. Once space is allocated, that is it.  Now, to be fair, the cost in size here for all 100 million edges is about 0.75 GB, so not meaningful in the long run, but if we have a busy database that always write and delete, we have no way to recover the space.

The “proper” answer, by the way, is to implement an external hash table. That has the property of O(1), can grow and shrink as the amount of data changes. I’m not presenting it here mostly because it is something that we haven’t yet had the need to implement in Voron, so it isn’t something that I can just show and move on. Beside, it is fun to explore all the wrong ways of doing something.

Database Building 101High level graph operations

time to read 2 min | 247 words

I talked about high level and low level data operations. So far, all we have imageseen are very low level operations (get node, get edges for, etc).

Let us see how we’ll deal with a bigger challenge. In this case, we want to implement a classic graph operation, doing a depth first search, filtering by both nodes and edges.

Here is how we can implement this:

In the real world, we’ll need quite a bit more. On each node (and edge) we’ll need to decide if to return it from the query, or just traverse through it, etc. And that is just to start with.

But I think this demonstrate the point of how to layer behavior on top of the lower level database operations.

There is one thing that we need to talk about still, this code will actually use a lot of individual transactions, one for each independent operation. That is quite expensive, we can open a single transaction and pass it to the functions we call, so there is just a single cost for the entire duration of the operation.

Other things we can do is to explicitly designate specific scenarios as important and change the design so we can answer them very quickly (as in the O(1) cost for accessing nodes/edge data).

Database Building 101Graphs aren’t toys

time to read 2 min | 396 words

image

I  keep calling this a toy database, and it is true for more reasons than the code existing mostly as unconnected snippets. When defining the storage layer, I breezed through quite a lot of stuff because they didn’t really matter for the point I was trying to make.

We’ll start with talking about node IDs. When we save a node to the database, we get an int64 ID back. What is that? We know that it gives us an O(1) access time to the node (or the edge data), but that’s about it. Typically, we don’t expose the internal table ID in this manner. Because the ID we get back from the Insert corresponds to the location on the disk where that data exists. So the node ID is something that is arbitrary and doesn’t correlate to anything. It isn’t a sequential number, or something that the user defines, or anything like that.

That can lead to issues. In particular, if you want to look up a node by some property, you need to have a way to do so that will retrieve its ID, and only then you can do graph operations from there. The common solution is to use a Lucene index for searching by the properties and finding the root node(s) and proceed from there.

But what about deletes? Deleting a node is possible, and when you do that, the space that was reserved for that node will be freed, and can be reused, so you’ll have a different node with the same ID. This leads to some awkwardness (you can see that with the Lucene document IDs, which have the same problem, although for very different reasons).

Updates also pose a problem, because if you need to extend the size of the node, it might be relocated, which changes the ID. Deleting is a challenge, because you need to remove the deleted node ID from all the edges that reference it, instead of cleaning it up on the fly.

This leads us to either abstract the physical ID with a virtual one (and pay the O(logN) cost for resolving it) or find a way to deal with the above inside your API.

Database Building 101The cost of graph storage

time to read 5 min | 802 words

imageSo now that we know how to store the data, in a way that allows efficient graph traversal, let’s compute some back of the envelope computations for storage costs.

Like any storage system, Voron needs to store some metadata about our data, and sometimes this can be very surprising to people.

Let’s look at each of the items that we store in turn.

  • Node data is stored in a table.
  • Edge data is stored in a table.
  • The edge itself is stored in a B+Tree containing fixed size trees.

A table does a bunch of stuff, including reserving some space on the disk, but we don’t have dynamic tables here, so those costs are relatively fixed.

The cost per item, however, depends on the size of the data. If the data size is less than 2036 bytes, then the cost of storing the data is a 4 bytes overhead. If, however, the size of the data item is higher than 2036, we round it up to 4KB section.

In other words, storing ten million nodes, which measure 1KB in size, will cost us about 40 MB  of overhead (compared to roughly 10 GB of data). However, if the size of the data is 2KB, we need to store them in a full page. The reason for this, by the way, is that we need to balance the cost of insert and the cost of update. So we only modify things on page boundary (in this case, 4KB). If the value is small, we pack multiples of them in a single page, but beyond a certain size, we just allocate dedicated pages for them, and live with a bit of free space in the end.

More interesting is the storage of the edge data, actually. A B+Tree costs a minimum of 4KB, and we have one of these per each of the edge types that we have. In practice, we don’t expect there to be a high number of edge types, and we can readily ignore this as fixed size costs. In most cases, I would be stunned to hear that there is more than a single 4KB page for all your edges types (should be enough for a hundred or so).

What isn’t fixed size is the number of fixed size tree (one per source node) and the number of entries in the fixed size trees (one per connection). The whole reason we have fixed size trees is that they allow us to make efficient use of storage by making assumptions. You can see this in their usage. A fixed size tree has an int64 as the key, and you need to specify upfront how much space you need for the values. That makes it very simple to write.

Fixed size trees actually have two forms, they can be embedded or they can be free floating. That mostly depends on their size. If they are embedded, they are stored inside the parent tree, but if they are too large, we store them in their own page. Embedded usage takes 6 bytes per fixed size tree, we have 8 bytes for the key, and the entry size itself (which in our case is also 8 bytes). So a total of 16 bytes per entry inside the fixed size tree.

What this means, in practice, is that up until the moment a node has more than 254 connections, it can be stored as embedded value. When it goes beyond that, we’ll spill over to a full page and start taking space at 4KB increments.

One thing that I didn’t mention is that we store the fixed size trees (keyed by their source node ID), inside a parent B+Tree. Here we need to track a lot more information (keys and values have different sizes, etc). The overhead cost per entry in a B+Tree is 14 bytes. Add to that the 8 bytes for the source node id, and it comes to 22 bytes per source node.

Given all of those numbers, if we had a graph with 10 million nodes and each node was connected to a 10 other nodes in average, and each node/edge was 0.5KB in size, we would have:

  • 5 GB – Nodes data – 10,000,000
  • 50 GB – Edges data – 100,000,000
  • 80 MB – overhead data for nodes & edges.
  • 1.75 GB – edges information about all nodes.

Note that in such a graph, we have 10 million nodes, but a hundred million edges. All of which can fit comfortably into RAM on a good server, and give you blazing good performance.

Database Building 101The cost of graphing

time to read 2 min | 368 words

image

So far we looked into how we can store the nodes and the edges, and we explored some interesting data structures inside Voron. Now, let’s see how we can traverse the graph.

So getting the node is pretty easy, and remember that to access the table by ID gives us an O(1) cost.

Now, what about finding all the edges from a node?

This is a lot heftier, but let’s try to break it into individual pieces. First, we find the tree holding all the edges of that particular type, then we access (by the ID of the source node) the fixed size tree that holds all the connected nodes and the edge data ID.

Finally, we access the edges data (if it exists) to get the data about the edge itself, after which, we return it all to the caller.

Unlike the previous method, here we can’t claim to have O(1) costs. Instead, our costs are composed of:

  1. O(logN) – where N is the number of edges types (typically, very few), to search for the right edges tree.
  2. O(logN) – where N is the number of source nodes (typically, very large, but the fixed size trees are very good at being fast, and they don’t have to do much).
  3. O(N) – where N is the number of connection between the source node and the destination node.

I’m excluding the cost of loading the edge data, since this is an O(1) operation and is directly relevant to the iteration over all nodes connected to the source node.

Ideally, we can find a way to turn the 2nd operation into an O(1) cost, but that should be more than good enough for the moment.

Now, this just gives us traversal of the nodes, but going from here to Breadth First Search / Depth First Search and doing them well is fairly trivial, although there are a few things that we'll need to talk about, but that would serve as a separate post.

Database Building 101The storage layer

time to read 3 min | 559 words

We are going to be using Voron to actually handle the storage of the data. Voron is a low level storage engine, which provide, among other things, fully ACID, MVCC and high performance.

For today, what'll look at are the following operations:

  • Add a node.
  • Add an edge between two nodes.
  • Traverse from a node to all its edges (cheaply)
  • Traverse from an edge to the other node (cheaply).

Here is the interface code that we will need to build:

I tried to make it as simple as it possible could be. I’m using NameValueCollection because it is a trivial property bag to serialize / deserialize without bringing any additional dependencies.

Let us focus on the initialization of this class. We need to create a StorageEnvironment in Voron, and setup some structures.

Voron has several different storage options for data, and in this case, we are using a table. A table (which is very much unlike a RDMBS table) it a way to store records that has some indexes, but in this case, we are doing something strange. We are defining a table that has no indexes (this is so strange that we need to explicitly allow it). This is useful because tables manages indexes for us automatically, but in this case we will use them strictly for storage. We’ll see why in just a bit.

 

The code here is a bit strange. Tables are actually meant to hold multiple values, and define indexes on this. So using them to store a single value is something that you wouldn’t normally do. Except that the id that we get back from the table has a very interesting property. It has O(1) cost of access. So given a node id, I can access it directly, regardless of how big my database is. That is a property that I want, because effectively random access of nodes is something that happens quite a lot and is highly desirable.

Now, let us see how we can connect two edges, shall we. This code ignores all error handling, missing nodes, etc. It is meant to explain the core stuff, this is a toy database, after all.

Here we continue doing strange stuff. We already know that we use the empty schema table to have an O(1) access to data. And we store the edge’s data there. But then we run into some new stuff. We create a B+Tree called “Edges_”+type, which hold all of the edges of a particular type. But the content of this B+Tree is not simple. Instead, it is using fixed size trees. Those are also B+Trees, but they have well known size both for keys (which must be long) and the value (which must be small < 256 bytes). Because they are very compact, we can pack quite a lot of data into them, and work with them efficiently.

The end result is that we are now storing the node data and access it at O(1) cost. We also store a B+Tree full of fixed size tree (whose name is the source node id) and whose keys are the destination nodes, and the values are the edge data.

Confusing yet? Not much different than Dictionary<SourceNodeId, Dictionary<DestinationNodeId, EdgeDataId>>. That is quite enough for today, tomorrow I’ll show how we can traverse the data, and what kind of costs are associated with it.

Database Building 101Let us graph this for real

time to read 3 min | 487 words

In the Guts n’ Glory of Database Internals posts series (which I’ll probably continue if people suggest new topics), I talked about the very low level things that are involved in actually building a database. From how to ensure consistency to the network protocols. But those are very low level concerns. Important ones, but very low level. In this series, I want to start going up a bit in the stack and actually implement a toy database on top of real production system, to show you what the database engine actually does.

In practice, we divide the layers of a database engine this way:

  1. Low level storage (how we save the bits to disk), journaling, ACID.
  2. High level storage (what kind of storage options do we have, B+Tree, nested trees, etc).
  3. Low level data operations (working on a single item at time).
  4. High level data operations (large scale operations, typically).
  5. Additional features (subscribing to changes, for example).

In order to do something interesting, we are going to be writing a toy graph database. I’m going to focus on levels 3 & 4 here, the kind of data operations that we need to provide the database we want, and we are going to build over pre-existing storage solution that handles 1 & 2.

Selecting the storage engine – sometimes it make sense to go elsewhere for the storage engine. Typical examples includes using LMDB or LevelDB as embedded databases that handles the storage, and you build the data operations on top of that. This works, but it is limiting. You can’t do certain things, and sometimes you really want to. For example, LMDB supports the notion of multiple trees (and even recursive trees), while LevelDB has a flat key space. That has a big impact on how you design and build the database engine.

At any rate, I don’t think it will surprise anyone that I’m using Voron as the storage engine. It was developed to be a very flexible storage engine, and it works very well for the purpose.

We’ll get to the actual code in tomorrow’s post, but let’s lay out what we want to end up with.

  • The ability to store nodes (for simplicity, a node is just an arbitrary property bag).
  • The ability to connect nodes using edges.
    • Edges belong to types, so KNOWS and WORKED_AT are two different connection types.
    • An edge can be bare (no properties) or have data (again, for simplicity, just arbitrary property bag)

The purpose of the toy database we build is to allow the following low level operations:

  • Add a node.
  • Add an edge between two nodes.
  • Traverse from a node to all its edges (cheaply)
  • Traverse from an edge to the other node (cheaply).

That is it, should be simple enough, right?

The Guts n’ Glory of Database Internals: Early lock release

time to read 2 min | 376 words

After talking about transaction merging, there is another kind of database trick that you can use to optimize your performance even further. It is called early lock release. With early lock release, we rely on the fact that the client will only know when the transaction has been committed after we told him. And that we don’t have to tell it right away.

That sounds strange, how can not telling the client that its transaction has been committed improve performance. Well, it improve throughput, but it can also improve the performance. But before we get into the details, let us see the code, and then talk about what it means:

The code is nearly identical to the one in the previous post, but unlike the previous post, here we play a little game. Instead of flushing the journal synchronously, we are going to do this in an async manner. And what is more important, we don’t want for it to complete. Why is that important?

It is important because notifying the caller that the transaction has been complete has now moved into the async callback, and we can start processing additional operations are write them to the journal file at the same time that the I/O for the previous operation completes.

As far as the client is concerned, it doesn’t matter how it works, it just need to get the confirmation that it has been successfully committed. But from the point of view of system resources, it means that we can parallelize a key aspect of the code, and we can proceed with the next set of transactions  before the previous one even hit the disk.

There are some issues to consider. Typically you have only a single pending write operation, because if the previous journal write had an error, you need to abort all future transactions that relied on its in memory state (effectively, rollback that transaction and any speculative transactions we executed assuming we can commit that transaction to disk. Another issue is that error handling is more complex, if a single part of the merge transaction failed, it will fail unrelated operations, so you need to unmerge the transaction, run them individually, and report on each operation individually.

The Guts n’ Glory of Database Internals: Merging transactions

time to read 4 min | 790 words

A repeating choice that we have to make in databases is to trade off performance for scale. In other words, in order to process more requests per time frame, we need to increase the time it takes to process a single request. Let’s see how this works with the case of transaction commits.

In the simplest model, a transaction does its work, prepares itself to be committed, writes itself to the journal, and then notifies the client that it has successfully committed. Note that the most important part of the process is writing to the journal, which is how the transaction maintains durability. It also tends to be, by far, the most expensive part of the operation.

This leads to a lot of attempts to try and change the equation. I talked about one such option when we talked about the details of the transaction journal, having a journal actor responsible for writing the transaction changes as they happen, and amortize the cost of writing them over time and many transactions. This is something that quite a few databases do, but that does have certain assumptions.

To start with, it assumes that transactions are relatively long, and spend a lot of their time waiting for network I/O. In other words, this is a common model in the SQL world, where you have to open a connection, make a query, then make another query based on the results of that, etc. The idea is that you parallelize the cost of writing the changes to the journal along with the time it takes to read/write from the network.

But other databases do not use this model. Most NoSQL databases use the concept of a single command (which may aggregate commands), but they don’t have the notion of a long conversation with the client. So there isn’t that much of a chance to spread the cost of writing to the journal on the network.

Instead, a common trick is transaction merging. Transaction merging relies on the observation that I/O costs are no actually linear to the amount of I/O that you use. Oh, sure, writing 1KB is going to be faster than writing 10 MB. But it isn’t going to be two orders of magnitude faster. In fact, it is actually more efficient to make a single 10MB write than 1024 writes on 1 KB. If this make no sense, consider buying groceries and driving back to your house. The weight of the groceries has an impact on the fuel efficiency of the car, but the length of the drive is of much higher importance in terms of how much fuel is consumed. If you buy 200 KG of groceries (you are probably planning a hell of a party) and drive 10 KB home, you are going to waste a lot less fuel then if you make 4 trips with 50 KG in the trunk.

So what is transaction merging? Put simply, instead of calling the journal directly, we queue the operation we want to make, and let a separate thread run through all the concurrent operations. Here is the code:

The secret here is that if we have a load on the system, by the time we read from the queue, there are going to be more items in there. This means that when we write to the journal file, we’ll write not just a single operation (a trip back & forth to the grocery store), but we’ll be able to pack a lot of those operations immediately (one single trip). The idea is that we buffer all of the operations, up to a limit (in the code above, we use time, but typically you would also consider space), and then flush them to the journal as a single unit.

After doing this, we can notify all of the operations that they have been successfully completed, at which point they can go and notify their callers. We traded off the performance of a single operation (because we now need to be more complex and pay more I/O) in favor of being able to scale to a much higher number of operations. If your platform support async, you can also give up the thread (and let some other request run) while you are waiting for the I/O to complete.

The number of I/O requests you are going to make is lower, and the overall throughput is higher. Just to give you an example, in one of our tests, this single change moved us from doing 200 requests / second (roughly the maximum number of fsync()/ sec that the machine could support) to supporting 4,000 requests per second (x20 performance increase)!

FUTURE POSTS

  1. Voron’s internals: MVCC - All the moving parts - 13 hours from now
  2. Voron internals: Cleaning up scratch buffers - 4 days from now
  3. Voron internals: The transaction journal & recovery - 5 days from now
  4. Voron internals: I/O costs analysis - 6 days from now
  5. Benchmarking with Go - 7 days from now

And 3 more posts are pending...

There are posts all the way to Sep 06, 2016

RECENT SERIES

  1. Database Building 101 (8):
    25 Aug 2016 - Graph querying over large datasets
  2. Production postmortem (16):
    23 Aug 2016 - The insidious cost of managed memory
  3. Digging into the CoreCLR (3):
    12 Aug 2016 - Exceptional costs, Part II
  4. The Guts n’ Glory of Database Internals (20):
    08 Aug 2016 - Early lock release
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats