One of our guys is having fun playing with graph databases, and we had a serious discussion on how we can use Voron for that.
No, we don’t have any plans to do a graph database. This is purely one of the guys playing with something that interest him.
For the purpose of this post, I’m only concerned with having the ability to store graph data and read them efficiently. I don’t care for the actual graph operations.
Let us look at the following graph:
Here is how we can define this in code:
var michael = db.CreateNode(); michael["Name"] = "Michael"; var graphs = db.CreateNode(); graphs["Name"] = "Graphs"; var edge = michael.RelatesTo(graphs, db.Relationship("Plays With"));
Now, how would we go about implementing something like this? Well, with Voron, that is pretty easy.
We’ll start with defining a Nodes tree, which is going to using an incremental 64 bits integer for the key, and a JSON object for the value. This means that on CreateNode, we’ll just allocate the id for it, and just have the node itself as a JSON object that can be as complex as you want.
We also have relationships, and here it gets a bit complex, a relationship is always from a node to a node, and it has a specific type. Because the types of relationships tend to be very few, we will limit them to 65,536 relationship types. I think that this would be more than enough. As a result, I can quickly get the id of a relationship type. This leads us to having another tree in Voron, the RelationshipTypes tree, with a key that is the string name of the relationship and the value is just an incremental short. The reason we need to do this will be obvious shortly.
After we have the relationship type, we need to record the actual relationships. That means that we need to consider how we want to record that. Relationships can have their own properties, so the actual relationship is going to be another JSON object as the value in a tree. But what about the key for this tree? The question here is how are we going to work with this? What sort of queries are we going to issue. Obviously, in a graph database, we are going to follow relationships a lot. And the kind of questions we are going to ask are almost always going to be “from node X, find all outgoing relations of type Y”. So we might as well do this properly.
The key for the relations tree would be 18 bytes, the first 8 bytes are the source node id, the next 2 bytes are the relationship type and the last 8 bytes are the destination node id. That means that on the disk, the data is actually sorted first by the node id, then by the relationship type. Which make the kind of queries that I was talking about very natural and fast.
And that is pretty much it. Oh, you’re going to need metadata tree for things like the last relationship type id, and probably other stuff. But that is it, when speaking from the point of view of the storage.
The overall structure is:
Nodes - (Key: Int64, Val: JSON)
RelationshipTypes – (Key: string, Val: Int16)
Relationships ( Key: Int64, Int16, Int64, Val: JSON)
And on top of that you’ll be able to write any sort of graph logic.