Database Building 101The cost of graphing
So far we looked into how we can store the nodes and the edges, and we explored some interesting data structures inside Voron. Now, let’s see how we can traverse the graph.
So getting the node is pretty easy, and remember that to access the table by ID gives us an O(1) cost.
Now, what about finding all the edges from a node?
This is a lot heftier, but let’s try to break it into individual pieces. First, we find the tree holding all the edges of that particular type, then we access (by the ID of the source node) the fixed size tree that holds all the connected nodes and the edge data ID.
Finally, we access the edges data (if it exists) to get the data about the edge itself, after which, we return it all to the caller.
Unlike the previous method, here we can’t claim to have O(1) costs. Instead, our costs are composed of:
- O(logN) – where N is the number of edges types (typically, very few), to search for the right edges tree.
- O(logN) – where N is the number of source nodes (typically, very large, but the fixed size trees are very good at being fast, and they don’t have to do much).
- O(N) – where N is the number of connection between the source node and the destination node.
I’m excluding the cost of loading the edge data, since this is an O(1) operation and is directly relevant to the iteration over all nodes connected to the source node.
Ideally, we can find a way to turn the 2nd operation into an O(1) cost, but that should be more than good enough for the moment.
Now, this just gives us traversal of the nodes, but going from here to Breadth First Search / Depth First Search and doing them well is fairly trivial, although there are a few things that we'll need to talk about, but that would serve as a separate post.
More posts in "Database Building 101" series:
- (25 Aug 2016) Graph querying over large datasets
- (24 Aug 2016) Stable node ids
- (22 Aug 2016) High level graph operations
- (19 Aug 2016) Graphs aren’t toys
- (18 Aug 2016) The cost of graph storage
- (17 Aug 2016) The cost of graphing
- (16 Aug 2016) The storage layer
- (15 Aug 2016) Let us graph this for real
Comments
Hi, in your previous post, you said we can access by key in O(1) no matter how big the database is. while this is good in theory, in practice when you have 1B items, things are getting more clumsy. (at least from mine experience) Is this something you faced with?
Uri, The ID is actually the offset of the value on disk, so while there are cost associated with seeking to it (or at the FS level, translating it to the proper disk offset), that isn't truly something to worry about.
should it be: if(it.Seek(Slices.BeforeAllKeys) == false) yield break; ?
I'm really enjoying this topic and series especially since I'm in the middle of reading a book on graph theory and about to take an algorithms course. Will/have you released the full source code for this demo? I grabbed the gists and a copy of the Vorlon project, but some of the bits to put them together aren't clear to me.
Tal, You are correct, I fixed the code.
Drew, This isn't really a project. This is the equivalent to notepad code, basically. There isn't anything that is fully functional database here that I have. I merely wrote each part to handle the details for the post.
Comment preview