Reading the NSA’s codebaseLemonGraph review–Part II - Storing edges and properties
The last post left us figuring out how LemonGraph is storing nodes and edges. There are the special properties type and val that are used for this kind of lookup. I find it interesting that these two properties are treated in a special manner. It seems like it would be more useful to have such distinction completely arbitrary or use a single property. A more common approach would be to just have a unique key for each node / edge and be done with it. I’m guessing that this is needed because you might want to do partial searches. So give me all the nodes whose type is X without caring what the value is.
Here is the sample code from the project’s page:
In the last post, I was able to follow up on the first two lines. Now, let’s see how the edge is actually used and how properties are stored. The edge() method call ends up calling the __edge_resolve() method:
This is very similar to what we saw with __node_resolve() method. This might be a good time to stop and consider the edge_t and node_t structures:
They seems pretty simple. The is_new:1 syntax is used by C for bitfields. In other words, the is_new field takes a single bit, while the rectype takes 7 bits.
A node has an id, the record type, the next pointer (which I assume is used for historical queries) and the type/val fields. The edge is pretty much the same shape, except that it has the source and target fields as well.
I’m going to skip detailed dive into the internals of __edge_resolve(), they are nearly identical as the __node_resolve() one. I do want to note that LemonGraph define the following indexes on edges:
- type, val, src, tgt –> id
- src, type –> id
- tgt, type –> id
In other words, it can very quickly find matches for queries on:
- Edges by type and values
- Edges from a source of a particular type
- Edges to a target of a particular type
Setting a property ends up calling _prop_resolve() which looks like this:
Again, very similar to what we have seen before. This is a good thing. Predictability in a code base is a very nice property (pun intended).
Properties are indexes on:
- parent, key –> id
In other words, you can lookup the properties of a node (or an edge) based on its parent or type. Interestingly enough, it isn’t indexing the value of a property. I would expect to have an index on (key, value –> id), which would allow for richer queries. On the other hand, I guess they do most / all of their queries on the type/val pair as the source.
With this, I feel like I have a good grasp on what is going on in LemonGraph in terms of the on disk data structure. In the next post, I’m going to try to figure out how it is handling queries.
More posts in "Reading the NSA’s codebase" series:
- (13 Aug 2018) LemonGraph review–Part VII–Summary
- (10 Aug 2018) LemonGraph review–Part VI–Executing queries
- (09 Aug 2018) LemonGraph review–Part V–Query parsing
- (08 Aug 2018) LemonGraph review–Part IV–Compressed, sortable integers
- (07 Aug 2018) LemonGraph review–Part III - Figuring out queries
- (06 Aug 2018) LemonGraph review–Part II - Storing edges and properties
- (03 Aug 2018) LemonGraph review–Part I - Storing Nodes