Reading the NSA’s codebase: LemonGraph review–Part I - Storing Nodes

Aug 03 2018

Reading the NSA’s codebaseLemonGraph review–Part I - Storing Nodes

time to read 7 min | 1294 words

A few months ago the NSA released LemonGraph, a graph database based on LMDB. This weekend I took the time to explore the codebase. As usual, I’m going to be reading the code in lexical order, and mostly trying to figure out where things are happening and what the code is doing.

I started to read the Python code, but it immediately called to native C code:

Looking at the lemongraph.h file, I see a lot of quite interesting methods. This looks like the heart of LemonGraph is implemented in C, while the higher level Python code is orchestrating things. Not sure if this is a fair statement, but I’m going to be reading some C code now.

The first interesting thing in the C code is that LemonGraph actually wrapped LMDB with their own interface. From db.c, we have:

The MDB_xyz flags are defined by LMDB, DB_xyz are defined by LemonGraph. I assume that this was done to hedge their bets with regards to the underlying storage engine. From my own experience, any such attempt is usually doomed to failure. It isn’t the compilation errors that will kill you but the different behavior that you are coupled to that matters. A lot of the code there is basically forwarding directly to LMDB, but there is actual behavior implemented in db.c. In particular, they seem to have a lot of logic around managing remapping of LMDB (which is fixed size and requires reopening to change).

The lemongraph.c file is about 1,750 lines in size. I skimmed it a bit, but let’s get to the interesting bits. I tried to check where it is creating a new node and follow the code from there, but I got lost. So I started with the docs and looked at this code:

As a reminder, I’m reading this code in a text editor, not running it, so I’m somewhat limited by what I can discover. This Python code is strange, because it is doing creation by omission. In other words, if you already had a a node with “foo:bar” there, it would return it. Otherwise, a new node is created. Not my cup of tea, but make sense for a Python interface.

Let’s dig deeper into how this is built, shall we? This call ends up here:

This then calls to:

I want to stop everything for a moment and look at the first line of this code. It calls malloc, and it does this in a pretty strange manner. Note that it looks likes it dereferences the variable before it has been initialized. This isn’t what is actually going on.

The sizeof operator is a compile time feature. And this relies on this code:

In C, structs are referred to as “struct STRUCT_NAME” and are typically typedef to themselves to make things easier. In this codebase, there is “struct node_t” and “node_t”, which is a pointer to the struct. I find this very confusing, to be honest.

At any rate, back to _node_resolve method. You can see something pretty interesting here. we have the _string_resolve method. What is it doing there? This just forward the call to __resolve_blob, which is really interesting.

It is a bit long, and you can read it at your leisure here. The signature is here:

Basically, it uses a feature of LMDB that allow to store multiple values for a key. I started describing what this is doing, but then I realized that this is crazy. Sample code would be much easier. Here is the essence of this method:

Dictionary<uint, List<long>> _crcToIds;

Dictionary<long, string> _idsToStrs;

long _lastId;

long __resolve_blob(string str){

uint crc = ComputeCrc(str);

if(_crcToIds.TryGetValue(crc, out var list))

{

foreach(var id in list)

{

var existingStr = _idsToStrs[id];

if(existingStr == str)

return id;

}

}

else

{

_crcToIds[crc] = list = new List<long>();

}

var id = ++_lastId;

list.Add(id);

_idsToStrs.Add(id, str);

return id;

}

view raw __resolve_blob.cs hosted with ❤ by GitHub

As you can see, the idea is that LemonGraph maintains an internal string index, allowing it to turn any existing string into a numeric id. Going back to the _node_resolve() method, you can see that the _string_resolve method is used to get a unique id for both the type and val parameters. This is then sent to the __node_resolve() method.

As an aside, I’m loving the fact that you can figure out the call graph for methods based on the number of underscores prefixed to the name. No underscore, public method. Single underscore, private method used internally. Two underscores, hidden method, used sparingly. At three underscores, I could tell you what it means, but given that this is NSA’s codebase, I’ll have to kill you afterward.

This is a nice method. We first do a lookup, and if it isn’t there, we’ll append a new node. I’m going to look at the _node_append() method first, because it will likely make looking at the _node_lookup() method easier.

The encode() method uses length prefixed ints to encode int64 values in fewer bytes. I’m not sure why they chose this format over variant size int. That would save at least one byte in almost all cases. The end result of this method is that we pass the dbuf buffer to the _log_append() method with the (next, type, val) values encoded in it. I’m not sure what next is at this point, but I know that in the case we have here, this is meant to be 0, so we can probably ignore it for now. The _log_append() method simply append the buffer to a tree and returns a sequential id, nothing more. Going back up a bit, the _node_index() is now called, which looks like this:

This is basically is just adding the value to the log and creating an index to search by (type,val) – giving the id. This just make me that much more interested in the next. I’m pretty sure that this is required for a feature they call historical views, which likely means that they never actually do deletes, just markers. This is a nice feature if you want to effectively go back in time, or maybe compare current and past events.

On the one hand, this limits the usability of the graph, because your data set will only grow. On the other hand, not having to worry about deletes is so much easier.

Now we can go and figure out what is going on with the _node_lookup() method.

You can see that this is using the same DB_NODE_IDX from the _node_index() method. So basically this is doing the reverse of the past few methods. This will lookup the index, then fetch the relevant log id for this.

This is enough for now, I have a good grasp of the basic underlying data structures. I’m pretty sure that now that I know how nodes works, I can quickly check how edges and properties work as well. Hopefully, next post will be able to talk about the graph operations directly, but I might run into other interesting bits first.

Tweet Share Share 0 comments

Tags:

reviews

Oren Eini

Oren Eini

CEO of RavenDB