Reading the NSA’s codebase: LemonGraph review–Part III - Figuring out queries

Aug 07 2018

Reading the NSA’s codebaseLemonGraph review–Part III - Figuring out queries

time to read 7 min | 1384 words

After figuring out how LemonGraph is storing data on disk, my next task is to figure out how queries are handled. Here are some queries:

A query starts in the Python’s Query class, where it is parsed by the MatchLGQL class. I scanned this code briefly, but this is basically doing query parsing into the in memory representation. This is typically ugly piece of code, and that holds true for this code as well. Python’s dynamic nature also means that there isn’t an easy to follow set of AST classes. I’ll skip the details of query parsing and just see how it is actually handling the queries, then.

I started to look into the query execution and got lost in details that I didn’t understand. In particular, there is a very strong tie between the query parsing and the execution. More so than I expected. What brought this home was this piece of code, which is used to rank the most efficient manner in which you should start executing the query.

At this point, I think that the idea here is that when you start running a query, you want to start from the smallest set of seed nodes. the ranking here seems to be a nice way to go about doing just that, but I’m not really sure yet how this is applied.

This is later used to figure out what the best starting location is in the Query.finalize() method.

This all come together for me inside the _adhoc() method on Query, where the query is actually being run:

The self.compiled is the set of already parsed matches. On each of them, we create a context (which will track already visited nodes / edges) and start by finding the seeds on the query. Seeds are handled using… an interesting approach:

It took me a few reads to get what this code is trying to do and I still thing that this is an obnoxious way to write things. This basically does the other side of the ranking. It is using the ranking to decide which method to call. I think that an enum would be about three time more readable. Especially since a bit lower you have:

I have to say that the modulus usage is the sign of true genius. Or madness. I’m not sure which, because I can’t figure out what the history of this code is. This was the same way from the initial commit, but I think that this code has history from before the public release. And it might have a reason for this kind of code. But I don’t like it and I think it would have been much easier to read if it wasn’t using magic numbers all over the place.

At any rate, let’s assume that we have the simplest query, for all the nodes. This would send us to txn.nodes() method. This would be rank 6, by the way. Here is how this looks like:

As you can see, we have two modes here. If the rank was 6, we aren’t sent a type. But if the rank was 4, we are getting a type. I’m going to start from the search for types, which seems more interesting.

Here is where we end up in the C code:

Yes, the graph_nodes_type() method is calling _graph_nodes_edges_type() method. That was confusing as well to me. The key here is the DB_NODE_IDX index there, which tell it to use a different tree for the lookups.

The graph_string_lookup() is something that we already run into, this is the __blob_resolve() method, which is searching for the string id for the given type. The code starts to get interesting when we see the graph_iter_new() call:

So we specify an iterator on the given prefix. From the previous post, you might recall that DB_NODE_IDX is specific as (type, val –> id). So this does a search on the first item that matches the prefix. The _cleanse_beforeID() method will ensure that the beforeID is only valid if it represent a value that is between 1 and the max log id that was generated on the graph.

The iterator we got from the nodes() method just implement’s Python iteration interface, starting from the graph_iter_new() item, then calling graph_iter_next() until the end. This is implemented like so:

Here we see for the first time the actual usage of beforeID. I’m not sure what this does yet, so we’ll skip this for now and look at the _iter_idx_nextID() method and come back to it later. This method is quite interesting. We have the head of the iterator, which is set to true in the iterator init. I’m not sure what this means yet. What seems to be interesting is that _blarf() method, which I understand to be a cry of frustration (I had to look it up).

I’m not sure what the next pointer is doing there at all. We’ll look at that after I’ll inspect (with some measure of dread) the _blarf() function.

To start with, I love the use of goto instead of return statements. I understand that this may be a coding convention here and this is used to clearly mark when resources are supposed to be disposed, but still…

The iter_next_key() ends up moving the cursor (and validating that the prefix is still valid). The _parse_idx_logID() call is here:

And this is just a fancy way of saying, gimme the last value in this buffer.

To understand what is going on, let’s go back a bit and understand what is going on here. We are actually scanning the index DB_NODE_IDX. And that index has the value of (type, val, id). Since the iteration is done in sorted order, this means that you can iterate over all the matches that match the type you want. The use of beforeID for filtering here, however, is interesting. I wonder how common is the use of historical queries like this are. Because if you need to skip over a lot of items, this will result in an O(N) operation while you are scanning and discarding a lot of data.

Anyway, there doesn’t seem to be any use of the graph_iter_t.next in this code path, so I’ll leave it for now. The iteration over all the nodes is done with the exact same method, but without specifying any prefix, which means that it matches everything.

I have to admit that at this point, I don’t really get how it process the more complex queries. I had to give up the “let’s not run the code” and try a few things out. Surprisingly, on WSL, the code just cause segmentation fault. I tracked it down to something in cffi, but I didn’t care that much. I created a simple ubuntu machine and played with it for a bit. So far, we just looked at how we get the first seeds of the query. A lot of the smarts seems to be hidden in the MatchCTX class. In particular, the matches() method seems to be doing a lot of magic.

I’m going to concentrate on that in my next post.

Tweet Share Share 0 comments

Tags:

reviews

Oren Eini

Oren Eini

CEO of RavenDB