Recently I have been playing with graphs a bit, trying to understand them in more depth. Because I learn much better by doing, I thought that I would build a toy graph query engine to see how that works. I loaded the MovieLens small data set into a set of C# classes and started playing with them.
Here is what the source data looks like:
I’m not dealing with typical issues, such as how to fetch the data, optimizing indexes, etc. Instead, I want to focus solely on the problem of finding patterns in the graph.
Here is a simple example of a pattern in the graph:
The syntax is called Cypher, which is commonly used for graph queries.
What we are trying to find here is a set of triads. User A who rated a movie that was also rated by user B. The result of this query is a list of tuples matching (userA, movie, userB).
This is really similar to the way I remember learning Prolog, so I thought about giving it a shot and solving the problem in this way.
The first thing to do is to break the query itself into independent steps:
(userA:User)-[:Rated]->(movie:Movie) AND (userB:User)-[:Rated]->(movie:Movie)
Note that in this case, the first and second queries are exactly the same, but now they are somewhat easier to reason about. We just need to do the match ups property, here is how I would write the code:
This query can take a while to run, because on the small data set (with just 100,004 recommendations and 671 users) there are over 6.2 million such connections. And yes, I used join intentionally, because it show case the interesting problem of cartesian product.
Now, these queries aren’t really interesting and they can be quite expensive. A better query would be to find the set of movies that were rated by both user 1 and user 306. This can be done as simple as changing the previous code starting location:
Again, this is a pretty simple scenario. A more complex one would be to find a list of movies a particular user has not rated that were rated by people who liked the same movies as this user. As a query, this will look roughly like this:
(userA:User)-[:Rated(Rating >= 4)]->(:Movie)<-[:Rated(Rating >= 4)]-(userB:User) AND (userB:User)-[:Rated(Rating >= 4)]->(notRatedByA:Movie) AND NOT (userA:User)-[:Rated]->(notRatedByA:Movie)
Note that this merely specify the first part, find me users that liked the same movies as userA. The second part is a bit more complex, we want to find movies rated by the second users and exclude movies rated by the first. Let’s break it into its component parts, shall we?
Here is the code for the first clause: (userA:User)-[:Rated(Rating >= 4)]->(:Movie)<-[:Rated(Rating >= 4)]-(userB:User)
As you can see, the output of this code is a set of ( userA, userB ). Now, let’s go to the second one, shall we? We already have a match on userB in this case, so we can start evaluating that. Here is the next stage: (userB:User)-[:Rated(Rating >= 4)]->(notRatedByA:Movie)
Now we have the last stage, where we need to filter things out:
And now we have the final results.
For me, thinking about these kind of queries as a “fill in the blanks” makes the most sense.