Playing with graphs and logic systems

time to read 3 min | 554 words

imageRecently I have been playing with graphs a bit, trying to understand them in more depth. Because I learn much better by doing, I thought that I would build a toy graph query engine to see how that works. I loaded the MovieLens small data set into a set of C# classes and started playing with them.

Here is what the source data looks like:

public class Movie
{
public int MovieId;
public string Title;
public string[] Genres;
}
public class UserRating
{
public int MovieId;
public float Rating;
public DateTime Timestamp;
public List<string> Tags = new List<string>();
}
public class User
{
public int UserId;
public Dictionary<int, UserRating> Ratings = new Dictionary<int, UserRating>();
}
var users = new Dictionary<int, User>();
var movies = new Dictionary<int, Movie>();
view raw setup.cs hosted with ❤ by GitHub

I’m not dealing with typical issues, such as how to fetch the data, optimizing indexes, etc. Instead, I want to focus solely on the problem of finding patterns in the graph.

Here is a simple example of a pattern in the graph:

(userA:User)-[:Rated]->(movie:Movie)<-[:Rated]-(userB:User)

The syntax is called Cypher, which is commonly used for graph queries.

What we are trying to find here is a set of triads. User A who rated a movie that was also rated by user B. The result of this query is a list of tuples matching (userA, movie, userB).

This is really similar to the way I remember learning Prolog, so I thought about giving it a shot and solving the problem in this way.

The first thing to do is to break the query itself into independent steps:

(userA:User)-[:Rated]->(movie:Movie) AND (userB:User)-[:Rated]->(movie:Movie)

Note that in this case, the first and second queries are exactly the same, but now they are somewhat easier to reason about. We just need to do the match ups property, here is how I would write the code:

var user_and_movie = (
from user in users.Values
from rating in user.Ratings.Values
select new
{
user,
movie = movies[rating.MovieId]
}
).ToList();
var one = user_and_movie;
var two = user_and_movie;
var results = (
from userA in one
join userB in two
on userA.movie equals userB.movie
where userA.user != userB.user
select new
{
userA = userA.user.UserId,
userB = userB.user.UserId,
movie = userA.movie.MovieId
}
).ToList();
view raw simple.query.cs hosted with ❤ by GitHub

This query can take a while to run, because on the small data set (with just 100,004 recommendations and 671 users) there are over 6.2 million such connections. And yes, I used join intentionally, because it show case the interesting problem of cartesian product.

Now, these queries aren’t really interesting and they can be quite expensive. A better query would be to find the set of movies that were rated by both user 1 and user 306. This can be done as simple as changing the previous code starting location:

var one = (
from user in new[] { users[1] }
from rating in user.Ratings.Values
select new
{
user,
movie = movies[rating.MovieId]
}
).ToList();
var two = (
from user in new[] { users[306] }
from rating in user.Ratings.Values
select new
{
user,
movie = movies[rating.MovieId]
}
).ToList();
view raw specific.cs hosted with ❤ by GitHub

Again, this is a pretty simple scenario. A more complex one would be to find a list of movies a particular user has not rated that were rated by people who liked the same movies as this user. As a query, this will look roughly like this:

(userA:User)-[:Rated(Rating >= 4)]->(:Movie)<-[:Rated(Rating >= 4)]-(userB:User) AND (userB:User)-[:Rated(Rating >= 4)]->(notRatedByA:Movie) AND NOT (userA:User)-[:Rated]->(notRatedByA:Movie)

Note that this merely specify the first part, find me users that liked the same movies as userA.  The second part is a bit more complex, we want to find movies rated by the second users and exclude movies rated by the first. Let’s break it into its component parts, shall we?

Here is the code for the first clause:  (userA:User)-[:Rated(Rating >= 4)]->(:Movie)<-[:Rated(Rating >= 4)]-(userB:User)

var one = (
from user in new[] { users[1] }
from rating in user.Ratings.Values
where rating.Rating >= 4
select new
{
user,
movie = movies[rating.MovieId]
}
).ToList();
var two = (
from user in users.Values
from rating in user.Ratings.Values
where rating.Rating >= 4
select new
{
user,
movie = movies[rating.MovieId]
}
).ToList();
var clauseA = (
from userA in one
join userB in two
on userA.movie equals userB.movie
where userA.user != userB.user
select new
{
userA = userA.user,
userB = userB.user
}
).ToList();
view raw first.cs hosted with ❤ by GitHub

As you can see, the output of this code is a set of ( userA, userB ). Now, let’s go to the second one, shall we? We already have a match on userB in this case, so we can start evaluating that. Here is the next stage: (userB:User)-[:Rated(Rating >= 4)]->(notRatedByA:Movie)

var clauseB = from matches in clauseA
from rating in matches.userB.Ratings.Values
where rating.Rating >= 4
select new
{
matches.userA,
matches.userB,
notRatedByA = movies[rating.MovieId]
};
view raw clauseB.cs hosted with ❤ by GitHub

Now we have the last stage, where we need to filter things out:

var clauseC = from matches in clauseB
where matches.userA.Ratings.ContainsKey(matches.notRatedByA.MovieId)
select matches;
view raw clauseC.cs hosted with ❤ by GitHub

And now we have the final results.

For me, thinking about these kind of queries as a “fill in the blanks” makes the most sense.