Querying your way to madness: the Facebook timeline

time to read 16 min | 3048 words

Facebook certainly changed the way we are doing things. Sometimes, that ain’t always for the best, as can be seen from the way too much time humanity as a whole spend watching cat videos.

One of the ways that Facebook impacted our professional lives is that a lot of people look at that as a blue print of how they want their application to be. I am not going to touch on whatever that is a good thing or not, suffice to say that this is a well known model that is very easy for a lot of users to grasp.

It is also a pretty hard model to actually design and build. I recently had a call from a friend who was tasked with building a Facebook like timeline. Like most such tasks, we have the notion of social network, with people following other people. I assume that this isn’t actually YASNP (yet another social network project), but I didn’t check too deeply.

The question was how to actually build the timeline. The first thing that most people would try is something like this:

   1: var user = session.Load<User>(userId);
   2: var timelineItems = 
   3:    session.Query<Items>()
   4:       .Where(x=>x.Source.In(user.Following))
   5:       .OrderByDescending(x=>x.PublishedAt)
   6:       .Take(30)
   7:       .ToList();

Now, this looks good, and it would work, as long as you have small number of users and no one follows a lot of people. And as long as you don’t have  a lot of items. And as long as you don’t have to do any additional work.  When any of those assumption is broken… well, welcome to unpleasantville, population: you.

It can’t work. And I don’t care what technology you are using for storage. You can’t create a good solution using queries for something like the timeline.

Nitpicker corner:

  • If you have users that are following a LOT of people (and you will have those), you are likely to get into problems with the query.
  • The more items you have, the slower this query becomes. Since you need to sort them all before you can return results. And you are likely to have a LOT of them.
  • You can’t really shard this query nicely or easily.
  • You can’t apply additional filtering in any meaningful way.

Let us consider the following scenario. Let us assume that I care for this Rohit person. But I really don’t care for Farmville.

hide farmville ribbon

And then:

hide farmville

Now, try to imagine doing this sort of thing in a query. For fun, assume that I do care for Farmville updates from some people, but not from all. That is what I mean when I said that you want to do meaningful filtering.

You can’t do this using queries. Not in any real way.

Instead, you have to turn it around. You would do something like this:

   1: var user = session.Load<User>(userId);
   2: var timelineItmes = session.Query<TimeLineItems>() 
   3:       .Where(x=>x.ForUser == userId)
   4:       .OrderBy(x=>x.Date)
   5:       .ToList();

Note how we structure this. There is a set of TimeLineItems objects, which store a bit of information about a set of items. Usually we would have one per user per day. Something like:

  • users/123/timeline/2013-03-12
  • users/123/timeline/2013-03-13
  • users/123/timeline/2013-03-14

That means that we get well scoped values, we only need to search on a single set of items (easily sharded, with a well known id, which means that we can also just load them by id, instead of querying for them).

Of course, that means that you have to have something that builds those timeline documents. That is usually an async process that run whenever you have a user that update something. It goes something like this:

   1: public void UpdateFollowers(string itemId)
   2: {
   3:     var item = session.Include<Item>(x=>x.UserId)
   4:         .Load(itemId);
   6:     var user = session.Load<User>(item.UserId);
   8:     // user.Followers list of documents with batches of followers
   9:     // we assume that we might have a LOT, so we use this techinque
  10:     // to avoid loading them all into memory at once
  11:     // http://ayende.com/blog/96257/document-based-modeling-auctions-bids
  12:     foreach(var followersDocId in user.Followers)
  13:     {
  14:         NotifyFollowers(followersDocId, item);
  15:     }
  16: }
  18: public void NotifyFollowers(string followersDocId, Item item)
  19: {
  20:     var followers = session.Include<FollowersCollection>(x=>x.Followers)
  21:         .Load(followersDocId);
  23:     foreach(var follower in followers.Followers)
  24:     {
  25:         var user = session.Load<User>(follower);
  26:         if(user.IsMatch(item) == false)
  27:             continue;
  28:         AddToTimeLine(follower, item);
  29:     }
  30: }

As you can see, we are batching the operation, loading the followers and batched on their settings, decide whatever to let that item to be added to their timeline or not.

Note that this has a lot of implications. Different people will see this show up in their timeline in different times (but usually very close to one another). Your memory usage is kept low, because you are only processing some of it at any given point in time. For users with a LOT of followers, and there will be some like those, you might want to build special code paths, but this should be fine even at its current stage.

What about post factum operations? Let us say that I want to start following a new user? This require special treatment, you would have to read the latest timeline items from the new user to follow and start merging that with the existing timeline. Likewise when you need to delete someone. Or adding a new filter.

It is a lot more work than just changing the query, sure. But you can get things done this way. And you cannot get anywhere with the query only approach.