This is pure “scratch an itch” design post, since I started thinking about the topic, I might as well put my thoughts in a structured format. I am going to use twitter for this example, and assume that my end goal is to be able to do graph traversal, not just finding neighbors. Applying this to other social network scenarios should be pretty easy.
I generated this graph using this tool, this works using mentions, rather than following / followers graph, though. It will server to give you an idea what I am talking about.
One of the fundamentals of computer science is: localizing data leads to greater read performance. This is true whatever we are talking about keeping the data in the CPU L1 cache or in a distributed networked system. Typically, part of a sharding strategy is to keep all the data related to a root in a single location. The problem is, of course, that graphs don’t really have roots. And in most social network graphs, there is no such thing as a closed graph. There are, on average, ~7 million users within three hops from any twitter user. Now, it would be very easy to put 7 millions users to a single machine, except, as they say, they aren’t the same 7 million users.
Given that, I think that I can come up with an approach to allow more efficient queries and higher localization in the graph. The model assume an open and dynamic model (indeed, it relies on that).
We starts with geographical distribution. When we create a new user, we will place it in a shard dedicate to the geographical location the user is located on. This is a piece of data that we can get cheaply, and it has the advantage that users that interact with their circle of physical friends would tend to be clustered together anyway.
Next, we start assigning weights to associations. We only take into account outgoing associations (which solve the problem with outliers for incoming associations such as @aplsuk), but with a small twist, the weight of each outgoing association is taken as a portion of the total number of outgoing associations. In other words, the value of an outgoing association when you are following 10 friends is 0.1, but when you are following 500 friends, the value of each association is 0.002.
Next, we place some value on each mention that the user twits. A mention indicate that the association is active. For that matter, we probably need to create silent associations if a user keep mentioning someone that they are not following. For now, we will say that this value is 1% of the value of an association. That means that if I am following 100 users and I mentioned another user a hundred time, the value of the association is 0.02.
How does gravity comes into play here? Well, each outgoing association exact a pull on a node. But moving a node between shards is expensive, so we give shards an escape velocity. When we check if we need to re-shard a node, we aggregate the pulls of all the associations per node. Only if one shard pull is higher than the current shard pull + escape velocity will the node be shifted to the new shard.
Over time, this will tend to migrate most users close to the users that they are actively associated with. With that in mind, we can now move of to queries.
As I mentioned, I am interested more in this for graph traversal, and the good thing about this approach is that for the most part, most of the relevant information is located on the same shard. When the times comes to perform a query, we can assert that queries, too, need to have an escape velocity to cross a shard boundary. Unless there are enough outgoing connections to a given shard to overcome the escape velocity, outgoing connections to that shard are ignored. We can limit the cost of remote calls further if we increase the escape velocity for the query as the query search deeper. In other words, the escape velocity at depth = 0 would be 10, at depth = 1 it would be 100, at depth = 2 it would be 1000, etc.
While this represent a loss in accuracy of the results, it also mean that for the most case, results will tend to be more relevant.
Is this important if I don’t care for graph traversal?
There is another issue to consider, quite aside from graph traversal. The gravity approach outlined will tend to higher localization, and most operations are local. Consider writing a status update to all the people interested in that, when you have good locality, the cost of such on operation grows down drastically, in fact, I would say, it is highly likely that many such operations could be completed completely locally.