That No SQL ThingScaling Graph Databases
Yesterday I talked about graph databases, outlining what they are and how they work. One of the interesting things about this series is that in many cases, I am posing a question (to myself), trying to answer it, then go and find out what other people do.
When thinking about scaling scenarios for a graph database, I had the following scenario in mind, a graph of nodes that is spread across multiple servers, where each member in the graph may reside on any machine in the system. The following diagram demonstrate what I am thinking about, each rectangle represent a different machine in the system:
Why is this important?
A single machine solution is obviously a barrier to scaling (and safety, but that is another concern. In a graph database, having relations between the node is the point, that makes sharding a bit more complicated, because unless you store the entire graph on a single machine, you are forced to query across machine boundaries. And you can’t store a graph in a single machine, for the simple reason that it is unlikely that you can limit a graph to be that small. Think about the implications of Six Degrees of Separation for graph databases and it will be clear what the problem is. In real world graphs, everyone is connected to everyone.
The problem with breaking the entire graph across machines is that now it is much more expensive to do graph traversals. The following query, which previous run on a single machine:
new GraphDatabaseQuery { SourceNode = ayende, MaxDepth = 3, RelationsToFollow = new[]{"As Known As", "Family", "Friend", "Romantic", "Ex"}, Where = node => node.Location == ayende.Location, SearchOrder = SearchOrder.BreadthFirst }.Execute();
Now need to touch 3 different machines. Worse, it isn’t the number of machines that impacts that, but the spread of graph nodes across machines in the system.
After spending some time thinking about it, I came to the conclusion that I can’t envision any general way to solve the problem. Oh, I can think of several ways of reduce the problem:
- Batching cross machine queries so we only perform them at the close of each breadth first step.
- Storing multiple levels of associations (So “users/ayende” would store its relations but also “users/ayende”’s relation and “users/arik”’s relations).
The solution most likely to be successful is limiting the depth of cross machine node searches. In many cases, that is acceptable, I think. If we put the depth limit on 3, we can still give pretty good answers in a reasonable time frame. But the only way this can be made to work is with good batching support.
The algorithm may look like:
public IEnumerable<Node> DistributedTraverse(Node sourceNode, int depth, string relationToFollow, Func<Node, filter> predicate) { if(depth == 0) // feeling defensive yield break; var related = GetRelatedNodes(sourceNode.ShardName, relationToFollow, predicate); foreach(var result in related) yield return result; if(depth == 1) // don't even bother asking down the line { yield break; } foreach(var relationsByShard in related.GroupBy(x=>x.ShardName)) { var shard = GetShardProxy(relationsByShard.Key); var results = shard.BatchedQuery(sourceNodes: relationsByShard.ToArray(), depth - 1,relationToFollow, predicate); foreach(var result in results) yield return result; } }
This give us a maximum amount of (depth * number_of_machines_in_cluster) – depth remote calls: With a depth of 3 and 3 machines in the cluster, we would have a max of 6 calls.
With that theory out of our heads, let us examine how real world Graph DBs tried to resolve this issue…
Neo4j (which seems to be pretty much the default for Graph DBs) doesn’t handle this currently, there are some hints that they intend to offer cluster wide replication, but nothing about design or implementation details. Neo4j does offer write-master/read-slaves approach for scaling out, which is really nice, but even that approach is limited at one point, and in this post, I am focusing on what happen when you go beyond that point.
FlockDB (which is what is used by twitter) does include, as part of its design goals: “horizontal scaling including replication”. However, FlockDB isn’t trying to solve the problem outlined above, indeed, graph traversal is a non goal for it. FlockDB is more about finding one level of relations very fast than anything else.
In summary, I believe that while you can shard a graph database, it place a very lot limit on the type of graph walking queries you can make. Now, just to give you an idea, Neo4j, for example, appears to be able to routinely handle billions on nodes on a single machines, so you might no need to scale higher than that..
More posts in "That No SQL Thing" series:
- (03 Jun 2010) Video
- (14 May 2010) Column (Family) Databases
- (09 May 2010) Why do I need that again?
- (07 May 2010) Scaling Graph Databases
- (06 May 2010) Graph databases
- (22 Apr 2010) Document Database Migrations
- (21 Apr 2010) Modeling Documents in a Document Database
- (20 Apr 2010) The relational modeling anti pattern in document databases
- (19 Apr 2010) Document Databases – usages
Comments
Hi Ayende,
Nice points, I think you touched on something really important in that you separate the way a graph is sharded [1] and the way it is queried [2].
[1] assumes queries will be performed as they are on an unsharded graph, but attempts to reduce the performance impact by trying to keep "related" data "close"
[2] attempts to compensate for the short comings of [1]. for example, using the classical approach of overlapping "communication" with "computation" to mask network latency
I'm actually writing my thesis at Neo4j at the moment, the topic of which is basically what we're discussing here. If you're interested, please take a look at my recent blog about it (my blog URL & email address should be displayed with this post)
Feel free to contact me, I'd love to discuss this with you in more detail!
Regards,
Alex
Bryan,
Except that Google doesn't (I believe) do graph traversal on the graph.
As I understand it, they tend to walk one depth (find links to this url) and then weight things based on that.
Google appear to have something called Pergel, which is supposed to be a graph database, but I really have no information about how it work.
Alex,
I found your post during the research for this post.
I think that with regards to graph, we have to assume that you can't truly keep it on a single machine, not if there is any degree of interconnection in the nodes (6 degrees and all that).
Given that, there isn't really any real ability to keep related data close.
Oh, you can play games with weighting the graph and trying to make decisions based on center of gravity, but that isn't easy to decide on. You mention sharding migration strategies in your post, and that might be a good approach to balance that, since it would trend toward large percentage of the queries staying in a small number of machines. The problem is that shard migration is a pretty expensive operation, for that matter, it might very well require a single master to keep the node_id -> shard_id to do so.
It seems simpler by far to just do accept the basically random distribution of nodes across machine, and try to focus more on reducing the processing time to handle a query.
I like the idea of overlapping the communication and computation part.
If we assume a set of machines holding a graph, each machine may be active participant, rather than passive (just responding to queries).
We can define a query as:
The processing logic in each node may be something like:
foreach link matching the query, send copy of the query (with different source node and depth -1) to the node machine.
send results directly to the original requester
That way, you can parallelize most of the graph search, the client gets a set of messages describing partial results, and would need to rebuild the graph from the discrete information that it got, but that would be pretty easy.
Here's a link to an abstract of the Pregel paper:
docs.google.com/.../...520graph%2520processing.pdf
Hi Ayende,
with regard to your comment about interconnections, this really depends on the domain.
small world graphs (social networks) naturally have clusters, which may be worth exploiting when sharding.
scale free graphs (the internet) are basically random with a number of big "hubs", which may be much more difficult to shard meaningfully.
I agree with you that shard migration has a cost, but in reality there's no avoiding it. when adding/removing machines from a cluster you also need to migrate data. an interesting question is, when does the benefit outweigh the cost?
regarding the need for a single master, I don't think this is necessary. designing performant distributed indexes is a challenge for sure, but its not a new problem, and it may be possible to use existing technologies here... hopefully.
i like the distributed computation model you just described. seeing each computer as an independent actor that performs "tasks" locally and returns results (be it to the client or another server) asynchronously. it also sounds like a fun system to design :)
of course... at the moment we're mostly discussing read scalability, we haven't considered replication, and we're basically ignoring ACID
If we take something like Twitter as an example of such a graph, we certainly have clusters, but I am not sure the clusters are really meaningful..
As you can see here, http://www.sysomos.com/insidetwitter/sixdegrees/, about about 6% are within 3 hops, twitter has about 100 millions users, this give us about 6,000,000 nodes, and 240,000 nodes within 2 hops. At that point, I am not sure how useful clusters are.
You would have them, but there are multiple dimensions for clusters.
For example, in my case, I am on the .net cluster, but I am also following a lot of people from Canada and from Israel. What cluster do I belong to?
For that matter, a query that include @ayende in the search would pretty much have to cross clusters. That is why I think that you can't really get meaningful optimization from just clustering, the nature of the problem doesn't lend itself well to closed or nearly closed clusters.
Distributed indexes aren't actually needed, you just need a DHT to store the mapping between node id & shard id. Complex implementation, for sure, but a solved problem from the point of view of theoretical complexity.
CAP bites you again with ACID and replication. I don't think that you can really define ACID semantics on a big graph. There isn't a consistent view, since it is always updating, and the only way to really take locks is to lock the entire graph, which isn't feasible.
I would defer replication worries to the part when we can actually figure out how to handle things without any error :-)
Isn't flockdb based on mysql and pure relational ? (i.o.w. not really NoSql based) ?
Your basic approach can work with bounded latency. In your example you make N sequential queries (each of bounded latency).
One interesting thing to note is that some "complex" graph queries are easily distributed and don't require traversal, such as the cosign algorithms for "similarity" and "recommendations". For example, similar nodes to node A:
Find all nodes N into A; aggregate by partition; For each partition, find all nodes out of Np, aggregate by count. Then add the results from each partition together.
The other thing thing to note is that if you do a random partitioning scheme (effectively what FlockDB does) you can still do approximations of arbitrary depth traversals. Each partition has edges (not nodes) randomly removed. So for a system with 15 partitions, each partition is like the graph with 14/15ths of the edges randomly removed. A lot of fancy graph algorithms work by randomly removing edges and then doing random walks and such.
Of course, they randomly remove edges and still try to preserve the connectivity of the graph. In this case, we have no such preservation (or if we do, it's an accident). But I have a hypothesis that if you run arbitrary depth traversals across all the partitions, and you somehow "merge" the results together, you can approximate a lot of random walk-style algorithms, with managed latency.
Also, not to talk smack, but the statement that Neo4j can handle "billions" of nodes should not be stated without qualification. Can all of these nodes be stored in memory? If not, with what locality are they stored on disk?
FlockDB can store quadzillions of edges per node but they performance would be atrocious.
RE: if(depth == 0) // feeling defensive
If you were really feeling defensive it'd be: if(depth <= 0)
Frans,
FlockDB is a NoSQL database that uses MySQL as a storage engine.
Although I agree with you that it might be better to call it a graph service rather than a graph database.
Nick,
Regarding the billions comment for neo4j, I don't know. I am basing this on their documentation
Another way I see to reduce the problem at least is to have one graph database only storing relations and using another (e.g. document) database to store the other information (username, etc).
This way you can heavily reduce the space requirement and hence store more relations on a single server.
Of course now you need to query a separate server to get the actual data you are interested in (anything other than the id), but in average it should be more efficient.
That being said I don't have any experience in regards to graph databases across multiple machines, so I may have missed something ;-)
Comment preview