Sharding vs. Having multiple databases

time to read 4 min | 610 words

I was recently at a customer site, and he made a mention of sharding their database in a specific way, putting all of the user generated content on a separate server. I jumped in and told him that this isn’t sharding. And then I had to explain, both to the customer and to myself, why splitting things up in that way bothered me.

The basic idea is the same, we want to split the information over several databases to reduce the load on each individual server, and get higher capacity. We’ll use the Flickr model for this post. What the client suggested was:


And what I suggested was:


On the face of it, there isn’t much difference, right? But I really don’t like calling the first example sharding. From my perspective, this is merely splitting the data into multiple databases, it isn’t sharding. I’ll accept arguments about this still being sharding, but it doesn’t feel right from my perspective.

The main problem with the first option is that in order to actually do something interesting, you have to go to three different servers. On the other hand, by selecting a good sharding function, you can usually serve a single request completely from a single database.

That is important, because databases, whatever they are relational or RavenDB all include multiple ways for you to gather several pieces of information in an efficient manner. The common part about all of those efficient ways to gather multiple pieces of information? They all break down when you are trying to gather different pieces of information from different servers. That is leaving aside things like connection pooling and persistent connections, which are also quite important.

For example, loading an image with its user’s information and its comments would be the following three queries in the first example:

var image = imagesSession.Load<Image>("images/1");
var user = usersSession.Load<User>(image.UserId)
var comments = commentsSession.Load<ImageComments>(images.CommentsId);

Whereas with the second example, we will have:


Other things, like transactions across the data set, are also drastically simpler when you are storing all of the related data on the same server.

In short, sharding means splitting the dataset, but a good sharding function would make sure that related information is located on the same server (strong locality), since that drastically simply things. A bad sharding function (splitting by type) would create very weak locality, which will drastically impact the system performance down the road.