Building a social media platform without going bankruptPart X–Optimizing for whales

time to read 5 min | 925 words

Unless I get good feedback / questions on the other posts in the series, this is likely to be the last post on the topic. I was trying to show what kind of system and constraints you have to deal with if you wanted to build a social media platform without breaking the bank.

I talked about the expected numbers that we have for the system, and then set out to explain each part of it independently. Along the way, I was pretty careful not to mention any one particular technological solution. We are going to need:

  • Caching
  • Object storage (S3 compatible API)
  • Content Delivery Network
  • Key/value store
  • Queuing and worker infrastructure

Note that the whole thing is generic and there are very little constraints on the architecture. That is by design, because if your architecture can hit the lowest common denominator, you have a lot more freedom. Instead of tying yourself to a particular provider, you have a lot more freedom. For that matter, you can likely set things up so you can have multiple disparate providers without too much of a hassle.

My goal with this system was to be able to accept 2,500 posts per second and to handle reads of 250,000 per second. This sounds like a lot, but a most of the load is meant to be handled by CDN and the infrastructure, not the core servers. Caching in a social network is somewhat problematic, since you’ll have a lot of the work is obviously personalized. That said, there is still quite a lot that can be cached, especially the more popular posts and threads.

If we’ll assume that only about 10% of the reading load hits our servers, that is 25,000 reads per second. If we have just 25 servers for handling this (assuming five each in five separate data centers) we can accept the load at 1,000 requests per second. On the one hand, that is a lot, but on the other hand…. most of the cost is supposed to be about authorization, minor logic, etc. We can also at this point add more application servers and scale linearly.

Just to give some indication of costs, a dedicated server with 8 cores & 32 GB disk will cost 100$ a month, and there is no charge for traffic. Assuming that I’m running 25 of these, that will cost me 2,500 USD a month. I can safely double or triple that amount without much trouble, I think.

Having to deal with 1,000 requests per server is something that requires paying attention to what you are doing, but it isn’t really that hard, to be frank. RavenDB can handle more than a million queries a second, for example.

One thing that I didn’t touch on, however, which is quite important, is the notion of whales. In this case, a whale is a user that has a lot of followers. Let’s take Mr. Beat as an example, he has 15 million followers and is a prolific poster. In our current implementation, we’ll need to add to the timeline of all his followers every time that he posts something. Mrs. Bold, on the other hand, has 12 million followers. At one time Mr. Beat and Mrs. Bold got into a post fight. This looks like this:

  1. Mr. Beat: I think that Mrs. Bold has a Broccoli’s bandana.
  2. Mrs. Bold: @mrBeat How dare you, you sniveling knave
  3. Mr. Beat: @boldMr2 I dare, you green teeth monster
  4. Mrs. Bold: @mrBeat You are a yellow belly deer
  5. Mr. Beat: @boldMr2 Your momma is a dear

This incredibly witty post exchange happened during a three minute span. Let’s consider what this will do, given the architecture that we outlined so far:

  • Post #1 – written to 15 million timelines.
  • Post #2 - 5 – written to the timelines of everyone that follows both of them (mention), let’s call that 10 million.

That is 55 million timeline writes to process within the span of a few minutes. If other whales also join in (and they might) the number of writes we’ll have to process will sky rocket.

Instead, we are going to take advantage of the fact that only a small number of accounts are actually followed by many people. We’ll place the limit at 10,000 followers. At which point, we’ll no longer process writes for such accounts. Instead, we’ll place the burden at the client’s side. The code for showing the timeline will then become something like this:

In other words, we record the high profile users in the system and instead of doing the work for them on write, we’ll do that on read. The benefit of doing it in this manner is that the high profile users tiimeline reads will have very high cache utilization.

Given that the number of high profile people you’ll follow are naturally limited, that can save quite a lot of work.

The code above can be improved, of course, there are usually a lot of difference in the timeline posts, so we may have a high profile user that is off for a day or two, they shouldn’t show up in the current timeline and can be removed entirely. You need to do a bit more work around the time frames as well, which means that timeline should also allow us to query itself by most recent post id, but that is also not too hard to implement.

And with that, we are at the end. I think that I covered quite a few edge cases and interesting details, and hopefully that was interesting for you to read.

As usual, I really appreciate any and all feedback.

More posts in "Building a social media platform without going bankrupt" series:

  1. (05 Feb 2021) Part X–Optimizing for whales
  2. (04 Feb 2021) Part IX–Dealing with the past
  3. (03 Feb 2021) Part VIII–Tagging and searching
  4. (02 Feb 2021) Part VII–Counting views, replies and likes
  5. (01 Feb 2021) Part VI–Dealing with edits and deletions
  6. (29 Jan 2021) Part V–Handling the timeline
  7. (28 Jan 2021) Part IV–Caching and distribution
  8. (27 Jan 2021) Part III–Reading posts
  9. (26 Jan 2021) Part II–Accepting posts
  10. (25 Jan 2021) Part I–Laying the numbers