Building a social media platform without going bankruptPart X–Optimizing for whales
Unless I get good feedback / questions on the other posts in the series, this is likely to be the last post on the topic. I was trying to show what kind of system and constraints you have to deal with if you wanted to build a social media platform without breaking the bank.
I talked about the expected numbers that we have for the system, and then set out to explain each part of it independently. Along the way, I was pretty careful not to mention any one particular technological solution. We are going to need:
- Caching
- Object storage (S3 compatible API)
- Content Delivery Network
- Key/value store
- Queuing and worker infrastructure
Note that the whole thing is generic and there are very little constraints on the architecture. That is by design, because if your architecture can hit the lowest common denominator, you have a lot more freedom. Instead of tying yourself to a particular provider, you have a lot more freedom. For that matter, you can likely set things up so you can have multiple disparate providers without too much of a hassle.
My goal with this system was to be able to accept 2,500 posts per second and to handle reads of 250,000 per second. This sounds like a lot, but a most of the load is meant to be handled by CDN and the infrastructure, not the core servers. Caching in a social network is somewhat problematic, since you’ll have a lot of the work is obviously personalized. That said, there is still quite a lot that can be cached, especially the more popular posts and threads.
If we’ll assume that only about 10% of the reading load hits our servers, that is 25,000 reads per second. If we have just 25 servers for handling this (assuming five each in five separate data centers) we can accept the load at 1,000 requests per second. On the one hand, that is a lot, but on the other hand…. most of the cost is supposed to be about authorization, minor logic, etc. We can also at this point add more application servers and scale linearly.
Just to give some indication of costs, a dedicated server with 8 cores & 32 GB disk will cost 100$ a month, and there is no charge for traffic. Assuming that I’m running 25 of these, that will cost me 2,500 USD a month. I can safely double or triple that amount without much trouble, I think.
Having to deal with 1,000 requests per server is something that requires paying attention to what you are doing, but it isn’t really that hard, to be frank. RavenDB can handle more than a million queries a second, for example.
One thing that I didn’t touch on, however, which is quite important, is the notion of whales. In this case, a whale is a user that has a lot of followers. Let’s take Mr. Beat as an example, he has 15 million followers and is a prolific poster. In our current implementation, we’ll need to add to the timeline of all his followers every time that he posts something. Mrs. Bold, on the other hand, has 12 million followers. At one time Mr. Beat and Mrs. Bold got into a post fight. This looks like this:
- Mr. Beat: I think that Mrs. Bold has a Broccoli’s bandana.
- Mrs. Bold: @mrBeat How dare you, you sniveling knave
- Mr. Beat: @boldMr2 I dare, you green teeth monster
- Mrs. Bold: @mrBeat You are a yellow belly deer
- Mr. Beat: @boldMr2 Your momma is a dear
This incredibly witty post exchange happened during a three minute span. Let’s consider what this will do, given the architecture that we outlined so far:
- Post #1 – written to 15 million timelines.
- Post #2 - 5 – written to the timelines of everyone that follows both of them (mention), let’s call that 10 million.
That is 55 million timeline writes to process within the span of a few minutes. If other whales also join in (and they might) the number of writes we’ll have to process will sky rocket.
Instead, we are going to take advantage of the fact that only a small number of accounts are actually followed by many people. We’ll place the limit at 10,000 followers. At which point, we’ll no longer process writes for such accounts. Instead, we’ll place the burden at the client’s side. The code for showing the timeline will then become something like this:
In other words, we record the high profile users in the system and instead of doing the work for them on write, we’ll do that on read. The benefit of doing it in this manner is that the high profile users tiimeline reads will have very high cache utilization.
Given that the number of high profile people you’ll follow are naturally limited, that can save quite a lot of work.
The code above can be improved, of course, there are usually a lot of difference in the timeline posts, so we may have a high profile user that is off for a day or two, they shouldn’t show up in the current timeline and can be removed entirely. You need to do a bit more work around the time frames as well, which means that timeline should also allow us to query itself by most recent post id, but that is also not too hard to implement.
And with that, we are at the end. I think that I covered quite a few edge cases and interesting details, and hopefully that was interesting for you to read.
As usual, I really appreciate any and all feedback.
More posts in "Building a social media platform without going bankrupt" series:
- (05 Feb 2021) Part X–Optimizing for whales
- (04 Feb 2021) Part IX–Dealing with the past
- (03 Feb 2021) Part VIII–Tagging and searching
- (02 Feb 2021) Part VII–Counting views, replies and likes
- (01 Feb 2021) Part VI–Dealing with edits and deletions
- (29 Jan 2021) Part V–Handling the timeline
- (28 Jan 2021) Part IV–Caching and distribution
- (27 Jan 2021) Part III–Reading posts
- (26 Jan 2021) Part II–Accepting posts
- (25 Jan 2021) Part I–Laying the numbers
Comments
What is the order of magnitude in which I should think of considering : "Given that the number of high profile people you’ll follow are naturally limited, that can save quite a lot of work."
Because if you define a whale as >10k followers, then I think (from what I see around me) that many people have 10+ to 25+ or even more whales which they are following.
And that is just for frequently posting whales (say once per day) If I look at non-frequently posting whales (say once per week/month) then people can easily follow 50+ of them. That can still grow to a shit-storm of posts when something really happens and everybody goes online at once.
I mean in your example your little discussion with 55 million (@250k reads per second) posts would require 3 minutes to reach everybody. But then the shit-storm really just starts, every small whale who was following them will repost the news for their followers, which will then repost it to their friends and families. Which could go to a very much higher nr than 55 million.
So I should also think that you need just for the whales, some sort of deduplication method, so that an average user only gets the original post and maybe one repost, but not the original post and then 20 reposts
C.W,
Consider the fact that you'll likely have a limit on the number of people you can follow.Twitter sets this limit at 5,000. See this posts for more research:https://askwonder.com/research/social-influencers-10k-100k-followers-world-us-jzrt5l5im
It puts the > 10K followers at 0.1% of the population at large.
assuming 50 million users, that means a max of 50K whales. You are not likely to be following all of them, and it would be possible to cache their data near you, so you can query "all my influencers" separately from your timeline, then merge it on the client.
I don't think that the term "small whale" is that meaningful. The long tail is really deep here. But yes, you'll have repercussions, sure, but they will smooth out over time. That is .why we employ a queue.
Some thoughts on edge-cases for whales:
I see a possibility for duplicate posts on users' timelines when somebody they follow have just become a whale - posts from when whale had <10k followers would be written to users' timelines directly, and then they would appear in this second 'whale query' the moment a whale reached 10k.
On the other hand, there's a case of 'missing posts' when somebody have just been 'degraded' and is no longer a whale. Say MrHappyBroccoli used to be a very respected user, with 50k users, but since broccolis are passe, his userbase is on decline and quickly reaches mere 9998 followers. Posts from when he was a 'whale' weren't propagated to users' timelines directly, and now he's not a whale they wouldn't be queried on the client side too - you would only see new posts.
Probably solvable with some background-process on the whale-non-whale switch, but then if it's a 'costly' process somebody could abuse it by following and unfollowing an account with 9999 users. You know, just because they like to see social network platforms burn :)
Adrian,
The client already has to handle the scenario where it has duplicate posts in the timeline, so I don't expect that to be an issue.
As for old posts in the timeline, they aren't actually all that useful. In fact, it is pretty common to have a hard limit on how far back you can go in your own timeline.
It isn't actually meaningful to say: "Show me my timeline a year ago", usually. You'll go to the timeline of a thread or a user, but not the view timeline that you see. So I don't think this is much of an issue here.
Comment preview