I touched briefly on the issue of posts statistics in a previous post, but it deserve its own post. There are all sort of metrics that we want to track on a post. Here are just a few of them:
Unlike most of the items that we discussed so far, these details are going to be very relevant for both reads and writes. In particular, it is very common for these numbers to be update concurrently, especially when talking about the popular posts. At the simplest level, these can be represented as a map<key, int64>. That gives us the maximum flexibility for our needs and can be also utilized in the future for additional use cases.
Given that this is effectively a distributed counter problem, there are all sort of ways that we can handle this. At the client level, we send the increment operation to the server and manually update the value. That gets us 90% there in terms of the UX factors, but there is a lot to handle this behind the scenes.
The likes and replies object has a property per each node that increment a value. That contains the value that we have for that node as well as the etag for this change. It is easy to merge such a model between different versions, because we can always take value of the higher etag to get the latest value. In this way, we can allow concurrent and distributed updates across the entire system and it will resolve itself in the end to the right value. Another option may be to push the commands all way to the owning data center, where we’ll apply the operations, but that may add a high load on hot posts in the system. Better to distribute this globally and not really concern ourselves with the matter.
Looking at Twitter, there are about 200 billion tweets a year. That means that we have to be ready for quite a few of those values. Having that in a dedicated system is a good idea, since it has far different read & write skew than other parts of our system. As part of reading of posts, however, we’ll likely want to build some mechanism for pushing those counters to the post itself so we can remove that from the rest of the system. An easy way to handle that is to do some on an hourly basis. So instead of the format above, we’ll have:
Here we have the last two hours of updates of operations on the post. Once every hour we’ll consolidate all the updates from two hours ago and write them to the post itself. When we get to the point where we have no more updates in the post, we can safely delete the value.
The reason you want to add this complexity is that there is a big difference between all the posts in a social media and the active working set. That tends to be far smaller value and can dramatically reduce the amount of data we need to keep and manage. Assuming that the working set is at 25 millions posts or so across the network seems reasonable, and that amount of data can be easily handle by any server instance you care to use. Managing 200 billion per year, on the other hand, puts us in a different class of problem, and we’ll need more and more resources down the line.
More posts in "Building a social media platform without going bankrupt" series:
- (05 Feb 2021) Part X–Optimizing for whales
- (04 Feb 2021) Part IX–Dealing with the past
- (03 Feb 2021) Part VIII–Tagging and searching
- (02 Feb 2021) Part VII–Counting views, replies and likes
- (01 Feb 2021) Part VI–Dealing with edits and deletions
- (29 Jan 2021) Part V–Handling the timeline
- (28 Jan 2021) Part IV–Caching and distribution
- (27 Jan 2021) Part III–Reading posts
- (26 Jan 2021) Part II–Accepting posts
- (25 Jan 2021) Part I–Laying the numbers