A social media platform has to deal with the concept of now and its history. For the most part, most users are interacting with the current state of the system. Looking at their timeline, watching current posts, etc. At the same time, there is a wealth of information that you can get from looking at the past.
It isn’t out of the question that you’ll have users diving into the history of posts of another user going as far back as possible. That can be a parent whose kids just left the house, looking at baby pictures or it can be a new friend, trying to learn some interesting tidbits before a party (when we still had those).
It can also be automated processes, such as: “5 years ago you posted…”
The architecture that I presented in these posts is relatively agnostic for such a scenario. Given the timeline feature, going back in time means that you can fairly easily discriminate based on age. Older sections in the timeline can be moved to lower class storage tier (basically, move to HDD instead of NVMe, for example). They are still accessible, still available, but far cheaper to store.
I don’t believe that you can usually go with an archive tier level for the timelines, not unless you are willing to effectively be unable to access them if a user requests it, but a policy of moving old and rarely used timeline sections and posts to HDD is absolutely doable. Note that things like intelligent tiering is not a good solution for our needs. That would move items based on age and access, but while we want to move items by age, older items are still access, just far more rarely, so we don’t want to move them back into hot storage if they are rarely accessed.
That said, certain posts are likely to generate active for a long time. So we can’t just send data to cold storage just based on age. Need to also take into account the recent access patterns. On the other hand, consider a post a few years ago that talks about Broccoli, when people still did that. Mr. Beat discovers that Mrs. Bold has such a post and blast it all over social media. Very quickly that old post become very active. That means that we should have a way to move data back to hot storage if there is enough access.
Ideally, we can rely on the underlying storage to do that for us, but we have to know how it actually works behind the scenes and understand what is actually going on there. The nice thing about this is that unlike most of the details we discussed so far, that is something that we can punt down the road, we already have the architecture in place that will allow us to introduce this cost savings measure down the line, we don’t have to have it figured out from day one. Given the fact that we have multi level caches, that means that we can probably just age out old information to cold storage and not usually have to think about it too much.
When we have enough data that this is a serious concern, on the other hand… we will have the time and resources to also handle it.
More posts in "Building a social media platform without going bankrupt" series:
- (05 Feb 2021) Part X–Optimizing for whales
- (04 Feb 2021) Part IX–Dealing with the past
- (03 Feb 2021) Part VIII–Tagging and searching
- (02 Feb 2021) Part VII–Counting views, replies and likes
- (01 Feb 2021) Part VI–Dealing with edits and deletions
- (29 Jan 2021) Part V–Handling the timeline
- (28 Jan 2021) Part IV–Caching and distribution
- (27 Jan 2021) Part III–Reading posts
- (26 Jan 2021) Part II–Accepting posts
- (25 Jan 2021) Part I–Laying the numbers