Building a social media platform without going bankrupt: Part III–Reading posts

Jan 27 2021

Building a social media platform without going bankruptPart III–Reading posts

time to read 4 min | 769 words

So far in this series of posts I looked into how we write posts. The media goes to S3 compatible API and the posts themselves will go to a key/value store. Reading them back, on the other hand, isn’t that simple. For the media, I”m going to assume that the S3 is connected to CDN and that is handled, but I want to focus on the manner in which we deal with reading posts. In particular, I’m not talking here about how we can display the timeline for a user. That is going to be the subject on another post, right now, I’m assuming that this is handled and talking about the next step. We have a list of post ids that we want to get and we need to manage that.

The endpoint in question would look like this:

GET /api/v1/read?post=1352410889870336005&post=1351081958063951875

The result of this API is a JSON object with the keys as the posts ids and the values as the content of the post.

This simple API is about as simple as you can imagine, but even from this simple scenario you can see a few interesting details:

The API is using GET, which means that there is a natural limit to the size of the URL. This is good and by design. We will likely limit this to a maximum of 128 items as a time anyway.
The API is inherently about dealing with batches of information.
The media is handled separately (generated directly from the client) so we can return far less information.

In many cases, this is going to be a unique set of posts, for example, when you view your timeline, it is likely that you’ll see a unique set of posts. However, in many other cases, you’ll send a request that is similar or identical to what others will use.

When you are looking at a popular thread, for example, you’ll be asking the same posts ids as everyone else, which means that there is a good chance to easily add caching for this via CDN or the like and benefit greatly as a result.

Internally, the implementation of this API is probably just going to issue direct reads by ids to the key/value store and just return the result. There should be a minimal amount of processing involved, usually, except for one factor, authorization.

Assuming that the key/value interface has a get(id) method, the backend code for this API critical API should be something like the code below. Note that this is server side code, I'm not showing any client side code in this series of posts. This is the backend code to handle address the reading of a batch of ids from the client.

The code itself assumes that there is no meaning to doing batch operation on the key/value itself, mind. That isn’t always the case, but I’ll assume that. We issue N async promises to the key/value and wait to get them all back. This assumes that the latency from the API node to the key/value servers is minimal and let us batch a lot of remote calls into near calls.

The vast majority of the function is dedicated to the auth behavior. A post can be marked as public or protected, and if it is the later, we need to ensure that only people that the author of the post follow will be able to see this. You’ll note that I’m doing a lot of stuff in an async manner here. For example, we’ll only issue a single check per post author and we can safely assume that most posts are public anyway. I’m including the “full” code here to give you an indication about the level of complexity that I would expect to see in the API.

You should also note that we indicate whatever we allow to cache the results or not. In the case of a request that include a protected post, we don’t allow it. But for the most part, we can expect to see high percentage of posts that would be only public and can benefit from that.

Because we are running in a distributed system, we also have to take into account all sort of interesting race conditions. For example, you may be trying to read a post that has been removed. We explicitly clear all such null items from the results. Another way to handle that is to replace the content of the post and set a marker flag, but we’ll touch that on another post.

Finally, the code above doesn’t handle caching or distribution. That is going to be handled both above and below this code. I’ll have a dedicated post around that tomorrow.

0 comments

Tags:

Oren Eini

Oren Eini

CEO of RavenDB