Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,148 | Comments: 50,125

Privacy Policy Terms
filter by tags archive
time to read 2 min | 363 words

A few days ago I posted about looking at GitHub projects for junior developer candidates. One of the things that is very common in such scenario is to see them use string concatenation for queries, I hate that. I just reached to a random candidate GitHub profile right now and found this gem:

The number of issues that I have with this code is legion.

  • Not closing the connection or disposing the command.
  • The if can be dropped entirely.
  • And, of course, the actual SQL INJECTION vulnerability in the code.

There is a reason that I have such a reaction for this type of code, even when looking at junior developer candidates. For them, this is acceptable, I guess. They are learning and focusing mostly on what is going on, not the myriad of taxes that you have to pay in order to get something to production. This is never meant to be production code (I hope, at least). I’m not judging this on that level. But I have to very consciously remind myself of this fact whenever I run into code like this (and as I said, this is all too common).

The reason I have such a visceral reaction to this type of code is that I see it in production systems all too often. And that leads to nasty stuff like this:

And this code led to a 70GB data leak on Gab. The killer for me that this code was written by someone with 23 years of experience.

I actually had to triple check what I was seeing when I read the code the first time, because I wasn’t sure that this is actually possible. I thought maybe this is some fancy processing done to avoid SQL injection, not that this is basically string interpolation.

Some bugs are things that you can excuse. A memory leak or a double free are things that will happen to anyone who is writing in C, regardless of experience and how careful they write. They are often subtle and easy to miss, happening in corner cases of error handling.

This sort of bug is a big box of red flags. It is also on fire.

time to read 2 min | 398 words

I run into an interesting scenario the other day. As part of the sign in process, the application will generate a random token that will be used to authenticate the following user requests. Basically, an auth cookie. Here is the code that was used to generate it:


This is a pretty nice mechanism. We use cryptographically secured random bytes as the token, so no one can guess what this will be.

There was a serious issue with this implementation, however. The problem was that on application startup, two different components would race to complete the signup. As you can see from the code, this is meant to be done in such as a way that two such calls within the space of one hour will return the same result. In most cases, this is exactly what happens. In some cases, however, we got into a situation where the two calls would race each other. Both would load the document from the database at the same time, get a(n obviously) different security token and write it out, then one of them would return the “wrong” security token.

At that point, it meant that we got an authentication attempt that was successful, but gave us the wrong token back. The first proposed solution was to handle that using a cluster wide transaction in RavenDB. That would allow us to ensure that in the case of racing operations, we’ll fail one of the transactions and then have to repeat it.

Another way to resolve this issue without the need for distributed transactions is to make the sign up operation idempotent. Concurrent calls at the same time will end up with the same result, like so:

In this case, we generate the security token using the current time, rounded to an hour basis. We use Argon2i (a password hashing algorithm) to generate the required security token from the user’s own hashed password, the current time and some pepper to make it impossible for outsiders to guess what the security token is even if they know what the password is. By making the output predictable, we make the rest of the system easier.

Note that the code above is till not completely okay. If the two request can with millisecond difference from one another, but on different hours, we have the same problem. I’ll leave the problem of fixing that to you, dear reader.

time to read 9 min | 1604 words

imageI was talking to a developer recently and had a really interesting discussion around the notion of consistency. For simplicity’s sake, let’s imagine that we are talking about a game and we need to deal with awarding achievements.

The story in question begins with a seemingly innocent business requirement:

We want to announce a unique achievement in the game. The first player to kill 10,000 rabbits will get a unique class-appropriate item.

Conceptually, that means that we need to write the following code:

The code is simple, easy and trivial. I wish we could end the post at this point, but as you can imagine, the situation is a bit more complex.

Consider the typical architecture of a game, we have the game world, which is composed of multiple servers, often located in different parts of the world. In this case, we can assume that we have three data centers, in separate continents.

image

Notice that we have a world first scenario here? That means that we need to synchronize the state across all of them in some manner, including when there are issues in the network, failures, etc.

For that matter, when we are talking about an achievement for a character, we can at least be sure that there is a single chain of events that we can follow, but what happens if we were to apply this achievement to a guild? In this case, multiple players may be competing to complete this achievement. If we allow to to also run on different servers (and regions), the situation now become quite complex.

Are there are technical ways to resolve this? Of course there are.

We can use distributed transactions to handle this, at first glance. However, that introduce an utterly unworkable spanner in the works (See what I did here Smile). Games care a lot about performance and latency, forcing us to run a globally distributed transaction on every rabbit kill is not a possible solution. There are stiff performance costs associated with it. For that matter, in a partial failure scenario, you still want to allow players to kill rabbits, even if you lost some servers in another continent. That lead to several additional scenarios that you need to cover:

  • What happened if a player killed rabbits but wasn’t able to record that using the world scope state? Do we also store that in a character / server level state? What happens if we already awarded the achievement and then find out that there are delayed updates in the mix?
  • What happens when two players kill their 10,000th rabbit at times T1 and T2 (where T2 > T1), but this happens near enough to one another that the write about the 2nd player is committed first?
    • Note that this can happen on the same server, on the same region or across different regions.
  • What happens when two players kill their 10,000th rabbit at the same instant?
    • Note that this can happen on the same server, on the same region or across different regions.
    • Note that this can actually happen at different times, but clock divergence will say it happened at the same time
  • What happens when a player kill their 10,000 rabbit, but we had a network hiccup and couldn’t record the action in time?

In fact, there are a multitude of issues that we have to deal with, if we accept the scenario as is. And the impact on the entire system because of the unstated requirements of a single achievement are huge.

That is likely not what the intended result is. We can do some minimal changes to the system and get pretty much what we want at a much reduced cost.  We start by giving up the implicit assumption that we have to award the achievement for the 10,000th within the same tick as the actual kill.  If we give up this requirement, it means that we have a far more relaxed environment to deal with.

We can say that we’ll process the achievements at the end of the next hour, for example. That gives us enough time to get updates from the rest of the system, settle the dust and avoid millisecond level decision making processes. In almost all businesses, there is no such thing as a race condition. The best example of that is “the check is in the mail”. The payment date for a check is not when you cashed it but when you posted it. My uncle used to go to the post office at 4:50 PM on a Friday to post checks. They would get the right time stamp, but would sit in the post office over the weekend before actually being delivered. That gave his 3 – 5 extra days before the check was cashed.

In the case of our game, giving us the time for things to settle make sense. It also make sense from a marketing and community perspective. World wide announcements don’t just happen, they can be scheduled for a particular time frame to maximize impact. Delaying the announcement of such an achievement make it a lot easier from a technical perspective but also give us more business level benefits to work with.

In many situation, we jump to assumptions about the kind of requirements that we have to meet, but usually there is a lot more flexibility. And discovering where you can get some slack can be of tremendous value.

Possible reaction from the business in this scenario can be:

  • We don’t actually care if this is exact. We want to give the achievement when this happens but it is okay if there is a little fudge factor and the “wrong” person won if there is a tiny amount of time difference.
  • It is actually fine if two players get the achievement when it happens at the “same” time.
  • Oh, I didn’t realize that this is so hard. Can we make it a server-wide achievement, then? Would that be easier?

Any of those responses will translate to a great simplification of what we have to deliver. In the first case, we can track the state of bunny killing without the use of distributed transactions and only apply a distributed transaction when we get to the 10,000th rabbit. That is going to be a rare event, so it is fine to get to pay a little extra there.

For that matter, a good question is to ask is how important the operation is. What happened when we have a failure in the distributed transaction when we record the 10,000th rabbit? I’m not talking about someone else getting there first, I”m talking bout a network hiccup or a faulty wire that cause the operation to fail. Do we retry? Do we output an error (and if so, to where?) or just ignore this? Do we need to have a secondary mechanism for checking for errors in the process?

The answer is that it depends, you can get into effectively infinite loop of trying to solve ever more unlikely scenarios. The question is what is the impact on the system at large and is it worth the cost?

For a game achievement, the answer is probably no (but then again, I’m not a gamer). People usually draw the line at money as the place where they’ll do their utmost to avoid issues.  This is a strange scenario, because money specifically has so many ways that you can run compensating actions as part of the normal workflow that you shouldn’t bend yourself into a pretzel to avoid that.

Let’s say that we are offering three unique mounts in our game, selling them for 9.99$. We obviously have to deal with race conditions here. There are only three mounts, but there are more users who want it and are willing to pay for the privilege. We are dealing with money here, so we can assume that we want to be careful about that, the question is, how careful?

We have three mounts, but more users. The payment process itself takes at least a few seconds, so how do you deal with it?

  • Throw the purchase attempts into a queue.
  • Pull the first three offers from the queue and attempt to charge them.
  • If charging failed, we mark the purchase attempt as errored and move on to the next item in the queue.
  • Once we successfully processed three purchases, we mark all the other purchase attempts as failed.

Simple, right?

What happens when a charge took longer than expected and a request timed out, but the charge actually happened? This can happen if you have SMS authentication in the process. So the card company will send a message to the card holder and they have to send an SMS back to approve the transaction. That can take long enough that the transaction will time out. But the user did authorize the transaction, funds were transferred, but you already sold the unique mount to someone else, with worse credit card security features.

What happens then? You can refund the money, provide another unique mount, etc.

The key here is that even for something that may consider critical, the number of failure modes is too high. Trying to handle them and ensuring a consistent world is usually too expensive. It is far better to have the hooks in place to handle failures and apply compensations. They are far rarer and much easier to deal with in isolation.

When you start seeing that something happens on a frequent basis, that is when you want to figure out a way to automate that failure mode.

time to read 5 min | 925 words

Unless I get good feedback / questions on the other posts in the series, this is likely to be the last post on the topic. I was trying to show what kind of system and constraints you have to deal with if you wanted to build a social media platform without breaking the bank.

I talked about the expected numbers that we have for the system, and then set out to explain each part of it independently. Along the way, I was pretty careful not to mention any one particular technological solution. We are going to need:

  • Caching
  • Object storage (S3 compatible API)
  • Content Delivery Network
  • Key/value store
  • Queuing and worker infrastructure

Note that the whole thing is generic and there are very little constraints on the architecture. That is by design, because if your architecture can hit the lowest common denominator, you have a lot more freedom. Instead of tying yourself to a particular provider, you have a lot more freedom. For that matter, you can likely set things up so you can have multiple disparate providers without too much of a hassle.

My goal with this system was to be able to accept 2,500 posts per second and to handle reads of 250,000 per second. This sounds like a lot, but a most of the load is meant to be handled by CDN and the infrastructure, not the core servers. Caching in a social network is somewhat problematic, since you’ll have a lot of the work is obviously personalized. That said, there is still quite a lot that can be cached, especially the more popular posts and threads.

If we’ll assume that only about 10% of the reading load hits our servers, that is 25,000 reads per second. If we have just 25 servers for handling this (assuming five each in five separate data centers) we can accept the load at 1,000 requests per second. On the one hand, that is a lot, but on the other hand…. most of the cost is supposed to be about authorization, minor logic, etc. We can also at this point add more application servers and scale linearly.

Just to give some indication of costs, a dedicated server with 8 cores & 32 GB disk will cost 100$ a month, and there is no charge for traffic. Assuming that I’m running 25 of these, that will cost me 2,500 USD a month. I can safely double or triple that amount without much trouble, I think.

Having to deal with 1,000 requests per server is something that requires paying attention to what you are doing, but it isn’t really that hard, to be frank. RavenDB can handle more than a million queries a second, for example.

One thing that I didn’t touch on, however, which is quite important, is the notion of whales. In this case, a whale is a user that has a lot of followers. Let’s take Mr. Beat as an example, he has 15 million followers and is a prolific poster. In our current implementation, we’ll need to add to the timeline of all his followers every time that he posts something. Mrs. Bold, on the other hand, has 12 million followers. At one time Mr. Beat and Mrs. Bold got into a post fight. This looks like this:

  1. Mr. Beat: I think that Mrs. Bold has a Broccoli’s bandana.
  2. Mrs. Bold: @mrBeat How dare you, you sniveling knave
  3. Mr. Beat: @boldMr2 I dare, you green teeth monster
  4. Mrs. Bold: @mrBeat You are a yellow belly deer
  5. Mr. Beat: @boldMr2 Your momma is a dear

This incredibly witty post exchange happened during a three minute span. Let’s consider what this will do, given the architecture that we outlined so far:

  • Post #1 – written to 15 million timelines.
  • Post #2 - 5 – written to the timelines of everyone that follows both of them (mention), let’s call that 10 million.

That is 55 million timeline writes to process within the span of a few minutes. If other whales also join in (and they might) the number of writes we’ll have to process will sky rocket.

Instead, we are going to take advantage of the fact that only a small number of accounts are actually followed by many people. We’ll place the limit at 10,000 followers. At which point, we’ll no longer process writes for such accounts. Instead, we’ll place the burden at the client’s side. The code for showing the timeline will then become something like this:

In other words, we record the high profile users in the system and instead of doing the work for them on write, we’ll do that on read. The benefit of doing it in this manner is that the high profile users tiimeline reads will have very high cache utilization.

Given that the number of high profile people you’ll follow are naturally limited, that can save quite a lot of work.

The code above can be improved, of course, there are usually a lot of difference in the timeline posts, so we may have a high profile user that is off for a day or two, they shouldn’t show up in the current timeline and can be removed entirely. You need to do a bit more work around the time frames as well, which means that timeline should also allow us to query itself by most recent post id, but that is also not too hard to implement.

And with that, we are at the end. I think that I covered quite a few edge cases and interesting details, and hopefully that was interesting for you to read.

As usual, I really appreciate any and all feedback.

time to read 3 min | 551 words

A social media platform has to deal with the concept of now and its history. For the most part, most users are interacting with the current state of the system. Looking at their timeline, watching current posts, etc.  At the same time, there is a wealth of information that you can get from looking at the past.

It isn’t out of the question that you’ll have users diving into the history of posts of another user going as far back as possible. That can be a parent whose kids just left the house, looking at baby pictures or it can be a new friend, trying to learn some interesting tidbits before a party (when we still had those).

It can also be automated processes, such as: “5 years ago you posted…”

The architecture that I presented in these posts is relatively agnostic for such a scenario. Given the timeline feature, going back in time means that you can fairly easily discriminate based on age. Older sections in the timeline can be moved to lower class storage tier (basically, move to HDD instead of NVMe, for example). They are still accessible, still available, but far cheaper to store.

I don’t believe that you can usually go with an archive tier level for the timelines, not unless you are willing to effectively be unable to access them if a user requests it, but a policy of moving old and rarely used timeline sections and posts to HDD is absolutely doable. Note that things like intelligent tiering is not a good solution for our needs. That would move items based on age and access, but while we want to move items by age, older items are still access, just far more rarely, so we don’t want to move them back into hot storage if they are rarely accessed.

That said, certain posts are likely to generate active for a long time. So we can’t just send data to cold storage just based on age. Need to also take into account the recent access patterns. On the other hand, consider a post a few years ago that talks about Broccoli, when people still did that. Mr. Beat discovers that Mrs. Bold has such a post and blast it all over social media. Very quickly that old post become very active. That means that we should have a way to move data back to hot storage if there is enough access.

Ideally, we can rely on the underlying storage to do that for us, but we have to know how it actually works behind the scenes and understand what is actually going on there. The nice thing about this is that unlike most of the details we discussed so far, that is something that we can punt down the road, we already have the architecture in place that will allow us to introduce this cost savings measure down the line, we don’t have to have it figured out from day one. Given the fact that we have multi level caches, that means that we can probably just age out old information to cold storage and not usually have to think about it too much.

When we have enough data that this is a serious concern, on the other hand… we will have the time and resources to also handle it.

time to read 3 min | 555 words

Quite a few of the features that we consider native to social media actually came about as a result of users’ behavior, not pre-planned actions. For example, the #tagging and @mentions were both created by users and then adopted as an official action by the social media giants.

I already touched on how mentioned are handled, as part of writing the document itself, we insert the post id to the timeline of the mentioned user. For tagging, the process is very similar. Each tag has a timeline, and we can insert posts into the tag’s timeline. From there on, the process is basically identical for what we already describe.

I want to stop for a second and emphasis the coolness factor of a significant feature being handle (in the backend) via simply:

There is probably UI work to do here, but that is roughly all you’ll need to manage tags. Presumably you’ll want some better policies, but that is the core behavior you’ll need to support this sort of feature.

What about searching? Full search indexes are nothing new and you can get an off the shelve solution to manage your searches easily enough. That is likely to be one of the annoying pieces of the system. Luckily, we can usually handle things by offering two tiers of searching. We have the first tier, which cover posts in the recent past (a month or two) which must have very high speed queries and then we have full data search, for which we have a lot longer SLA. By far most queries are going to be hitting the recent data set, which makes the task itself easier. The actual choice of indexing solution and its usage is fairly irrelevant at this point. You’ll need something that is distributed, but there is enough variety there that you can get away with selecting pretty much anything.

We aren’t going to need to provide sophisticated full text search features, we just want users to be able to find results by text queries.

You’ll note that throughout this series of posts, I’m not trying to find novel ways to get the best solution. I’m using practical options for the actual use case presented and in many cases, I can get away with a lot by changing the requirements just slightly. For that matter, a lot of the limitations that I accept are real limitations that you’ll find with other social media networks as well.

Finally, I just wanted to show how we can enable basic search capability using minimal amount of code, given the infrastructure we have so far:

As you can see, we built a simple full text search here. To query it, you get the timeline for a particular term and get the list of post that it has.

For tags and searches, as you can imagine, this can be a huge list, which is partly why timelines are built on the concept of sections that can be so easily distributed.

The solution above isn’t actually a good one for full text search. I can’t easily turn that into a search by a phrase, and there are many other features that I’ll likely want to have, but that is a good example of how the infrastructure that we built for one part of the system can be utilized for completely different purpose.

time to read 4 min | 623 words

I touched briefly on the issue of posts statistics in a previous post, but it deserve its own post. There are all sort of metrics that we want to track on a post. Here are just a few of them:

image

Unlike most of the items that we discussed so far, these details are going to be very relevant for both reads and writes. In particular, it is very common for these numbers to be update concurrently, especially when talking about the popular posts. At the simplest level, these can be represented as a map<key, int64>. That gives us the maximum flexibility for our needs and can be also utilized in the future for additional use cases.

Given that this is effectively a distributed counter problem, there are all sort of ways that we can handle this. At the client level, we send the increment operation to the server and manually update the value. That gets us 90% there in terms of the UX factors, but there is a lot to handle this behind the scenes.

A good algorithm to use for this is the PN Counters model from the CRDT playbook. RavenDB implements these for you, for example. In essence, that means that we have the following data model:

The likes and replies object has a property per each node that increment a value. That contains the value that we have for that node as well as the etag for this change. It is easy to merge such a model between different versions, because we can always take value of the higher etag to get the latest value. In this way, we can allow concurrent and distributed updates across the entire system and it will resolve itself in the end to the right value. Another option may be to push the commands all way to the owning data center, where we’ll apply the operations, but that may add a high load on hot posts in the system. Better to distribute this globally and not really concern ourselves with the matter.

Looking at Twitter, there are about 200 billion tweets a year. That means that we have to be ready for quite a few of those values. Having that in a dedicated system is a good idea, since it has far different read & write skew than other parts of our system. As part of reading of posts, however, we’ll likely want to build some mechanism for pushing those counters to the post itself so we can remove that from the rest of the system. An easy way to handle that is to do some on an hourly basis. So instead of the format above, we’ll have:

Here we have the last two hours of updates of operations on the post. Once every hour we’ll consolidate all the updates from two hours ago and write them to the post itself. When we get to the point where we have no more updates in the post, we can safely delete the value.

The reason you want to add this complexity is that there is a big difference between all the posts in a social media and the active working set. That tends to be far smaller value and can dramatically reduce the amount of data we need to keep and manage. Assuming that the working set is at 25 millions posts or so across the network seems reasonable, and that amount of data can be easily handle by any server instance you care to use. Managing 200 billion per year, on the other hand, puts us in a different class of problem, and we’ll need more and more resources down the line.

time to read 5 min | 834 words

In the series so far, we talked about reading and writing posts, updating the timeline and distributing it, etc. I talked briefly about the challenges of caching data when we have to deal with updates in the background, but I haven’t really touched on that. Edits and updates are a pain to handle, because they invalidate the cache, which is one of the primary ways we can scale cheaply.

We need to consider a few different aspects of the problem. What sort of updates do we have in the context of a social media platform? We can easily disable editing of posts, of course, but you have to be able to support deletes. A user may post about Broccoli, which is verboten and we have to be able to remove that. And of course, users will want to be able to delete their own post, let’s their salad tendencies come back to hunt them in the future. Another reason we need to handle updates is this:

image

We keep track of the interactions of the post and we need to update them as they change. In fact, in many cases we want to update them “live”. How are we going to handle this? I discussed the caching aspects of this earlier, but the general idea is that we have two caching layers in place.


image

An user getting data will first hit the CDN, which may cache the data (green icon) and then the API, which will get the data from the backend. The API endpoint will query its own local state and other pieces of the puzzle are responsible for the data distribution.

When changes happen, we need to deal with them, like so:

image

Each change means that we have to deal with a policy decision. For example, a deletion to a post means that we need to go and push an update to all the data centers to update the data. The same is relevant for updates to the post itself. In general, updating the content of the post or updating its view counts aren’t really that different. We’ll usually want to avoid editing the post content for non technical reasons, not for lack of ability to do so.

Another important aspect to take into account is latency and updates. Depending on the interaction model with the CDN, we likely have it setup to cache data based on duration, so API requests are cached for a period of a few seconds to a minute or two. That is usually good enough to reduce the load on our servers and still retain good enough level of updates.

Another advantage that we can use is the fact that when we get to high numbers, we can reduce the update rate. Consider:

image

We now need to update the post only once every 100,000 likes or shares or once for 10,000 replies. Depending on the rate of change, we can skip that if this happened recently enough. That is the kind of thing that can reduce the load curve significantly.

There is also the need to consider live updates. Typically, that means that we’ll have the client connected via web socket to a server and we need to be able to tell it that a post has been updated. We can do that using the same cache update mechanism. The update cache command is placed on a queue and the web socket servers process messages from there. A client will indicate what post ids it is interested in and the web socket server will notify it about such changes.

The idea is that we can completely separate the different pieces in the system. We have the posts storage and the timeline as one system and the live updates as a separate system. There is some complexity here about cache usage, but it is actually better to assume that stuff will not work than to try for cache coherency.

For example, a client may get an update that a particular post was updated. When it query for the new post details, it gets a notice that it wasn’t modified. This is a classic race condition issue which can case a lot of trouble for the backend people to eradicate. If we don’t try, we can simply state that on the client side, getting a not modified response after an update note is not an error. Instead, we need to schedule (on the client) to query the post again after the cache period elapsed.

A core design tenant in the system is to assume failure and timing issues, to avoid having to force a unified view of the system, because that is hard. Punting the problem even just a little bit allows us a much better architecture.

time to read 12 min | 2245 words

In the series of posts so far, when discussing reads, I punted the part where we know what to read. I mentioned that we get a whole batch of post ids from some where and discussed how that is going to work, but that was it. Now I want to talk exactly on how this works.

The timeline concept is a fairly simple one. We have a list of posts ids that the user goes through. As they are browsing through the list, items are added at the top, etc. This is basically the Twitter model. Another alternative is that as you scroll, if there are new items in the list, you are shown them before older values (the Facebook approach), but that is more complex.

Conceptually, the timeline is as simple as:

In other words, we just have a list of post ids, we add items at the end and as we scroll we keep track of where we started. That is sufficient to get us pretty much all the features that we want, surprisingly enough.

When you go to the home page and look at your timeline, you’ll typically start with whatever the latest value there, we’ll record the last position we saw and then start scrolling backward in time. In other words, if we have 10,000 items in the timeline, we’ll record that the position we started at was 10,000 and then start going back toward zero. If there are new items, the size of the list will increase and we can jump back to the top, etc.

That is simple enough, but how does this actually help us? That may be good if I wanted to see the public timeline of the entire network, but what about the actual features. I don’t care that a restraint in Prague is now offering discounted deliveries, for example. I care about the accounts that I follow.

The idea is that we don’t have just a single such timeline, but many. In fact, pretty much all operations in the social platform can be represented using the timeline abstraction.

Let’s consider typical usage of an account. I’m adding posts, but I also want to be able to see people talking to me or about tags that I follow. How is that going to work?

Well, I’m actually going to have two timelines:

  • Public Timeline – where we’ll add all the posts from the user, and maybe posts that mention / reply, etc.
  • Private Timeline – where we’ll add posts from users that you follow, mentions, replies to discussion the user took part of, etc.

In both cases, the behavior of the system is identical. We simply go through the list.  If you’ll recall, I left a lot unsaid when I discussed writing posts. In particular, how do I publish them to interested parties. This is where we start to apply policies. Part of the process of adding a post is to figure out what timelines it should go to.

By moving most of the cost to the write side, we drastically reduce the overall complexity. Furthermore, it also make a lot more sense, given that most posts aren’t going to have a wide blast radius.

When posting a message, we need to consider the following:

  • Is this a high impact account? (Let’s say, > 50,000 followers) If so, we’ll have special behaviors.
  • Who is following this user?
  • Is this a reply to another post?
  • Are there mentions on this post? If so, need to apply the logic based on the mention policies.

As you can imagine, this is quite involved, but in general, the way it will work is something like this:

The key here is the whole manner in which this works is done via selecting what we’ll publish to. Further more, you can see that a user have multiple timelines, and in the case of a mention, we can apply additional policies to see how the post gets routed. This is complex and often changing, but it also happens a lot less often than reads. So it is a net benefit to move all the costs to the write side.

Another thing to notice in this case is how we handle a reply. A reply it just appended to the timeline of the post. In other words, it is timelines all the way down. We want to have a single simple abstraction to handle as much of the system as we want. In this case, we need to handle replies on a post and that can be anything from very few to hundreds of thousands. By generating a timeline for the post as well, we can reuse all the same behaviors and it just works.

As for the timeline itself? It is merely a queue of post ids, and it allows you to set a position in it in an efficient manner. The list of post ids above is how it works conceptually, but we have to think about the numbers here. How big can a timeline get?

  • The personal timeline of a user is limited to how many posts they can make. In general, even very heavy users will not top a few hundreds a day and low thousands a month. That means that we have a good reasonable upper bound to how big the personal timeline can grow. Ten years of posting 5,000 posts a month will get you over half a million, but that I would assume be the top rate for anything that isn’t an automated system.
  • Your Public Timeline is impact by how many people you follow and how prolific they are. There is a natural limit to how many people an account can follow, so there is a bound here, but assuming that you follow 1000 accounts that all post 1000 posts a month, that adds up to a million posts a month. Over a ten year span, that would be 120 million posts. That said, we’ll discuss other properties of the public timeline below.
  • Post’s timeline is all the replies that were made to the post. Most posts have very few replies, but some will garner a lot. It took me a minute to find a Tweet on Twitter that had close to 400,000 replies, for example.

So a timeline may be big, potentially very big. However, there is an interesting issue here, how much do we actually need to keep?

The purpose of the Public Timeline, for example, is to show you the front page of the site, how much data back do we need to keep? Is there a reason to keep your timeline from three years ago? The answer is probably no. We can keep the public timeline at a certain size and likely benefit from a lot of space savings. On the other hand, the replies for a post can be quite interesting, and while they can grow very big, it probably make sense to never trim them.

So we have the concept of a timeline, but what is it actually going to be?

In terms of REST API, we are going to have the following endpoints:

  • GET /timelines?id=1351081943163123854
    GET /timelines/sections/F4BE2048BF51F3DCC69EA4CA4ED08F12A36BD6524C9F12018BA0CE6F7C076BB2

In other words, we can access the timeline or a section in that timeline. I would rather show the output and then discuss what it all means. The first endpoint gives us the timeline itself, and looks like this:

There are a few things to note here. When we ask for a timeline, we get the most recent posts in the timeline, as well as the past few sections. The way a timeline works, you can always append posts ids to it, which works great, except that at some point the sheer size that is involved is starting to be problematic.

If we consider a big timeline, one with 400,000 posts in it, that comes to about 3.2 MB used, just to store the post ids (8 bytes each). In practice, due to concurrency and distribution concerns, we can’t have an actual list of post ids, so we need some better management. Another factor is that you very rarely need or want to get the entire timeline, you want to start from the top and work your way down.

We can handle that easily enough using two stage approach. First, all the new post ids appended to a timeline are written in a “loose” form. Each one with each own entry. Once we hit a certain limit (128, for example), we know that this is likely to grow bigger. We can grab the loose post ids in the timeline and gather them into a section. A section is an independently addressed part of the timeline. The idea is that we gather all of the post ids currently loose in the timeline, write them into a single object and compress that. Then we use the hash of the resulting object to as the key to an object store.

Side note, timelines are immutable. Once the section is created, it cannot change. You can add additional filtering on the timeline on read, on the other hand. The timeline also should handle the case where posts in the timeline has been deleted, since we aren’t cannot modify it. For ease of implementation, we’ll also allow duplication in the posts ids. Clients are expected to handle and ignore duplicate post ids that happen within a certain time range.

The reason that the timeline section is compressed is to reduce the size, obviously. In my testing, I was able to get 65% reduction in size without taking any special efforts. Throwing the compressed data into object storage (S3 and the like) also means that it is much easier to scale reads on them. If we have a user who is very popular, we can move that timeline to a compression section faster to reduce load. This design explicitly acknowledge the problems with distributed systems and concurrency. It is possible that a compressed section will have an id that also appears in the loose portion of the timeline. The responsibility to handle such a scenario is on the client code, which is able to do so far more easily than the server side portion.

After compressing the loose posts in the timeline, we record the new section hash and allow clients to access it. It might be easier to see how that would work in the following image:

image

Given a post id size of 8 bytes, and assuming that we can compress it by 65% (my naïve tests using Brotli & GZip says yes, can probably do better than that) we can state that every thousand post ids or so we can generate a new section (meaning that it would be about 2KB in size, in the end). Even a very big timeline with hundreds of thousands of entries would end up with just a few hundreds of sections at the top.

The entire mechanism is very limited, quite intentionally. The external operations we allow on a timeline are append and get, with the client expected to understand the manner in which they are going to go deeper into the timeline. The limitation and expectation from the client (like allowing duplicate post ids, handling post ids that point to deleted posts, etc) are all there to make it easy to handle scaling out the system.

Consider a typical use case, I go into a popular account and look at their posts. Effectively, I’m browsing their public timeline. My interactions with the server goes like this:

  • Get a list of the post ids in the timeline. The first step is: GET /timelines?id=1351081943163123854
    • This gets me the list of loose post ids and the recent segments.
    • Notice that this API call is open for caching as well, so we can get the scaling benefit of that as well.
  • Get the actual posts, which I can do with the batch post read API that we discussed earlier.

In many cases, the cost of getting the timeline for the first time will be amortize over the reading time of many posts. The bulk read API gets me 128 posts at a time, so as the user is reading, I can get the next batch ready and give them the next part immediately.

Once I’m done with the loose posts, I can go into the compressed sections and do the same there. If each section has about 1000 posts ids, that will be sufficient for quite some time. And because I’m driving this from the client, it is very easy for me to scale. Throwing the data into object store like S3 means that I can get both CDN support easily and my scaling issue is now: “serve a lot of small files”, which is a very well understood problem.

I still have to take into account permissions, but that is already something that we handle in the batch read API. Notice that for the common case of public posts, pretty much the whole thing has caching and distribution baked and the amount of work that we can let the rest of the system handle is very high.

Hot spots in the system are going to be handled by the infrastructure, not by our own code, big machines or clever algorithms. They are going to be handled by the architecture of the system making it simple to manage them, giving plenty of room for caching and CDN to take charge and reduce our costs.

That is, after all, the whole point of this series of posts.

time to read 9 min | 1673 words

In my previous post we looked at the process in which we process a request for a batch of posts and get their results. The code made no assumptions about where it is running and aside from specifying whatever it is okay or not to allow caching, did no such work.

Caching is important, it matters a lot for performance. To the point where if you aren’t using caching, you are past willful neglect and in the territory of intentional malpractice. The difference can be between needing 18,500 cores to serve a website and needing less than 400  to serve a much busier website. Except that the difference will likely be more pronounced.

Because it is so important, we need to take it into account at the architecture level. Another aspect we have to consider is the data distribution. Assuming we want to build a global social media platform, that means having to access it from multiple locations. Which mean, in turn, that we have to consider the fallacies of distributed computing in our system. Locality of reference is another key factor that you have to take into account. Which means that you have to consider the flow of data in the system.

Let’s assume that we have the following datacenters around the world:

image

We are using geo routing and the relevant infrastructure to make sure that you’ll always hit the nearest data center to you.

Let’s say that we have a Mr. Beat in our social platform, who is very popular and like to post controversial messages such as the need to abolish peppers from your menus. Mr. Beat is located in Australia, so when he is posting yet another “peppers have no place in the kitchen” post, the data center in Brisbane is going to be the one to field the request.

A system like that would do best if we can avoid any and all required coordination between the different data centers, as such, we are going to be using gossip to share the results among all the data centers. In other words, the post will be written in the Brisbane data center and then replicated to the rest of the data centers. There are several ways that we can implement such a feature.

The simplest way to replicate all the data to all the data centers. This way, we can always access it from the local store. However, that present two challenges:

  • First, there is latency involved with replicating information across the glove. We may get a request for a post in the Odessa data center for a post originating in Brisbane. If the network devils decided to have a party, we may not have that particular post yet in Odessa. What do you do then? This is where the format of the post id come into play. If we can’t find the post in our local storage, we can figure out who owns that (based on the machine id segment) and go ask the owning data center.
  • The second problem is that in many cases, the data is purely local. For example, consider this Twitter account. For the most part, everyone that is following that is going to be located nearby. I assume you may have a very small minority of followers who left the area who are still interested in following up on what is going on, but for the most part, this is not that common. That means that replicating all the information to all the data centers is likely a waste.

Given these two facts together, we are actually better off using a different model for data distribution and caching. The post id is generated using the following mechanism:

image

The machine segment in the post id (10 bits long) gives us a good indication of where that post was created. This is originally used only for the purpose of generating unique ids, but we can make far greater usage of this. We decide that the data center that created the post is also the one that owns it. In other words, Mr. Beat’s posts are “owned” by the Brisbane data center. The actual posts are held in a key/value store, but there is a difference in how we do lookup by id based on the ownership.

Here is the relevant code:

The idea is that we first check the key/value in the current data center for the post id. If it isn’t found, we check if we are the owner of the post. If so, it doesn’t exists or was removed. If it belongs to another data center, we’ll go and fetch it from there. Note that we’ll record a null if needed, so we’ll not need to go and fetch a missing value each time it is requested.

In other words, the first time that an item is requested from the owner data center, we’ll place it in our own key/value for next time. We do so with an indication to the key/value that this is a remote value. Remote values may be purge early or be put on a least frequently viewed rotation, etc.

There are other things that we need to deal with, of course. Concurrency, for example, if we have multiple concurrent requests to the same missing id. We’ll have multiple chats between data centers to fetch the same value. I don’t think that this is a problem. That is likely to only happen the first time, and then it is cached. We might want to monitor how often remote values are requested and only get them after a certain number of requests, but in all honesty, it probably doesn’t matter.

The key/value store architecture is already likely to cause us to use both disk and memory. We can take advantage of the fact that the remote key isn’t important locally and not persist it to disk right away, or not in a durable format. When we need to remove the value from memory, we can see if it had enough hits to warrant writing locally or not.

The most complex issue, however, is related to the cache itself. We have a strong ownership model, in which the data center that created a post is its owner. What happens when we need to update a post?

Twitter, for example, doesn’t have an Edit feature. That is a great reduction of the complexity we have to deal with, but it isn’t all that simple. An update to the post can also be a delete. For example, let’s say that Mr. Beat posted: “eating steamed Br0ccoli”. Such a violation of community standards cannot stand. Even though Mr. Beat cleverly disguised his broccoli tendencies by typo-ing the forbidden term. Mr. Beat is very popular and his posts have likely spread across many data centers. An admin marking the post as deleted also has to deal with the possibility that the post is located on other locations as well.

We can try to keep track of this, but to be fair, it is easier to simply queue a delete command on all the data centers except the owner. That will ensure that they will remove the cached version and have to re-read it from the owner data center.

Everything that I describe so far was about behavior, but we also ought to talk about policies. As part of the work we do when writing a post, we can apply all sort of interesting policies. For example, we may know that Mr. Beat is popular globally and as part of writing a post from him preemptively send that post to all the other data centers. If we have an update to a post, instead of sending a delete command to the rest of the data centers and let it refresh automatically, we can send the updated post content immediately.

I intentionally don’t want to dig too deeply into those policies, because they are important, but they aren’t on the same level as the infrastructure I describe here is. Those are like the cherry on top, if you like such a thing, it can take something good and make it great. But given that those are policies that can be applied on a per item basis and modified as you go along, there isn’t a reason to start going there yet.

Finally, there is a last aspect to discuss: Expiry. 

In most social media, the now is important beyond all else. In other words, you are very unlikely to be seeing posts from two years ago and if you are, it matters a lot less if you have to deal with slightly higher latency. Expiring remote content that is over 3 months old, for example, and not placing that in our local key/value at all can be a great way to handle long tail issues. For that matter, given that old content is rarely accessed, we can also optimize our storage by compressing old posts instead of holding on to them directly.

And after all of this discussion, I wanted to point out that you are also likely to want to have another layer of caching in place. The API calls you make in many cases may be good candidate for at least short term caching. In other words, if you put them behind something like Cloudflare, given that we explicitly state what post ids we want, we can set a cache duration of 1 – 3 minutes without needing to worry too much about updates. That can massively reduce the number of requests that we actually have to handle, and it costs a lot less. Under what scenarios would that be useful?

Consider people going to view a popular user, such as Mr. Beat’s page. The list of posts there is going to be the same for anyone, and even a short duration on the cache would massively help our load.

As you can see, the design of the system assume caching and actively work to make it possible for you to utilize the cache at multiple levels.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Webinar (3):
    12 May 2021 - Real Time Architecture
  2. Building a phone book (3):
    02 Apr 2021 - Part III
  3. Building a social media platform without going bankrupt (10):
    05 Feb 2021 - Part X–Optimizing for whales
  4. Webinar recording (12):
    15 Jan 2021 - Filtered Replication in RavenDB
  5. Production postmortem (30):
    07 Jan 2021 - The file system limitation
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats