Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,252 | Comments: 50,433

Privacy Policy Terms
filter by tags archive
time to read 12 min | 2245 words

In the series of posts so far, when discussing reads, I punted the part where we know what to read. I mentioned that we get a whole batch of post ids from some where and discussed how that is going to work, but that was it. Now I want to talk exactly on how this works.

The timeline concept is a fairly simple one. We have a list of posts ids that the user goes through. As they are browsing through the list, items are added at the top, etc. This is basically the Twitter model. Another alternative is that as you scroll, if there are new items in the list, you are shown them before older values (the Facebook approach), but that is more complex.

Conceptually, the timeline is as simple as:

In other words, we just have a list of post ids, we add items at the end and as we scroll we keep track of where we started. That is sufficient to get us pretty much all the features that we want, surprisingly enough.

When you go to the home page and look at your timeline, you’ll typically start with whatever the latest value there, we’ll record the last position we saw and then start scrolling backward in time. In other words, if we have 10,000 items in the timeline, we’ll record that the position we started at was 10,000 and then start going back toward zero. If there are new items, the size of the list will increase and we can jump back to the top, etc.

That is simple enough, but how does this actually help us? That may be good if I wanted to see the public timeline of the entire network, but what about the actual features. I don’t care that a restraint in Prague is now offering discounted deliveries, for example. I care about the accounts that I follow.

The idea is that we don’t have just a single such timeline, but many. In fact, pretty much all operations in the social platform can be represented using the timeline abstraction.

Let’s consider typical usage of an account. I’m adding posts, but I also want to be able to see people talking to me or about tags that I follow. How is that going to work?

Well, I’m actually going to have two timelines:

  • Public Timeline – where we’ll add all the posts from the user, and maybe posts that mention / reply, etc.
  • Private Timeline – where we’ll add posts from users that you follow, mentions, replies to discussion the user took part of, etc.

In both cases, the behavior of the system is identical. We simply go through the list.  If you’ll recall, I left a lot unsaid when I discussed writing posts. In particular, how do I publish them to interested parties. This is where we start to apply policies. Part of the process of adding a post is to figure out what timelines it should go to.

By moving most of the cost to the write side, we drastically reduce the overall complexity. Furthermore, it also make a lot more sense, given that most posts aren’t going to have a wide blast radius.

When posting a message, we need to consider the following:

  • Is this a high impact account? (Let’s say, > 50,000 followers) If so, we’ll have special behaviors.
  • Who is following this user?
  • Is this a reply to another post?
  • Are there mentions on this post? If so, need to apply the logic based on the mention policies.

As you can imagine, this is quite involved, but in general, the way it will work is something like this:

The key here is the whole manner in which this works is done via selecting what we’ll publish to. Further more, you can see that a user have multiple timelines, and in the case of a mention, we can apply additional policies to see how the post gets routed. This is complex and often changing, but it also happens a lot less often than reads. So it is a net benefit to move all the costs to the write side.

Another thing to notice in this case is how we handle a reply. A reply it just appended to the timeline of the post. In other words, it is timelines all the way down. We want to have a single simple abstraction to handle as much of the system as we want. In this case, we need to handle replies on a post and that can be anything from very few to hundreds of thousands. By generating a timeline for the post as well, we can reuse all the same behaviors and it just works.

As for the timeline itself? It is merely a queue of post ids, and it allows you to set a position in it in an efficient manner. The list of post ids above is how it works conceptually, but we have to think about the numbers here. How big can a timeline get?

  • The personal timeline of a user is limited to how many posts they can make. In general, even very heavy users will not top a few hundreds a day and low thousands a month. That means that we have a good reasonable upper bound to how big the personal timeline can grow. Ten years of posting 5,000 posts a month will get you over half a million, but that I would assume be the top rate for anything that isn’t an automated system.
  • Your Public Timeline is impact by how many people you follow and how prolific they are. There is a natural limit to how many people an account can follow, so there is a bound here, but assuming that you follow 1000 accounts that all post 1000 posts a month, that adds up to a million posts a month. Over a ten year span, that would be 120 million posts. That said, we’ll discuss other properties of the public timeline below.
  • Post’s timeline is all the replies that were made to the post. Most posts have very few replies, but some will garner a lot. It took me a minute to find a Tweet on Twitter that had close to 400,000 replies, for example.

So a timeline may be big, potentially very big. However, there is an interesting issue here, how much do we actually need to keep?

The purpose of the Public Timeline, for example, is to show you the front page of the site, how much data back do we need to keep? Is there a reason to keep your timeline from three years ago? The answer is probably no. We can keep the public timeline at a certain size and likely benefit from a lot of space savings. On the other hand, the replies for a post can be quite interesting, and while they can grow very big, it probably make sense to never trim them.

So we have the concept of a timeline, but what is it actually going to be?

In terms of REST API, we are going to have the following endpoints:

  • GET /timelines?id=1351081943163123854
    GET /timelines/sections/F4BE2048BF51F3DCC69EA4CA4ED08F12A36BD6524C9F12018BA0CE6F7C076BB2

In other words, we can access the timeline or a section in that timeline. I would rather show the output and then discuss what it all means. The first endpoint gives us the timeline itself, and looks like this:

There are a few things to note here. When we ask for a timeline, we get the most recent posts in the timeline, as well as the past few sections. The way a timeline works, you can always append posts ids to it, which works great, except that at some point the sheer size that is involved is starting to be problematic.

If we consider a big timeline, one with 400,000 posts in it, that comes to about 3.2 MB used, just to store the post ids (8 bytes each). In practice, due to concurrency and distribution concerns, we can’t have an actual list of post ids, so we need some better management. Another factor is that you very rarely need or want to get the entire timeline, you want to start from the top and work your way down.

We can handle that easily enough using two stage approach. First, all the new post ids appended to a timeline are written in a “loose” form. Each one with each own entry. Once we hit a certain limit (128, for example), we know that this is likely to grow bigger. We can grab the loose post ids in the timeline and gather them into a section. A section is an independently addressed part of the timeline. The idea is that we gather all of the post ids currently loose in the timeline, write them into a single object and compress that. Then we use the hash of the resulting object to as the key to an object store.

Side note, timelines are immutable. Once the section is created, it cannot change. You can add additional filtering on the timeline on read, on the other hand. The timeline also should handle the case where posts in the timeline has been deleted, since we aren’t cannot modify it. For ease of implementation, we’ll also allow duplication in the posts ids. Clients are expected to handle and ignore duplicate post ids that happen within a certain time range.

The reason that the timeline section is compressed is to reduce the size, obviously. In my testing, I was able to get 65% reduction in size without taking any special efforts. Throwing the compressed data into object storage (S3 and the like) also means that it is much easier to scale reads on them. If we have a user who is very popular, we can move that timeline to a compression section faster to reduce load. This design explicitly acknowledge the problems with distributed systems and concurrency. It is possible that a compressed section will have an id that also appears in the loose portion of the timeline. The responsibility to handle such a scenario is on the client code, which is able to do so far more easily than the server side portion.

After compressing the loose posts in the timeline, we record the new section hash and allow clients to access it. It might be easier to see how that would work in the following image:

image

Given a post id size of 8 bytes, and assuming that we can compress it by 65% (my naïve tests using Brotli & GZip says yes, can probably do better than that) we can state that every thousand post ids or so we can generate a new section (meaning that it would be about 2KB in size, in the end). Even a very big timeline with hundreds of thousands of entries would end up with just a few hundreds of sections at the top.

The entire mechanism is very limited, quite intentionally. The external operations we allow on a timeline are append and get, with the client expected to understand the manner in which they are going to go deeper into the timeline. The limitation and expectation from the client (like allowing duplicate post ids, handling post ids that point to deleted posts, etc) are all there to make it easy to handle scaling out the system.

Consider a typical use case, I go into a popular account and look at their posts. Effectively, I’m browsing their public timeline. My interactions with the server goes like this:

  • Get a list of the post ids in the timeline. The first step is: GET /timelines?id=1351081943163123854
    • This gets me the list of loose post ids and the recent segments.
    • Notice that this API call is open for caching as well, so we can get the scaling benefit of that as well.
  • Get the actual posts, which I can do with the batch post read API that we discussed earlier.

In many cases, the cost of getting the timeline for the first time will be amortize over the reading time of many posts. The bulk read API gets me 128 posts at a time, so as the user is reading, I can get the next batch ready and give them the next part immediately.

Once I’m done with the loose posts, I can go into the compressed sections and do the same there. If each section has about 1000 posts ids, that will be sufficient for quite some time. And because I’m driving this from the client, it is very easy for me to scale. Throwing the data into object store like S3 means that I can get both CDN support easily and my scaling issue is now: “serve a lot of small files”, which is a very well understood problem.

I still have to take into account permissions, but that is already something that we handle in the batch read API. Notice that for the common case of public posts, pretty much the whole thing has caching and distribution baked and the amount of work that we can let the rest of the system handle is very high.

Hot spots in the system are going to be handled by the infrastructure, not by our own code, big machines or clever algorithms. They are going to be handled by the architecture of the system making it simple to manage them, giving plenty of room for caching and CDN to take charge and reduce our costs.

That is, after all, the whole point of this series of posts.

time to read 9 min | 1673 words

In my previous post we looked at the process in which we process a request for a batch of posts and get their results. The code made no assumptions about where it is running and aside from specifying whatever it is okay or not to allow caching, did no such work.

Caching is important, it matters a lot for performance. To the point where if you aren’t using caching, you are past willful neglect and in the territory of intentional malpractice. The difference can be between needing 18,500 cores to serve a website and needing less than 400  to serve a much busier website. Except that the difference will likely be more pronounced.

Because it is so important, we need to take it into account at the architecture level. Another aspect we have to consider is the data distribution. Assuming we want to build a global social media platform, that means having to access it from multiple locations. Which mean, in turn, that we have to consider the fallacies of distributed computing in our system. Locality of reference is another key factor that you have to take into account. Which means that you have to consider the flow of data in the system.

Let’s assume that we have the following datacenters around the world:

image

We are using geo routing and the relevant infrastructure to make sure that you’ll always hit the nearest data center to you.

Let’s say that we have a Mr. Beat in our social platform, who is very popular and like to post controversial messages such as the need to abolish peppers from your menus. Mr. Beat is located in Australia, so when he is posting yet another “peppers have no place in the kitchen” post, the data center in Brisbane is going to be the one to field the request.

A system like that would do best if we can avoid any and all required coordination between the different data centers, as such, we are going to be using gossip to share the results among all the data centers. In other words, the post will be written in the Brisbane data center and then replicated to the rest of the data centers. There are several ways that we can implement such a feature.

The simplest way to replicate all the data to all the data centers. This way, we can always access it from the local store. However, that present two challenges:

  • First, there is latency involved with replicating information across the glove. We may get a request for a post in the Odessa data center for a post originating in Brisbane. If the network devils decided to have a party, we may not have that particular post yet in Odessa. What do you do then? This is where the format of the post id come into play. If we can’t find the post in our local storage, we can figure out who owns that (based on the machine id segment) and go ask the owning data center.
  • The second problem is that in many cases, the data is purely local. For example, consider this Twitter account. For the most part, everyone that is following that is going to be located nearby. I assume you may have a very small minority of followers who left the area who are still interested in following up on what is going on, but for the most part, this is not that common. That means that replicating all the information to all the data centers is likely a waste.

Given these two facts together, we are actually better off using a different model for data distribution and caching. The post id is generated using the following mechanism:

image

The machine segment in the post id (10 bits long) gives us a good indication of where that post was created. This is originally used only for the purpose of generating unique ids, but we can make far greater usage of this. We decide that the data center that created the post is also the one that owns it. In other words, Mr. Beat’s posts are “owned” by the Brisbane data center. The actual posts are held in a key/value store, but there is a difference in how we do lookup by id based on the ownership.

Here is the relevant code:

The idea is that we first check the key/value in the current data center for the post id. If it isn’t found, we check if we are the owner of the post. If so, it doesn’t exists or was removed. If it belongs to another data center, we’ll go and fetch it from there. Note that we’ll record a null if needed, so we’ll not need to go and fetch a missing value each time it is requested.

In other words, the first time that an item is requested from the owner data center, we’ll place it in our own key/value for next time. We do so with an indication to the key/value that this is a remote value. Remote values may be purge early or be put on a least frequently viewed rotation, etc.

There are other things that we need to deal with, of course. Concurrency, for example, if we have multiple concurrent requests to the same missing id. We’ll have multiple chats between data centers to fetch the same value. I don’t think that this is a problem. That is likely to only happen the first time, and then it is cached. We might want to monitor how often remote values are requested and only get them after a certain number of requests, but in all honesty, it probably doesn’t matter.

The key/value store architecture is already likely to cause us to use both disk and memory. We can take advantage of the fact that the remote key isn’t important locally and not persist it to disk right away, or not in a durable format. When we need to remove the value from memory, we can see if it had enough hits to warrant writing locally or not.

The most complex issue, however, is related to the cache itself. We have a strong ownership model, in which the data center that created a post is its owner. What happens when we need to update a post?

Twitter, for example, doesn’t have an Edit feature. That is a great reduction of the complexity we have to deal with, but it isn’t all that simple. An update to the post can also be a delete. For example, let’s say that Mr. Beat posted: “eating steamed Br0ccoli”. Such a violation of community standards cannot stand. Even though Mr. Beat cleverly disguised his broccoli tendencies by typo-ing the forbidden term. Mr. Beat is very popular and his posts have likely spread across many data centers. An admin marking the post as deleted also has to deal with the possibility that the post is located on other locations as well.

We can try to keep track of this, but to be fair, it is easier to simply queue a delete command on all the data centers except the owner. That will ensure that they will remove the cached version and have to re-read it from the owner data center.

Everything that I describe so far was about behavior, but we also ought to talk about policies. As part of the work we do when writing a post, we can apply all sort of interesting policies. For example, we may know that Mr. Beat is popular globally and as part of writing a post from him preemptively send that post to all the other data centers. If we have an update to a post, instead of sending a delete command to the rest of the data centers and let it refresh automatically, we can send the updated post content immediately.

I intentionally don’t want to dig too deeply into those policies, because they are important, but they aren’t on the same level as the infrastructure I describe here is. Those are like the cherry on top, if you like such a thing, it can take something good and make it great. But given that those are policies that can be applied on a per item basis and modified as you go along, there isn’t a reason to start going there yet.

Finally, there is a last aspect to discuss: Expiry. 

In most social media, the now is important beyond all else. In other words, you are very unlikely to be seeing posts from two years ago and if you are, it matters a lot less if you have to deal with slightly higher latency. Expiring remote content that is over 3 months old, for example, and not placing that in our local key/value at all can be a great way to handle long tail issues. For that matter, given that old content is rarely accessed, we can also optimize our storage by compressing old posts instead of holding on to them directly.

And after all of this discussion, I wanted to point out that you are also likely to want to have another layer of caching in place. The API calls you make in many cases may be good candidate for at least short term caching. In other words, if you put them behind something like Cloudflare, given that we explicitly state what post ids we want, we can set a cache duration of 1 – 3 minutes without needing to worry too much about updates. That can massively reduce the number of requests that we actually have to handle, and it costs a lot less. Under what scenarios would that be useful?

Consider people going to view a popular user, such as Mr. Beat’s page. The list of posts there is going to be the same for anyone, and even a short duration on the cache would massively help our load.

As you can see, the design of the system assume caching and actively work to make it possible for you to utilize the cache at multiple levels.

time to read 4 min | 769 words

So far in this series of posts I looked into how we write posts. The media goes to S3 compatible API and the posts themselves will go to  a key/value store. Reading them back, on the other hand, isn’t that simple. For the media, I”m going to assume that the S3 is connected to CDN and that is handled, but I want to focus on the manner in which we deal with reading posts. In particular, I’m not talking here about how we can display the timeline for a user. That is going to be the subject on another post, right now, I’m assuming that this is handled and talking about the next step. We have a list of post ids that we want to get and we need to manage that.

The endpoint in question would look like this:

GET /api/v1/read?post=1352410889870336005&post=1351081958063951875

The result of this API is a JSON object with the keys as the posts ids and the values as the content of the post.

This simple API is about as simple as you can imagine, but even from this simple scenario you can see a few interesting details:

  • The API is using GET, which means that there is a natural limit to the size of the URL. This is good and by design. We will likely limit this to a maximum of 128 items as a time anyway.
  • The API is inherently about dealing with batches of information.
  • The media is handled separately (generated directly from the client) so we can return far less information.

In many cases, this is going to be a unique set of posts, for example, when you view your timeline, it is likely that you’ll see a unique set of posts. However, in many other cases, you’ll send a request that is similar or identical to what others will use.

When you are looking at a popular thread, for example, you’ll be asking the same posts ids as everyone else, which means that there is a good chance to easily add caching for this via CDN or the like and benefit greatly as a result.

Internally, the implementation of this API is probably just going to issue direct reads by ids to the key/value store and just return the result. There should be a minimal amount of processing involved, usually, except for one factor, authorization.

Assuming that the key/value interface has a get(id) method, the backend code for this API critical API should be something like the code below. Note that this is server side code, I'm not showing any client side code in this series of posts. This is the backend code to handle address the reading of a batch of ids from the client.

The code itself assumes that there is no meaning to doing batch operation on the key/value itself, mind. That isn’t always the case, but I’ll assume that. We issue N async promises to the key/value and wait to get them all back. This assumes that the latency from the API node to the key/value servers is minimal and let us batch a lot of remote calls into near calls.

The vast majority of the function is dedicated to the auth behavior. A post can be marked as public or protected, and if it is the later, we need to ensure that only people that the author of the post follow will be able to see this. You’ll note that I’m doing a lot of stuff in an async manner here. For example, we’ll only issue a single check per post author and we can safely assume that most posts are public anyway. I’m including the “full” code here to give you an indication about the level of complexity that I would expect to see in the API.

You should also note that we indicate whatever we allow to cache the results or not. In the case of a request that include a protected post, we don’t allow it. But for the most part, we can expect to see high percentage of posts that would be only public and can benefit from that.

Because we are running in a distributed system, we also have to take into account all sort of interesting race conditions. For example, you may be trying to read a post that has been removed. We explicitly clear all such null items from the results. Another way to handle that is to replace the content of the post and set a marker flag, but we’ll touch that on another post.

Finally, the code above doesn’t handle caching or distribution. That is going to be handled both above and below this code. I’ll have a dedicated post around that tomorrow.

time to read 6 min | 1076 words

This design deal with creating what is effectively a Twitter clone, seeing how we can do that efficiently. A really nice feature of Twitter is that it has just one type of interaction a tweet. The actual tweet may be a share, a reply, a mention or any number of other things, but those are properties on the post, not a different model entirely. Contrast that with Facebook, on the other hand, where you have Posts and Replies as very distinct items.

As it turns out, that can be utilized quite effectively to build the core foundation of the system with great efficiency. There are two separate sides for a social network, the write and read portions. And the read side is massively bigger than the write side. Twitter currently has about 6,000 tweets a second, for example, but it has 186 million daily users.

We are going to base the architecture on a few assumptions:

  • We are going to favor reads over writes.
  • Reads’ speed should be a priority at all times.
  • It is fine to take some (finite, small) amount of time to show a post to followers.
  • Favor the users’ experience over actual guarantees.

What this means is that when we write a new post, the process is going to be roughly so:

  • Post the new message to a queue and send confirmation to the client.
  • Add the new post to the user’s timeline on the client side directly.
  • Done.

Really important detail here. The process of placing an item on the queue is simple, trivial to scale on an infinite basis and can easily handle huge spikes in load.

On the other hand, the fact that the client’s code will show the user that the message in their timeline is usually sufficient for good user experience.

There is the need to process that and send it to followers ASAP, but that is as soon as possible in people’s terms. In other words, if it takes 30 seconds or two minutes, it isn’t a big deal.

With just those details, we are pretty much done with the write side. We accepted the post, we pretend to the user that we are done processing it, but that is roughly about it. All the rest of the work that we need to do now is to see how we can most easily generate the read portion of things.

There are some other considerations to take into account. We need to deal not just with text but also images and videos. A pretty core part of the infrastructure is going to be an object storage with S3 compatible API. The S3 API has became an industry standard and is pretty widely supported. That help us reduce the dependency issue. If needed, we can MinIO, run on Backblaze, etc.

When a user send a new post, any media elements of the post are stored directly in the S3 storage and then the post itself is written to a queue. Workers will fetch items from the queue and process them. Such processing may entail things like:

  • Stripping EXIF data from images.
  • Re-encoding videos.
  • Analyzing content for language / issues. For example, we never want to have posts about Broccoli, so we can remove / reject them at this stage.

This is where a lot of the business logic will reside, mind. During the write portion, we have time. This is an asynchronous process that we can afford to take some time. Scaling workers to read from a queue is cheap, simple an easy technique, after all. That means that we can afford to shift most of the work required to this part of the process.

For example, maybe a user posted a reply to a message that only allow replies from users mentioned on the post? That sort of thing.

Once processed, we end up with the following architecture:

image

The keys for each post are numeric (this will be important later). We can generate them using the Snowflake method:

image

In other words, we use 40 bits with 16 millisecond precision for the time, 10 bits (1,024) as the machine id and 14 bits (16,384) as the sequence number. The 16 ms precision is already the granularity that you can expect from most computer clocks, so we aren’t actually losing much by giving it up. It does means that we don’t really have to think about it. A single instance can generate 16K ids each 16 ms, or about a million ids per second. More than enough for our needs.

The key about those ids is that they are going to be roughly sorted. That will be very nice to use later on.  When accepting a post, we’ll generate an id for that, and then place that in the key/value store using that id. All other work from that point of is about working with those ids, but we’ll discuss that with more details when we talk about timelines.

For now, I think that this post gives a good (and intentionally partial) view of how I expect to handle a new write:

  • Upload any media to S3 compatible API.
  • Generate a new ID for the post.
  • Run whatever processing you need for the post and the media.
  • Write the post to the key/value store under the relevant id.
    • This include also the appropriate references for the parent post, any associated media, etc.
  • Publish to the appropriate timelines. (I’ll discuss this in a future post)

I’m using the term key/value store here generically, because we’ll do a lookup per id and find the relevant JSON for the post. Such systems can scale pretty much linearly with very little work. Given the fact that we use roughly time based ids and the time base nature of most social interactions, we can usually move most posts to archive mode in a very natural way. But that would be a separate optimization step that I don’t think that would actually be relevant at this point. It is good to have such options, though.

And that is pretty much it for writes. There are probably pieces here that I’m missing, but I expect that they are related to the business processing that you’ll want to do on the posts, not the actual infrastructure. On my next post, I’ll deal with the other side. How do we actually read a post? Given the difference in scale, I think that this is a much more interesting scenario.

time to read 5 min | 832 words

Following the discussion a few days ago, I thought that I would share my high level architecture for building a social media platform in a way that would make sense. In other words, building software that is performant, efficient and not waste multiples of your yearly budget on unnecessary hardware.

It turns out that 12 years ago, I wrote a post that discusses how I would re-architect twitter. At the time, the Fail Whale would make repeated appearances several times a week and Twitter couldn’t really handle its load. I think that a lot of the things that I wrote then are still applicable and would probably allow you to scale your system without breaking the bank. That said, I would like to think that I learned a lot since that time, so it is worth re-visiting the topic.

Let’s outline the scenario, in terms of features, we are talking about basically cloning the key parts of Twitter. Core features include:

  • Tweets
  • Replies
  • Mentions
  • Tags

Such an application does quite a lot a the frontend, which I’m not going to touch. I’m focusing solely on the backend processing here. There are also a lot of other things that we’ll likely need to deal with (metrics, analytics, etc), which are separate and not that interesting. They can be handled via existing analytics platforms and don’t require specialized behavior.

One of the best parts of a social media platform is that by its very nature, it is eventually consistent. It doesn’t matter if I post a tweet and you see it now or in 5 seconds. That gives us a huge amount of flexibility in how we can implement this system efficiently.

Let’s talk about numbers I can easily find:

There are problem with those stats, however. A lot of them are old, some of them are very old, nearly a decade!

Given that I’m writing this blog to myself (and you, my dear reader), I’m going to make some assumptions so we can move forward:

  • 50 million users, but we’ll assume that they are more engaged than the usual group.
  • Out of which 50% (25,000,000) would post on a given month.
  • 80% of the users post < 5 posts a month. That means 20 million users that post very rarely.
  • 20% of the users 5 million or so, post more frequently, with a maximum of around 300 posts a month.
  • 1% of the active users 50,000) posts even more frequently, to the tune of a couple of hundred posts a day.

Checking my math, that means that:

  • 50,000 high active users with 150 posts a day for a total of 225 million posts.
  • 5 million active users with 300 posts a month for another 1.5 billion posts.
  • 20 million other users with 5 posts a month, given us another 100 million posts.

Total month posts in this case, would be:

  • 1.745 billion posts a month.
  • 2.4 million posts an hour.
  • 670 posts a second.

That assume that there is  a constant load on the system, which is probably not correct. For example, the 2016 Super Bowl saw a record of 152,000 tweets per minute with close to 17 million tweets posted during the duration of the game.

What this means is that the load is highly variable.  We may have low hundreds of posts per second to thousands. Note that 152,000 posts per minute are “just” 2,533 posts per second, which is a lot less scary, even if it means the same.

We’ll start by stating that we want to process 2,500 posts per second as the current maximum acceptable target.

One very important factor that we have to understand is what exactly do we mean by “processing” a post. That means recording the act of the post and doing that within an acceptable time frame, we’ll call that 200 ms latency for the 99.99%.

I’m going to focus on text only mode, because for binaries (pictures and movies) the solution is to throw the data on a CDN and link to it, nothing more really needs to be done. Most CDNs will already handle things like re-encoding, formatting, etc, so that isn’t something that you need to worry about to start with.

Now that I have laid the ground works, we can start thinking about how we can actually process this. That is going to be handled in a few separate pieces. First, how do we accept the post and process it and then how do we distribute it to all the followers. I’ll start dissecting those issues in my next post.

time to read 7 min | 1238 words

I run into the following twitter, which list some of Parler’s requirements (using the upper limits specified):

  • Scylla cluster – 40 nodes with 64 cores, 512GB RAM, 14TB NVME drives for each node. For a total of 2,560 cores and 20TB RAM, 560 TB of disks.
  • PostgreSQL cluster – 100 nodes with 96 cores, 768 GB RAM and 4 TB NVME. For a total of 9,600 cores, 75 TB RAM and 400 TB of disks.
  • 400 application instances – 16 cores & 64 GB RAM.

Their internal traffic is about 6.6 GB / sec and their external traffic is about 2 GB / sec. There is a lot of interesting discussion on the twitter feed on these numbers, but I thought that it would be interesting to see how much it would cost to build that.

The 64 Cores & 512 GB RAM can be handled via Gigabyte R282-Z90, the given specs says that a single one would cost 27,000 USD. That means that the Scylla cluster alone would be about a million dollar, but I haven’t even touched on the drives. I couldn’t find a 14 TB NVMe drive in a cursory search, but a 15.36TB drive (Micron 9300 Pro 15.36TB NVMe) costs 2,500 USD per unit. That makes the cost of the hardware alone for the Scylla cluster at 1.15 million USD.

I would expect about twice that much for the PostgreSQL cluster, for what it’s worth. For the application servers, that is a lot less, with about a 4,000 USD cost per instance. That comes to another 1.6 million USD.

Total cost is roughly 5 million USD, and we aren’t talking about the other stuff (power, network, racks, etc). I’m not a hardware guy, mind! I’m probably missing a lot of stuff. At that size, you can likely get volume discounts, but I’m missing that the stuff that I’m missing would cost quite a lot as well. Call it a minimum of 7.5 million USD to setup a data center with those numbers. That does not include labor and licensing costs, I want to add.

Also, note that that kind of capacity is likely something that you can’t just get from anyone but the big cloud providers with a quick turnaround basis. I’ll estimate that this is a multiple months just to order the parts, to be honest.

In other words, they are going to be looking at a major financial commitment and some significant lead time.

Then again… Given their location in Henderson, Nevada, the average developer salary is 77,000 USD per year. That means that the personal cost, which is typically significantly higher than any other expense, is actually not that big. As of Non 2020, they had about 30 people working for Parler, assuming all of them are developers paid 100,000 USD a year (significantly higher than the average salary in their location), the employment costs of the entire company would likely be under half of the cost of the hardware required.

All of that said…. what we can really see here is a display of incompetency. At the time it was closed, Parler has roughly 15 – 20 million users. A lot of them were recently registrations, of course, but Parler already experience several cases of high number of user registrations in the past. In June of 2020 it saw 500,000 users registering to its services within 3 days, for example.

Let’s take the 20 million users as the number of users, and assume that all of them are in the states and have the same hours of activity. We’ll further assume that we have high participation numbers and all of those users are actively viewing. Remember the 1% rule, only a small minority of users are actually generating content on most platforms. The vast majority are silent observers. That would give us roughly 200,000 users that generate content, but even then, not all content is made equal. We have posts and comments, basically, and treating them differently is a basic part of building efficient system.

On Twitter, Katy Perry has just under 110 million followers. Let’s assume that the Parler ecosystem was highly interconnected and most of the high profile accounts would be followed by the majority of the users. That means that the top 20,000 users will be followed by all the other 20 millions. The rest of the 180,000 users that active post will likely do so in reaction, not independently, and have comparatively smaller audiences.

Now, we need to estimate how much these people will post. I looked at Dave Weigel’s account (591.7K followers), covering politics for Washington Post. I’m writing this on Jan 20, so the Biden inauguration takes place. I’m assuming that this is a busy time for political correspondents.  Looking at his twitter feed, he posted 3,220 tweets this month and Jan 6, which had a lot to report on, had 377 total tweets. Let’s take 500 as the reasonable upper bound for the number of interactions of most of the top users in the system, shall we?

That means that we have:

  • 20,000 high profiler users.
  • Each posting to a max of 500 a day.
  • Let’s assume that this all happens in 8 hours, instead of over the entire day.
  • That translates to roughly 1,250,000 posts an hour. If we express this in terms of posts per second, that comes to 348 posts per second.

Go and look at the specs above. Using these metrics, you can dedicate a machine for each one of those posts. Given the number of cores requested for application instances (400 x 16 = 6400 cores), this is beyond ridiculous.

Just to give you some context, when we run benchmarks of RavenDB, we run it on a Raspberry Pi 3. That is a 25$ machine, with a shady power supply and heating issues. We were able to reach over 1,000 writes / second on a sustained basis. Now, that is for simple writes, sure, but again, that is a Raspberry Pi doing three times as much as we would need to handle Parler’s expected load (which I think I was overestimating).

This post is getting a bit long, but I want to point out another social network Stack Exchange (Stack Overflow), with 1.3 billion page views per month (assuming perfect distribution, roughly 485 page views per second, each generating multiple requests).

  • Their web servers handle 450 req/sec at peak across 9 web servers (Max of 4,050 req/sec) with peak CPU usage of 12%.
  • 2 SQL Server clusters with 4 machines in total. Handling an aggregate of 23800 queries / sec with peak CPU usage of 15%.
  • Render time across the board of < 20 ms.

The hardware that is used for those servers:

  • 9 Web - 48 cores + 64 GB RAM
  • 4 DB – 32 cores + 768 GB RAM

There are a few other type of servers there, and I recommend looking into the links, because there is a lot of interesting details there.

The key here is that they are running top 200 site in significantly less hardware, and are able to serve requests and provide great quality of service.

To be fair, Stack Overflow is a read heavy site, with under half a million questions and answers in a month. In other words, less than 0.04% of the views generate a write. That said, I’m not certain that the numbers would be meaningfully different in other social media platforms.

In my next post, I’m going to explore how you can build a social media platform without going bankrupt.

time to read 2 min | 333 words

I am really proud with the level of transparency and visibility that RavenDB gives out of the box to its users. The server dashboard gives you all the critical information about what a node is doing and can serve as a great mechanism to tell at a glance what is the health of a node.

A repeated customer request is to take that up a notch. Not just showing a single server status, but showing the entire cluster state. This isn’t a replacement for a full monitoring solution, but it is meant to provide you with a quick overview of exactly what is going on in your cluster.

I want to present some early prototypes of how we are thinking about showing this information, and I wanted to get your feedback about those, as well as any other information that you think should be relevant for the cluster dashboard.

Here is the resource utilization portion of the dashboard. We aren’t settled yet on the graphs, but we’ll likely have CPU usage and memory (including memory breakdowns).

image

Some of this information may be hidden by default, and you can expand it:

image

You can get more details about what the cluster is doing here:

image

And finally, the overall view of task assignment in the cluster:

image

You can also drill down to a particular server status:

image

This is early stages yet, we pretty much have just the mockup, so this is the right time to ask for what you want to see there.

time to read 1 min | 165 words

Among the advantages of a highly distributed system with endless edge points are that you can outsource data collection to a universe of locations, and even include them in your workflow, thereby expanding your operations. The challenges are when you have endpoints that contribute to your organization and systems, but you don’t exactly trust. They can be newcomers that you don’t know enough about, or entities with a history of misusing the data inclusion to your systems give them access to. You want the value they create, the information they amass and gather to be copied from the edge up the levels of your system, but you don’t want to give too much for that value or pay for it in the form of greater risk. Filtered replication is the art of enabling nontrusted edge points to access your system in a limited manner, replicating the information they produce in a nontrusted format.


time to read 4 min | 636 words

Yesterday I posted about Parler banning and the likely impact of that, both legally and in terms of the technical details. My expectations is that new actors will step in to fill the existing demand created by the current social network account suspensions. I had spent some time thinking about the likely effects of this, and I think that it will lead to some interesting results.

A new social network will very likely rise as a result of those actions. That network would have to be resilient for de-platforming issues. That means that it cannot assume that it can run on any of the cloud services, at least not as normally understood by today’s standards. That means that we are likely to see one of two options:

  • Fully distributed systems – independent nodes collaborating with one another to create a network. Each node may be host and operated independently. Similar to how torrents work and other fully distributed P2P systems.
  • Distributed infrastructure – a set of servers that are running on behalf of a single entity, but are spread over multiple vendors and locations. The idea is that the shutdown of a single or multiple vendors will have little impact, because of distribution of effort.

The first option is probably something like Mastodon, but I would really like to see a return to blogs & RSS as the preferred social network. That has the advantage of a true distributed model without a single controlling actor. It is also much lower cost in terms of technology and complexity. Discovery of new blogs can be handled via recommendations, search, etc.

The reason I prefer this option is that I like to blog Smile. More seriously, owning your own content and distribution platform has just become quite important. A blog is about as simple a piece of software as you can imagine. Consuming blogs is an act that require no publication of personal information, no single actor that can observe everything you do, etc.

I don’t know if this will be the direction, although it is my favorite one. It is possible that we’ll end up with Mastodon empire, with many actors creating networks of servers which may or may not be interconnected. I can see a future where you’ll have a network of dog owners vs. cat owners, but the two aren’t federated and there are isolated discussions between them.

Given that you could create links from one to the other, I don’t think we have to deal with total echo chambers. Consider a post in the cats social network: The dog owners are talking about the chore of having to go for walks at “dogs://social.media/walks-are-great”, that is so high maintenance, the silly buggers. 

That would create separate communities, with their own rules and moderation. Consider this something like subreddits, but without the single organization that can enforce global rules.

The other alternative is that a social network would rise with a truly distributed backend that is resilient to de-platforming issues. From an outside perspective, this will present as something to the existing social networks. That has the advantage of requiring the least from users, but it is a non trivial technical challenge.

I prefer the first option, but I believe it is more likely we’ll end up with the second. The reason for that is monetization strategies. If you have a many different actors cooperating to create a network, there is a question on how you pay for that. The typical revenue model for social network is advertising. That doesn’t work so well where there isn’t a single actor that can sell ads (and track users).

That said, it would be much faster and easier to get started with the first option and it may be that we’ll end up there with the force of inertia.

FUTURE POSTS

  1. Feature Design: ETL for Queues in RavenDB - 17 hours from now
  2. re: Why IndexedDB is slow and what to use instead - about one day from now
  3. Implementing a file pager in Zig: What do we need? - 3 days from now
  4. Production postmortem: The memory leak that only happened on Linux - 6 days from now
  5. Talk: Scalable architecture from the ground up - 7 days from now

And 6 more posts are pending...

There are posts all the way to Dec 22, 2021

RECENT SERIES

  1. Challenge (63):
    03 Nov 2021 - The code review bug that gives me nightmares–The fix
  2. Talk (6):
    23 Apr 2020 - Advanced indexing with RavenDB
  3. Production postmortem (32):
    17 Sep 2021 - The Guinness record for page faults & high CPU
  4. re (29):
    23 Jun 2021 - The performance regression odyssey
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats