Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,198 | Comments: 50,268

Privacy Policy Terms
filter by tags archive
time to read 8 min | 1563 words

We got a pretty nasty bug at a customer site a few months ago. Every now and then, the server running RavenDB will go into a high CPU mode, use 100% CPU and stay there for extended period of time. After a while, it will just return back to normal.

Looking at the details, there was nothing really that should cause this scenario. The amount of load of the system didn’t justify this load, we are talking about a maximum load of under 500 requests / second. RavenDB can handle that much on a Raspberry PI without straining itself, after all.

The problem was that we couldn’t figure out what was going on… none of the usual metrics were relevant. Typically, when we see high CPU utilization, the fault is either our code or the GC is working hard. In this case, however, while the RavenDB process was responsible for the CPU usage, there was no indication that it was any of the usual suspects. Here is what the spike looked like:

image

The customer has increased the size of the machine several time, trying to accommodate the load, but the situation was not getting better. In fact, the situation appeared to be getting worse. This is on a server running Windows 2016, and all the nodes in the cluster would experience this behavior, effectively taking the system down. The did not do that on an synchronized schedule, but one would go into a high CPU load, and the clients would fail to the other nodes, which will usually (but not always) trigger the situation. After a short while, it would get back down to normal rates, but that was obviously not a good situation.

After a while, we found something that was absolutely crazy. Looking at the Task Manager, we added all the possible columns and looked at them. Take a look at the following screen shot:

image

What you see here is the Page Fault Delta, basically, how many page faults happened in the system in the past second for this process. A high number on this column is when you see hundreds of page faults. Thousands of page faults usually means that you are swapping badly. Hundreds of thousands? I have never seen such a scenario and I couldn’t imagine what that actually is.

What is crazier is that this number should be physically impossible. At a glance, we are looking at a lot of reads from the disk, but looking at the disk metrics, we could see that we had very little activity there.

That is when we discovered that there is another important metric, Hard Page Faults / sec. That metric, on the other hand, typically ranged in the single digits or very low double digits, nothing close to what we were seeing. So what was going on here?

In addition to hard page faults (reading the data from disk), there is also the concept of soft page faults. Those page faults can happen if the OS can find the data it needs in RAM (the page cache, for example). But if it is in RAM, why do we even have a page fault in the first place? The answer is that while the memory may be in RAM, it may not have mapped to the process in question.

Consider the following image, we have two processed that mapped the same file to memory. On both processes, the first page is mapped to the same physical page. But the 3rd page on the second process (5678) is not mapped. What do you think will happen if the process will access this page?

image

At this point, the CPU will trigger a page fault, which the OS needs to handle. How is it going to do that? It can fetch the data from disk, but it doesn’t actually need to do so. What it needs to do is just update the page mapping to point to the already loaded page in memory. That is what is called a soft page fault (with hard page fault requiring us to go to disk).

Note that the CPU utilization above shows that the vast majority of the time is actually spent on system time, not user time. That means that the kernel is somehow doing a lot of work, but what is going on?

The issue with page fault delta was the key to understanding what exactly is going on here. When we looked deeper using ETW, we were able to capture the following trace:

image

What you can see in the image is that the vast majority of the time is spent in handling the page fault on the kernel side, as expected given the information that we have so far. However, the reason for this issue is that we are contending on an exclusive lock? What lock is that?

We worked with Microsoft to figure out what exactly is going on and we found that in order to modify the process mapping table, Windows needs to take a lock. That make sense, since you need to avoid concurrent modification to the page table. However, on Windows 2016, that is a process wide lock. Consider the impact of that. If you have a scenario where a lot of threads want to access pages that aren’t mapped to the process, what will happen?

On each thread, we’ll have a page fault and handle that. If the page fault is a hard page fault, we will issue a read and put the thread to sleep until this happen. But what if this is a soft page fault? Then we just need to take the process lock, update the mapping table and return. But what if I have a high degree of concurrency? Like 64 concurrent cores that all contend on the same exact lock? You are going to end up with the exact situation above. There is going to be a hotly contested lock and you’ll spend all your time at the kernel level.

The question now was, why did this happen? The design of RavenDB relies heavily on memory mapped I/O and it is something that we have been using for over a decade. What can cause us to have so many soft page faults?

The answer came from looking even more deeply into the ETW traces we took. Take a look at the following stack:

image

When we call FlushFileBuffers, as we need to do to ensure that the data is consistent on disk, there is a lot that is actually going on. However, one of the key aspects that seems to be happening is that Windows will remove pages that were written by FlushFileBuffers from the working set of the process. That will lead to page faults (soft ones). We confirmed with Microsoft that this is the expected behavior, calling FlushFileBuffers (fsync) will trim the modified pages from the process mapping table. This is done to improve coherency between the memory mapped pages and the page cache, I believe.

To reproduce this scenario, you’ll need to do something similar to:

  • Map a large number of pages (in this case, hundreds of GB)
  • Modify the data in those pages (in our case, write documents, indexes, etc)
  • Call FlushFileBuffers on the data
  • From many threads, access the recently flushed data (each thread ideally accessing a different page)

On Windows 2016, you’ll hit a spin lock contention issue and spend most of your time contending inside the kernel. The recommendation from Microsoft has been to move to Windows 2019, where the memory lock granularity has been increased, so they won’t all contend on the same lock. Indeed, testing on Windows 2019 we weren’t able to reproduce the problem.

The really strange thing here is that we have using the exact same code and approach in RavenDB for many years, and only recently did we see a shift with most of our customers running on Linux. That particular behavior is how we are used to be running, and I would expect it to be triggered often.

The annoying thing about this is that this is actually the case of too much of a good thing. Usually RavenDB will scale linearly with the number of cores for reads, the customer in question moved from RavenDB 3.5 to RavenDB 5.2, and they used the same size machine in both cases. RavenDB 5.2 is far more efficient, however. It was able to utilize the cores a lot better and trigger this behavior on a consistent basis. Using RavenDB 3.5, on the other hand, a lot of the CPU time was spent on doing other things, so we didn’t trigger this issue. Indeed, a workaround to improve performance was to reduce the number of cores on the system. That reduced the contention and made the whole system more stable.

The actual solution, however, was to run on Windows 2019, but that was a hard problem to solve. We tested pretty much any scenario that we could think to see what can help us here. And yes, we tested this on  Linux, and didn’t see any indication of a similar problem.

time to read 4 min | 606 words

imageI recently had to discuss the issue on the impact of latency a few times, and I found the coffee cup analogy to be an excellent tool to explain exactly what is going on. Consider the humble coffee cup, without which there would be no code.

It is a pretty simple drink, composed of coffee, water and milk. I’ll ignore coffee snobs and the like for now and focus strictly on the process of making a cup of coffee. I found this recipe:

  • 1 cup milk
  • ½ cup cold brewed coffee
  • 2 sweetener

Mix milk, coffee, and sweetener together in a glass until sweetener is dissolved.

If I was writing this in code, I would probably write something like this:

Simple enough, right? There is just a little bit of details to fill. How are the coffee() or sweetner() methods implemented?

The nice thing about this code is that this is nicely abstracted, the coffee recipe and the code reads almost in the same manner. However, there is an issue with the actual implementation. We have the go_to_store() method, but we know that this is an expensive operation. To avoid making it too often, we calculate the amounts that we need to make 20 cups of coffee and make sure that we set the relevant XYZ_AMOUNT_TO_BUY appropriately.

What do you think will happen on the 21th cup of coffee, however? We run out of coffee, so we’ll go to the store to get some. Once we got it, we can pour the coffee to the cup, but then we need to put the milk in, in which case we’ll discover that we run out. Off to the store we go, and all the way back. And then there is the sweetener that run out, so that is the third trip to the store.

Abstraction, in this case, is actively hurting us. We ignore the fact that ingredients may be missing, and that isn’t something that we can afford to. The cost of going to the store outweigh anything else in the process of making a cup of coffee, and we just did that three times.

In the context of software, of course, we are talking about the issue of making a remote call. For example, sending a separate query to the database for each datum that you need. The cost of the remote call far exceed any other costs you have in the system.

To solve the coffee cup problem, you’ll need to do something like:

Abstraction? What abstraction? There are no abstractions here. We are very clearly focused on the things that need to happen to get it working properly. In fact, a better alternative would be to not check that we have enough for the current cup but to schedule a purchase when we notice that we are low.

That, again, intermix the responsibilities of making the coffee and making sure that we have the ingredients at hand. That is not an actual problem, however. That is something that we are fine with, given the difference in performance that this entails.

In the same manner, when I see people trying to hide (RPC, database calls, etc) behind an abstraction layer, I know that it will almost always end in tears. Because if you have what looks like a cheap function call go to the store for you, the end result is that you have to wait a lot of time for your coffee. Maybe enough to (gasp) not even have coffee.

On that note, I have a cup of coffee to finish…

time to read 2 min | 251 words

I got an interesting question by email and I thought that this is worth a post. The question was whatever RavenDB can handle Pivot tasks. Consider the case where I have orders data, and I want to see a summary product sales on a monthly basis, like so:

image

This data was produced using the sample data in RavenDB and the following map/reduce index:

That works, but it gives each individual month on its own row. When using Excel, we can Pivot the whole thing so instead of rows, we’ll get columns. For certain types of data, that makes it much easier to work with. For example, let’s say that I want to compare monthly sales data across different products.

The data we see is the same, it is just the way we process and show it that is different. Let’s see how we can do that in RavenDB. We can do that with a secondary aggregation step in the reduce, like so:

The idea is that the reduce step in RavenDB can have its own complex processing, and the result of this process gives us the following output:

If we use JavaScript indexes, we can even manipulate the data to skip the nested values, the code is nastier (likely a product of my skill in JavaScript, I’ll freely admit), but the results are nice.

image

time to read 5 min | 834 words

I wrote a post a couple of weeks ago called: Architecture foresight: Put a queue on that. I got an interesting comment from Mike Tomaras on the post that deserve its own post in reply.

Even though the benefits of an async queue are indisputable, I will respectfully point out that you brush over or ignore the drawbacks.

… redacted, see the real comment for details …

I think we agree that your sync code example is much easier to reason about than your async one. "Well, it is a bit more complex to manage in the user interface", "And you can play games on the front end" hides a lot of complexity in the FE to accommodate async patterns.

Your "At more advanced levels" section presents no benefits really, doing these things in a sync pattern is exactly the same as in async, the complexity is moved to the infrastructure instead of the code.

This is a great discussion, and I agree with Mike that there are additional costs to using the async option compared to the synchronous one. There is a really good reason why pretty much all modern languages has something similar to async/await, after all. And anyone who did any work with Node.js and promises without that knows exactly what are the cost of trying to keep the state of the system through multiple levels of callbacks.

It is important, however, that my recommendation had nothing to do with async directly, although that is the end result. My recommendation had a lot more to do with breaking apart the behavior of the system, so you aren’t expected to give immediate replies to the user.

Consider this: ⏱. When you are processing a user’s request, you have a timer inherent to the operation. That timer can be a real one (how long until the request times out) or it can be a mental one (how long until the user gets bored). That means that you have a very short SLA to run the actual request.

What is the impact of that on your system? You have to provision enough capacity in the system to handle the spikes within the small SLA that you have to work with. That is tough. Let’s assume that you are running a website that accepts comments, and you need to run spam detection on the comment before actually posting that. This seems like a pretty standard scenario, right? It doesn’t require specialized scenarios.

However, the service you use has a rate limit of 10 comments / sec. That is also something that is pretty common and reasonable. How would you handle something like that if you have a post that suddenly gets a lot of comments? Well, you’ll have something that ensure that you don’t pass the limit, but then the user is sitting there, waiting and thinking that the request timed out. On the other hand, if you accept the request and place it into a queue, you can show it in the UI as accepted immediately and then process that at leisure.

Yes, this is more complex than just making the call inline, it requires a higher degree of complexity, but it also ensure that you have proper separation in your system. The front end submit messages to the backend, which will reply when it is done. By having this separation upfront, as part of your overall design, you get options. You can change how you are processing things in the backend quickly. Your front end feel fast (which is usually much more important than being fast, mind you).

As for the rate limits and the SLA? In the case of spam API or similar services, sure, this is obvious. But there are usually a lot of implicit SLAs like that. Your database disk is only able to serve so many writes a second, for example. That isn’t usually surfaced to you as X writes / sec limit, but it is true nevertheless. And a queue will smooth over any such issues easily. With making the request directly, you have to ensure that you have enough capacity to handle spikes, and that is usually far more expensive.

What is more interesting, in my opinion, is that the queue gives you options that you wouldn’t have otherwise. For example, tracing of all operations (great for audits), retries if needed, easy model for scale out, smoothing out of spikes, etc.

You cannot actually put everything into a queue, of course. The typical example is that you’ll want to handle a login page. You cannot really “let the user login immediately and process in the background”. Another example where you don’t want to use asynchronous processing is when you are making a query. There are patterns for async query completions, but they are pretty horrible to work with.

In general, the idea is that whenever the is any operation in the system, you throw that to a queue. Reads and certain key aspects are things that you’ll need to run directly.

time to read 2 min | 395 words

A user called us to ask about how they can manage to move a particular report from a legacy system to RavenDB. They need to be able to ask questions such as the following one:

This is an interesting issue, when you think about it from the point of view of a database engine. The distinct issue means that we have to keep state (all the unique values) while we evaluate the query, which can be expensive. One of the design principles of RavenDB was that we want to make it hard to accidently create expensive queries. Indeed, a query like that isn’t trivial to implement in RavenDB. We need to have a two stage approach for implementing this feature.

First, we’ll introduce a Map/Reduce index, which will aggregate the data on Employee, Company and City. Along the way, it will run the distinct operation on the City, because it will group by it. That gives us a model in which we get the distinct amount for free, and in a highly efficient manner. Here is the index in question:

The interesting thing about this index is that querying it will not give us the right results. We don’t want to get the details based on Employee, Company and City. We want just by Employee and Company. This is where the second stage comes into play. Instead of running a simple query on the index, we’ll use a faceted query. Here is what it will look like:

What this does is to aggregate the results (which were already partially aggregated by the Map/Reduce) and give us the totals. And here are the results:

The end result is that we are able to do most of the work an indexing time, and the query time is left working on already aggregated data. That means that the queries should be much faster and that there is a lot less work for the database to do.

It also isn’t RavenDB’s strong suit. Such queries are typically more inline with OLAP systems, to be honest. If you know what your query patterns looks like, you can use this technique to easily handle such queries, but if there is a wide range of dynamic queries, you may want to use RavenDB as the system of record and then use either SQL ETL or OLAP ETL to push that to a reporting system.

time to read 1 min | 149 words

Implementing a unit of work in Python can be an interesting challenge. Consider the following code:

This is about as simple a code as possible, to associate a tag to an object, right?

However, this code will fail for the following scenario:

You’ll get a lovely: “TypeError: unhashable type: 'Item'” when you try this. This is because data classes in Python has a complicated relationship with __hash__().

An obvious solution to the problem is to use:

However, the id() in Python is not guaranteed to be unique. Consider the following code:

On my machine, running this code gives me:

124597181219840
124597181219840

In other words, the id has been reused. This makes sense, since this is just the pointer to the value. We can fix that by holding on to the object reference, like so:

With this approach, we are able to implement proper reference equality and make sure that we aren’t mixing different values.

time to read 5 min | 945 words

A common question that is raised by customers is how to determine what kind of hardware you need to run RavenDB on. I’m sorry, but the answer is it’s depend, because there are a lot of variables to juggle, in this post, I”m going to try to give some insights about what sort of things you should consider when sizing your instances.

In general, you have three axis that you can work with. CPU, Memory and I/O. In terms of the best bang for the buck, optimizing I/O is usually the way to go and will return the most dividends. This is because most of the time, RavenDB will be bottlenecked on the I/O. This is especially true when you are running on the cloud, where 500 IOPS is a fairly common default (that is basically zilch to a database).

To give more a concrete answer we’ll need more details. Let’s say that you have an application with a database per customer (common for multi tenant scenarios). The structure of the database is the same, but the databases contain data that is separated from each customer. The database has 20 indexes in total, 15 map / full text search as well as 5 for map-reduce / facets operations. There is also an few ETL tasks and a couple of subscriptions for background work.

Let’s breakdown the load on  single server in this mode, shall we?

  • 100 databases (meaning 100 tx merger threads for I/O).
  • 2,000 indexes - 20 indexes x 100 databases (meaning 2,000 indexing threads).

Across the cluster, we also have:

  • 500 ETL tasks – 5 per database x 100
  • 200 subscriptions – you get the drill

The latest items are spread fairly among all the nodes that you have, but the first two are present in all nodes in the cluster.

What does this mean? We have 2,100 threads active at any given point in time? Well, that is where things gets a bit complex. We need to know more than just the raw numbers, we need to understand usage.

How many of those databases are active at any given point in time? In a multi tenant system, it is common to have many customers using the system sporadically, which can allow you to pack a lot more instances into the same hardware resources.

Of more interest, however, is usually the rate of writes. Here we need to ask ourselves what is the write write as well. In general, for reads RavenDB will load all the relevant items into memory and serve directly from there. For writes, given it’s durable nature, RavenDB must hit the disk. And the question now becomes how many database are active at the same time?

This is important, because 10 writes per second to a single database are far better than 10 writes / second across 10 databases. This is because RavenDB is able to batch I/O for a single database, but not across databases. Let’s consider the scenario where we have writes that would impact 5 indexes in the database, what is going to happen when we have 10 writes / sec in a single database?

  • 1 – 5 writes to the disk for the actual documents writes (depends on a lot of factors, and assuming that we are talking about concurrent requests here).
  • 5 – 10 index updates: 1 –2 index updates x 5 relevant indexes (in most cases, we are able to batch indexes even better than documents writes).

Total number of writes to disk: 6 – 15 writes.

However, if we take the same scenario, but now run it across 10 databases, each having a single write? There is no way for us to batch updates, so we’ll have:

  • 10 databases x (1 document writes  + 5 index updates) = 50 writes to disk.

If the number of relevant indexes is high, or if there are more databases involved, it is easy to hit the limits of I/O, especially on the cloud.

I’m actually painting somewhat bleak picture, in most cases you don’t have to worry too much about those details, RavenDB will take care of that for you. However, when you need to consider the sizing, you want to be aware of the possible load that you’ll have. Ironically enough, if you have enough load, RavenDB is able to really optimize things, it is when you have sporadic operations, spread across many locations that we start putting a lot of load on the underlying system.

So far, I was talking about I/O only, but there are other factors as well. Let’s assume that you are running 100 databases with 20 indexes each on a system with 4 cores. How is RavenDB going to split the load across the system?

The first priority is going to be given to processing requests, and then we’ll start on running indexes. That is actually by design, to ensure that we won’t overwhelmed the underlying system by issuing too much work all at once. That means that we’ll round robin the work across all the indexes that want to run, while keeping enough capacity to process user requests. In this case, more cores will allow us higher degree of parallelism, but if you have an unbalanced system (a lot of CPU but slow I/O), you’re going to see stalls because we’ll wait a lot for I/O.

In short, you need to have a fair idea about how your system is going to be used. If you don’t have at least a good guess on the topic, you are probably better off getting more I/O bandwidth than anything else. RavenDB continuously monitor itself and will alert you if there are resource issues. You are then able to shore up anything that is lacking to get the best system performance.

time to read 1 min | 199 words

It turns out that there were quite a lot podcasts and videos that we took part of recently, enough that I didn’t get to talk about all of them. This post is to clear the queue, so to speak.

  1. What Is a noSQL Database? – Dejan, our developer advocate, is discussing the high level details of non relational databases and how they fit into the ecosystem.

  2. Getting started with RavenDB – Dejan is talking about how to get started using RavenDB, including live demos.

  3. Applying BDD techniques using RavenDB – Dejan will show you how you can build a behavior driven design system while leaning on RavenDB’s capabilities.

  4. Live demoing RavenDB – I talk about RavenDB and take you for a walk through all its features. This video is close to two hours and cover many of the interesting bits of RavenDB and its design.

  5. Interview with Oren Eini – Here I talk about my career and how I got to building RavenDB

  6. The Birth of RavenDB – This is a podcast in which I discuss in details how I got started working in the databases field and how I ended up here Smile.

time to read 6 min | 1001 words

RavenDB offers both single node transactions as well as cluster wide transactions. You are free to use either one or even mix them together. That level of freedom, on the other hand, brings with it its own set of challenges. How do you know what to use? What are the scenarios and implications for each operation?Remember, RavenDB is a distributed database that can allow you to make modification on any node in the cluster.

In essence, this boil down to a simple concept, how important is the write that you are making. In detail, this gets complex. It’s easy to say that if for low importance writes, you’ll use single node transactions, and for high value items, you’ll use cluster wide transactions. But that isn’t correct. The primary issue is what you are trying to achieve. I’m afraid that I have no choice but to dig into this topic.

Let’s consider the following scenario: A user clicked on “Add to Cart” in the application. How should we record this fact? There is a “shopping-carts/ayende” document for this user, which represent their current shopping cart. But how should we save it?

Obviously, we never want to lose an item from the shopping cart, right? We can use a cluster wide transaction here to ensure maximum safety! Except… a cluster wide transaction will fail if the node that we reached cannot access the majority of the nodes in the cluster. Going back to the business, I asked them about it. The answer I got was “never lose an item from the shopping cart”. That means that we need to process the write even if we can reach no other node.

That leads us to single node transactions, which will do just that. However, now we have to deal with another issue, what happened if two concurrent transactions modified the same document on different nodes at the same time? Now we have a conflict, and when the nodes will replicate the data to one another, we’ll need to resolve it somehow. RavenDB will default to resolving to latest, meaning that some of the changes will be lost. However, we can setup a resolution script that can merge our changes between multiple versions of a document.

This is confusing, I’m aware. The rule of thumb goes like this:

  1. Use single node transactions by default – if there are errors / conflicts / issues, let RavenDB resolve them to the latest version (a revision exists so you can recover anything lost).
  2. Use single node transactions + conflict resolver script if you actually care about applying any sort of logic to the merging of conflicts. This is rare, the scenario is usually when we have something that can be modified and merged together. Shopping cart is an excellent example of this.
  3. Use a cluster wide transaction when you would rather fail than go forward if you cannot ensure the operation is successful. This is also rare, usually reserved for things such as selling limited amount of some item.

The default recommendation, let RavenDB manage that and accept that it may select the latest version is not something that I make lightly. It is based on quite a bit of experience in how users are actually using RavenDB.  In almost any business context, you are going to have large parts of the model that have only a single reason to change, even in the worst case assumption. A customer changing its billing address, for example, can be reasonably assumed to want to keep the latest version they put in. There is also no real meaning to concurrency in this scenario, the modification to a particular document is done by the relevant customer directly.  Failures are rare (but they do happen, so you have to account for them), so you need to consider what the impact you’ll have. If this is something that doesn’t have multiple concurrent operations going on it normally (and proper document modeling will suggest that this isn’t the case), you can just ignore the problem.

I’m saying ignore the problem because there is the question on what is the meaning of not ignoring the issue? You can try to write your conflict resolution script, but even with knowledge of your model, what are you expected to do with two conflicting versions of a customer, with different billing addresses in each?

And trying to do something generic doesn’t work. It will fail, but because this is rare, it will happen only a year after deployment, when no one recalls what exactly the behavior is and an error on such a case will cause hard failure in production.

For some cases, like the shopping cart, you can meaningful write merge code, and the scenario make sense, I may click on two “Add to Cart” buttons at the same time from different locations and I don’t want to lose any of that.

The last scenario, using cluster wide transaction, is actually the reserve. Usually RavenDB will jump through all sorts of hoops to ensure that it won’t lose a write, but cluster wide transactions are actually going the other way. They need to fail if they can’t ensure that they went through. In this case, you’ll usually be working on something very specific. The classic example is ensuring a unique user name in the system, we want to fail if we can’t absolutely ensure that this username is unique. But that isn’t something that we want to do all the time, updating the LastLogin time on the user’s document is not something that you need to ensure will be consistent (and in this case, selecting the latest is also by definition the right thing to do).

I like to say that you should use a single node transaction to record that you purchased a lottery ticket, and a cluster wide transaction to record who won the lottery. That gives the right mindset about the stakes involved. I never want to lose the record of a sale, but I want to ensure that once the win is awarded, I get it absolutely right.

time to read 3 min | 549 words

Assume that you have a service that needs to accept some data from a user. Let’s say that the scenario in question is that the user wants to upload a photo that you’ll later process / aggregate / do stuff with.

How would you approach such a system? One way to do this is to do something similar to this:

image

The user will upload the function to your code (in the case above, a Lambda function, but can be an EC2 instance, etc) which will then push the data to its final location (S3, in this case). This is simple, and quite obvious to do. It is also wrong.

There is no need to involve your code in the middle. What you should do, instead, is to have the user talk directly to the end location (S3, Azure Blob Storage, Backblaze, etc). A better option would be:

image

In this model, we have:

  1. User ping your code to generate a secured upload link. You can also setup an “upload only area” in storage that a user can upload files to ahead of time, removing this step.
  2. User upload directly to S3 (or equivalent).
  3. S3 will then ping your code when the upload is done.

Why use this approach rather than the first one?

Quite simply, because in the first example, you are paying for the processing / upload / bandwidth for the work. In the second option, it is on the cloud / storage provider to actually provision enough resources to handle this. You are paying for the storage, but nothing else.

Consider the case of a user that uploads a 5 MB image over 5 seconds, if you are using the first option, you’ll pay for the full 5 seconds of compute time if you are using something like Lambda. If you are using EC2, your machine is busy and consume resources.

This is most noticeable if you also have to handle spikes / load. If you have 100 concurrent users, the first option will likely cost quite a lot just in the compute resources you use (either server less or provisioned machines). In the second option, it is the cloud provider that needs to have the machines ready to accept the data, and we don’t pay for any of that.

In fact, a much better solution is shown here. Again, the user gets the upload link in some manner and then upload directly to S3. At that point, instead of S3 calling you, it will push the notification to a queue (SQS) and then your code can handle this.

Here is what this looks like:

image

Note that in this case, you are in control of how fast or slow you want to process the data on the queue. You can set a maximum number of concurrent workers / lambdas and let the cloud infrastructure manage that for you. At this point, you can smooth any peaks that you have in the process.

A lot of this is just setting up the orchestration properly so you aren’t in the way, that you utilize the cloud infrastructure instead of writing your code.

FUTURE POSTS

  1. Atomic reference counting (with Zig code samples) - 2 days from now

There are posts all the way to Sep 20, 2021

RECENT SERIES

  1. Production postmortem (31):
    17 Sep 2021 - The Guinness record for page faults & high CPU
  2. RavenDB 5.2 (2):
    06 Aug 2021 - Simplifying atomic cluster wide transactions
  3. Postmortem (2):
    23 Jul 2021 - Accidentally quadratic indexing output
  4. re (28):
    23 Jun 2021 - The performance regression odyssey
  5. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats