Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,950 | Comments: 49,488

filter by tags archive
time to read 3 min | 542 words

imageOne of our developers recently got a new machine, and we were excited to see what kind of performance we can get out of it. It is an AMD Ryzen 9, 12 cores @ 3.79 Ghz with 32 GB of RAM. The disk used was Samsung SSD 970 EVO Plus 500 GB.

This isn’t an official benchmark, to be fair. This is us testing on how fast the machine is. As such, this is a plain vanilla Windows 10 machine, with no effort to perform any optimizations. Our typical benchmark involves loading all of stack overflow into RavenDB, so we’ll have enough data to work with. Here is what things looked like midway through:

image

As you can see, the write speed we are able to get is impressive.

We were able to insert all of stack overflow, a bit over 52GB in 3 and a half minutes, at a rate of about 300 MB / sec sustained.

Then we tested indexing.

  • Map/Reduce on users by registration month (source ~6 million users) – under a minute.
  • Full text search on users – two and a half minutes.
  • Simple index on questions by tag (over 18 million questions & answers) – 11.5 minutes.
  • Full text search on all questions and answers – 33 minutes.

Remember, these numbers are for indexing everything for the first time. It is worth noting that RavenDB dedicates a single thread per index, to avoid hammering the system with too much work. That means that this indexes were building concurrently with one another.

Here is the system utilization while this was going on:

image

Finally, we tested some other key scenarios (caching disabled in all of them):

  • Reading documents (small working set, representing recent questions)  - 243,371 req / ses at 512 MB / sec.
  • Full random reads (data size exceed memory, so disk hits) – 15,393.66 res / sec at 13.4 MB / sec.

These two are really interesting numbers. The first one, we generate queries to specific documents over an over (with no caching). That means that RavenDB is able to answer them from memory directly. The idea is to simulate a common scenario of a working set that can fit entirely in memory.

The second one is different. The data size on disk is 52 GB and we have 32 GB available for us. We generate random queries here, for different documents each time. We ensure that the queries cannot be served directly from memory and that RavenDB will have to hit the disk. As you can see, even under this scenario, we are doing fairly well. As an aside, it helps that the disk is good. We tried running this on HDD once. The results were… not nice.

The final test we did was for writes, writing a small document to RavenDB. We got 118,000 writes/sec on a sustained basis, with about 32MB / sec in data throughput. Note that we can do more, but playing with the system configuration, but we are already at high enough rate that it probably wouldn’t matter.

All in all, that is a pretty nice machine.

time to read 7 min | 1260 words

In the previous post, I wrote about how I changed the structure of the hash leaf page to increase data density. I managed to get it down to 32MB range when I’m using random keys. That is a pretty great number, for memory usage, but what is the cost in terms of performance?

Well, let’s figure it out, shall we?

I added some tracing code and got the first result:

3.124000 us/op with 32.007813 MB

That is not to shabby, right? Let’s see where we are spending most of our time, shall we? I opened the profiler and got:

image

Okay, that is a good point, isn’t it? Changing to release mode gives us:

1.471000 us/op with 32.007813 MB

that is much nicer, but still, profiler please…

As a side note, it actually takes less time to run the profiler than for it to analyze its output. I was looking at this for a while.

image

The result was… stunning:

image

What is this thing? And why did it take almost 50% of my runtime?

As it turns out, I was compiling for x86, and I’m using a lot of shifts on 64 bits numbers. This _allshl seems to be part of the x86 runtime. That means that what I expected to be a cheap instruction on a register was actually a method call.

That is interesting, but easy to fix. When running in Release/x64, we get the following results:

0.723 us/op with 32.007813 MB

Okay, so we are under a microsecond per op, and very reasonable memory, good to go, right?

Well, remember that I did absolutely zero optimizations so far? What does the profiler tell us now? Here is an interesting hotspot:

image

That is reasonable, we are benching this method, after all. But inside that method, we see:

image

This is the part where we scan an existing piece to see if the value is inside it or not. This tell us if we need to add a new value or update an existing one. It make sense this will be hot, we have to do it on each put to the data related to the piece where we want to put the new key.

There are a few ways to deal with this, we can try to move from the simple varint mode to a more complex (and performant) system. StreamVByte would probably be a good solution, in term of raw performance. But it is meant for 32 bits numbers and doesn’t play nice with being able to remove and add values from the stream easily.

I could also try to play games, instead of calling this function twice, call it once and pass both k and v. However, that is almost assuredly a false play. The varint method is small enough that it doesn’t really matter, the compiler can inline it and play its own optimizations. Also, I tried it and there was no noticeable performance change, so that’s down.

Another way to deal with it is to reduce the number of times we call this function. And here is where things get interesting. Why is this called so much? Because during the put process, we find a page to put a value, then in that page, we find a piece (a 64 byte range) that we will put the key and value in. When we get to the piece, we need to check the already existing data if the key is there or not. So far, so good, but there is another factor to consider, overflows.

A piece may overflow and spill into consecutive pieces. After all, that is what allowed us to reduce the memory usage from 147MB to just 32MB in the random integers scenario. However, that also means that we may need to scan much larger piece of the page. That explains why we are seeing so much usage of the decoding function.

Let’s look at the previous behavior, where we have no overflow at all?

0.551000 us/op with 147.320313 MB

That is a much cheaper cost, but much higher memory. It looks like the typical compute vs. memory cycle, but let’s look at the actual costs?

image

You’ll notice that we spend most of our time on increasing the hash table size, allocating and moving memory, etc. So even though we are faster, that isn’t a good option for us.

One thing to note, we are looking for the same key, and decoding all the data to find it. But we don’t actually need to do that, we already have the key, and encoded it to its varint form. We can do a search on the raw encoded data to find it. It won’t be good enough for the positive case (we may have a value that was encoded to the same form), but it should help for the common case of inserting a new value. If we find something with memmem(), we still need to decode the data itself and see if the pattern we found is a key or a value, but that should help.

I tested it using GCC’s implementation, and the performance dropped by almost 50%, it took 1.3 us/op! Maybe if I was using a SIMD optimized implementation, that would help, but given the kind of data we are looking for, it didn’t pan out.

Another option is to reduce the number of times we’ll try to overflow a value. Right now, if we can’t put a value in its proper place, we’ll try putting it in any of the other locations. That means that we may probe as many as 127 pieces. It also means that during put, we have to scan overflow chains. As we saw in the previous post, that can add up to scanning up to 1.8 KB of data for a single put. What happens if we limit the overflow amount?

Let’s see if we limit the overflow to 32 probes. Now it only takes 0.403 us/op, which is a huge improvement. But what about the memory size? It’s easier to look things up as a table:

Max chain Overall Time (sec) us/op Size (MB)
10.5450000.545000147.320313
20.3590000.35900075.156250
40.3720000.37200055.523438
80.3220000.32200036.882813
160.3360000.33600032.226563
320.4480000.44800032.007813
640.5960000.59600032.007813
1280.7700000.77000032.007813

These numbers are interesting, but let’s look at them as a graph, shall we?

image

We can see that the size drops sharply as the performance is best between 8 and 16 probe attempts, and all we are left choosing is the memory cost.

If we go with 8 probe attempts, we’ll pay with additional 4.875 MB, but with 16 probe attempts, we’ll use just 224KB more with a cost of 0.044 us/op more than the optimal value.

We could go to 32, of course, which gives us optimal size, with about 60% of the cost of doing the full scan. However, by paying just 224KB more, we get down to 43% of the initial cost. And that certainly seems like it is worth it.

You can find the full source code (a little bit cleaned up) here.

time to read 14 min | 2615 words

We got a few requests for some guidance on how to optimize RavenDB insert rate. Our current benchmark is standing at 135,000 inserts/sec on a sustained basis, on a machine that cost less than a 1,000$. However, some users tried to write their own benchmarks and got far less (about 50,000 writes / sec). Therefor, this post, in which I’m going to do a bunch of things and see if I can make RavenDB write really fast.

I’m sorry, this is likely to be a long post. I’m going to be writing this as I’m building the benchmark and testing things out. So you’ll get a stream of consciousness. Hopefully it will make sense.

Because of the size of this post, I decided to move most of the code snippets out. I created a repository just for this post, and I’m showing my steps as I go along.

Rules for this post:

  • I’m going to use the last stable version of RavenDB (4.2, at the time of writing)
  • Commodity hardware is hard to quantify, I’m going to use AWS machines because they are fairly standard metric and likely where you’re going to run it.
    • Note that this does mean that we’ll probably have less performance than if we were running on dedicated hardware.
    • Another thing to note (and we’ll see later) is that I/O rate on the cloud is… interesting topic.
  • No special system setup
    • Kernel config
    • Reformatting of hard disk
    • Changing RavenDB config parameters

The first thing to do is to figure out what we are going to write.

The test machine is:  t3a.xlarge with 4 cores, 16 GB RAM. This seemed like a fairly reasonable machine to test with. I’m using Ubuntu 18.04 LTS as the operating system.

The machine has an 8GB drive that I’m using to host RavenDB and a separate volume for the data itself. I create a 512GB gp2 volume (with 1536 IOPS) to start with. Here what this looked like from inside the machine:

image

I’m including the setup script here for completeness, as you can see, there isn’t really anything here that matters.

Do note that I’m going the quick & dirty mode here without security, this is mostly so I can see what the impact of TLS on the benchmark is at a later point.

We are now pretty much ready, I think. So let’s take a look at the first version I tried. Writing 100,000 random user documents like the following:

image

As you can see, that isn’t too big and shouldn’t really be too hard on RavenDB. Unfortunately, I discovered a problem, the write speed was horrible.

image

Oh wait, the problem exists between keyboard and chair, I was running that from my laptop, so we actually had to go about 10,000 KM from client to server. That… is not a good thing.

Writing the data took almost 12 minutes. But at least this is easy to fix. I setup a couple of client machines on the same AZ and tried again. I’m using spot instances, so I got a t3.large instance and a m5d.large instance.

That gave me a much nicer number, although still far from what I wanted to have.

image

On the cloud machines, this takes about 23 - 25 seconds. Better than 12 minutes, but nothing to write home about.

One of the reasons that I wanted to write this blog post is specifically to go through this process, because there are a lot of things that matter, and it sometimes can be hard to figure out what does.

Let’s make a small change in my code, like so:

image

What this does is to remove the call to RavenDB entirely. The only cost we have here is the cost of generating the from the Bogus library. This time, the code completes in 13 seconds.

But wait, there are no RavenDB calls here, why does it take so long? Well, as it turns out, the fake data generation library has a non trivial cost to it,  which impact the whole test. I changed things  so that we’ll generate 10,000 users and then use bulk insert to send them over and over again. That means that the time that we measure is just the cost of sending the data over. With these changes, I got much nicer numbers:

image

While this is going on, by the way, we have an interesting observation about the node while I’m doing this.

image

You can see that while we have two machines trying to push data in as fast as them can, we have a lot of spare capacity. This is key, actually. The issue is what the bottleneck, and we already saw that the problem is probably on the client. We improved our performance by over 300% by simply reducing the cost of generating the data, not writing to RavenDB. As it turns out, we are also leaving a lot of performance on the table because we are doing this single threaded. A lot of the time is actually spent on the client side, doing serialization, etc.

I changed the client code to use multiple threads and tried it again. By the way, you might notice that the client code is… brute forced, in a way. I intentionally did everything in the most obvious way possible, caring non at all about the structure of the code. I just want it to work, so no error handling, nothing sophisticated at all here.

image

This is with both client machines setup to use 4 threads each to send the data. It’s time to dig a bit deeper and see what is actually going on here. The t3.large machine has 2 cores, and looking into what it is doing while it has 4 threads sending data is… instructive…

image

The m5d.large instance also have two cores, and is in a similar state:

image

Leaving aside exactly what is going on here (I’ll discuss this in more depth later in this post), it is fairly obvious that the issue here is on the client side, we are completely saturating the machine’s capabilities.

I created another machine to serve as a client, this time a c5.9xlarge, an instance that has 36 cores and is running a much faster CPU that the previous instances. This time, I a single machine and I used just a single thread, and I got the following results:

image

And at the same time, the server resources utilization was:

image

Note that this is when we have a single thread doing the work… what happens when we increase the load?

Given the disparity between the client (36 cores) and the server (just 4), I decided to start slow and told the client to use just 12 threads to bulk insert the data. The result:

image

Now we are talking, but what about the server’s resources?

image

We got ourselves some spare capacity to throw around, it seems.

At this point, I decided to go all in and see what happens when I’m using all 36 cores for this. As it runs out, we can get faster, which is great, but the rise isn’t linear, unfortunately.

image

At this point, I mostly hit the limits. Regardless of how much load I put on the client, it wasn’t able to hit any higher than this. I decided to look at what the server is doing. Write speed for RavenDB is almost absolutely determined by the ACID nature of the database, we have to wait for the disk to confirm the write. Because this is such an important factor of our performance, we surface all of that information to you. In the database’s stats page, you can go into the IO Stats section, like so:

image

The first glace might be a bit confusing, I’ll admit. We tried to pack a lot of data into a single view.

image

The colors are important. Blue are writes to the journal, which are the thing that would usually hold up the transaction commit. The green (data write / flush) and red (sync) are types of disk operations, and they are shown here to allow you to see if there are any correlation. For example, if you have a big sync operation, it may suck all the I/O bandwidth, and your journal writes will be slow. With this view, you can clearly correlate that information. The brighter the color, the bigger the write, the wider the write, the more time it took. I hope that this is enough to understand the gist of it.

Now, let’s zoom in. Here you can see a single write, for 124KB, that took 200ms.

image

Here is another one:

image

These are problematic for us, because we are stalling. We can’t really do a lot while we are waiting for the disk (actually, we can, we start processing the next tx, but there is a limit to that as well). That is likely causing us to wait when we read from the network and in likely the culprit. You might have noticed that both slow writes happened in conjunction with the sync (the red square below), that indicate that we might have latency because both operations go to the same location at the same time.

On the other hand, here is another section, where we have two writes very near one another and they both very slow, without a concurrent sync. So the interference from the sync is a theory, not a proven fact.

image

We can go and change the gp2 drive we have to an io1 drive with provisioned IOPS (1536, same as the gp2). That would cost me 3 times as much, so let’s see if we can avoid this. Journals aren’t meant to be forever. They are used to maintain durability of the data, once we synced the data to disk, we can discard them.

I created an 8 GB io2 drive with 400 IOPS and attached it to the server instance and then set it up:

Here is what this ended up as:

image

Now, I’m going to setup the journals’ directory for this database to point to the new drive, like so:

And now we have a better separation of the journals and the data, let’s see what this will give us? Not much, it seems, I’m seeing roughly the same performance as before, and the IO stats tells the same story.

image

Okay, time to see what we can do when we change instance types. As a reminder, so far, my server instance was t3a.xlarge (4 cores, 16 GB). I launched a r5d.large instance (2 cores, 16 GB) and set it up with the same configuration as before.

  • 512 GB gp2 (1536 IOPS) for data
  • 8GB io2 (400 IOPS) for journals

Here is what I got when I started hammering the machine:

image

This is interesting, because you can see a few discrepancies:

  • The machine feels faster, much faster
  • We are now bottleneck on CPU, but note the number of writes per second
  • This is when we reduced the number of cores by half!

That seems pretty promising, so I decided to switch instances again. This time to i3en.xlarge instance (4 cores, 30GB, 2 TB NVMe drive). To be honest, I’m mostly interested in the NVMe drive Smile.

Here are the results:

image

As you can see, we are running pretty smoothly with 90K – 100K writes per second sustained.

On the same i3en.xlarge system, I attached the two volumes (512GB gp2 and 8GB io2) with the same setup (journals on the io2 volume), and I’m getting some really nice numbers as well:

image

And now, the hour is nearing 4AM, and while I had a lot of fun, I think this is the time to close this post. The factor in write performance for RavenDB is the disk, but we care a lot more about latency than throughput for these kind of operations.

A single t3a.xlarge machine was able to hit peak at 77K writes/second and by changing the instance type and getting better IO, we were able to push that to 100K writes/sec. Our current benchmark is sitting at 138,000 writes/second, by the way, but it isn’t running on virtual machine but on physical hardware. Probably the most important part of that machine is the fact that is has an NVMe drive (latency, again).

However, there is one question that still remains. Why did we have to spend so much compute power on generating the bulk insert operations? We had to hit the server from multiple machines or use 36 concurrent threads just to be able to push enough data so the server will sweat it.

To answer this, I’m going to do the Right Thing and look at the profiler results. The problem is in the client side, so let’s profile the client and see what is taking so much computation horse power. Here are the results:

image

The cost here is serialization is the major factor here. That is why we need to parallelize the work, otherwise, as we saw, RavenDB is basically going to sit idle.

The reason for this "issue" is that JSON.Net is a powerful library with many features, but it does have a cost. For bulk insert scenarios, you typically have a very well defined set of documents, and you don't need all this power. For this reason, RavenDB exposes an API that allow you to fully control how serialization works for bulk insert:

DocumentStore.Conventions.BulkInsert.TrySerializeEntityToJsonStream

You can use that to significantly speed up your insert processes.

time to read 5 min | 929 words

One of the measure that we don’t care much about is the startup time of RavenDB. Whatever it takes 5 seconds or 15 seconds is of little concern to us. Whatever it takes 15 seconds or 3 minutes, however, is something that we most certainly want to pay attention to.

One of our customers has an interesting use case. They are running on Azure machines and take full advantage of the multiple storage options that they have available there. In particular, their journals are using a premium storage disk but their data is residing on a a large (and slow) disk. This is because they have quite a lot of data. One of their indexes just exceeded the 256GB mark, for example.

In their case, the startup time for RavenDB wasn’t acceptable. We investigated the issue and it turned out that the root of the problem was that RavenDB was running recovery on the database, re-applying recent transactions to make sure that we are consistent. This is expected, and in most cases, shouldn’t cause you to spend too much time at startup. By default, journals are going to be about 256MB if you are heavily loaded. But due to the customer’s access patterns, we saw transactions that included multiple GBs.  We compress the transaction data before writing it to disk,  so a single transaction (which cannot be split into multiple journal files) that takes multiple GBs compressed has likely wrote to 10+ GB on the data file. We can tell that we don’t need to apply a transaction if it was already applied, but we need to read and analyze it first.

Times that by a number of databases and a number of indexes per database and you can see that restarting RavenDB begins to be something that you plan for. That is not where we want to be, obviously. Now, if we just had a crash, there is really no good way to avoid reapplying these transactions,  but the problem was that we saw the same behavior without a crash. We saw this when doing normal shutdown.

The basic problem was that RavenDB doesn’t track the location in the journal file that we know have been safely synced to disk. We only track things at the journal level. That means that on startup, we need to read through the entire journal file and figure out whatever we need to apply each of the transactions inside it. We could track the last synced transaction location, of course. That would mean changing the on disk format at a very low level, something that we have the facilities to do, but is probably going to be awkward and cause compatibility concerns that I would rather not get into.

We also looked into changing the runtime behavior so we’ll be more likely to move to a new journal file after we synced the data in the previous one if it is too large. I was looking at this today and figure out something silly. Whenever we have a large transaction (where large is bigger that the max journal size) we need to ensure that we have enough space for the transaction. We do that by allocating a big enough file on disk. However, the way we did that was interesting.

image 

As you can see, if the minimum required size is smaller than the current journal size, we make sure to increase it. And because we want to avoid making too many file allocation calls, we try to ensure that we’ll use a size that is big enough that the journal file can be used or the next transaction as well. Now, consider the common scenario where the current journal size is 256MB (which is the default journal file limit) and the transaction size is 1.56 GB.

What will happen then is that we’ll get a journal size of 2GB, of which only 1.56GB is used. This is fine, and we’ll use the rest of the space, if we can. However, if the next transaction is too large (let’s say, 800MB), we’ll need to create a new file, whose size will be 1GB, etc.

It is when we sync the data to disk, that we really hit the bad behavior. We just synced the data to disk, so we can get rid of the journal file. But there are still 440MB of disk space allocated to the journal file, so we keep the journal around for the next transaction. And if we restart at that point, we’ll have to go through the entire 2 GB journal file to make sure that we haven’t missed anything. The fix, in this case, was stupidly easy:

image

All we need to do is to ensure that if the power of two size of the write to the journal is bigger than the max journal size, we’ll use the size of the write to the journal. That will create a journal that has just a single transaction on it. Most importantly, that means that once the data is synced to disk, there is no more space available on that journal file and Voron will immediately know that it can clear it. No big journal sticking around, no need to re-structure our on disk data or to go into tricky change of behavior. I really love this change because is it succinct, simple and does the job.

time to read 1 min | 114 words

Federico is the go to guy we have for all our performance issues, he talks about a lot of our challenges in this talk.

Micro-optimizations at the RavenDB vNext storage engine are critical to achieve 50K+ write requests per second on single node commodity hardware. In this talk we'll explore the use of the new hardware intrinsic introduced on CoreCLR 2.1 in the context of real-life critical path bottlenecks. We will touch on hardcore topics like CPU architecture and its effect on instruction latency and throughput, the effect of cache behaviors (hit/miss ratio, poisoning), prefetching, etc. The talk is aimed at engineers doing micro-optimization and high performance computing.

time to read 3 min | 512 words

I needed to use Bloom Filters, and I didn’t want to use the implementation we already have in RavenDB. That one is too tied up in our infrastructure to be easily used. So I found the Maybe.NET project. it is nice, but it doesn’t have a CoreCLR Nuget package. This meant that I had to go into the code and look at what is going on. And I started going, “nope, that isn’t the way I want it done”.

Now, to be clear, the code in the project is great. It is clear, obvious and idiomatic C#. It is also raising every red flag I have for inefficient code detection that I had built over the past few years of making RavenDB faster. Because this is such as small sample that I thought it would make a good blog post, because I can explain what the code is doing and what changes I’m doing there, and why. Before I can get to the Bloom Filter implementation, I need to use the hash function, and that was just a full stop for me. Let me show you what I mean. The key parts are below, and you can find the full code here.

image

This is a hash method, it produced several hashes for the purpose of the bloom filter. I underlined in red every time that this code allocates. As you can see, this allocates a lot.

There is also the fact that this accepts a generic object as a parameter and serialize that to a byte[]. I’m ignoring the allocations in that part of the code, but I’m assuming that they are significant. So let’s talk about how we can optimize this function?

Well, to start with, I’m going to decide that accepting an object is too high level. This is a hash function, the caller should give us bytes. Now, let’s see what impact that has on us, shall we?

Now this is much better. We don’t allocate anything in the ComputeHashes method and we give the compiler the chance to build really efficient code here. We can probably require that the maxHashValue be a power of two and avoid the mod operation in favor of bit shifting, but I’m not writing RavenDB here and worrying about every little thing. I’ll leave that part as an exercise for the reader.

Now, let’s look at the actual Hash function, shall we?

There is quite a bit going on, but essentially, I’m using the fixed to get the pointer from the span, then compute the hash in 4 bytes at once, then handle the remainder. There is not allocations and this has far fewer instructions that actually need to run. Note that this would be a great place to stop and run unit tests to verify that I didn’t break something, I’m going to assume that I got it write and close this post, I still want to talk about the optimizations that are available for the bloom filter.

time to read 3 min | 436 words

imageYou knew that this had to come, after talking about memory and CPU so often, we need to talk about actual I/O.

Did I mention that the cluster was setup by a drunk monkey. That term was raised in the office today, and we had a bunch of people fighting over who was the monkey. So to clarify things, here is the monkey:

image

If you have any issues with this being a drunk monkey, you are likely drunk as well. How about you setup a cluster that we can test things on.

At any rate, after setting things up and push the cluster, we started seeing some odd behaviors. It looked like the cluster was… acting funny.

One of the things we build into RavenDB is the ability to inspect its state easily. You can see it in the image on the right. In particular, you can see that we have a journal write taking 12 seconds to run.

It is writing 76Kb to disk, at a rate of about 6KB per second. To compare, a 1984 modem would actually be faster. What is going on? As it turned out, the IOPS on the system was left in their default state, and we had less than 200 IOPS for the disk.

Did I mention that we are throwing production traffic and some of the biggest stuff we have on this thing? As it turns out, if you use all your IOPS burst capacity, you end up having to get your I/O through a straw.

This is excellent, since it exposed a convoy situation in some cases, and also gave us a really valuable lesson about things we should look at when we are investigating issue (the whole point of doing this “setup by a monkey” exercise).

For the record, here is what this looks like when you do things properly:

image

Why does this matter, by the way?

A journal write is how RavenDB writes to the transaction journal. This is absolutely critical to ensuring that the transaction ACID properties are kept.

It also means that when we write, we must wait for the disk to OK the write before we consider it completed. And that means that there were requests somewhere that were sitting there waiting for 12 seconds for a reply because the IOPS run out.

time to read 2 min | 332 words

This blog post was a very interesting read. It talks about the ipify service and how it grew. It is an interesting post, but what really caught my eye was the performance details.

In particular, this talks about exceeding 30 billions of requests per month. The initial implementation used Node, and couldn’t get more than 30 req/second or so and the current version seems to be in Go and can handle about 2,000 requests a second.

I’m interested in performance so I decided to see why my results would be. I very quickly wrote the simplest possible implementation using Kestrel and threw that on a t2.nano in AWS. I’m pretty sure that this is the equivalent for the dyno that he is using. I then spun another t2.small instance and used that to bench the performance of the system. All I did was just run the code with release mode on the t2.nano, and here are the results:"

image

So we get 25,000 requests a second and some pretty awesome latencies under very high load on a 1 CPU / 512 MB machine. For that matter, here are the top results from midway through on the t2.nano machine:

image

I should say that the code I wrote was quick and dirty in terms of performance. It allocates several objects per request and likely can be improved several times over, but even so, these are nice numbers. Primarily because there is actually so little that needs to be done here. To be fair, a t2.nano machine is meant for burst traffic and is not likely to be able to sustain such load over time, but even when throttled by an order of magnitude, it will still be faster than the Go implementation Smile.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. RavenDB 5.0 (2):
    21 Jan 2020 - Exploring Time Series–Part II
  2. Webinar (2):
    15 Jan 2020 - RavenDB’s unique features
  3. Challenges (2):
    03 Jan 2020 - Spot the bug in the stream–answer
  4. Challenge (55):
    02 Jan 2020 - Spot the bug in the stream
  5. re (26):
    27 Dec 2019 - Writing a very fast cache service with millions of entries
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats