Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,163 | Comments: 50,150

Privacy Policy Terms
filter by tags archive
time to read 1 min | 88 words

I posted a few weeks ago about a performance regression in our metrics that we tracked down the to the disk being exhausted.

We replaced the hard disk to a new one, can you see what the results were?

image (1)

This is mostly because we were pretty sure that this is the problem, but couldn’t rule out that this was something else. Good to know that we were on track.

time to read 2 min | 319 words

We recently added support for running RavenDB on a Ubuntu machine (or Debian) using DEB files. I thought that I would post a short walkthrough of how you can install RavenDB on such a machine.

I’m running the entire process on a clean EC2 instance.

Steps beforehand, making sure that the firewall is setup appropriately:

image

Note that I’m opening up just the ports we need for actual running of RavenDB.

Next is to go and fetch the relevant package, you can do that from the Download Page, where you can find the most up to date DEB file.

image

SSH into the machine and then we’ll need to download and install the package:

$ sudo apt-get update && sudo apt-get install libc6-dev –y 
$ wget --content-disposition https://hibernatingrhinos.com/downloads/RavenDB%20for%20Ubuntu%2020.04%20x64%20DEB/51027
$ sudo dpkg -i ravendb_5.1.8-0_amd64.deb

This will download and install the RavenDB package, after making sure that the environment is properly setup for it.

Here is what this will output:

### RavenDB Setup ###
#
#  Please navigate to http://127.0.0.1:53700 in your web browser to complete setting up RavenDB.
#  If you set up the server through SSH, you can tunnel RavenDB setup port and proceed with the setup on your local.
#
#  For public address:    ssh -N -L localhost:8080:localhost:53700 ubuntu@34.235.129.104
#
#  For internal address:  ssh -N -L localhost:8080:localhost:53700 ubuntu@ip-172-31-22-131
#
###

RavenDB is installed, but we now need to configure it. For security, RavenDB default to listening to the local host only, however, we are now running it on a remote server. tat is why the installer output gives you the port forwarding command. We can exit SSH and run these commands, getting us to run the setup via secured port forwarding and setting up a secured RavenDB instance in minutes.

time to read 5 min | 993 words

I mentioned that I’m teaching a Cloud Computing course at university in a previous post. That lead to some good questions that I have to field about established wisdom that I have to really think about. One such question that I run into was about the intersection of databases and the cloud.

One of the most important factors for database performance is the I/O rate that you can get. Let’s take a fairly typical off the shelf drive, shall we?

Cost of the drive is less than 500 $ US for a 2TB disk and it can write at close 5GB / sec with sustained writes sitting at 3GB /sec at  User Benchmark, it is also rated to hit 1 million IOPS. That is a lot. And that is when you spend less than 500$ on that.

On the other hand, a comparable drive would be Azure P40, which cost 235.52$ per month for 2TB of disk space. It also offers a stunning rate of 7,500 IOPS (with bursts of 30,000!). The write rate is 250MB/sec with bursts of 1GB/sec. The best you can get on Azure, though, is an Ultra disk. Where a comparable disk to the on premise option would cost you literally thousands per month (and would be about a tenth of the performance).

In other words, the cloud option is drastically more costly. To be fair, we aren’t comparing the same thing at all. A cloud disk is more than just renting of the hardware. There is redundancy to consider, the ability to “move” the disk between instances, the ability to take snapshots and restore, etc.

A more comparable scenario would be to look at NVMe instances. If we’ll take L8sv2 instance on Azure, that gives us a 2TB NVMe drive with 400,000 IOPS and 2GB/sec throughput. That is at least within reach of the off the shelf disk I pointed out before. The cost? About 500$ per month. But now we are talking about a machine that has 8 cores and 64 GB of RAM.

The downside of NVMe instances is that the disk are transient. If there is a failure that requires us to stop and start the machine (basically, moving hosts), that would mean that the data is lost. You can reboot the machine, but not stop the cloud allocation of the machine without losing the data.

The physical hardware option is much cheaper, it seems. If we add everything around the disk, we are going to get somewhat different costs. I found a similar server to L8sv2 on Dell for about 7,000 $ US, for example. Pretty sure that you can get it for less if you know what you are doing, but it was my first try and it included 3.2 TB of enterprise grade NVMe drives.

Colocation pricing can run about 100$ a month (again, first search result, you can get for less) and that means that the total monthly cost is roughly 685$. That is comparable to the cloud, actually, but doesn’t account for the fact that you can use the same server for much longer than a single year. It is also probably wasting a lot of money on bigger hardware. What you don’t get, which you probably want, is the envelope around that. The ability to say: “I want another server” (or ten), the ability to move and manage your resources easily, etc. And that is as long as you are managing just hardware resources.

You don’t get any of the services or the expertise in running things. Given that even professional organizations can suffer devastating issues, you want to have an expert manage than, because an armature handling that topic lead to problems. 

A lot of the attraction of the cloud comes from a very simple reason. I don’t want to deal with all of that stuff. None of that is your competitive advantage and you would rather just pay and not think about that. The key for the success of the cloud is that globally, you are paying less (in time, effort and manpower) than taking the cost of managing it yourself.

There are two counterpoints here, though.

  • At some scale, it would make sense to move out from the cloud to your own hardware. Dropbox did that at some point, moving some of its infrastructure off the cloud to savings of over 75 million dollars. You don’t have to be at Dropbox size to benefit from having some of your own servers, but you do need to hit some tipping point before that would make sense.
  • StackOverflow is famously running on their own hardware, and is able to get great results out of that. I wonder how much the age of StackOveflow has to do with that, though.

The cloud is a pretty good abstraction, but it isn’t one that you get for free. There are a lot of scenarios where it makes a lot of sense to have some portions of your system outside of the cloud. The default of “everything is in the cloud”, however, make a lot of sense. Specifically because you don’t need to do complex (and costly) sizing computations. Once you have the system running and the load figured out, you can decide if it make sense to move things to your own severs.

And, of course, this all assumes that we are talking about just the hardware. That is far from the case in today’s cloud. Cloud services are another really important aspect of what you get in the cloud. Consider the complexity of running a  Kubernetes cluster, or setting up a system for machine vision or distributed storage or any of the things that the cloud providers has commoditized.

The decision of cloud usage is no longer a simple buy vs. rent but a much more complex one about where do you draw the line of what should be your core concerns and what should be handled outside of your purview.

time to read 3 min | 438 words

We are gearing up to a new release of RavenDB, and it is about time that I’ll start talking about the new features. I thought to start with what one of our most requested features: Read only access to RavenDB.

This has been asked by enough customers that we decided to implement it, even though I don’t like the concept very much. One of the key aspects of RavenDB design is the notion that RavenDB is an application database, not a shared database. As such, we limit access per database and expect that pretty much all accesses to the database will have the same privilege level. After all, this is meant to be a single application, even if it is deployed as multiple processes / services.

From real world usage, this expectation is false. People want to isolate access even within the scope of a single application. Hence, the read only mode.

The documentation does a good job describing what read only mode is, but I wanted to give some additional background.

As usual, authentication to RavenDB is done using X509 certificates. When you define the certificate’s permissions, you can grant it read only access to a database (or databases). At this point, applications and users using this certificate will be limits to only reads from the database. That isn’t a big surprise, right?

The devil is in the details, however. A read only certificate can perform the following operations:

  • Load documents by id – the security boundary here is the entire database, there is no limit on a per document / collection.
  • Query documents using predefined indexes. – note that you cannot deploy new indexes.
  • Query documents using auto generated indexes – this also implies that a query by a read only certificate can cause the database engine to create an index to answer the query. This is explicitly allowed when you are using a read only certificate.
  • Inspect the database state and its ongoing tasks – you can look at the tasks and their status, but things like the connection strings details are hidden.
  • Connect to a subscription and accept documents – this is a case where a read only certificate will modify the state of the database (by advancing what documents the subscription consumes). This is explicitly allowed since it is a likely scenario for read only certificates. Creating a subscription, on the other hand, is something that you’ll need to do with more permissions.

I’m not trying to give you the whole thing, you can read the documentation for that, but I am trying to give you some idea about the kind of use cases you can use read only certificates for.

time to read 2 min | 325 words

Yesterday I asked about dealing with livelihood detection of nodes running in AWS. The key aspect is that this need to be simple to build and easy to explain.

Here are a couple of ways that I came up with, nothing ground breaking, but they do the work while letting someone else do all the heavy lifting.

Have a well known S3 bucket that each of the nodes will write an entry to. The idea is that we’ll have something like (filename –  value):

  • i-04e8d25534f59e930 – 2021-06-11T22:01:02
  • i-05714ffce6c1f64ad – 2021-06-11T22:00:49

The idea is that each node will scan the bucket and read through each of the files, getting the last seen time for all the nodes. We’ll consider all the nodes whose timestamp is within the last 1 minute to be alive and any other node is dead.  Of course, we’ll also need to update the node’s file on S3 every 30 seconds to ensure that other nodes know that we are alive.

The advantage here is that this is trivial to explain and implement and it can work quite well in practice.

The other option is to actually piggy back on top of the infrastructure that is dedicated for this sort of scenario. Create an elastic load balancer and setup a target group. On startup, the node will register itself to the target group and setup the health check endpoint. From this point on, each node can ask the target group to find all the healthy nodes.

This is pretty simple as well, although it requires significantly more setup. The advantage here is that we can detect more failure modes (a node that is up, but firewalled away, for example).

Other options, such as having the nodes ping each other, are actually quite complex since they need to find each other. That lead to some level of service locator, but then you’ll have to avoid each node pining all the other nodes, since that can get busy on the network.

time to read 2 min | 286 words

I’m teaching a course at university about cloud computing. That can be a lot of fun, but quite frustrating at time. The key issue for me is that I occasionally need to provide students with some way to do something that I know how to do properly, but I can’t.

Case in point, assuming that I have a distributed cluster of nodes, and we need to detect what nodes are up or down, how do you do that?

With RavenDB, we assign an observer to the cluster whose job is to do health monitoring. I can explain that to the students, but I can’t expect them to utilize this technique in their exercises, there is too much detail there. The focus of the lesson or exercise is not to build a distributed system but to make use of one, after all.

As a rule, I try to ensure that all projects that we are working on can be done in under 200 lines of Python code. That puts a hard limit to the amount of behavior I can express. Because of that, I find myself looking for ways to rely on existing infrastructure to deal with the situation. 

Each node is running the same code, and they are setup so they can talk to one another, if needed. It is important that all the live nodes will converge to agree on the active nodes in relatively short order.

The task is to find the list of active nodes in a cluster, where nodes may go up or down dynamically. We are running in AWS cloud so you can use its resources, how would you do that?

The situation should be as simple as possible and easy to explain to students.

time to read 5 min | 885 words

In the database field and information retrieval in general, there is a very common scenario. I have a list of (sorted) integers that I want to store, and I want to do that in an as efficient a manner as possible. There are dozens of methods to do this and this is a hot topic for research. This is so useful because there are so many places where you can operate on a sorted integer list and gain massive benefits. Unlike generic compression routines, we can usually take advantage of the fact that we understand the data we are trying to work with and get better results.

The reason I need to compress integers (actually, int64 values) is that I’m trying to keep track of matches for some data, so the integers that I’m tracking are actually file offsets for user’s data inside of Voron. That lead to a few different scenarios that I have to deal with:

  • There is a single result
  • There is a reasonable number of results
  • There is a boatload of results

I’m trying to figure out what is the best way to store the later two options in as efficient manner as possible.

The first stop was Daniel Lemire’s blog, naturally, since he has wrote about this extensively. I looked at the following schemes: FastPFor and StreamVByte. They have somewhat different purposes, but basically, FastPFor is using a bits stream while StreamVByte is using byte oriented mode. Theoretically speaking, you can get better compression rate from FastPFor, but StreamVByte is supposed to be faster. Another integer compression system come from the Gorilla paper from Facebook, that is a bigger scheme, which include time series values compression. However, part of that scheme talks about how you can compress integers (they use that to store the ticks of a particular operation). We are actually using that for the time series support inside of RavenDB.

I’m not going to cover that in depth, here is the paper on Gorilla compression, the relevant description is at section 4.1.1. Suffice to say that they are using a bit stream and delta of deltas computation. Basically, if you keep getting values that are the same distance apart, you don’t need to record all the value, you can compute that naturally. In the best case scenario, Gorilla compression needs a single bit per value, assuming the results are spaced similarly.

For my purpose, I want to get as high a compression rate as possible, and I need to store just the list of integers. The problem with Gorilla compression is that if we aren’t getting numbers that are the same distance apart, we need to record the amount that they are different. That means that at a minimum, we’ll need a minimum of 9 bits per value. That adds up quickly, sadly.

On the other hand, with PFor, there is a different system. PFor computes the maximum number of bits required for a batch of integer, and then record just those values. I found the Binary Packing section (2.6) in this paper to be the clearest explanation on how that works exactly.  The problem with PFor, however, is that if you have a single large outlier, you’ll waste a lot of bits unnecessarily.

I decided to see if I can do something about that and created an encoder that works on batches of 128 integers at a time. This encoder will:

  • Check the maximum number of bits required to record the deltas of the integers. That along already saves us a lot.
  • Then we check the top and bottom halves of the batch, to see if we’ll get a benefit from recording them separately. A single large value (or a group of them) that is localized to a part of the batch will be recorded independently in this case.
  • Finally, instead of only recording the meaningful bit ranges, we’ll also analyze the batch we get further. The idea is to try to find ranges within the batch that have the same distance from one another. We can encode those as repetitions instead of each independent value. That can end up saving a surprisingly amount of space.

You can look at the results of my research here. I’ll caution you that this is raw, but the results are promising. I’m able to beat (in terms of compression rate) the standard PFor implementation by a bit, with a lot less code.

I’m also looking at a compression rate of 30% – 40% for realistic data sets. And if the data is setup just right and I’m able to take advantage of the repeated delta optimization, I can pack things real tight.

Currently numbers say that I can push upward of 10,000 int64 values in an 8KB buffer without any repeated deltas. It goes to just under 500,000 int64 values in an 8KB buffer if I can take full advantage of the deltas.

The reason I mention the delta so often, it is very likely that I’ll record values that are roughly the same size, so we’ll get offsets that are the same space from one another. In that case, my encoder goes to town and the compression rate is basically crazy.

This is a small piece of a much larger work, but this is the first time in a while that I got to code at Voron’s level. This is fun.

time to read 2 min | 369 words

Last week, Amazon had an outage in its Frankfurt region. Here is what they had to say about it:

We can confirm increased API error rates and latencies for the EC2 APIs and connectivity issues for instances within a single Availability Zone (euc1-az1) within the EU-CENTRAL-1 Region, caused by an increase in ambient temperature within a subsection of the affected Availability Zone. Other Availability Zones within the EU-CENTRAL-1 Region are not affected by the issue and we continue to work towards resolving the issue.

That is of particular interest to us, because we have clients running RavenDB Cloud clusters on that region. Here are some of the alerts that we got when the incident happened:

image

This is marked as a Disaster level event, because we lost all connectivity with the node and none of the redundant watchdogs were unable to bring it back up.

Our operations team looked at the issue, figured out that this is an AWS outage that impacted us and then dropped the matter.

Wait a minute, dropped the matter?! What kind of a reaction is that from an operations team?

The right reaction. There wasn’t anything that we could have done, since the problem was out of our hands.

What does that means for our customers? Well, they didn’t notice that anything happened. RavenDB was explicitly designed to survive just this sort of incident.

On the cloud, we are running each cluster with three nodes on separate availability zones. A single node going down is a non event, the rest of the cluster will just make a note of that and clients will transparently failover to the other nodes.

This behavior is the basis for a lot of operations inside of RavenDB and RavenDB Cloud. For example, we routinely put ourselves in this position, whenever we do a maintenance run or whenever a user want to scale their systems up or down.

When the AWS outage ended, our internal systems then brought the nodes back online and they got integrated to their clusters automatically. All in all, that is pretty much a non event for everyone, but the fact that we suddenly got flooded with “the sky is falling” messages.

time to read 4 min | 660 words

image (2)We care a lot about the performance of RavenDB.

Aside from putting a lot of time an effort into ensuring that RavenDB uses optimal code, we also have a monitoring system in place to alert us if we can observe a performance degradation. Usually those are fairly subtle issues, but we got an alert on the following scenario. As you can see, we are seeing a big degradation of this test.

The actual test in question is doing a low level manipulation of Voron (RavenDB’s storage engine), and as such, stand at the core of our performance hotspots to watch for.

Looking at the commits around that time frame, we quickly narrow the fault down to the following changes:

image

A really important observation here, however, is that this method is not called in the test. So we were looking at whatever this change caused a regression in code generation. I couldn’t believe that this is the case, to be honest.

Indeed, looking at the generated assembly, there was no difference between the two versions. But something cause the performance to degrade significantly enough for this test that it raised all sorts of alarm bells.

We started looking into a lot of details about the system, the usual things like checking for thermal throttling, etc.

We struck gold on this command: sudo smartctl --all /dev/nvme0n1

Take a look at the results:

    SMART overall-health self-assessment test result: FAILED!
    - NVM subsystem reliability has been degraded
    SMART/Health Information (NVMe Log 0x02, NSID 0x1)
    Critical Warning:                   0x04
    Temperature:                        35 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    115%
    Data Units Read:                    462,613,897 [236 TB]
    Data Units Written:                 2,100,668,468 [1.07 PB]
    Host Read Commands:                 10,355,495,581
    Host Write Commands:                9,695,954,131
    Controller Busy Time:               70,777

In other words, the disk is literally crying at us. This tells us that the drive has been in action for ~50 days of actual activity and that it has gone beyond is design specs.

In particular, you can see that we wrote over a petabyte of data to the disk as part of our test case executions. This is a 500GB drive, which means that we fill it to capacity over 2,000 times before we hit this issue.

Once we hit this capacity (Percentage Used is > 100%), the drive needs to do a lot more work, so we are seeing longer test times.

First time that I closed a bug by sending a PO to get new hardware, I got to admit.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. re (28):
    23 Jun 2021 - The performance regression odyssey
  2. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
  3. Webinar (4):
    11 Jun 2021 - Machine Learning and Time Series in RavenDB with Live Examples
  4. Webinar recording (13):
    24 May 2021 - The Rewards of Escaping the Relational Mindset
  5. Building a phone book (3):
    02 Apr 2021 - Part III
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats