Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,161 | Comments: 50,139

Privacy Policy Terms
filter by tags archive
time to read 2 min | 369 words

Last week, Amazon had an outage in its Frankfurt region. Here is what they had to say about it:

We can confirm increased API error rates and latencies for the EC2 APIs and connectivity issues for instances within a single Availability Zone (euc1-az1) within the EU-CENTRAL-1 Region, caused by an increase in ambient temperature within a subsection of the affected Availability Zone. Other Availability Zones within the EU-CENTRAL-1 Region are not affected by the issue and we continue to work towards resolving the issue.

That is of particular interest to us, because we have clients running RavenDB Cloud clusters on that region. Here are some of the alerts that we got when the incident happened:

image

This is marked as a Disaster level event, because we lost all connectivity with the node and none of the redundant watchdogs were unable to bring it back up.

Our operations team looked at the issue, figured out that this is an AWS outage that impacted us and then dropped the matter.

Wait a minute, dropped the matter?! What kind of a reaction is that from an operations team?

The right reaction. There wasn’t anything that we could have done, since the problem was out of our hands.

What does that means for our customers? Well, they didn’t notice that anything happened. RavenDB was explicitly designed to survive just this sort of incident.

On the cloud, we are running each cluster with three nodes on separate availability zones. A single node going down is a non event, the rest of the cluster will just make a note of that and clients will transparently failover to the other nodes.

This behavior is the basis for a lot of operations inside of RavenDB and RavenDB Cloud. For example, we routinely put ourselves in this position, whenever we do a maintenance run or whenever a user want to scale their systems up or down.

When the AWS outage ended, our internal systems then brought the nodes back online and they got integrated to their clusters automatically. All in all, that is pretty much a non event for everyone, but the fact that we suddenly got flooded with “the sky is falling” messages.

time to read 4 min | 660 words

image (2)We care a lot about the performance of RavenDB.

Aside from putting a lot of time an effort into ensuring that RavenDB uses optimal code, we also have a monitoring system in place to alert us if we can observe a performance degradation. Usually those are fairly subtle issues, but we got an alert on the following scenario. As you can see, we are seeing a big degradation of this test.

The actual test in question is doing a low level manipulation of Voron (RavenDB’s storage engine), and as such, stand at the core of our performance hotspots to watch for.

Looking at the commits around that time frame, we quickly narrow the fault down to the following changes:

image

A really important observation here, however, is that this method is not called in the test. So we were looking at whatever this change caused a regression in code generation. I couldn’t believe that this is the case, to be honest.

Indeed, looking at the generated assembly, there was no difference between the two versions. But something cause the performance to degrade significantly enough for this test that it raised all sorts of alarm bells.

We started looking into a lot of details about the system, the usual things like checking for thermal throttling, etc.

We struck gold on this command: sudo smartctl --all /dev/nvme0n1

Take a look at the results:

    SMART overall-health self-assessment test result: FAILED!
    - NVM subsystem reliability has been degraded
    SMART/Health Information (NVMe Log 0x02, NSID 0x1)
    Critical Warning:                   0x04
    Temperature:                        35 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    115%
    Data Units Read:                    462,613,897 [236 TB]
    Data Units Written:                 2,100,668,468 [1.07 PB]
    Host Read Commands:                 10,355,495,581
    Host Write Commands:                9,695,954,131
    Controller Busy Time:               70,777

In other words, the disk is literally crying at us. This tells us that the drive has been in action for ~50 days of actual activity and that it has gone beyond is design specs.

In particular, you can see that we wrote over a petabyte of data to the disk as part of our test case executions. This is a 500GB drive, which means that we fill it to capacity over 2,000 times before we hit this issue.

Once we hit this capacity (Percentage Used is > 100%), the drive needs to do a lot more work, so we are seeing longer test times.

First time that I closed a bug by sending a PO to get new hardware, I got to admit.

time to read 2 min | 298 words

Yesterday I had presented a webinar on using Machine Learning and Time Series in RavenDB. One of the things that I love about doing those webinars is that I get to field questions from the audience and have to really think on my feet.

In almost all cases, I think that I am able to provide good answers, and I usually accompany these with a live demo showing what is going on.

Yesterday, that wasn’t the case. During the demo, I managed to run into a fairly obscure bug very deep in the internals of RavenDB and got the wrong results. Then I got stuck on that and couldn’t figure out what is a proper workaround until just after the webinar concluded.

Hugely embarrassing, but at least I can take comfort that it wasn’t the first time that a live demo failed and probably won’t be the last time.

The good news, on the other hand, is that I created an issue to fix this problem. Today I made myself a cup of coffee and was about to dig into the problem when I realized that it has already been fixed. Time from opening the bug to it getting fixed, under 6 hours. That is without rushing it, mind you. I think that given the turnaround time, it is a good thing overall that I run into this.

The actual problem, for what it is worth, is that we had lazily evaluated a collection during its enumeration. However, when using GroupBy(), we effectively enumerated the value twice, fetching different values each time. The second time, we would use the last value from the collection, since it was lazily evaluated. You can check the pull request if you are that much of a nerd Smile.

time to read 1 min | 103 words

You can now get a Grafana dashboard that would monitor your RavenDB instances. You can use Telegraf 1.18.0+, which includes a RavenDB input plugin, which will expose all the relevant details.

You can see how that would look like:

image

This is part of the work we want to do in order to make it easier and smoother to operate your RavenDB clusters. It joins the work we do on the cluster dashboard as well as a host of other (mostly minor) changes.

image

time to read 1 min | 93 words

In this talk, Oren Eini, founder of RavenDB, is going to take apart a database engine on stage. We are going to inspect all the different pieces that make for an industrial-grade database engine, from the way the data is laid out on disk to how the database is ensuring that transactions are durable. We'll explore algorithms such as B+Tree, write-ahead logs, discuss concurrency strategies and how different features of the database work together to achieve the end goals.

time to read 1 min | 70 words

From the very start, most of the RavenDB community communication was handled via the mailing list.

Some members of the community mentioned that this workflow is outdated and wanted to move to a more modern options. As a result, we opened up the GitHub Discussions board and we welcome the community there as well.

Ask your questions, interact and let us know what you are doing with RavenDB.

time to read 3 min | 438 words

RavenDB has been using the Raft protocol for the past years. In fact, we have written three or four different implementations of Raft along the way. I implemented Raft using pure message passing, on top of async RPC and on top of TCP. I did that using actor model and using direct parallel programming as well as the usual spaghettis mode.

The Raft paper is beautiful in how it explain a non trivial problem in a way that is easy to grok, but it is also something that can require dealing with a number of subtleties. I want to discuss some of the ways to successfully implement it. Note that I’m assuming that you are familiar with Raft, so I won’t explain anything here.

A key problem with Raft implementations is that you have multiple concurrent things happening all at once, on different machines. And you always have the election timer waiting in the background. In order to deal with that, I divide the system into independent threads that each has their own task.

I’m going to talk specifically about the leader mode, which is the most complex aspect, usually. In this mode, we have:

  • Leader thread – responsible for determining the current progress in the cluster.
  • Follower thread – once per follower – responsible for communicating with a particular follower.

In addition, we may have values being appended to our log concurrently to all of the above. The key here is that the followers threads will communicate with their follower and push data to it. The overall structure for a follower thread looks like this:

What is the idea? We have a dedicated thread that will communicate with the follower. It will either ping the follower with an empty AppendEntries (every 1/3 of the election timeout) or it will send a batch of up to 50 entries to update the follower. Note that there is nothing in this code about the machinery of Raft, that isn’t the responsibility of the follower thread. The leader, on the other hand, listen to the notifications from the followers threads, like so:

The idea is that each aspect of the system is running independently, and the only communication that they have with each other is the fact that they can signal the other that they did some work. We then can compute whatever that work changed the state of the system.

Note that the code here is merely drafts, missing many details. For example, we aren’t sending the last commit index on AppendEntries, and committing the log is an asynchronous operation, since it can take a long time and we need to keep the system in operation.

FUTURE POSTS

  1. Sorted integer compression - 3 hours from now
  2. Challenge: Detecting livelihood in a distributed cluster - about one day from now
  3. Answers: Detecting livelihood in a distributed cluster - 2 days from now
  4. RavenDB 5.2 Features: Read only certificates - 3 days from now
  5. The cost of the cloud - 6 days from now

There are posts all the way to Jun 21, 2021

RECENT SERIES

  1. Challenge (58):
    21 Apr 2020 - Generate matching shard id–answer
  2. Webinar (4):
    11 Jun 2021 - Machine Learning and Time Series in RavenDB with Live Examples
  3. Webinar recording (13):
    24 May 2021 - The Rewards of Escaping the Relational Mindset
  4. Building a phone book (3):
    02 Apr 2021 - Part III
  5. Building a social media platform without going bankrupt (10):
    05 Feb 2021 - Part X–Optimizing for whales
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats