Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,198 | Comments: 50,268

Privacy Policy Terms
filter by tags archive
time to read 3 min | 509 words

Ransomware, Cyber, Crime, Attack, Malware, HackerNewsBlur (a personal news aggregator) suffered from a data breech / ransomware “attack”. I’m using the term “attack” here in quotes because this is the equivalent to having your car broken into after you left it with the engine running with the keys inside in a bad part of town.

As a result of the breech, users’ data including personal RSS feeds, access tokens for social media, email addresses and other sundry items of various import. It looks like about 250GB of data was taken hostage, by the way.

The explanation about what exactly happened is really interesting, however.

NewsBlur moved their MongoDB instance from its own server to a container. Along the way, they accidently (looks like a Docker default configuration) opened the MongoDB port to the whole wide world. By default, MongoDB will only listen to the localhost, in this case, I think that from the perspective of MongoDB, it was listening to the local port, it is Docker infrastructure that did the port forwarding and tied the public port to the instance. From that point on, it was just a matter of time. It apparently took two hours or so for some automated script to run into the welcome mat and jump in, wreak havoc and move on.

I’m actually surprised that it took so long. In some cases, machines are attacked in under a minute from showing up on the public internet. I used the term bad part of town earlier, but it is more accurate to say that the entire internet is a hostile environment and should be threated as such.

That lead to the next problem. You should never assume that you are running in anywhere else. In the case above, we have NewsBlur assuming that they are running on a private network where only the internal servers can access. About a year ago, Microsoft had a similar issue, they exposed an Elastic cluster that was supposed to be on an internal network only and lost 250 million customer support records.

In both cases, the problem was lack of defense in depth. Once the attacker was able to connect to the system, it was game over. There are monitoring solutions that you can use, but in general, the idea is that you don’t trust your network. You authenticate and encrypt all the traffic, regardless of where you are running it. The additional encryption cost is not usually meaningful for typical workloads (even for demanding workloads), given that most CPUs have dedicated encryption instructions.

When using RavenDB, we have taken the steps to ensure that:

  • It is simple and easy to run in a secure mode, using X509 client certificate for authentication and all network communications are encrypted.
  • It is hard and complex to run without security.

If you run the RavenDB setup wizard, it takes under two minutes to end up with a secured solution, one that you can expose to the outside world and not worry about your data taking a walk.

time to read 4 min | 714 words

imageA RavenDB database can reside on multiple nodes in the cluster. RavenDB uses a multi master protocol to handle writes. Any node holding the database will be able to accept writes. This is in contrast to other databases that use the leader/follower model. In such systems, only a single instance is able to accept writes at any given point in time.

The node that accepted the write is responsible for disseminating the write to the rest of the cluster. This should work even if there are some breaks in communication, mind, which makes things more interesting.

Consider the case of a write to node A. Node A will accept the write and then replicate that as part of its normal operations to the other nodes in the cluster.

In other words, we’ll have:

  • A –> B
  • A –> C
  • A –> D
  • A –> E

In a distributed system, we need to be prepare for all sort of strange failures. Consider the case where node A cannot talk to node C, but the other nodes can. In this scenario, we still expect node C to have the full data. Even if node A cannot send the data to it directly.

The simple solution would be to simply have each node replicate the data it get from any source to all its siblings. However, consider the cost involved?

  • Write to node A (1KB document) will result in 4 replication (4KB)
  • Node B will replicate to 4 nodes (including A, mind), so that it another 4KB.
  • Node C will replicate to 4 nodes, so that it another 4KB.
  • Node D will replicate to 4 nodes, so that it another 4KB.
  • Node E will replicate to 4 nodes, so that it another 4KB.

In other words, in a 5 nodes cluster, a single 1KB write will generate 20KB of network traffic, the vast majority of it totally unnecessary.

There are many gossip algorithms, and they are quite interesting, but they are usually not meant for a continuous stream of updates. They are focus on robustness over efficiency.

RavenDB takes the following approach, when a node accept a write from a client directly, it will send the new write to all its siblings immediately. However, if a node accept a write from replication, the situation is different. We assume that the node that replicate the document to us will also replicate the document to other nodes in the cluster. As such, we’ll not initiate replication immediately. What we’ll do, instead, it let all the nodes that replicate to us, that we got the new document.

If we don’t have any writes on the node, we’ll check every 15 seconds whatever we have documents that aren’t present on our siblings. Remember that the siblings will report to us what documents they currently have, proactively. There is no need to chat over the network about that.

In other words, during normal operations, what we’ll have is node A replicating the document to all the other nodes. They’ll each inform the other nodes that they have this document and nothing further needs to be done. However, in the case of a break between node A and node C, the other nodes will realize that they have a document that isn’t on node C, in which case they’ll complete the cycle and send it to node C, healing the gap in the network.

I’m using the term “tell the other nodes what documents we have”, but that isn’t what is actually going on. We use change vectors to track the replication state across the cluster. We don’t need to send each individual document write to the other side, instead, we can send a single change vector (a short string) that will tell the other side all the documents that we have in one shot.  You can read more about change vectors here.

In short, the behavior on the part of the node is simple:

  • On document write to the node, replicate the document to all siblings immediately.
  • On document replication, notify all siblings about the new change vector.
  • Every 15 seconds, replicate to siblings the documents that they missed.

Just these rules allow us to have a sophisticated system in practice, because we’ll not have excessive writes over the network but we’ll bypass any errors in the network layer without issue.

time to read 5 min | 812 words

Earlier this week, we have released RavenDB 5.2 to the world. This is an exciting release for a few reasons. We have a bunch of new features available and as usual, the x.2 release is our LTS release.

RavenDB 5.2 is compatible with all 4.x and 5.x releases, you can simply update your server binaries and you’ll be up and running in no time. RavenDB 4.x clients can talk to RavenDB 5.2 server with no issue. Upgrading in a cluster (from 4.x or 5.x versions) can be done using rolling update mode and mixed version clusters are supported (some features will not be available unless a majority of the cluster is running on 5.2, though).

Let’s start by talking about the new features, they are more interesting, I’ll admit.

I’m going to be posting details about all those features, but I want to point out what is probably the most important aspect of RavenDB, even beyond the feature, OLAP ETL. RavenDB 5.2 is a LTS release.

Long Term Support release

LTS stands for Long Term Support, we support such releases for an extended period of time and they are recommended for production deployments and long term projects.

Our previous LTS release, RavenDB 4.2, was released in May 2019 and is still fully supported. Standard support for RavenDB 4.2 will lapse in July 2022 (a year from now), we’ll offer extended support for users who want to use that version afterward.

We encourage all RavenDB users to migrate to RavenDB 5.2 as soon as they are able.

OLAP ETL

This new feature deserve its own post (which will show up next week), but I wanted to say a few words on that. RavenDB is meant to serve as an application database, serving OLTP workloads. It has some features aimed at reporting, but that isn’t the primary focus.

For almost a decade, RavenDB has supported native ETL process that will push data on the fly from RavenDB to a relational database. The idea is that you can push the data into your reporting solution and continue using that infrastructure.

Nowadays, people are working with much larger dataset and there is usually not a single reporting database to work with. There are data lakes (and presumably data seas and oceans, I guess) and the cloud has a much higher presence in the story.  Furthermore, there is another interesting aspect for us. RavenDB is often deployed on the edge, but there is a strong desire to see what is going across the entire system. That means pushing data from all the edge locations into the cloud and offering reports based on that.

To answer those needs, RavenDB 5.2 has the OLAP ETL feature. At its core, it is a simple concept. RavenDB allows you to define a script that will transform your data into a tabular format. So far, that is very much the same as the SQL ETL. The interesting bit happens afterward. Instead of pushing the data into a relational database, we’ll generate a set of Parquet files (columnar data store) and push them to a cloud location.

On the cloud, you can use your any data lake solution to process such file, issue reports, etc. For example, you can use Presto or AWS Athena to run queries over the uploaded files.

You can define the ETL process in a single location or across your entire fleet of databases on the edge, they’ll push the data to the cloud automatically and transparently. All the how, when, failure management, reties and other details are handled for you. RavenDB is also capable of integrating with custom solution, such as generating a one time token on each upload (no need to expose your credentials on the edge).

The end result is that you have a simple and native option to run large scale queries across your data, even if you are working in a widely distributed system. And even for those who run just a single cluster, you have a wider set of options on how to report and aggregate your data.

time to read 1 min | 88 words

I posted a few weeks ago about a performance regression in our metrics that we tracked down the to the disk being exhausted.

We replaced the hard disk to a new one, can you see what the results were?

image (1)

This is mostly because we were pretty sure that this is the problem, but couldn’t rule out that this was something else. Good to know that we were on track.

time to read 2 min | 319 words

We recently added support for running RavenDB on a Ubuntu machine (or Debian) using DEB files. I thought that I would post a short walkthrough of how you can install RavenDB on such a machine.

I’m running the entire process on a clean EC2 instance.

Steps beforehand, making sure that the firewall is setup appropriately:

image

Note that I’m opening up just the ports we need for actual running of RavenDB.

Next is to go and fetch the relevant package, you can do that from the Download Page, where you can find the most up to date DEB file.

image

SSH into the machine and then we’ll need to download and install the package:

$ sudo apt-get update && sudo apt-get install libc6-dev –y 
$ wget --content-disposition https://hibernatingrhinos.com/downloads/RavenDB%20for%20Ubuntu%2020.04%20x64%20DEB/51027
$ sudo dpkg -i ravendb_5.1.8-0_amd64.deb

This will download and install the RavenDB package, after making sure that the environment is properly setup for it.

Here is what this will output:

### RavenDB Setup ###
#
#  Please navigate to http://127.0.0.1:53700 in your web browser to complete setting up RavenDB.
#  If you set up the server through SSH, you can tunnel RavenDB setup port and proceed with the setup on your local.
#
#  For public address:    ssh -N -L localhost:8080:localhost:53700 ubuntu@34.235.129.104
#
#  For internal address:  ssh -N -L localhost:8080:localhost:53700 ubuntu@ip-172-31-22-131
#
###

RavenDB is installed, but we now need to configure it. For security, RavenDB default to listening to the local host only, however, we are now running it on a remote server. tat is why the installer output gives you the port forwarding command. We can exit SSH and run these commands, getting us to run the setup via secured port forwarding and setting up a secured RavenDB instance in minutes.

time to read 5 min | 993 words

I mentioned that I’m teaching a Cloud Computing course at university in a previous post. That lead to some good questions that I have to field about established wisdom that I have to really think about. One such question that I run into was about the intersection of databases and the cloud.

One of the most important factors for database performance is the I/O rate that you can get. Let’s take a fairly typical off the shelf drive, shall we?

Cost of the drive is less than 500 $ US for a 2TB disk and it can write at close 5GB / sec with sustained writes sitting at 3GB /sec at  User Benchmark, it is also rated to hit 1 million IOPS. That is a lot. And that is when you spend less than 500$ on that.

On the other hand, a comparable drive would be Azure P40, which cost 235.52$ per month for 2TB of disk space. It also offers a stunning rate of 7,500 IOPS (with bursts of 30,000!). The write rate is 250MB/sec with bursts of 1GB/sec. The best you can get on Azure, though, is an Ultra disk. Where a comparable disk to the on premise option would cost you literally thousands per month (and would be about a tenth of the performance).

In other words, the cloud option is drastically more costly. To be fair, we aren’t comparing the same thing at all. A cloud disk is more than just renting of the hardware. There is redundancy to consider, the ability to “move” the disk between instances, the ability to take snapshots and restore, etc.

A more comparable scenario would be to look at NVMe instances. If we’ll take L8sv2 instance on Azure, that gives us a 2TB NVMe drive with 400,000 IOPS and 2GB/sec throughput. That is at least within reach of the off the shelf disk I pointed out before. The cost? About 500$ per month. But now we are talking about a machine that has 8 cores and 64 GB of RAM.

The downside of NVMe instances is that the disk are transient. If there is a failure that requires us to stop and start the machine (basically, moving hosts), that would mean that the data is lost. You can reboot the machine, but not stop the cloud allocation of the machine without losing the data.

The physical hardware option is much cheaper, it seems. If we add everything around the disk, we are going to get somewhat different costs. I found a similar server to L8sv2 on Dell for about 7,000 $ US, for example. Pretty sure that you can get it for less if you know what you are doing, but it was my first try and it included 3.2 TB of enterprise grade NVMe drives.

Colocation pricing can run about 100$ a month (again, first search result, you can get for less) and that means that the total monthly cost is roughly 685$. That is comparable to the cloud, actually, but doesn’t account for the fact that you can use the same server for much longer than a single year. It is also probably wasting a lot of money on bigger hardware. What you don’t get, which you probably want, is the envelope around that. The ability to say: “I want another server” (or ten), the ability to move and manage your resources easily, etc. And that is as long as you are managing just hardware resources.

You don’t get any of the services or the expertise in running things. Given that even professional organizations can suffer devastating issues, you want to have an expert manage than, because an armature handling that topic lead to problems. 

A lot of the attraction of the cloud comes from a very simple reason. I don’t want to deal with all of that stuff. None of that is your competitive advantage and you would rather just pay and not think about that. The key for the success of the cloud is that globally, you are paying less (in time, effort and manpower) than taking the cost of managing it yourself.

There are two counterpoints here, though.

  • At some scale, it would make sense to move out from the cloud to your own hardware. Dropbox did that at some point, moving some of its infrastructure off the cloud to savings of over 75 million dollars. You don’t have to be at Dropbox size to benefit from having some of your own servers, but you do need to hit some tipping point before that would make sense.
  • StackOverflow is famously running on their own hardware, and is able to get great results out of that. I wonder how much the age of StackOveflow has to do with that, though.

The cloud is a pretty good abstraction, but it isn’t one that you get for free. There are a lot of scenarios where it makes a lot of sense to have some portions of your system outside of the cloud. The default of “everything is in the cloud”, however, make a lot of sense. Specifically because you don’t need to do complex (and costly) sizing computations. Once you have the system running and the load figured out, you can decide if it make sense to move things to your own severs.

And, of course, this all assumes that we are talking about just the hardware. That is far from the case in today’s cloud. Cloud services are another really important aspect of what you get in the cloud. Consider the complexity of running a  Kubernetes cluster, or setting up a system for machine vision or distributed storage or any of the things that the cloud providers has commoditized.

The decision of cloud usage is no longer a simple buy vs. rent but a much more complex one about where do you draw the line of what should be your core concerns and what should be handled outside of your purview.

time to read 3 min | 438 words

We are gearing up to a new release of RavenDB, and it is about time that I’ll start talking about the new features. I thought to start with what one of our most requested features: Read only access to RavenDB.

This has been asked by enough customers that we decided to implement it, even though I don’t like the concept very much. One of the key aspects of RavenDB design is the notion that RavenDB is an application database, not a shared database. As such, we limit access per database and expect that pretty much all accesses to the database will have the same privilege level. After all, this is meant to be a single application, even if it is deployed as multiple processes / services.

From real world usage, this expectation is false. People want to isolate access even within the scope of a single application. Hence, the read only mode.

The documentation does a good job describing what read only mode is, but I wanted to give some additional background.

As usual, authentication to RavenDB is done using X509 certificates. When you define the certificate’s permissions, you can grant it read only access to a database (or databases). At this point, applications and users using this certificate will be limits to only reads from the database. That isn’t a big surprise, right?

The devil is in the details, however. A read only certificate can perform the following operations:

  • Load documents by id – the security boundary here is the entire database, there is no limit on a per document / collection.
  • Query documents using predefined indexes. – note that you cannot deploy new indexes.
  • Query documents using auto generated indexes – this also implies that a query by a read only certificate can cause the database engine to create an index to answer the query. This is explicitly allowed when you are using a read only certificate.
  • Inspect the database state and its ongoing tasks – you can look at the tasks and their status, but things like the connection strings details are hidden.
  • Connect to a subscription and accept documents – this is a case where a read only certificate will modify the state of the database (by advancing what documents the subscription consumes). This is explicitly allowed since it is a likely scenario for read only certificates. Creating a subscription, on the other hand, is something that you’ll need to do with more permissions.

I’m not trying to give you the whole thing, you can read the documentation for that, but I am trying to give you some idea about the kind of use cases you can use read only certificates for.

time to read 2 min | 325 words

Yesterday I asked about dealing with livelihood detection of nodes running in AWS. The key aspect is that this need to be simple to build and easy to explain.

Here are a couple of ways that I came up with, nothing ground breaking, but they do the work while letting someone else do all the heavy lifting.

Have a well known S3 bucket that each of the nodes will write an entry to. The idea is that we’ll have something like (filename –  value):

  • i-04e8d25534f59e930 – 2021-06-11T22:01:02
  • i-05714ffce6c1f64ad – 2021-06-11T22:00:49

The idea is that each node will scan the bucket and read through each of the files, getting the last seen time for all the nodes. We’ll consider all the nodes whose timestamp is within the last 1 minute to be alive and any other node is dead.  Of course, we’ll also need to update the node’s file on S3 every 30 seconds to ensure that other nodes know that we are alive.

The advantage here is that this is trivial to explain and implement and it can work quite well in practice.

The other option is to actually piggy back on top of the infrastructure that is dedicated for this sort of scenario. Create an elastic load balancer and setup a target group. On startup, the node will register itself to the target group and setup the health check endpoint. From this point on, each node can ask the target group to find all the healthy nodes.

This is pretty simple as well, although it requires significantly more setup. The advantage here is that we can detect more failure modes (a node that is up, but firewalled away, for example).

Other options, such as having the nodes ping each other, are actually quite complex since they need to find each other. That lead to some level of service locator, but then you’ll have to avoid each node pining all the other nodes, since that can get busy on the network.

time to read 2 min | 286 words

I’m teaching a course at university about cloud computing. That can be a lot of fun, but quite frustrating at time. The key issue for me is that I occasionally need to provide students with some way to do something that I know how to do properly, but I can’t.

Case in point, assuming that I have a distributed cluster of nodes, and we need to detect what nodes are up or down, how do you do that?

With RavenDB, we assign an observer to the cluster whose job is to do health monitoring. I can explain that to the students, but I can’t expect them to utilize this technique in their exercises, there is too much detail there. The focus of the lesson or exercise is not to build a distributed system but to make use of one, after all.

As a rule, I try to ensure that all projects that we are working on can be done in under 200 lines of Python code. That puts a hard limit to the amount of behavior I can express. Because of that, I find myself looking for ways to rely on existing infrastructure to deal with the situation. 

Each node is running the same code, and they are setup so they can talk to one another, if needed. It is important that all the live nodes will converge to agree on the active nodes in relatively short order.

The task is to find the list of active nodes in a cluster, where nodes may go up or down dynamically. We are running in AWS cloud so you can use its resources, how would you do that?

The situation should be as simple as possible and easy to explain to students.

FUTURE POSTS

  1. Atomic reference counting (with Zig code samples) - 2 days from now

There are posts all the way to Sep 20, 2021

RECENT SERIES

  1. Production postmortem (31):
    17 Sep 2021 - The Guinness record for page faults & high CPU
  2. RavenDB 5.2 (2):
    06 Aug 2021 - Simplifying atomic cluster wide transactions
  3. Postmortem (2):
    23 Jul 2021 - Accidentally quadratic indexing output
  4. re (28):
    23 Jun 2021 - The performance regression odyssey
  5. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats