Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,283 | Comments: 46,769

filter by tags archive

AnswerWhat does this code do?

time to read 2 min | 225 words

This was a surprising shock, this code seems so simple, but it does something very different than what I would expect.

The question is why?

As it turns out, we are missing one character here:

image

Notice the lack of the comma? Let us see how the compiler is treating this code by breaking it apart into steps, shall we?

First, let us break it into two statements:

Note that we moved the name line into the first statement, since there isn’t a command here. But what is it actually doing? This looks very strange, but that is just because we have the dictionary initializer here. If we drop it, we get:

image

And that make a lot more sense. If we’ll break it all down, we have:

And that explains it all, make perfect sense, and a very nasty trap. We run into it accidently in production code, and it was near impossible to figure out what was going on or why it happened.

The RavenDB Pi Demo Unit

time to read 7 min | 1300 words

imageI just had an interesting discussion on Raspberry Pi and RavenDB over twitter. The 140 characters limit is a PITA when you try to express complex topics, so I thought that I would do better job explaining it here.

RavenDB is running on the Raspberry Pi and we routinely use it internally for development and demonstrations purposes. That seems to have run into some objections, in particular: “Looks like a toy and shows no real value and can also show poor performance and make uneducated people concerned” and “Nobody is going to deploy a database on a machine with quad cores and lack of RAM and the kind of poor IO a Pi has.”

In fact, I wish that I had the ability to purchase a few dozens Raspberry Pi Zero, since that would bring the per unit cost significantly. But why bother?

Let us start with the things that I think we all agree on, the Pi isn’t a suitable production machine. Well, sort off. If you don’t have any high performance requirements, it is a perfectly suitable machine, we have one that is used to show the build status and various metrics, and it is a production machine (in the sense that if it breaks, it needs fixing), but that is twisting things around.

We have two primary use cases for RavenDB on Raspberry Pi. Internal testing and external demos. Let examine each scenario in turn.

Internal testing most refers to having developers multiple Pis on their desktop, participating in a cluster, and serving as physically remote endpoints. We could likely handle that with a container / vm easily enough, but it is much nicer to have a separate machine (that gets knocked down to the floor, have its network cable unplugged, etc for certain kind of experiments. It also provide us with assurance that we are actually hitting real network and deal with actual hardware. A nice “problem” that we have with the Pi is that it is pretty slow on IO, especially since we typically run it on an internal SD card, so it give us a slightly slower system overall, and prevent the usual fallacies of distributed computing from taking hold. One of our Pis would crash when used under load, and it was an interesting demonstration of how to deal with a flaky node (the problem was that we used the wrong power supply, if you care).

Using the Pis as semi realistic nodes help us avoid certain common mistaken assumptions, but it is by no mean replacing actual structural testing of those (Jepsen for certain parts, actual tests with various simulated failures).  Another important factor here is that the Pi is cool. It is really funny to see how some people in the office threat their Pis. And it make for a nicer work environment than just running a bunch of VMs.

Running on much slower hardware has also enabled us to find and resolve a few bottlenecks much earlier (and cheaper) in the process, which is always nice. But more importantly, it gives us a consistent low end machine that we can try on and verify that even on something like that, our minimum SLA is met.

So that is it in terms of the internal stuff, but I’ll have to admit that those mostly were born out as a result of our desire to be able to demo RavenDB on a Raspberry PI.

You might reasonably ask, why on earth would you want to demo your product on a machine that is literally decades behind the curve (in 1999, Dell PowerEdge 6450 had 4 PIII Xeon 550 CPUs and 8 GB RAM and sold for close to 20,000$).

It circles back again to the Raspberry Pi being cool, but a lot of that has to do with the environment in which we give the demos. Most of our demos are given at a booth in a conference. In those cases, we have 5 – 10 minutes to give a complete demo of RavenDB. This is obviously impossible, so what we are actually doing is giving a demo of the highlights, and provide enough information that people will go and seek out more on their own.

That means that we have our sexiest features upfront and center. And one of those feature is RavenDB’s clustering. A lot of the people we meet in conferences have very little to no experience with NoSQL, and when they do have some experience, it is often mixed.

A really cool demo that we do is the “time to cluster”, in which we are asking one of the people listening to pull out a phone and setup a timer, and see how much time it takes us to build a multi master cluster with full failover capabilities. (Our current record, if anyone wants to know, is about 37 seconds, but we have a bug report to stream line this). Once that is setup, we can show how RavenDB works in the presence of failures and network issues.

We could do all of that virtually, using firewall / iptables or something like that, but people accustomed with “setting up a cluster takes at least a week or two” are already suspicious that we couldn’t possibly make this work. Having them physically disconnect the server from the network and seeing how it behaves give a much better demo experience, and it allows us to bypass a lot of the complexities involved in explaining some of the concepts around distributed computing.

One of the most common issue, by the way, is the notion of a partition. It is relatively easy to explain to a user that a server may go down, but some people had hard time realizing that a node can be up, but unable to connect to some of the network. When you are actually setting up the wires so they can actually see it, you can see a light bulb moment going on.

A cool demo might sound like it isn’t important, but it generates a lot of interest and people really like it. Because the Raspberry Pi isn’t a 20,000$ machine, we have taken to taking a few of them with us to every conference, and just raffling them off by the end of the conference. The very fact that we treat the Raspberry Pi as a toy make this very attractive for us.

As for people assuming that the performance on the PI is the actual performance on real hardware… We haven’t run into those yet, nor do I believe that we are likely to. To be frank, we are fast enough that this isn’t really an issue unless you are purposefully trying to load test the system. Hell, I have production systems right now that are I can handle with a trio of Pis without breaking a sweat. At one point, we had a customer visit where we literally left them with a Raspberry Pi as their development server because it was the simplest way to get them started. I don’t expect to be shipping a lot of RavenDB Development “Servers” in the future, but it was fun to do.

All in all, I think that this has been a great investment for us, and we expect to see the Pis generate interest (and the RavenDB itself keeping it) whenever we show it.

If you are around NDC London, we have a few guys there in a booth right now, and they’ll raffle a few Raspberry PIs by the end of the week.

Protocol design implications: REST vs. TCP

time to read 3 min | 444 words

I was going over design documents today, and I noticed some common themes in the changes that we have between RavenDB 3.5 and RavenDB 4.0.

With RavenDB 3.5 (and all previous versions), we always had the communication layer as HTTP REST calls between nodes. When I designed RavenDB, REST was the thing to do, and it is reflected in the design of RavenDB itself. However, 8 years later, we sat down and considered whatever this is really appropriate for everything. The answer was a resounding no. In fact, while over 95% of RavenDB is still pure REST calls, we have moved certain key functions to using TCP directly.

Note that this goes in directly contrast to this post of mine from 2012: Why TCP is evil and HTTP is king.

The concerns in this post are still valid, but we have found that there are a few major reasons why we want to switch to TCP for certain stuff. In particular, the basic approach is that the a client will communicate with the server using HTTP calls, but servers communicate with one another using TCP. The great thing about TCP is that it is a stream oriented protocol, so I don’t need to carry state with me on every call.

With HTTP, each call is stateless, and I can’t assume anything about the other side. That means that I need to send the state, manage the state on the other side, and have to deal with potential issues such as concurrency in the same conversation, restarts of one side that the other side can’t easily detect, repeated validation on each call, etc.

With TCP, on the other hand, I can make a lot of assumptions about the conversation. I have state that I can carry between calls to the other side, and as long as the TCP connection is opened, I can assume that it is valid. For example, if I need to know what is the last item I sent to the remote end, I can query that at the beginning of the TCP connection, as part of the handshake, and then I can just assume that what I sent to the other side has arrived (since otherwise I’ll eventually get an error, requiring me to create a new TCP connection and do another handshake). On the other side, I can verify the integrity of a connection once, without requiring me to repeatedly verify our mutual state on each and every message being passed.

This has drastically simplified a lot of code on both the sending and receiving ends, and reduced the number of network roundtrips by a significant amount.

Getting the design ready for production troubleshooting

time to read 2 min | 339 words

The following is an excerpt from a design document for a major feature in RavenDB 4.0 that I’m currently reviewing, written by Tal.

One of the major problems when debugging such issues in production is the fact that most of the interesting information resides in memory and goes away when the server restarts, the sad thing is that the first thing an admin will do when having issues with the server is to recycle it, giving us very little to work with. Yes, we have logs, but debug level logs are very expensive and usually are not enabled in production (nor should they), we already have the ability to turn logs on, on a production system which is a great option but not enough. The root cause of a raft problem usually resides in the past so unless we have logs from the beginning of time there is not much use for them. The suggested solution is a persistent log for important events that indicate that things went south.

This is based on our experience (and frustration) from diagnosing production issues. By the time the admin see something is wrong, the problem already occurred, and in the process of handling the problem, the admin will typically focus on fixing it, rather than figuring out what exactly is going on.

Those kind of features, focusing explicitly on giving us enough information to find the root cause of the issue has been an on going effort for us. Yesterday they enabled us to get a debug package from a customer (a zip file that the server can generate with a lot of important information), go through it and figure out exactly what the problem was (the customer was running in 32 bits mode and running into virtual memory exhaustion) in one support roundtrip, rather than having to go back and forth multiple times to try to get a bunch of different data points to figure out the issue.

Also, go and read Release It, it has a huge impact on actual system design.

Rust based load balancing proxy server with async I/O

time to read 5 min | 981 words

In my previous Rust post, I built a simple echo server that spun a whole new thread for each connection. In this one, I want to do this in an async manner. Rust doesn’t have the notion of async/await, or something similar to Go green threads (it seems that it used to, and it was removed as costly abstraction for low level system language).

I’m going to use Tokio.rs to do that, but sadly enough, the example on the front page is about doing an async echo server. That kinda killed the mood for me there, since I wanted to deal with actually implementing it from scratch. Because of that, I decided to do something different and build an async Rust based TCP level proxy server.

Expected usage:

cargo run live-test.ravendb.net:80 localhost:8080

Which should print the port that this proxy runs on and then route the connection to one of those endpoints.

This led to something pretty strange, check out the following code:

image

Can you figure out what the type of addr is? It is inferred, but from what? The addr definition line does not have enough detail to figure it out. Therefor, the compiler actually goes down and see that we are passing it to the bind() method, which takes a std::net::SocketAddr value. So it figures out that the value must be a std::net::SocketAddr.

This seems to be utterly backward and fragile to me.  For example, I added this:

image

And the compiler was very upset with me:

image

I’m not used to the variable type being impacted by its usage. It seems very odd and awkward. It also seems to be pretty hard to actually figure out what the type of a variable is from just looking at the code. And there isn’t an easy way to get it short of causing an intentional compiler error that would reveal those details.

The final code looks like this:

At the same time, there is a lot going on here and this is very simple.

Lines 1 – 15 are really not interesting. Lines 17 – 29 are about parsing the user’s input, but the fun stuff begins from line 30 and onward.

I use fun cautiously, it wasn’t very fun to work with, to be honest. On lines 30 & 31 I setup the event loop handlers. And then bind them to a TCP listener.

On lines 40 – 62 I’m building the server (more on that later) and on line 64 I’m actually running the event loop.

The crazy stuff is all in the server handling. The incoming().for_each() call will call the method for each connected client, passing the stream and the remote IP. I then split the TCP stream into a read half and a write half, and select a node to load balance to.

Following that, I’m doing an async connect to that node, and if it is successful I’m splitting the server and then reverse them using the copy methods. Basically attaching the input and output of each to the other side. Finally, I’m joining the two together, so we’ll have a future that will only be done when both sending and receiving is done, and then I’m sending it back to the event loop.

Note that when I’m accepting a new TCP connection, I’m not actually pausing to connect to the remote server. Instead, I’m going to setup the call and then pass the next stage to the event loop ( the spawn ) method.

This was crazy hard to do and generated a lot of compilation errors along the way. Why? See line 57, where we erase the types?

The type of send_data without this line is something like Future<Result<(u64,u64), Error>>. But the map & map_err turn it into just a Future. If you don’t do that? Well, the compiler errors are generally very good, but it seems that inference can take you into la-la land, see this compiler error. That reminds me of trying to make sense of C++ template errors in 1999.

image

Now, here is the definition of the spawn method:

image

And I didn’t understand this syntax at all. Future is a trait, and it has associated types, but I’m thinking about generics as only the stuff inside the <>, so that was pretty confusing.

Basically, the problem was that I was passing a future that was returning values, while the spawn method expected one that was expecting none.

I also tried to change the and_then to just then, but at that point I got:

image

At which point I just out.

However, just looking at the code on its own, it is quite nicely done, and it expresses exactly what I want it to. My problem is that every single change that I make there has repercussions down the line, which is hard for me to predict.

Database security and defaults

time to read 3 min | 454 words

imageThe nightmare scenario for a database vendor is something like this: Over 27,000 databases managed by MongoDB held to ransom; 99,000 still vulnerable.

To be fair, this isn’t quite the nightmare scenario. The nightmare scenario would be if this would be due to some vulnerability in the database, but in this case, this isn’t that at all. It is simply that admins have setup a publicly visible database with no permissions on the internet, and said “okay, we are done, what is the next ticket?”.

Now, I presume that it didn’t really went on like that, but the problem is that if you follow the proper instructions, you are fine, by default, all your data is exposed over the network. I’m assuming that a few of those were setup by a proper dev ops team, and mostly they were done by “Joe, we are going to prod, here are the server credentials, make sure that the db is running there”.  Or, also likely, “We are done with dev, we can just use the same servers for prod”, with no one going in and setting them up properly.

You should note that this isn’t really about MongoDB specifically (although this is the one that has the most noise at the moment). This makes for a pretty sad reading, you literally require nothing to do to “hack” into production systems, and access over 600 TB of data (just for MongoDB).

The scary thing is that you have questions like this: bind_ip = 127.0.0.1 does not work but 0.0.0.0 works.

So the user will actively try to fight any measure you have to protect them.

With RavenDB, we have actually made it a startup error (the server will abort) if you are running a production instance (identified with a license) but you don’t require authentication. Now, there are scenarios where this is valid, such as running on a secured network, but they are pretty far, so you have a configuration option that you can set that will enable this scenario, but that require an explicit step and hopefully get the user thinking. With RavenDB 4.0, we’ll require authentication (or explicit configuration override) whenever a user ask us to bind to an interface other than localhost.

I think that is one case where you have to reverse “let’s make it easy to use us” and also consider putting hurdles to actually get it running. Because in the long run, getting this wrong means that it is very easy to shoot yourself in the foot.

Building a low level trie with Rust: Part I

time to read 3 min | 481 words

Before getting to grips with a distributed gossip system in Rust, I decided that it would be better to look at something a bit more challenging, but smaller in scope. I decided to implement the low level trie challenge in Rust.

This is interesting, because it is complex enough problem to require thinking even for experienced developers, but at the same time, it isn’t complex, just have a lot of details. It also require us to do a lot of lot level stuff and manipulate memory directly, so that is something that would be an interesting test for a system level programming language.

On the one hand, even with just a few hours with Rust, I can see some elegance coming out of certain pieces.  For example, take a look at the following code:

This is responsible for searching on the trie for a value, and I like that the find_match function traverse the tree and allow me to return both an enum value and a the closest match to this when it fails (so I can continue the process directly from there).

On the other hand, we have pieces of code like this:

image

And any line that has four casts in it is already suspect. And as I’m dealing with the raw memory, I have quite a bit of this.

And I certainly feeling the pain of the borrow checker. Here is where I’m currently stumped.

This is a small and simple example that shows the issue. It fails to compile:

image

I have a method that takes a mutable MyTrie reference, and pass it to a method that expects a immutable reference. This is fine, and would work. But I need to use the value from the find method in the delete_internal method, which again needs a mutable instance. But this fails with:

error[E0502]: cannot borrow `*self` as mutable because it is also borrowed as immutable

I understand the problem, but I am not really sure how to solve it. The problem is that I kinda want the find method to remain immutable, since it is also used on the read method, which can run on immutable instances.Technically speaking, I could copy the values that I want out of the node reference and do a lexical scope that would force the immutable borrow to end, but I’m unsure yet what would be the best option.

It seems like a lot of work to get what I want in spite, and not with the help of, the compiler.

Initial design for strong encryption in RavenDB 4.0

time to read 7 min | 1322 words

The previous post generated some great discussion, and we have done a bit of research in the meantime about what is going to be required in order to provide strong encryption support in RavenDB.

Note: I’m still no encryption expert. I’m basing a lot of what I have here on reading libsodium code and docs.

The same design goals that we had before still hold. We want to encrypt the data at the page level, but it looks like it is going to be impossible to just encrypt the whole page. The reason behind that is that encryption is actual a pure mathematical operation, and given the same input text and the same key, it is always going to generate the same value. Using that, you can create certain attacks on the data by exploiting the sameness of the data, even if you don’t actually know what it is.

In order to prevent that, you would use an initialization vector or nonce (seems to be pretty similar, with the details about them being relevant only with regards to the randomness requirements they have). At any rate, while I initially hoped that I can just use a fixed value per page, that is a big “nope, don’t do that”. So we need some place to store that information.

Another thing that I run into is the problem with modifying the encrypted text in order to generate data that can be successfully decrypted but is different from the original plain text. A nice example of that can be seen here (see the section: How to Attack Unauthenticated Encryption). So we probably want to have that as well.

This is not possible with the current Voron format. Luckily, one of the reasons we built Voron is so we can get it to do what we want. Here is what a Voron page will look after this change:

  Voron page: 8 KB in size, 64 bytes header
+-------------------------------------------------------------------------+
|Page # 64 bits|Page metadata up to 288 bits  |mac 128 bits| nonce 96 bits|
+-------------------------------------------------------------------------+
|                                                                         |
|  Encrypted page information                                             |
|                                                                         |
|       8,128 bytes                                                       |
|                                                                         |
|                                                                         |
+-------------------------------------------------------------------------+

The idea is that when we need to encrypt a page, we’ll do the following:

  • First time we need to encrypt the page, we’ll generate a random nonce. Each time that we encrypt the page, we’ll increment the nonce.
  • We’ll encrypt the page information and put it in the page data section
  • As well as encrypting the data, we’ll also sign both it and the rest of the page header, and place that in the mac field.

The idea is that modifying either the encrypted information or the page metadata will generate an error because the tampering will be detected.

This is pretty much it as far as the design of the actual data encryption goes. But there is more to it.

Voron uses a memory mapped file to store the information (actually, several, with pretty complex interactions, but it doesn’t matter right now). That means that if we want to decrypt the data, we probably shouldn’t be doing that on the memory mapped file memory. Instead, each transaction is going to set aside some memory of its own, and when it needs to access a page, it will be decrypted from the mmap file into that transaction private copy. During the transaction run, the information will be available in plain text mode for that transaction. When the transaction is over, that memory is going to be zeroed. Note that transactions RavenDB tend to be fairly short term affairs. Because of that, each read transaction is going to get a small buffer to work with and if there are more pages accessed than allowed, it will replace the least recently used page with another one.

That leaves us with the problem of the encryption key. One option would be to encrypt all pages within the database with the same key, and use the randomly generated nonce per page and then just increment that. However, that does leave us with the option that two pages will be encrypted using the same key/nonce. That has a low probability, but it should be considered. We can try deriving a new key per page from the master page, but that seems… excessive. But it looks like there is another option is to generate use a block chipper, where we pass different block counter for each page.

This would require a minimal change to crypto_aead_chacha20poly1305_encrypt_detached, allowing to pass a block counter externally, rather than have it as a constant. I asked the question with more details so I can have a more authoritative answer about that. If this isn’t valid, we’ll probably use a nonce that is composed of the page # and the number of changes that the page has went through. This would limit us to about 2^32 modifications on the same page, though. We could limit the a single database file size to mere 0.5 Exabyte rather than 128 Zettabyte, but somehow I think we can live with it.

This just leave us with the details of key management. On Windows, this is fairly easy. We can use the CryptProtectData / CryptUnprotectData to protect the key. A transaction will start by getting the key, doing its work, then zeroing all the pages it touched (and decrypted) and its key. This way, if there are no active transactions, there is no plaintext key in memory. On Linux, we can apparently use Libsecret to do this. Although it seems like it has a much higher cost to do so.

FUTURE POSTS

  1. The metrics calculation methods - 3 days from now
  2. The struggle with Rust - 4 days from now

There are posts all the way to Jan 24, 2017

RECENT SERIES

  1. Answer (9):
    20 Jan 2017 - What does this code do?
  2. Challenge (48):
    19 Jan 2017 - What does this code do?
  3. Implementing low level trie (2):
    14 Dec 2016 - Part II
  4. The performance regression in the optimization (2):
    01 Dec 2016 - Part II
  5. Digging into the CoreCLR (4):
    25 Nov 2016 - Some bashing on the cost of hashing
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats