Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by phone or email:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,283 | Comments: 46,769

filter by tags archive

The RavenDB Pi Demo Unit

time to read 7 min | 1300 words

imageI just had an interesting discussion on Raspberry Pi and RavenDB over twitter. The 140 characters limit is a PITA when you try to express complex topics, so I thought that I would do better job explaining it here.

RavenDB is running on the Raspberry Pi and we routinely use it internally for development and demonstrations purposes. That seems to have run into some objections, in particular: “Looks like a toy and shows no real value and can also show poor performance and make uneducated people concerned” and “Nobody is going to deploy a database on a machine with quad cores and lack of RAM and the kind of poor IO a Pi has.”

In fact, I wish that I had the ability to purchase a few dozens Raspberry Pi Zero, since that would bring the per unit cost significantly. But why bother?

Let us start with the things that I think we all agree on, the Pi isn’t a suitable production machine. Well, sort off. If you don’t have any high performance requirements, it is a perfectly suitable machine, we have one that is used to show the build status and various metrics, and it is a production machine (in the sense that if it breaks, it needs fixing), but that is twisting things around.

We have two primary use cases for RavenDB on Raspberry Pi. Internal testing and external demos. Let examine each scenario in turn.

Internal testing most refers to having developers multiple Pis on their desktop, participating in a cluster, and serving as physically remote endpoints. We could likely handle that with a container / vm easily enough, but it is much nicer to have a separate machine (that gets knocked down to the floor, have its network cable unplugged, etc for certain kind of experiments. It also provide us with assurance that we are actually hitting real network and deal with actual hardware. A nice “problem” that we have with the Pi is that it is pretty slow on IO, especially since we typically run it on an internal SD card, so it give us a slightly slower system overall, and prevent the usual fallacies of distributed computing from taking hold. One of our Pis would crash when used under load, and it was an interesting demonstration of how to deal with a flaky node (the problem was that we used the wrong power supply, if you care).

Using the Pis as semi realistic nodes help us avoid certain common mistaken assumptions, but it is by no mean replacing actual structural testing of those (Jepsen for certain parts, actual tests with various simulated failures).  Another important factor here is that the Pi is cool. It is really funny to see how some people in the office threat their Pis. And it make for a nicer work environment than just running a bunch of VMs.

Running on much slower hardware has also enabled us to find and resolve a few bottlenecks much earlier (and cheaper) in the process, which is always nice. But more importantly, it gives us a consistent low end machine that we can try on and verify that even on something like that, our minimum SLA is met.

So that is it in terms of the internal stuff, but I’ll have to admit that those mostly were born out as a result of our desire to be able to demo RavenDB on a Raspberry PI.

You might reasonably ask, why on earth would you want to demo your product on a machine that is literally decades behind the curve (in 1999, Dell PowerEdge 6450 had 4 PIII Xeon 550 CPUs and 8 GB RAM and sold for close to 20,000$).

It circles back again to the Raspberry Pi being cool, but a lot of that has to do with the environment in which we give the demos. Most of our demos are given at a booth in a conference. In those cases, we have 5 – 10 minutes to give a complete demo of RavenDB. This is obviously impossible, so what we are actually doing is giving a demo of the highlights, and provide enough information that people will go and seek out more on their own.

That means that we have our sexiest features upfront and center. And one of those feature is RavenDB’s clustering. A lot of the people we meet in conferences have very little to no experience with NoSQL, and when they do have some experience, it is often mixed.

A really cool demo that we do is the “time to cluster”, in which we are asking one of the people listening to pull out a phone and setup a timer, and see how much time it takes us to build a multi master cluster with full failover capabilities. (Our current record, if anyone wants to know, is about 37 seconds, but we have a bug report to stream line this). Once that is setup, we can show how RavenDB works in the presence of failures and network issues.

We could do all of that virtually, using firewall / iptables or something like that, but people accustomed with “setting up a cluster takes at least a week or two” are already suspicious that we couldn’t possibly make this work. Having them physically disconnect the server from the network and seeing how it behaves give a much better demo experience, and it allows us to bypass a lot of the complexities involved in explaining some of the concepts around distributed computing.

One of the most common issue, by the way, is the notion of a partition. It is relatively easy to explain to a user that a server may go down, but some people had hard time realizing that a node can be up, but unable to connect to some of the network. When you are actually setting up the wires so they can actually see it, you can see a light bulb moment going on.

A cool demo might sound like it isn’t important, but it generates a lot of interest and people really like it. Because the Raspberry Pi isn’t a 20,000$ machine, we have taken to taking a few of them with us to every conference, and just raffling them off by the end of the conference. The very fact that we treat the Raspberry Pi as a toy make this very attractive for us.

As for people assuming that the performance on the PI is the actual performance on real hardware… We haven’t run into those yet, nor do I believe that we are likely to. To be frank, we are fast enough that this isn’t really an issue unless you are purposefully trying to load test the system. Hell, I have production systems right now that are I can handle with a trio of Pis without breaking a sweat. At one point, we had a customer visit where we literally left them with a Raspberry Pi as their development server because it was the simplest way to get them started. I don’t expect to be shipping a lot of RavenDB Development “Servers” in the future, but it was fun to do.

All in all, I think that this has been a great investment for us, and we expect to see the Pis generate interest (and the RavenDB itself keeping it) whenever we show it.

If you are around NDC London, we have a few guys there in a booth right now, and they’ll raffle a few Raspberry PIs by the end of the week.

Database security and defaults

time to read 3 min | 454 words

imageThe nightmare scenario for a database vendor is something like this: Over 27,000 databases managed by MongoDB held to ransom; 99,000 still vulnerable.

To be fair, this isn’t quite the nightmare scenario. The nightmare scenario would be if this would be due to some vulnerability in the database, but in this case, this isn’t that at all. It is simply that admins have setup a publicly visible database with no permissions on the internet, and said “okay, we are done, what is the next ticket?”.

Now, I presume that it didn’t really went on like that, but the problem is that if you follow the proper instructions, you are fine, by default, all your data is exposed over the network. I’m assuming that a few of those were setup by a proper dev ops team, and mostly they were done by “Joe, we are going to prod, here are the server credentials, make sure that the db is running there”.  Or, also likely, “We are done with dev, we can just use the same servers for prod”, with no one going in and setting them up properly.

You should note that this isn’t really about MongoDB specifically (although this is the one that has the most noise at the moment). This makes for a pretty sad reading, you literally require nothing to do to “hack” into production systems, and access over 600 TB of data (just for MongoDB).

The scary thing is that you have questions like this: bind_ip = 127.0.0.1 does not work but 0.0.0.0 works.

So the user will actively try to fight any measure you have to protect them.

With RavenDB, we have actually made it a startup error (the server will abort) if you are running a production instance (identified with a license) but you don’t require authentication. Now, there are scenarios where this is valid, such as running on a secured network, but they are pretty far, so you have a configuration option that you can set that will enable this scenario, but that require an explicit step and hopefully get the user thinking. With RavenDB 4.0, we’ll require authentication (or explicit configuration override) whenever a user ask us to bind to an interface other than localhost.

I think that is one case where you have to reverse “let’s make it easy to use us” and also consider putting hurdles to actually get it running. Because in the long run, getting this wrong means that it is very easy to shoot yourself in the foot.

Initial design for strong encryption in RavenDB 4.0

time to read 7 min | 1322 words

The previous post generated some great discussion, and we have done a bit of research in the meantime about what is going to be required in order to provide strong encryption support in RavenDB.

Note: I’m still no encryption expert. I’m basing a lot of what I have here on reading libsodium code and docs.

The same design goals that we had before still hold. We want to encrypt the data at the page level, but it looks like it is going to be impossible to just encrypt the whole page. The reason behind that is that encryption is actual a pure mathematical operation, and given the same input text and the same key, it is always going to generate the same value. Using that, you can create certain attacks on the data by exploiting the sameness of the data, even if you don’t actually know what it is.

In order to prevent that, you would use an initialization vector or nonce (seems to be pretty similar, with the details about them being relevant only with regards to the randomness requirements they have). At any rate, while I initially hoped that I can just use a fixed value per page, that is a big “nope, don’t do that”. So we need some place to store that information.

Another thing that I run into is the problem with modifying the encrypted text in order to generate data that can be successfully decrypted but is different from the original plain text. A nice example of that can be seen here (see the section: How to Attack Unauthenticated Encryption). So we probably want to have that as well.

This is not possible with the current Voron format. Luckily, one of the reasons we built Voron is so we can get it to do what we want. Here is what a Voron page will look after this change:

  Voron page: 8 KB in size, 64 bytes header
+-------------------------------------------------------------------------+
|Page # 64 bits|Page metadata up to 288 bits  |mac 128 bits| nonce 96 bits|
+-------------------------------------------------------------------------+
|                                                                         |
|  Encrypted page information                                             |
|                                                                         |
|       8,128 bytes                                                       |
|                                                                         |
|                                                                         |
+-------------------------------------------------------------------------+

The idea is that when we need to encrypt a page, we’ll do the following:

  • First time we need to encrypt the page, we’ll generate a random nonce. Each time that we encrypt the page, we’ll increment the nonce.
  • We’ll encrypt the page information and put it in the page data section
  • As well as encrypting the data, we’ll also sign both it and the rest of the page header, and place that in the mac field.

The idea is that modifying either the encrypted information or the page metadata will generate an error because the tampering will be detected.

This is pretty much it as far as the design of the actual data encryption goes. But there is more to it.

Voron uses a memory mapped file to store the information (actually, several, with pretty complex interactions, but it doesn’t matter right now). That means that if we want to decrypt the data, we probably shouldn’t be doing that on the memory mapped file memory. Instead, each transaction is going to set aside some memory of its own, and when it needs to access a page, it will be decrypted from the mmap file into that transaction private copy. During the transaction run, the information will be available in plain text mode for that transaction. When the transaction is over, that memory is going to be zeroed. Note that transactions RavenDB tend to be fairly short term affairs. Because of that, each read transaction is going to get a small buffer to work with and if there are more pages accessed than allowed, it will replace the least recently used page with another one.

That leaves us with the problem of the encryption key. One option would be to encrypt all pages within the database with the same key, and use the randomly generated nonce per page and then just increment that. However, that does leave us with the option that two pages will be encrypted using the same key/nonce. That has a low probability, but it should be considered. We can try deriving a new key per page from the master page, but that seems… excessive. But it looks like there is another option is to generate use a block chipper, where we pass different block counter for each page.

This would require a minimal change to crypto_aead_chacha20poly1305_encrypt_detached, allowing to pass a block counter externally, rather than have it as a constant. I asked the question with more details so I can have a more authoritative answer about that. If this isn’t valid, we’ll probably use a nonce that is composed of the page # and the number of changes that the page has went through. This would limit us to about 2^32 modifications on the same page, though. We could limit the a single database file size to mere 0.5 Exabyte rather than 128 Zettabyte, but somehow I think we can live with it.

This just leave us with the details of key management. On Windows, this is fairly easy. We can use the CryptProtectData / CryptUnprotectData to protect the key. A transaction will start by getting the key, doing its work, then zeroing all the pages it touched (and decrypted) and its key. This way, if there are no active transactions, there is no plaintext key in memory. On Linux, we can apparently use Libsecret to do this. Although it seems like it has a much higher cost to do so.

Searching shouldn’t be so hard

time to read 3 min | 411 words

The trigger for this post is a StackOverflow question that caught my eye.

Let us imagine that you have the following UI, and you need to implement the search function:

enter image description here 

For simplicity’s sake, I’ll assume that you have the following class:

And we need to implement this search, we want users to be able to search by the restaurant name, or its location or its cuisine, or all of the above, for that matter. A query such as”Rama Thai” or “coffee 48th st” should all give us results.

One way of doing that is to do something like this:

Of course, that would only find stuff that matches directly. It will find “Rama” or “Thai”, but “Rama Thai” would confuse it. We can make it better, somewhat, but doing a bit of work on the client side and changing the query, like so:

That would now find results for “Rama Thai”, right? But what about “Mr Korean” ? Consider a user who have no clue about the restaurant name, let alone how to spell it, but just remember enough pertinent information “it was Korean food and had a Mr in its name, on Fancy Ave”.

You can spend a lot of time trying to cater for those needs. Or you can stop thinking about the data you search as the same shape of your results and use this index:

Note that what we are doing here is picking from the restaurant document all the fields that we want to search on and plucking them into a single location, which we then mark as analyzed. This let RavenDB know that it is time to start cranking. It merge all of those details together and arrange them in such a way that the following query can be done:

And now we don’t do a field by field comparison, instead, we’ll apply the same analysis rules that we applied at indexing time to the query, after which we’ll be able to search the index. And now we have sufficient information not just to find this a restaurant named “Fancy Mr Korean” (which to my knowledge doesn’t exist), but to find the appropriate Korean restaurant in the appropriate street, pretty much for free.

Those kind of features can dramatically uplift your applications’ usability and attractiveness to users. “This sites gets me”. 

Strong data encryption questions

time to read 3 min | 428 words

Image result for encryption iconWith RavenDB 4.0, we are looking to strengthen our encryption capabilities. Right now RavenDB is capable of encrypting document data and the contents of indexes at rest. That is, if you look at the disk, the data is securely encrypted. However, in memory, we keep quite a bit of information in plain text (mostly in caches of various kinds), and the document metadata isn’t encrypted, so documents keys are visible.

With RavenDB 4.0 we are looking into making some stronger guarantees. That means that we want to keep all data encrypted on disk, and only decrypt it during transaction, after which it will immediately be encrypted back.

Now, encryption and security in general are pretty big fields, and I’m by no means an expert, so I thought that I would outline the initial goals of our research and see if you have anything to add.

  • All encryption / decryption operations are done on data that is aligned on 4KB boundary and is always in multiples of 4 KB. It would be extremely helpful if the encryption would not change the size of the data. Given that the data is always in 4KB increments, I don’t think that this is going to be an issue.
  • We can’t use managed API to do so. Out data is actually residing in unmanaged memory, so ideally we would need something like this:




  • I also need to do this be able to call this from C#, and it needs to run on Windows, Linux and hopefully Mac OS.
  • I’ve been looking at stuff like this page, trying to understand what it means and hoping that this is actually using best practices for safety.

Another problem is that just getting the encryption code right doesn’t help without managing all the rest of it properly. Selecting the appropriate algorithm and mode, making sure that the library we use is both well known and respected, etc. How do we distributed / deploy / update it over multiple platforms?

Any recommendations?

You can see some sample code that I have made here: https://gist.github.com/ayende/13b206b9d83e7aa126df77d6b12711f3

This is basically the sample OpenSSL translated to C# with a bit of P/Invoke. Note that this is meant for our own use, so we don't need padding since we always pass a buffer that is a multiple of 4KB. 

I'm assuming that since this is based on the example on the OpenSSL wiki, it is also a best practice sample. There is a chance that I am mistaken, however, which is why we have this post.

RavenDB 4.0 Indexing Benchmark

time to read 4 min | 645 words

I run into this post, which was pretty interesting to read. It compares a bunch of ways to index inside CouchDB, so I decided to see how RavenDB 4.0 compares.

I wrote the following code to generate the data inside RavenDB:

In the CouchDB post, this took… a while. With RavenDB, this took 7.7 seconds and the database size at the end was 48.06 MB. This is my laptop, 6th gen i7 with 16 GB RAM and SSD drive. The CouchDB tests were run over a similar machine, but with 8 GB RAM, but RavenDB didn’t get to use much memory at all throughout the benchmark.

It is currently sitting on around 205MB working set, and allocated a total of 70 MB managed memory and 19 MB of native memory.

I then created the following index:

image

This does pretty much the same as the indexes created in CouchDB. Note that in this case, this is running in process, so the nearest equivalent would be to Erlang native views.

This took 1.281 seconds to index 100,000 documents, giving us a total of 78,064 indexed documents per second.

The working during the indexing process grew to 312 MB.

But those were small documents, what about larger ones? In the linked post, there was another test done, this time with each index also having a 300KB text property. Note that this is a single property.

The math goes like this: 300KB * 100,000 documents = 30,000,000 KB = 29,296 MB = 28.6 GB.

Inserting those took 1 minute and 40 seconds. However, the working set for RavenDB peaked at around 500 MB, and the total size of the database at the end was  512.06 MB. It took me a while to figure out what was going on.

One of the features that we have with the blittable format is the fact that we can compress large string values. The large string property was basically lorem ipsum repeated 700 times, so it compressed real well. And the beauty of this is that unless we actually need to touch that property and do something to it, it can remain compressed.

Indexing that amount of data took 1.45 seconds, for 68,965 indexes documents per second.

My next test was to avoid creating a single large property and go with a lot of smaller properties.

This generates 100,000 documents in the 90KB – 900 KB range. Inserting this amount of data took a while longer. This is mostly because of the amount of data that we write in this scenario which is in the many GB range. In fact, this is what my machine looked like while the insert was going on:

image

Overall, the insert process took 8:01 minutes to complete. And the final DB size was around 12 GB.

Indexing that amount of information took a lot longer, and consume a few resources:

image

Time to index? Well, it was about twice as much as before, with RavenDB taking a total of 2.58 seconds to index 100,000 documents totaling 12 GB in size.

That comes to 38,759 indexes documents per second.

A large part of that is related to the way we are storing information in blittable format. We don’t need to do any processing to it, so we are drastically faster.

Comparing to the CouchDB results in the linked post, we have:

  RavenDB CouchDB
Small documents 78,064 19,821
Large documents 38,759   9,363

In other words, we are talking about being 4 times faster than the faster CouchDB option.

RavenDB 4.0 Alpha is out!

time to read 3 min | 428 words

imageFor the past two years or so, we have been working very hard on RavenDB 4.0. You have seen some of the posts in this blog detailing some of the work we have been doing. In RavenDB 4.0 we have put a tremendous amount of emphasis on performance and reducing the amount of work we are doing. We have been quite successful. Our metrics show a minimum of 10x improvement across the board, and often we are even faster.

I’m going to be posting a lot about RavenDB 4.0 in the next few weeks, to highlight some of the new stuff, this time in context, but for now, let me just say that I haven’t been this excited about releasing software since the 1.0 release of RavenDB.

You can download the new alpha here, we have pre-built binaries for Windows, Linux (Ubuntu) and Raspberry PI. Mac OS support is scheduled and will likely be in the next alpha.

The alpha version supports building databases, working with documents, subscriptions, changes, indexing (simple, full text and map/reduce). There have been significant improvements to indexing speed and how they work, memory utilization across the board is much lower and CPU utilization is much reduced.

We aren’t ready to do full blown benchmarks yet, but a sample scenario that has RavenDB 3.5 pegged at 100% with 45,000 requests per second has RavenDB 4.0 relaxing with 70% CPU with 153,000 requests per second.

We would like to get feedback on RavenDB 4.0, and so please download the software and give it a whirl.

You need to make sure that you use both server & client from the same version (older versions will not work with the new server). We actually have a couple of clients, we have the updated client for 4.0, which is pretty much the same as the old client, just updated to work with the new server, and we have a completely new client, which is built to provide much higher performance and apply the same lessons we learned while building the server to the client code. Over time, the new client is going to be the default one, but while we aren’t there yet, we want people to have access to it from the get go.

Please remember that this is pre-release alpha software, as such, bugs are expected, and should be reported.

A different sort of cross platform bug

time to read 2 min | 334 words

During the move of RavenDB to Linux we had to deal with the usual share of cross platform bugs. The common ones are things like file system case sensitivity, concatenating paths with \ separators, drive letters, etc. The usual stuff.

Note that the hairy stuff (different threading model, different I/O semantics, etc) I’m not counting, we already had dealt with them before.

The real problems had to do with the file systems, in particular, ext3 doesn’t support fallocate, which we kind of wanna use to allocate our disk space in advance (which can give the file system information so it can allocate all of that together). But that was pretty straightforward. ENOSUP is clear enough for me, and we could just use ext4, which works.

But this bug was special. To start with, it snuck in and lain there dormant for quite a while, and it was a true puzzle. Put simply, a certain database would work just fine as long as we didn’t restart it. On restart, some of its indexes would be in an error state.

It took a while to figure out what was going on. At first, we thought that there was some I/O error, but the issue turned out to be much simpler. We assumed that we would be reading indexes in order, in fact, our code output the index directories in such a way that they would be lexically sorted. That worked quite well, and had the nice benefit that on NTFS, we would always read the indexes in order.

But on Linux, there is no guaranteed order, and on one particular machine, we indeed hit the out of order issue. And in that case, we had an error raised because we were trying to create an index whose id was too low. Perfectly understandable, when you realize what is going on, and trivial to fix, but quite hard to figure out, because we kept assuming that we did something horrible to the I/O system.

FUTURE POSTS

  1. The metrics calculation methods - 2 days from now
  2. The struggle with Rust - 3 days from now

There are posts all the way to Jan 24, 2017

RECENT SERIES

  1. Answer (9):
    20 Jan 2017 - What does this code do?
  2. Challenge (48):
    19 Jan 2017 - What does this code do?
  3. Implementing low level trie (2):
    14 Dec 2016 - Part II
  4. The performance regression in the optimization (2):
    01 Dec 2016 - Part II
  5. Digging into the CoreCLR (4):
    25 Nov 2016 - Some bashing on the cost of hashing
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats