Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,524 | Comments: 47,985

filter by tags archive

Production postmortemThe random high CPU

time to read 2 min | 253 words

A customer complained that every now and then RavenDB is hitting 100% CPU and stays there. They were kind enough to provide a minidump, and I started the investigation.

I loaded the minidump to WinDB and started debugging. The first thing you do with high CPU is rung the “!runaway” command, which sorts the threads by how busy they are:

image

I switched to the first thread (39) and asked for its stack, I highlighted the interesting parts:

image

This is enough to have a strong suspicion on what is going on. I checked some of the other high CPU threads and my suspicion was confirmed, but even from this single stack trace it is enough.

Pretty much whenever you see a thread doing high CPU within the Dictionary class it means that you are accessing it in a concurrent manner. This is unsafe, and may lead to strange effects. One of them being an infinite loop.

In this case, several threads were caught in this infinite loop. The stack trace also told us where in RavenDB we are doing this, and from there we could confirm that indeed, there is a rare set of circumstances that can cause a timer to fire fast enough that the previous timer didn’t have a chance to complete, and both of these timers will modify the same dictionary, causing the issue.

RavenDB SetupHow the automatic setup works

time to read 8 min | 1456 words

imageOne of the coolest features in the RC2 release for RavenDB is the automatic setup, in particular, how we managed to get a completely automated secured setup with minimal amount of fuss on the user’s end.

You can watch the whole thing from start to finish, it takes about 3 minutes to go through the process (if you aren’t also explaining what you are doing) and you have a fully secured cluster talking to each other over secured TLS 1.2 channels.  This was made harder because we are actually running with trusted certificates. This was a hard requirement, because we use the RavenDB Studio to manage the server, and that is a web application hosted on RavenDB itself. As such, it is subject to all the usual rules of browser based applications, including scary warnings and inability to act if the certificate isn’t valid and trusted.

In many cases, this lead people to chose to use HTTP. Because at least with that model, you don’t have to deal with all the hassle. Consider the problem. Unlike a website, that has (at least conceptually) a single deployment, RavenDB is actually deployed on customer sites and is running on anything from local developer machines to cloud servers. In many cases, it is hidden behind multiple layers of firewalls, routers and internal networks. Users may chose to run it in any number of strange and wonderful configurations, and it is our job to support all of them.

In such a situation, defaulting to HTTP only make things easy. Mostly because things work. Using HTTPS require that we’ll use a certificate. We can obviously use a self signed certificate, and have the following shown to the user on the first access to the website:

image

As you can imagine, this is not going to inspire confidence with users. In fact, I can think of few other ways to ensure the shortest “download to recycle bin” path. Now, we could ask the administrator to generate a certificate an ensure that this certificate is trusted. And that would work, if we could assume that there is an administrator. I think that asking a developer that isn’t well versed in security practices to do that is likely to result in an even shorter “this is waste of my time” reaction than the unsecured warning option.

We considered the option of installing a (locally generated) root certificate and generating a certificate from that. This would work, but only on the local machine, and RavenDB is, by nature, a distributed database. So that would make for a great demo, but it would cause a great deal of hardships down the line. Exactly the kind of feature and behavior that we don’t want. And even if we generate the root certificate locally and throw it away immediately afterward, the idea still bothered me greatly, so that was something that we considered only in times of great depression.

So, to sum it all up, we need a way to generate a valid certificate for a random server, likely running in a protected network, inaccessible from the outside (as  in, pretty much all corporate / home networks these days). We need to do without requiring the user to do things like setup dynamic DNS, port forwarding in router or generating their own certificates. We also need to to be fast enough that we can do that as part of the setup process. Anything that would require a few hours / days is out of the question.

We looked into what it would take to generate our own trusted SSL certificates. This is actually easily possible, but the cost is prohibitive, given that we wanted to allow this for free users as well, and all the options we got always had a per generated certificate cost associated with it.

Let’s Encrypt is the answer for HTTPS certificate generation on the public web, but the vast majority all of our deployments are likely to be inside the firewall, so we can’t verify a certificate using Let’s Encrypt. Furthermore, doing so will require users to define and manage DNS settings as part of the deployment of RavenDB. That is something that we wanted to avoid.

This might require some explanation. The setup process that I’m talking about is not just to setup a production instance. We consider any installation of RavenDB to be worth a production grade setup. This is a lesson from the database ransomware tales. I see no reason why we should learn this lesson again on the backs of our users, so a high priority was given to making sure that the default install mode is also the secure and proper one.

All the options that are ruled out in this post (provide your own certificate, setup DNS, etc) are entirely possible (and quite easily) with RavenDB, if an admin so chose, and we expect that many will want to setup RavenDB in a manner that fits their organization policies. But here we are talkingh about the base line (yes, dear) install and we want to make it as simple and straightforward as we possibly can.


There is another problem with Let’s Encrypt for our situation, we need to generate a lot of certificates, significantly more than the default rate limit that Let’s Encrypt provides. Luckily, they provide a way to request an extension to this rate limit, which is exactly what we did. Once this was granted, we were almost there.

imageThe way RavenDB generates certificates as part of the setup process is a bit involved. We can’t just generate any old hostname, we need to provide proof to Let’s Encrypt that we own the hostname in question. For that matter, who is the we in question? I don’t want to be exposed to all the certificates that are generated for the RavenDB instances out there. That is not a good way to handle security.

The key for the whole operation is the following domain name: dbs.local.ravendb.net

During setup, the user will register a subdomain under that, such as arava.dbs.local.ravendb.net. We ensure that only a single user can claim each domain. Once they have done that, they let RavenDB what IP address they want to run on. This can be a public IP, exposed on the internet, a private one (such as 192.168.0.28) or even a loopback device (127.0.0.1).

The local server, running on the user’s machine then initiates a challenge to Let’s Encrypt for the hostname in question. With the answer to the challenge, the local server then call to api.ravendb.net. This is our own service, running on the cloud. The purpose of this service is to validate that the user “owns” the domain in question and to update the DNS records to match the Let’s Encrypt challenge.

The local server can then go to Let’s Encrypt and ask them to complete the process and generate the certificate for the server. At no point do we need to have the certificate go through our own servers, it is all handled on the client machine. There is another thing that is happening here. Alongside the DNS challenge, we also update the domain the user chose to point to the IP they are going to be hosted at. This means that the global DNS network will point to your database. This is important, because we need the hostname that you’ll use to talk to RavenDB to match the hostname on the certificate.

Obviously, RavenDB will also make sure to refresh the Let’s Encrypt certificate on a timely basis.

The entire process is seamless and quite amazing when you see it. Especially because even developers might not realize just how much goes on under the cover and how much pain was taken away from them.

We run into a few issues along the way and Let’s Encrypt support has been quite wonderful in this regard, including deploying a code fix that allowed us to make the time for RC2 with the full feature in place.

There are still issues if you are running on a completely isolated network, and some DNS configurations can cause issues, but we typically detect and give a good warning about that (allowing you to switch to 8.8.8.8 as a good workaround for most such issues). The important thing is that we achieve the main goal, seamless and easy setup with the highest level of security.

RavenDB 4.0 book update is available

time to read 2 min | 388 words

imageA new update to the Inside RavenDB book is available. I’m up to chapter 9 (although Chapter 8 is just a skeleton). You can read it here.

In particular, the details about running RavenDB in a cluster and the distributed technologies and approaches it uses are now fully covered. I still have to get back to discussing ETL strategies, but there are two full chapters discussing how RavenDB clusters and replication work in detail. I would dearly appreciate any feedback on that part.

This is a complex topic, and I want to get additional eyes on this to make sure sure that it is understandable. Especially if you are new to distributed system design and how they work.

Another major advantage that we now have a professional editor go through chapter 1 – 7, so the usage of the English language probably leveled up at least twice. Errors, awkward phrasing and outright mistakes remains my own, and I would love to hear about any issues you find.

Also new in this drop is a full chapter talking about how to query RavenDB and dive into the new RQL language. There is still a lot to cover about indexes, and this chapter hasn’t been edited yet, but I think that this should give a good insight into how we are actually doing things and what you can do with the new query language.

In addition to that, we are ramping up documentation work as we start closing things down to the actual final release. We are currently aiming that at the end of the year, so it is right around the corner. I also would like to remind people that we are currently giving 30% discount for purchase of RavenDB licenses, for the duration of the Release Candidate. This offer will go away after the RTM release.

Another source of confusion seems to be the community license. I wanted to clarify that you can absolutely use the community license for production usage, including using features such as high availability and running in a cluster.

So grab a license, or just grab the bits and run with them. But most importantly, grab the book (https://github.com/ravendb/book/releases) and let me know what you think.

RavenDB 4.0 nightly builds are now available

time to read 2 min | 245 words

imageWith the RC release out of the way, we are starting on a much faster cadence of fixes and user visible changes as we get ready to the release.

In order to allow users to be able to report issues and have then resolved as soon as possible we now publish our nightly build process.

The nightly release is literally just whatever we have at the top of the branch at the time of the release. A nightly release goes through the following release cycle:

  • It compiles
  • Release it!

In other words, a nightly should be used only on development environment where you are fine with the database deciding that names must be “Green Jane” and it is fine to burp all over your data or investigate how hot we can make your CPU.

More seriously, nightlies are a way to keep up with what we are doing, and its stability is directly related to what we are currently doing. As we come closer to the release, the nightly builds stability is going to improve, but there are no safeguards there.

It means that the typical turnaround for most issues can be as low as 24 hours (and it give me back the ability, “thanks for the bug report, fixed and will be available tonight”). All other release remains with the same level of testing and preparedness.

RavenDB 4.0Managing encrypted databases

time to read 6 min | 1012 words

imageOn the right you can see how the new database creation dialog looks like, when you want to create a new encrypted database. I talked about the actual implementation of full database encryption before, but todays post’s focus is different.

I want to talk a out managing encrypted databases. As an admin tasked working with encrypted data, I need to not only manage the data in the database itself, but I also need to handle a lot more failure points when using encryption. The most obvious of them is that if you have an encrypted database in the first place, then the data you are protecting is very likely to be sensitive in nature.

That raise the immediate question of who can see that information. For that matter, are you allowed to see that information? RavenDB 4.0 has support for time limited credentials, so you register to get credentials in the system, and using whatever workflow you have the system generate a temporary API key for you that will be valid for a short time and then expire.

What about all the other things that an admin needs to do? The most obvious example is how do you handle backups, either routine or emergency ones. It is pretty obvious that if the database is encrypted, we also want the backups to be encrypted, but are they going to use the same key? How do you restore? What about moving the database from one machine to the other?

In the end, it all hangs on the notion of keys. When you create a new encrypted database, we’ll generate a key for you, and require that you confirm for us that you have persisted that information in some manner. You can print it, download it, etc. And you can see the key right there in plain text during the db creation. However, this is the last time that the database key will ever reside in plain text.

So what about this QR code, what is it doing there? Put simply, it is there to capture attention. It replicates the same information that you have in the key itself, obviously. But what for?

The idea is that users are often hurrying through the process, (the “Yes, dear!” mode) and we want to encourage them to stop of a second and think. The use of the QR code make it also much more likely that the admin will print and save the key in an offline manner, which is likely to be safer than most methods.

So this is how we encourage administrators to safely remember the encryption key. This is useful because that give the admin the ability to take a snapshot on one machine, and then recover it on another, where the encryption key is not available, or just move the hard disk between machines if the old one failed. It is actually quite common in cloud scenarios to have a machine that has an attached cloud storage, and if the machine fails, you just spin up a new machine and attach the storage to the new one.

We keep the encryption keys secret by utilizing system specific keys (either DPAPI or decryption key that only the specific user can access), so moving machines like that will require the admin to provide the encryption key so we can continue working.

The issue of backups is different. It is very common to have to store long term backups, and needing to restore them in a separate location / situation. At that point, we need the backup to be encrypted, but we don’t want it it use the same encryption key as the database itself. This is mostly about managing keys. If I’m managing multiple databases, I don’t want to record the encryption key for each as part of the restore process. That is opening us to a missed key and a useless backup that we can do nothing about.

Instead, when you setup backups (for encrypted databases it is mandatory, for regular databases, optional) we’ll give you the option to provide a public key that we’ll then use to encrypted the backup. That means that you can more safely store it in cloud scenarios, and regardless of how many databases you have, as long as you have the private key, you’ll be able to restore the backup.

Finally, we have one last topic to cover with regards to encryption, the overall system configuration. Each database can be encrypted, sure, but the management of the database (such as connection strings that it replicates to, API keys that it uses to store backups and a lot of other potentially sensitive information) is still stored in plain text. For that matter, even if the database shouldn’t be encrypted, you might still want to encrypted the full system configuration. That lead to somewhat of a chicken and egg problem.

On the one hand, we can’t encrypt the server configuration from the get go, because then the admin will not know what the key is, and they might need that if they need to move machines, etc. But once we started, we are using the server configuration, so we can’t just encrypt that on the fly. What we ended up using is a set of command line parameters, so if the admins wants to run encrypted server configuration, they can stop the server, run a command to encrypt the server configuration and setup the appropriate encryption key retrieval process (DPAPI, for example, under the right user).

That gives us the chance to make the user aware of the key and allow to save it in a secure location. All members in a cluster with an encrypted server configuration must also have encrypted server configuration, which prevents accidental leaks.

I think that this is about it, with regards to the operations details of managing encryption, Smile. Pretty sure that I missed something, but this post is getting long as it is.

Elemar Junior is joining our Latin America RavenDB team

time to read 2 min | 321 words

elemarjrToday is a great plus one news day to RavenDB Latin America. Elemar Junior is joining our team as official RavenDB consultant, on January 1st.

Elemar will work helping our customers, writing a lot of code, producing demos, videos and tutorials and in general focus on the getting started process easier as well as faster and smoother process for developers and operations people to get familiar and accustomed to getting the best out of RavenDB.

Elemar is well known for writing and speaking about advanced topics on development, design and software architecture. If you aren’t familiar with him, feel free to check his blog (pt-br only), or linkedin but the short gist of it is that he started to write computer code when he was nine years old and still love it. He has over seventeen years of professional experience developing software (used in over thirty countries) for manufacturing furniture and to design and planning of residential and commercial spaces.

With a strong focus on the Microsoft ecosystem, he has been awarded as Microsoft MVP since 2011. Elemar is the author of FluentIL, an emitting library for .NET platform, and leading figure of CodeCracker, a popular analyzer library for C# and VB that uses Roslyn to produce refactoring, code analysis, and other niceties.

Elemar can be reached via elemarjr@ravendb.net and is currently looking for someone that would Photoshop this image with a cat or ask him tough questions about RavenDB.

On a more serious note, this represent a bigger focus on having dedicated people to handle just working with customers, providing guidance and support and writing tutorials, sample applications and in general just making everything that much easier.

Code quality gateways

time to read 2 min | 319 words

I just merged a Pull Request from one of our guys. This is a pretty important piece of code, so it went through two rounds of code reviews before it actually hit the PR stage.

That was the point where the tests run (our test suite takes over an hour to run, so we run a limited test frequently, and leave the rest for later), and we got three failing tests. The funny thing was, it wasn’t a functional test that failed, it was the code quality gateways  tests.

The RavenDB team has grown quite a lot, and we are hiring again, and it is easy to lose knowledge of the “unimportant” things. Stuff that doesn’t bite you in the ass right now, for example. But will most assuredly will bite you in the ass (hard) at a later point in time.

In this case, an exception class was introduced, but it didn’t have the proper serialization. Now, we don’t actually use AppDomains all that often (at all, really), but our users sometimes do, and getting a “your exception could not be serialized” makes for an annoying debug session. Another couple of tests related to missing asserts on new configuration values. We want to make sure that we have a new configuration value is actually connected, parsed and working. And we actually have a test that would fail if you aren’t checking all the configuration properties.

There are quite a few such tests, from making sure that we don’t have void async methods to ensuring that the tests don’t leak connections or resources (which could harm other tests and complicate root cause analysis).

We also started to make use of code analyzers, for things like keeping all complex log statements in conditionals (to save allocations) to validating all async calls are using ConfigureAwait(false).

Those aren’t big things, but getting them right, all the time, give us a really cool product, and a maintainable codebase.

Looking at the bottom line

time to read 3 min | 457 words

In 2009, I decided to make a pretty big move. I decided to take RavenDB and turn that from a side project into a real product. That was something that I actually had a lot of trouble with. Unlike the profiler suite, which is a developer tool, and has a relatively short time to purchase, building a database was something that I knew was going to be a lot more complex in terms of just getting sales.

Unlike a developer tool, which is a pretty low risk investment, a database is something that is pretty significant, and that means that it would take time to settle into the market, and even if a user starts developing with RavenDB, it is usually 3 – 6 months minimum just to get to the part where they order the license. Add that to the cost of brining a new product to market, and…

Anyway, it wasn’t an easy decision. Today I was looking at some reports when I noticed something interesting. The following is the breakdown of our product based revenue since the first sale of NHibernate Profiler. Note that there is no doubt that NH Prof is a really good product for us. But it is actually pretty awesome that RavenDB is at second place.

image

This is especially significant in that the profilers has several years of lead time in the market over RavenDB. In fact, running the numbers, until 2011, we sold precious few licenses of RavenDB. In fact, here are the sales numbers for the past few years:

image

Obviously, the numbers for 2013 are still not complete, but we have already more than surpassed 2012, and we still have a full quarter to go.

For that matter, looking at the number just for 2013, we see:

image

So NH Prof is still a very major product, but RavenDB is now our top performing product for 2013, which makes me a whole lot better.

Of course, it also means that we probably need to get rid of a few other products, in particular, LLBLGen, Linq to Sql and Hibernate profilers don’t look like they are worth the trouble to keep them going. But that is a matter for another time.

Hibernating Rhinos PracticesA Sample Project

time to read 2 min | 320 words

I have previously stated that one of the things that I am looking for in a candidate is the actual candidate code. Now, I won’t accept “this is a project that I did for a client / employee”, and while it is nice to be pointed at a URL from the last project the candidate took part of, it is not a really good way to evaluate someone’s abilities.

Ideally, I would like to have someone that has an OSS portfolio that we can look at, but that isn’t always relevant. Instead, I decided to sent potential candidates the following:

Hi,

I would like to give you a small project, and see how you handle that.

The task at hand is to build a website for Webinars questions. We run bi-weekly webinars for our users, and we want to do the following:

  • Show the users a list of our webinars (The data is here: http://www.youtube.com/user/hibernatingrhinos)
  • Show a list of the next few scheduled webinar (in the user’s own time zone)
  • Allow the users to submit questions, comment on questions and vote on questions for the next webinar.
  • Allow the admin to mark specific questions as answered in a specific webinar (after it was uploaded to YouTube).
  • Manage Spam for questions & comments.

The project should be written in C#, beyond that, feel free to use whatever technologies that you are most comfortable with.

Things that we will be looking at:

  • Code quality
  • Architecture
  • Ease of modification
  • Efficiency of implementation
  • Ease of setup & deployment

Please send us the link to a Git repository containing the project, as well as any instructions that might be necessary.

Thanks in advance,

     Oren Eini

This post will go live about two weeks after I started sending this to candidates, so I am not sure yet what the response would be.

Hibernating Rhinos PracticesDesign

time to read 2 min | 286 words

One of the things that I routinely get asked is how we design things. And the answer is that we usually do not. Most things does not require complex design. The requirements we set pretty much dictate how things are going to work. Sometimes, users make suggestions that turn into a light bulb moment, and things shift very rapidly.

But sometimes, usually with the big things, we actually do need to do some design upfront. This is usually true in complex / user facing part of our projects. The Map/Reduce system, for example, was mostly re-written  in RavenDB 2.0, and that only happened after multiple design sessions internally, a full stand alone spike implementation and a lot of coffee, curses and sweat.

In many cases, when we can, we will post a suggested design on the mailing list and ask for feedback. Here is an example of such a scenario:

In this case, we didn’t get to this feature in time for the 2.0 release, but we kept thinking and refining the approach for that.

The interesting things that in those cases, we usually “design” things by doing the high level user visible API and then just let it percolate. There are a lot of additional things that we would need to change to make this work (backward compatibility being a major one), so there is a lot of additional work to be done, but that can be done during the work. Right now we can let it sit, get users’ feedback on the proposed design and get the current minor release out of the door.

FUTURE POSTS

  1. Queries++ in RavenDB: I suggest you can do better - 2 hours from now
  2. Queries++ in RavenDB: Spatial searches - 3 days from now
  3. PR Review: The simple stuff will trip you - 4 days from now
  4. The married couple component design pattern - 5 days from now
  5. Distributed computing fallacies: There is one administrator - 6 days from now

There are posts all the way to Dec 21, 2017

RECENT SERIES

  1. PR Review (9):
    08 Nov 2017 - Encapsulation stops at the assembly boundary
  2. Queries++ in RavenDB (4):
    11 Dec 2017 - Gimme more like this
  3. Production postmortem (21):
    06 Dec 2017 - data corruption, a view from INSIDE the sausage
  4. API Design (9):
    04 Dec 2017 - The lack of a method was intentional forethought
  5. The best features are the ones you never knew were there (5):
    27 Nov 2017 - You can’t do everything
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats