Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,668 | Comments: 48,517

filter by tags archive

I want to see the QA process that catch this bug!

time to read 2 min | 344 words

When we get bug reports from the field, we routinely also do a small assessment to figure out why we missed the issue in our own internal tests and runway to production.

We just got a bug report like that. RavenDB is not usable at all on a Raspberry PI because of an error about Non ASCII usage.

This is strange. To start with we test on Raspberry Pi. To be rather more exact, we test on the same hardware and software combination that the user was running on.  And what is this Non ASCII stuff? We don’t have any such thing in our code.

As we investigated, we figured out that the root cause was that we were trying to pass a Non ASCII value to the headers of the request. That didn’t make sense, the only things we write to the request in this case is well defined values, such as numbers and constant strings. All of which should be in ASCII. What was going on?

After a while, the mystery cleared. In order to reproduced this bug, you needed to have the following preconditions:

  • A file hashed to a negative Int64 value.
  • A system whose culture settings was set to sv-SE (Swedish).
  • Run on Linux.

This is detailed in this issue. On Linux (and not on Windows), when using Swedish culture, negative numbers are using: ”−1” and not “-1”.

For those of you with sharp eyes, you noticed that this is U+2212, (minus sign), and not U+002D (hyphen minus). On Linux, for Unicode knows what, this is used as the negative mark. I would complain, but my native language has „.

Anyway, the fix was to force the usage of invariant when converting the Int64 to a string for the header, which is pretty obvious. We are also exploring how to fix this in a more global manner.

But I keep coming back to the set of preconditions that is required. Sometimes I wonder why we miss a bug, in this case, I can only say that I would have been surprised if we would have found it.

Times are hard

time to read 2 min | 277 words

One of the things RavenDB does is allow you to define a backup task that will be executed on a given schedule (such as every Saturday at midnight). However, as it turns out, specifying the right time is actually a pretty hard thing to do. The problem is what to do when you have multiple time zones involved:

  • UTC
  • The server local time
  • The operator’s local time
  • The business hours of the application using the database

In some cases, you might have a server in Germany being managed from Japan with users primarily from South Africa. There are at least four different options for when Saturday’s midnight is, and the one sure thing is that it will happen when you least want it to.

Because of that, RavenDB takes the simple positon that the time that it cares about is the server's own time. An operator is free to define it as they wish, but only the server local time is relevant. But we still need to make the operator’s job easier, and we do it using the following method:

image

The operator can specify the time specification using CRON syntax (which should be common to most admins). We translate the CRON syntax to a human readable string, but we also provide the next backup date with the server’s time (when it will actually run), the operator’s local time (which as you can see is a bit different from the server) and the duration. The later is actually really important because it gives the operator an intuitive understanding of when the backup is going to run next.

Migrating data from RavenDB 3.5 to 4.0

time to read 2 min | 325 words

One of the first steps you’ll have when migration RavenDB from 3.5 to 4.0 is to actually get your data in 4.0. There are a few ways of doing that.

You can create a new database in 4.0 from a 3.5 database directory. You can click on the chevron on the New database button to access it:

image

This will give you the following screen, where you can point to the existing database directory (the RavenDB 3.5 server must be offline for this) and the Raven.StorageExporter tool that comes with the 3.5 distribution. RavenDB 4.0 will then create your database and import all the data from the existing db to the new one.

image

This works great if you are doing this is a one time operation, but in many cases, the migration process is a long one. You’ll start by migrating your code, and it will take one or two iterations to complete the full process.

In order to handle that scenario, you’ll create a new database on 4.0 normally, then go to Settings > Import and select importing from another database. In this mode, the 3.5 server is online and running. You’ll provide the details of the server and database and then click on Migrate Database, as you can see in the picture.

image

This will import all the data from the existing database to the new database. This can be an ongoing process. Once this is done, you can migrate your application code to use RavenDB 4.0 and at deployment time, you’ll run this again.

Each time you run this migration, it will get only the updated data from the source server, it doesn’t have to read it all from scratch.

Production Test RunThe self flagellating server

time to read 2 min | 354 words

imageSometimes you see the impossible. In one of our scenarios, we saw a cluster that had such a bad case of split brain that it came near to fracturing the very boundaries of space & time.

In a three node cluster, we have one node that looked to be fine. It connected to all the other nodes and was the cluster leader. The other two nodes, however, were not in the cluster and in fact, they were showing signs that they never were in the cluster.

What was really strange was that we took the other two machines down and the first node was still showing a successful cluster. We looked deeper and realized that it wasn’t actually a healthy situation, in fact, this node was very rapidly switching between leader and follower mode.

It took a bit of time to figure out what was going on, but the root cause was DNS. We had the three nodes on separate DNS (a.oren.development.run, b.oren.development.run, c.oren.development.run) and they were setup to point to the three machines. However, we have previously used the same domain names to run a cluster on the first machine only. Because of the way DNS updates, whenever the machine at a.oren.development.run would try to connect to b.oren.development.run it would actually connect to itself.

At this point, A would tell B that it is the leader. But A is B, so A would respond by becoming a follower (because it was told it should, by itself). Because it became a follower, it disconnected from itself. After a timeout, it would become leader again, and the cycle would continue.

Every time that the server would get up, it would whip itself down again. “I’m a leader”, “No, I’m a leader”, etc.

This is a fun thing to discover. We had to trace pretty deep to figure out that the problem was in the DNS cache (since the DNS itself was properly updated).

We fixed things so we now recognize if we are talking to ourselves and error properly.

Production Test RunWhen your software is configured by a monkey

time to read 3 min | 457 words

imageSystem configuration is important, and the more complex your software is, the more knobs you usually have deal with. That is complex enough as it is, because sometimes these configurations are inter dependent. But it become a lot more interesting when we are talking about a distributed environment.

In particular, one of the oddest scenarios that we had to deal with in the production test run was when we got the different members in the cluster to be configured differently from each other. Including operational details such as endpoints, security and timeouts.

This can happen for real when you make a modification on a single server, because you are trying to fix something, and it works, and you forget to deploy it to all the others. Because people drop the ball, or because you have different people working on different things at the same time.

We classified such errors into three broad categories:

  • Local state which is fine to be different on different machines. For example, if each node has a different base directory or run under a different user, we don’t really care for that.
  • Distributed state which breaks horribly if misconfigured. For example, if we use the wrong certificate trust chains on different machines. This is something we don’t really care about, because things will break in a very visible fashion when this happens, which is quite obvious and will allow quick resolution.
  • Distributed state which breaks horrible and silently down the line if misconfigured.

The last state was really hard to figure out and quite nasty. One such setting is the timeout for cluster consensus. In one of the nodes, this was set to 300 ms and on another, it was set to 1 minute. We derive a lot of behavior from this value. A server will heartbeat every 1/3 of this value, for example, and will consider a node down if it didn’t get a heartbeat from it within this timeout.

This kind of issue meant that when the nodes are idle, one of them would ping the others every 20 seconds, while they would expect a ping every 300 milliseconds. However, when they escalated things to check explicitly with the server, it replied that everything was fine, leading to the whole cluster being confused about what is going on.

To make things more interesting, if there is activity in the cluster, we don’t wait for the timeout, so this issue only shows up only on idle periods.

We tightened things so we enforce the requirement that such values to be the same across the cluster by explicitly validating this, which can save a lot of time down the road.

Production Test RunToo much of a good thing isn’t so good for you

time to read 2 min | 316 words

imageNot all of our testing happened in a production settings. One of our test clusters was simply running a pretty simple loop of writes, reads and queries on all the nodes in the cluster while intentionally destabilizing the system.

After about a week of this we learned that this worked, there were no memory leaks or increased resource usage and also that the size of the data on disk was about three orders of magnitude too much.

Investigating this we discovered that the test process introduced conflicts because it wrote the same set of documents to each of the nodes, repeatedly. We are resolving this automatically but are also keeping the conflicted copies around so users can figure out what happened to their system. In this particular scenario, we had a lot of conflicted revisions, and it was hard initially to figure out what took that space.

In our production system, we also discovered that we log too much. One of the interesting feedback items we were looking for in this production test run is to see what kind of information we can get from the logs and make sure that the details there are actionable. A part of that was to see if we could troubleshoot something simply using the logs, and add missing details if there were stuff that we couldn’t figure out from them.

We also discovered that under load, we would log a lot. In particular, we had logs detailed every indexed document and replicated item. These are almost never useful, but they generate a lot of noise when we lowered the log settings. So that went away as well. We are very focused on logs usability, it should be possible to understand what is going on and why without drowning in minutia.

Production Test RunThe worst is yet to come

time to read 4 min | 676 words

imageBefore stamping RavenDB with the RTM marker, we decided that we wanted to push it to our production systems. That is something that we have been doing for quite a while, obviously, dogfooding our own infrastructure. But this time was different. While before we had a pretty simple deployment and stable pace, this time we decided to mix things up.

In other words, we decided to go ahead with the IT version of the stooges, for our production systems. In particular, that means this blog, the internal systems that run our business, all our websites, external services that are exposed to customers, etc. As I’m writing this, one of the nodes in our cluster has run out of disk space, it has been doing that since last week. Another node has been torn down and rebuilt at least twice during this run.

We also did a few times of “it is compiles, it fits production”. In other words, we basically read this guy’s twitter stream and did what he said. This resulted in an infinite loop in production on two nodes and that issue was handled by someone who didn’t know what the problem was, wasn’t part of the change that cause it and was able to figure it out, and then had to workaround it with no code changes.

We also had two different things upgrade their (interdependent) systems at the same time, which included both upgrading the software and adding new features. I also had two guys with the ability to manage machines, and a whole brigade of people who were uploading things to production. That meant that we had distinct lack of knowledge across the board, so the people managing the machines weren’t always aware that the system was experiencing and the people deploying software weren’t aware of the actual state of the system. At some points I’m pretty sure that we had two concurrent (and opposing) rolling upgrades to the database servers.

No, I didn’t spike my coffee with anything but extra sugar. This mess of a production deployment was quite carefully planned. I’ll admit that I wanted to do that a few months earlier, but it looks like my shipment of additional time was delayed in the mail, so we do what we can.

We need to support this software for a minimum of five years, likely longer, that means that we really need to see where all the potholes are and patch them as best we can. This means that we need to test it on bad situations. And there is only so much that a chaos monkey can do. I don’t want to see what happens when the network failed. That is quite easily enough to simulate and certainly something that we are thinking about. But being able to diagnose a live production system with a infinite loop because of bad error handling and recover that. That is the kind of stuff that I want to know that we can do in order to properly support things in production.

And while we had a few glitches, but for the most part, I don’t think that any one that was really observed externally. The reason for that is the reliability mechanisms in RavenDB 4.0, we need just a single server to remain functional, for the most part, which meant that we can just run without issue even if most of the cluster was flat out broken for an extended period of time.

We got a lot of really interesting results for this experience, I’ll be posting about some of them in the near future. I don’t think that I recommend doing that for any customers, but the problem is that we have seen systems that are managed about as poorly, and we want to be able to survive in such (hostile) environment and also be able to support customers that have partial or even misleading ideas about what their own systems look like and behave.

The roadmap for 2018

time to read 6 min | 1179 words

imageThe year 2018 just rolled by, and now it the time to talk about what we want to do in this year. The release of the 4.0 version is going to be just the start, to be honest.

In no particular order, I want the following things to happen in the near future:

  • Finishing the book (github.com/ravendb/book). I currently have more than 300 pages in it, and I’m afraid that I’m only 2/3 of the way, if that. RavenDB has gotten big and doing justice to everything it does take a lot of time. My wish list here is that I’ll finish writing all the content by the first quarter and have it out (as in, you can have it on your desk) by the second quarter. Note that you can read it right now, and the feedback would be very welcome.
  • All the client APIs RTM’ed. We currently have clients for .NET, Python, JVM, Go, Ruby and Node.JS. Some of them are already ready for production, some are at RC level and some are still beta quality. We’ll dedicate a some effort and release all of these in the first quarter as well. I think that alongside with being able to run on multiple operating systems, we want to give people the choice of using RavenDB from multiple platforms and having a client for a particular platform is the first step on that road.
  • Getting (and incorporating) users’ feedback. We have worked closely with several of our customers on the release of 4.0, and we have got people chomping at the bit to just get it out (who wants to say no to being 10 – 50 times faster). But RavenDB 4.0 is a huge undertaking, and there are going to be things that we missed. The feedback from the RC releases has been invaluable in finding scenarios and conditions that we didn’t consider. I’ve explicitly put aside time to handle that sort of feedback as people are rolling out RavenDB and need to smooth any rough corners that still remain.

These are all the near term plans, for the next few months. These mostly deal with actually dealing with the aftermath of a big release, with nothing major planned for the near future because I expect all of us to be dealing with all the other things that you need to do with a big release.

The last year had seen us grow by over 40% in terms of manpower and the flexibility of having some many great people working here which can push the product in so many directions at once is intoxicating. I have been dealing with a lot of retrospectives recently as we have been completing RavenDB 4.0 and it amazed me just how much was accomplished and how many irons we still have in the fire. So let’s talk about the big plans for 2018, shall we?

Additional storage types

In 4.0, we have JSON documents and binary attachments that you can add to a document. One of our goals in 2018 is to add two or three additional options, turning RavenDB from a document database to a true multi paradigm database. In particular, we want to add:

  • Distributed counters
  • Time series
  • Graph operations

These are all going to be living together with documents, so you have have a user’s document with a FitBit and a heartrate time series on that document that updates every 5 seconds. Or you can have a post document in a blog and use a counter to track how many likes it has gotten. And I don’t believe that I need to explain about graph operations. We want to allow you to define connections between documents and query them directly.

The idea here is that we got documents, but they aren’t always the best tool for the job, so we want to offer you the option to do that in a way that is optimized, fast, easy and convenient to use.

Better integration

We already have done a lot of work around working with additional services and environment, we just need to polish and expose that. This means things like being able to get a PouchDB instance that is running in your browser and have it sync automatically and securely to your RavenDB cluster.  Or being able to point RavenDB into an instance of a relational database and have it such all the data, build the document model and save your a lot of work on migrating to a document database.

Ever faster

Performance is addictive, and it has caught us. We are now orders of magnitude faster than ever before. We have actually been scaling down our production servers intentionally to be able to see if we can find more bottlenecks in the real world, so far we went down CPU by half and memory to one quarter and we are still seeing faster response times and better latencies. That said, we can do better, and we are planning to.

I’m looking forward to are things like Span<T> and Memory<T> which would really reduce overheads in some key scenarios.  We are also eagerly awaiting the arrival of SIMD intrinsics in the CoreCLR and already have some code paths that are going to be heavily optimized as a result. Early results show something like 20% – 40% improvement, but we’ll probably be able to get more over time.  One of the reasons I’m so excited about the release is that people get to actually use the software and see how much it improved, but also give us feedback on the things that can be made even faster.

Until we have actual people using us in production over a long period of time, it is hard to avoid optimizing in the dark, and that never gives a good ROI.

Community

Everything before was mostly technical things. Features that are upcoming and new things that you get to do with RavenDB. We are also going to invest heavily in getting the word out, showing up at conferences and users’ group. We are also scheduling a lot of workshops around the globe to teach RavenDB 4.0. The first round is already available here.

There is also the new community license for RavenDB, allowing you to go to production without needing to purchase a commercial license. This should reduce the barrier for adoption and we hope to see a lot of new users starting to come to RavenDB. We are now free to use, running on multiple operating systems and available in the most commonly used platform. And we are easy to get right, although it was anything but easy to get there. A good example of that is the setup video.

All in all, 2017 has been a major year for us, both in term of growth on any parameter we track and the culmination of years of efforts that lead us to the release of RavenDB and seeing the new version take its first steps on the first days of 2018.

Happy new year, everyone.

Production postmortemThe random high CPU

time to read 2 min | 253 words

A customer complained that every now and then RavenDB is hitting 100% CPU and stays there. They were kind enough to provide a minidump, and I started the investigation.

I loaded the minidump to WinDB and started debugging. The first thing you do with high CPU is rung the “!runaway” command, which sorts the threads by how busy they are:

image

I switched to the first thread (39) and asked for its stack, I highlighted the interesting parts:

image

This is enough to have a strong suspicion on what is going on. I checked some of the other high CPU threads and my suspicion was confirmed, but even from this single stack trace it is enough.

Pretty much whenever you see a thread doing high CPU within the Dictionary class it means that you are accessing it in a concurrent manner. This is unsafe, and may lead to strange effects. One of them being an infinite loop.

In this case, several threads were caught in this infinite loop. The stack trace also told us where in RavenDB we are doing this, and from there we could confirm that indeed, there is a rare set of circumstances that can cause a timer to fire fast enough that the previous timer didn’t have a chance to complete, and both of these timers will modify the same dictionary, causing the issue.

RavenDB SetupHow the automatic setup works

time to read 8 min | 1456 words

imageOne of the coolest features in the RC2 release for RavenDB is the automatic setup, in particular, how we managed to get a completely automated secured setup with minimal amount of fuss on the user’s end.

You can watch the whole thing from start to finish, it takes about 3 minutes to go through the process (if you aren’t also explaining what you are doing) and you have a fully secured cluster talking to each other over secured TLS 1.2 channels.  This was made harder because we are actually running with trusted certificates. This was a hard requirement, because we use the RavenDB Studio to manage the server, and that is a web application hosted on RavenDB itself. As such, it is subject to all the usual rules of browser based applications, including scary warnings and inability to act if the certificate isn’t valid and trusted.

In many cases, this lead people to chose to use HTTP. Because at least with that model, you don’t have to deal with all the hassle. Consider the problem. Unlike a website, that has (at least conceptually) a single deployment, RavenDB is actually deployed on customer sites and is running on anything from local developer machines to cloud servers. In many cases, it is hidden behind multiple layers of firewalls, routers and internal networks. Users may chose to run it in any number of strange and wonderful configurations, and it is our job to support all of them.

In such a situation, defaulting to HTTP only make things easy. Mostly because things work. Using HTTPS require that we’ll use a certificate. We can obviously use a self signed certificate, and have the following shown to the user on the first access to the website:

image

As you can imagine, this is not going to inspire confidence with users. In fact, I can think of few other ways to ensure the shortest “download to recycle bin” path. Now, we could ask the administrator to generate a certificate an ensure that this certificate is trusted. And that would work, if we could assume that there is an administrator. I think that asking a developer that isn’t well versed in security practices to do that is likely to result in an even shorter “this is waste of my time” reaction than the unsecured warning option.

We considered the option of installing a (locally generated) root certificate and generating a certificate from that. This would work, but only on the local machine, and RavenDB is, by nature, a distributed database. So that would make for a great demo, but it would cause a great deal of hardships down the line. Exactly the kind of feature and behavior that we don’t want. And even if we generate the root certificate locally and throw it away immediately afterward, the idea still bothered me greatly, so that was something that we considered only in times of great depression.

So, to sum it all up, we need a way to generate a valid certificate for a random server, likely running in a protected network, inaccessible from the outside (as  in, pretty much all corporate / home networks these days). We need to do without requiring the user to do things like setup dynamic DNS, port forwarding in router or generating their own certificates. We also need to to be fast enough that we can do that as part of the setup process. Anything that would require a few hours / days is out of the question.

We looked into what it would take to generate our own trusted SSL certificates. This is actually easily possible, but the cost is prohibitive, given that we wanted to allow this for free users as well, and all the options we got always had a per generated certificate cost associated with it.

Let’s Encrypt is the answer for HTTPS certificate generation on the public web, but the vast majority all of our deployments are likely to be inside the firewall, so we can’t verify a certificate using Let’s Encrypt. Furthermore, doing so will require users to define and manage DNS settings as part of the deployment of RavenDB. That is something that we wanted to avoid.

This might require some explanation. The setup process that I’m talking about is not just to setup a production instance. We consider any installation of RavenDB to be worth a production grade setup. This is a lesson from the database ransomware tales. I see no reason why we should learn this lesson again on the backs of our users, so a high priority was given to making sure that the default install mode is also the secure and proper one.

All the options that are ruled out in this post (provide your own certificate, setup DNS, etc) are entirely possible (and quite easily) with RavenDB, if an admin so chose, and we expect that many will want to setup RavenDB in a manner that fits their organization policies. But here we are talkingh about the base line (yes, dear) install and we want to make it as simple and straightforward as we possibly can.


There is another problem with Let’s Encrypt for our situation, we need to generate a lot of certificates, significantly more than the default rate limit that Let’s Encrypt provides. Luckily, they provide a way to request an extension to this rate limit, which is exactly what we did. Once this was granted, we were almost there.

imageThe way RavenDB generates certificates as part of the setup process is a bit involved. We can’t just generate any old hostname, we need to provide proof to Let’s Encrypt that we own the hostname in question. For that matter, who is the we in question? I don’t want to be exposed to all the certificates that are generated for the RavenDB instances out there. That is not a good way to handle security.

The key for the whole operation is the following domain name: dbs.local.ravendb.net

During setup, the user will register a subdomain under that, such as arava.dbs.local.ravendb.net. We ensure that only a single user can claim each domain. Once they have done that, they let RavenDB what IP address they want to run on. This can be a public IP, exposed on the internet, a private one (such as 192.168.0.28) or even a loopback device (127.0.0.1).

The local server, running on the user’s machine then initiates a challenge to Let’s Encrypt for the hostname in question. With the answer to the challenge, the local server then call to api.ravendb.net. This is our own service, running on the cloud. The purpose of this service is to validate that the user “owns” the domain in question and to update the DNS records to match the Let’s Encrypt challenge.

The local server can then go to Let’s Encrypt and ask them to complete the process and generate the certificate for the server. At no point do we need to have the certificate go through our own servers, it is all handled on the client machine. There is another thing that is happening here. Alongside the DNS challenge, we also update the domain the user chose to point to the IP they are going to be hosted at. This means that the global DNS network will point to your database. This is important, because we need the hostname that you’ll use to talk to RavenDB to match the hostname on the certificate.

Obviously, RavenDB will also make sure to refresh the Let’s Encrypt certificate on a timely basis.

The entire process is seamless and quite amazing when you see it. Especially because even developers might not realize just how much goes on under the cover and how much pain was taken away from them.

We run into a few issues along the way and Let’s Encrypt support has been quite wonderful in this regard, including deploying a code fix that allowed us to make the time for RC2 with the full feature in place.

There are still issues if you are running on a completely isolated network, and some DNS configurations can cause issues, but we typically detect and give a good warning about that (allowing you to switch to 8.8.8.8 as a good workaround for most such issues). The important thing is that we achieve the main goal, seamless and easy setup with the highest level of security.

FUTURE POSTS

  1. Deployment Postmortem: RavenDB inside Kubernetes - 13 days from now

There are posts all the way to Aug 03, 2018

RECENT SERIES

  1. RavenDB 4.1 features (11):
    04 Jul 2018 - This document is included in your subscription
  2. Codex KV (2):
    06 Jun 2018 - Properly generating the file
  3. I WILL have order (3):
    30 May 2018 - How Bleve sorts query results
  4. Inside RavenDB 4.0 (10):
    22 May 2018 - Book update
  5. RavenDB Security Report (5):
    06 Apr 2018 - Collision in Certificate Serial Numbers
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats