Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q c

Posts: 6,545 | Comments: 48,085

filter by tags archive

When what you need is a square triangle…

time to read 2 min | 331 words

imageRecently I had a need to test our SSL/TLS infrastructure. RavenDB is making heavy use of SSL/TLS and I was trying to see if I could get it to do the wrong thing. In order to do that, I needed to make strange TLS connections. In particular, the kind of things that would violate the spec and hopefully cause the server to behave in a Bad Way.

The problem is that in order to do that, beyond the first couple of messages, you need to handle the whole encryption portion of the TLS stack. That is not fun. I asked around, and it looks like the best way to do that is to start with an existing codebase and break it. For example, get OpenSSL and modify it to generate sort of the right response. But getting OpenSSL compiling and working is a non trivial task, especially because I’m not familiar with the codebase and it is complex.

I tried that for a while, and then despaired and looked at the abyss. And the abyss looked back, and there was JavaScript. More to the point, there was this projectForge is a Node.JS implementation of TLS from the ground up. I’m not sure why it exists, but it does. It is also quite easy to follow up and read the code, and it meant that I could spend 10 minutes and have a debuggable TLS implementation that I could hack without any issues in VS Code. I liked that.

I also heard good things about the Go TLS client in this regard, but this was easier.

This was a good reminder for me to look further than the obvious solution. Oh, and I tested things and it my hunch that the server would die was false, it handled things properly, so that was a whole lot of excitement over nothing.

Building a Let’s Encrypt ACME V2 client

time to read 2 min | 301 words

The Let’s Encrypt ACME v2 staging endpoint is live, with planned release date of February 27. This is a welcome event, primarily because it is going to bring wild card certificates support to Let’s Encrypt.

That is something that is quite interesting for us, so I sat down and built an ACME v2 client for C#. You can find the C# ACME v2 Let’s Encrypt client here, you’ll note that this is a gist containing a single file and indeed, this is all you need, with the only dependency being JSON.Net.

Here is how you use this code for certificate generation:

Note that the code itself is geared toward our own use case (generating the certs as part of a setup process) and it only handles DNS challenges. This is partly why it is not a project but a gist, because it is just to share what we have done with others, not to take it upon ourselves to build a full blown client.

I have to say that I like the V2 protocol much better, is seems much more intuitive to use and easier to work with. I particularly liked the fact that I can resume working on an order after a while, which means that failure modes such as failing to propagate a DNS update can now be much more easily recoverable. It also means that trying to run the same order twice for some reason doesn’t generate a new order, but resume the existing one, which is quite nice, given the rate limits imposed by Let’s Encrypt.

Note that the code is also making assumptions, such as caching details for you behind the scenes and not bother with other parts of the API that are not important for our needs (modifying an account or revoking a certificate).

There is a Docker in your assumptions

time to read 2 min | 317 words

imageAbout a decade ago I remember getting a call from a customer that was very upset with RavenDB. They just deployed to production a brand new system, they were ready for high load so they wen with a 32 GB and 16 cores system (which was a lot at the time).

The gist of the issue, RavenDB was using 15% of the CPU and about 3 GB on RAM to serve requests. When I inquired further about how fast it was doing that I got a grudging “a millisecond or three, not more”. I ended the call wondering if I should add a thread that would do nothing but allocate memory and compute primes. That was a long time ago, since the idea of having a thread do crypto mining didn’t occur to me Smile.

This is a funny story, but it shows a real problem. Users really want you to be able to make full use of their system, and one of the design goals for RavenDB has been to do just that. This means making use of as much memory as we can and as much CPU as we need. We did that with an eye toward common production machines, with many GB of memory and cores to spare.

And then came Docker, and suddenly it was the age of the 512MB machine with a single core all over again. That cause… issues for us. In particular, our usual configuration is meant for a much stronger machine, so we now need to also deploy separate configuration for lower end machines. Luckily for us, we were always planning on running on low end hardware, for POS and embedded scenarios, but it is funny to see the resurgence of the small machine in production again.

When letting the user put the system into an invalid state is a desirable property

time to read 2 min | 381 words

imageA bug report was opened against RavenDB by one of our developers: “We should prevent error if the user told us to use a URL whose hostname isn’t matching the certificate provided.”

The intent is clear, we have a setup in which we have a certificate, and the only valid URLs in this case are hostnames that are in the certificate. If the user configured the system so we’ll be listening on: https://my-rvn-one but the certificate has hostname “their-rvn-two”, then we know that this is going to cause issues for clients. They will try to connect but fail because of certificate validation. The hostname in the URL and the hostnames in the certificate don’t match, therefor, an error.

I closed this bug as Won’t Fix, and that deserve an explanation. I usually care very deeply about the kind of errors that we generate, and we want to catch things as early as possible.

But sometimes, that is the worst possible thing. By preventing the user from doing the “wrong” thing, we also prevent it from doing something that is required if you got yourself into a bad state.

Consider the following case, a node is down, and we provisioned another one. We got a different IP, but we need to update the DNS record. That is going to take 24 hours to propagate properly, but we need to be up now. So I change the system configuration to use a different URL, but I can’t get a certificate for the new one yet for whatever reason. Now the validation kicks in, and I’m dead in the water. I might just want to be able to peek into the system, or configure the clients to ignore the certificate error, or something.

In this case, putting the system into an invalid state (such as mismatch between hostname and certificate) is desirable. An admin may want to do this for a host of reasons, mostly because they are under the gun and need things to work. There are surprisingly a large number of such cases, where you know that the situation is invalid, but you allow it because not doing so will lead to blocking off important scenarios.

Distributed computing fallacies: There is one administrator

time to read 4 min | 624 words

imageActually I would state it better as “there is any administrator / administration”.

This fallacy is meant to refer to different administrators defining conflicting policies, at least according to wikipedia. But our experience has shown that the problem is much more worse. It isn’t that you have two companies defining policies that end up at odds with one another. It is that even in the same organization, you have very different areas of expertise and responsibility, and troubleshooting any problem in an organization of a significant size is a hard matter.

Sure, you have the firewall admin which is usually the same as the network admin or at least has that one’s number. But what if the problem is actually in an LDAP proxy making a cross forest query to a remote domain across DMZ and VPN lines, with the actual problem being a DNS change that hasn’t been propagated on the internal network because security policies require DNSSEC verification but the update was made without one.

Figure that one out, with the only error being “I can’t authenticate to your software, therefor the blame is yours” can be a true pain. Especially because the person you are dealing with has likely no relation to the people who have a clue about the problem and is often not even able to run the tools you need to diagnose the issue because they don’t have permissions to do so (“no, you may not sniff our production networks and send that data to an outside party” is a security policy that I support, even if it make my life harder).

So you have an issue, and you need to figure out where it is happening. And in the scenario above, even if you managed to figure out what the actual problem was (which will require multiple server hoping and crossing of security zones) you’ll realize that you need the key used for the DNSSEC, which is at the hangs of yet another admin (most probably at vacation right now).

And when you fix that you’ll find that you now need to flush the DNS caches of multiple DNS proxies and local servers, all of which require admin privileges by a bunch of people.

So no, you can’t assume one (competent) administrator. You have to assume that your software will run in the most brutal of production environments and that users will actively flip all sort of switches and see if you break, just because of Murphy.

What you can do, however, is to design your software accordingly. That means reducing to a minimum the number of external dependencies that you have and controlling what you are doing. It means that as part of the design of the software, you try to consider failure points and see how much of everything you can either own or be sure that you’ll have the facilities in place to inspect and validate with as few parties involved in the process.

A real world example of such a design decision was to drop any and all authentication schemes in RavenDB except for X509 client certificates. In addition to the high level of security and trust that they bring, there is also another really important aspect. It is possible to validate them and operate on them locally without requiring any special privileges on the machine or the network. A certificate can be validate easy, using common tools available on all operating systems.

The scenario above, we had similar ones on a pretty regular basis. And I got really tired of trying to debug someone’s implementation of active directory deployment and all the ways it can cause mishaps.

The married couple component design pattern

time to read 3 min | 528 words

imageOne of the tough problems in distributed programming is how to deal with a component that is made up of multiple nodes. Consider a reservation service that is made up of a few nodes that needs to ensure that regardless of which node you’re talking to, if you made a reservation, you’ll have a spot. There are a lot of ways to solve this from the inside, but that isn’t what I want to talk about right now. I want to talk about the overall approach to modeling such systems.

Instead of focusing on how you would implement such a system, consider this to be an internal problem for this particular component. A good parallel for this problem is making plans with a couple for a meetup. You might be talking to both of them or just one, but you don’t care. The person you are talking to is the one that is going to giving you a decision that is valid for the couple.

How they do that is not relevant. It can be that one of them is in charge of the social calendar or that they flip based on the day of the week or whoever got out of bed first this morning or whatever his mother called her dog last year or… you don’t care. Furthermore, you probably don’t want to know. That is an internal problem and sticking your nose into the internal decision making is a Bad Idea that may lead to someone sleeping on your couch for an indefinite period of time.

But, and this is important, you can walk to either one of them and they will make such a decision. It may be something in the order of “Let me talk to my significant other and I’ll get back to you by tomorrow” or it can be “Sure, how about 8 PM on Sunday?” or even “I don’t like you, so nope, nope nope” but you’ll get some sort of an answer.

Taking this back to the distributed components design, that kind of decision in internal to the component and the mechanics of this is handled internally shouldn’t be exposed outside. Let’s take a look on why this is the case.

Starting out, we run all such decisions as a consensus that required a majority of the nodes. But a couple of nodes went down and took down the system in a bad way, so the next iteration we had moved to reserving some spots for each node that they own and can hand off to others on their own, without consulting any other nodes.  This sort of change shouldn’t matters to callers of this component, but it is very common to have outside parties take notice of how you are doing things and take a dependency on that.

The main reason I call it the married couple design problem is that it should immediately cause you to consider how you should stay away from the decision making there. Of course, if you don’t, I’m going to call your design the Mother In Law Architecture.

Reducing the cost of support with anticipatory errors

time to read 2 min | 262 words

I speak a lot about good error handlings and our perspective on support. We consider support to be a cost center (as in, we don’t want or expect to make money from support), I spoke at this at more length here. Today I run into what is probably the best example for exactly what this means in a long while.

A user got an error:

image

The error is confusing, because they are able to access this machine and URL. The actual issue, if you open the show details is here:

Connection test failed: An exception was thrown while trying to connect to http://zzzzzzzz.com:8081 : System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (0x80004005): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 143.29.63.128:53329

And that is good enough to explain to me, what is going on. RavenDB is usually using HTTP for most things, but it is using TCP connections to handle performance critical stuff, such as clustering, replication, etc. In this case, the server is listening on port 53329 for TCP connections, but since this is a public facing instance, the port is not accessible from the outside world.

This issue has generated a support call, but a better message, explaining that we could hit the HTTP endpoint and not the TCP endpoint would have led the user knowing exactly what the issue is and solving the problem on their own, much faster.

Queries++ in RavenDBFacets of information

time to read 3 min | 536 words

image

RavenDB has a lot of functionality that is available just underneath the surface. In addition to just finding documents, you can use RavenDB to find a lot more about what is going on in your database. This series of posts is aimed at exposing some of the more fun things that you can do with RavenDB that you are probably not aware of.

One of the those things is the idea of not just querying for information, but also querying for the facets of the results. This can be very useful if you are likely to search for something that would return a lot of results and you want to quickly filter these out without having the user do a lot of trial an error. This is one of those cases where it is much easier to explain what is going on with a picture.

Imagine that you are searching for a phone. You might have a good idea what you are looking for a phone on eBay. I just did that and it gave me over 300 thousands results. The problem is that if I actually want to buy one of them, I’m not going to scroll through however many pages of product listings. I need a way to quickly narrow down the selection, and facets allow me to do that, as you can see in the image. Each of these is a facet and I can filter out things so only the stuff that I’m interested in will be shown, allowing me to quickly make a decision (and purchase).

Using the sample dataset in RavenDB, we’ll explore how we can run faceted searches in RavenDB. First, we’ll define the “Products/Search” index:

Using this index, we can now ask RavenDB to give us the facets from this dataset, like so:

image

This will give us the following results:

image

And we can inspect each of them in turn:

image     image

These are easy, because they give us the count of matching products for each category and supplier. Of more interest to us is the Prices facet.

image

And here we can see how we sliced and diced the results. We can narrow things further with the user’s choices, of course, let’s check out this query:

image

Which gives us the following Prices facet:

image

This means that you can, in a very short order, produce really cool search behavior for your users.

Time handling and user experience, Backup Scheduling in RavenDB as a case study

time to read 3 min | 597 words

Time handling in software sucks. In one of the RavenDB conferences a few years ago we had a fantastic talk for over an hour that talked about just that.  It sucks because what a computer think about as time and what we think about as time are two very different things. This usually applies to applications, since that is where you are typically working with dates & times in front of the users, but we had an interesting case with backups scheduling inside RavenDB.

RavenDB allow you to schedule full and incremental backups and it used the CRON format to set things up. This make things very easy to setup and is highly configurable.

It is also very confusing. Consider the following initial user interface:

image

It’s functional, does exactly what it needs to do and allow the administrator complete freedom. It is also pretty much opaque, requiring the admin to either know the CRON format by heart (possible, but not something that we want to rely on) or find a webpage that would translate that.

The next thing we did was to avoid the extra step and let you know explicitly what this definition meant.

image

This is much better, but we can still do better. Instead of just an abstract description, let us let the user know when the next backup is going to happen. If you run backups each Friday, you probably want to change that to the before or after Black Friday, for example. So date & time matter.

image

This lead us to the next issue, what time? In particular, backups are done on the server’s local time, on the assumption that most of the time this is what the administrator will expects. This make it easier to do things like schedule backup to happen in the off hours. We thought about doing that always in UTC, but this would require you to always do date math in your head.

That does lead to the issue of what to do when the admin’s clock and the server clock are out of sync? This is how this will look like in that case.


image

We let the user know that the backup will run in the local server time and when that will happen in the user’s time.

We also provide on the fly translation from CRON format to a human readable form.

image

And finally, to make sure that we cover all the basis, in addition to giving you the time specification, the server time and local time, we also give you the time duration to the next backup.

image

I think that this covers up pretty much every scenario that I can think of.

Except getting the administrator to do practice restores to ensure that they are familiar with how to do this. Smile

Update: Here is how the field looks like when empty:

image

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production Test Run (3):
    18 Jan 2018 - When your software is configured by a monkey
  2. Reminder (8):
    09 Jan 2018 - Early bird pricing for RavenDB workshops about to close
  3. Book Recommendation (2):
    08 Jan 2018 - Serious Cryptography
  4. Talk (3):
    03 Jan 2018 - Modeling in a Non Relational World
  5. Invisible race conditions (3):
    02 Jan 2018 - The cache has poisoned us
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats