Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,965 | Comments: 49,570

filter by tags archive
time to read 3 min | 434 words

RavenDB makes extensive use of certificates for authentication and encryption. They allow us to safely communicate between distributed instances without worrying about a man in the middle or eavesdroppers. Given the choices we had to implement authentication, I’m really happy with the results of choosing certificates as the foundation of our authentication infrastructure.

It would be too good, however, to expect to have no issues with certificates. The topic of this point is a puzzler. A user has chosen to use a self signed certificate for the nodes in the cluster, but was unable to authenticate between the servers unless they registered the certificate in the OS’ store.

That sounds reasonable, right? If this is a self signed certificate, we obviously don’t trust it, so we need this extra step to ensure that we do trust it. However, we designed RavenDB specifically to avoid this step. If you are using a self signed certificate, the server will trust its own certificate, and thus will trust anyone that is using the same certificate.

In this case, however, that wasn’t happening. For some reason, the code path that we use to ensure that we trust our own certificate was not being activated, and that was a puzzler indeed.

One of the things that RavenDB does on first startup is to try to connect to itself as a client. It checks whatever it is successful or not. If not, we’ll try again, ignoring the registered root CAs. If we are successful at that point, we know what the issue here and ensure that we ignore the untrusted signer on the certificate. We only enable this code path if by default we don’t trust our own certificate.

Looking at the logs, we could see that we got a failure when talking to ourselves, some sort of a device not ready issue. That was strange. We hooked strace to look into what was going on, but there was nothing that was wrong at the sys call level. Then we looked into what was going on and realized that the issue was that the server’s was configured to use: https://ravendb-1.francecentral.cloudapp.azure.com/ but was actually hosted on https://ravendb-1-tst.francecentral.cloudapp.azure.com/

Do you see the difference?

The server was try to contact itself using the configured hostname. It failed, because of a DNS issue, so it couldn’t contact itself to figure out that the certificate was invalid. At that point, it didn’t install the hook and wouldn’t trust the self signed certificate.

So the issue started with investigating why we nodes in the cluster don’t trust each other with self signed certificate and got resolved by a simple configuration error.

time to read 2 min | 281 words

Subscriptions in RavenDB gives you a great way to handle backend business processing. You can register a query and get notified whenever a document that matches your query is changed. This works if the document actually exists, but what happens if you want to handle a business process relating to document’s deletion ?

I want to explicitly call out that I’m generally against deletion. There are very few business cases for it. But sometimes you got to (GDPR comes to mind) or you have an actual business reason for this.

A key property of deletion is that the data is gone, so how can you process deletions? A subscription will let you know when a document changes, but not when it is gone. Luckily, there is a nice way to handle this. First, you need to enable revisions on the collection in question, like so:

image

At this point, RavenDB will create revisions for all changed documents, and a revision is created for deletions as well. You can see deleted documents in the Revisions Bin in the Studio, to track deleted documents.

image

But how does this work with Subscriptions? If you’ll try to run a subscription query at this point, you’ll not find this employee. For that, you have to use versioned subscription, like so:

image

And now you can subscribe to get notified whenever an employee is deleted.

time to read 2 min | 388 words

I recently had what amounted to a drive by code review. I was looking into code that wasn’t committed or PR. Code that might not have been even saved to disk at the time that I saw it. I saw that while working with the developer on something completely different. And yet even a glace was enough to cause me to pause and make sure that this code will be significantly changed before it ever move forward. The code in question is here:

What is bad about this code? No, it isn’t the missing ConfigureAwait(false), in that scenario we don’t need it. The problem is in the very first line of code.

This is meant to be public API. It will have consumers from outside our team. That means that the very first thing that we need to ensure is that we don’t expose our own domain model to the outside world.

There are multiple reasons for this. To start with, versioning is a concern. Sure, we have the /v1/  in the route, but there is nothing here that would make break if we changed our domain model in a way that a third party client relies on. We have a compiler, we really want to be able to use it.

The second issue, which I consider more important, is that this leaks information that I may not really want to share. By exposing my full domain model to the outside world, I risk quite a bit. For example, I may have internal notes on the support ticket which I don’t want to expose to the public. Any field that I expose to the outside world is a compatibility concern, but any field that I add is a problem as well. This is especially true if I assume that those fields are private.

The fix is something like this:

Note that I have class that explicitly define the shape that I’m giving to the outside world. I also manually map between the internal and external fields. Doing something like auto mapper is not something that I want, because I want all of those decisions to be made explicitly. In particular, I want to be sure that every single field that I share with the outside world is done in such a way that it is visible during PR reviews.

time to read 2 min | 260 words

These are not the droids you are looking for! – Obi-Wan Kenobi

Sometimes you need to find a set of documents not because of their own properties, but based on a related document. A good example may be needing to find all employees that blue Nissan car. Here is the actual model:

image

In SQL, we’ll want a query that goes like this:

This is something that you cannot express directly in RavenDB or RQL. Luckily, you aren’t going to be stuck, RavenDB has a couple of options for this. The first, and the most closely related to the SQL option is to use a graph query. That is how you will typically query over relationships in RavenDB. Here is what this looks like:

Of course, if you have a lot of matches here, you will probably want to do things in a more efficient manner. RavenDB allows you to do so using indexes. Here is what the index looks like:

The advantage here is that you can now query on the index in a very simple manner:

RavenDB will ensure that you get the right results, and changing the Car’s color will automatically update the index’s value.

The choice between these two comes down to frequency of change and how large the work is expected to be. The index favors more upfront work for faster query times while the graph query option is more flexible but requires RavenDB to do more on each query.

time to read 2 min | 254 words

We run a lot of benchmarks internally and sometimes it feels like there is a roaming band of performance focused optimizers that go through the office and try to find under utilized machines. Some people mine bitcoin for fun, in our office, we benchmark RavenDB and try to see if we can either break a record or break RavenDB.

Recently a new machine was… repurposed to serve as a benchmarking server. You can call it a right of passage for most new machines here, I would say. The problem with that machine is that the client would error. Not only would it fail, but at the exact same interval. We tested that from multiple clients and from multiple machines and found that every 30 minutes on the dot, we’ll have an outage that lasted under one second.

Today I come to the office to news that the problem was found:

image

It seems that after 30 minutes of idle time (no user logged in), the machine would turn off the ethernet, regardless of if there are active connections going on. Shortly afterward it would be woken up, of course, but it would be down just enough time for us to notice it.

In fact, I’m really happy that we got an error. I would hate to try to figure out latency spikes because of something like this, and I still don’t know how the team found the root cause.

time to read 3 min | 568 words

Compression is a nice way to trade off time for space. Sometimes, this is something desirable, especially as you get to the higher tiers of data storage. If your data is mostly archived, you can get significant savings in storage in trade for a bit more CPU. This perfectly reasonable desire create somewhat of a problem for RavenDB, we have competing needs here. On the one hand, you want to compress a lot of documents together, to benefit for duplications between documents. On the other hand, we absolutely must be able to load a single document as fast as possible. That means that just taking 100MB of documents and compressing them in a naïve manner is not going  to work, even if this is going to result in great compression ratio. I have been looking at zstd recently to help solve this issue. 

The key feature for zstd is the ability to train the model on some of the data, and then reuse the resulting dictionary to greatly increase the compression ratio.

Here is the overall idea. Given a set of documents (10MB or so) that we want to compress, we’ll train zstd on the first 16 documents and then reuse the dictionary to compress each of the documents individually. I have used a set of 52MB of JSON documents as the test data. They represent restaurants critics, I think, but I intentionally don’t really care about the data.

Raw data: 52.3 MB. Compressing it all with 7z gives us 1.08 MB. But that means that there is no way to access a single document without decompressing the whole thing.

Using zstd with the compression level of 3, I was able to compress the data to 1.8MB in 118 milliseconds. Choosing compression level 100 reduced the size to 1.02MB but took over 6 seconds to run.

Using zstd on each document independently, where each document is under 1.5 KB in size gave me a total reducing from to 6.8 MB. This is without the dictionary. And the compression took 97 milliseconds.

With a dictionary whose size was set to 64 KB, computed from the first 128 documents gave me a total size of 4.9 MB and took 115 milliseconds.

I should note that the runtime of the compression is variable enough that I’m pretty much going to call all of them the same.

I decided to use this on a different dataset and run this over the current senators dataset. Total data size is 563KB and compressing it as a single unit would give us 54kb. Compressing as individual values, on the other hand, gave us a 324 kb.

When training zstd on the first 16 documents with 4 KB of dictionary to generate we got things down to 105 kb.

I still need to mull over the results, but I find them really quite interesting. Using a dictionary will complicate things, because the time to build the dictionary is non trivial. It can take twice as long to build the dictionary as it would be to compress the data. For example, 16 documents with 4 kb dictionary take 233 milliseconds to build, but only take 138 milliseconds to compress 52 MB. It is also possible for the dictionary to make the compression rate worse, so that is fun.

Any other idea on how we can get both the space savings and the random access option would be greatly appreciated.

time to read 2 min | 311 words

RavenDB always had optimistic concurrency, I consider this to be an important feature for building correct distributed and concurrent systems. However, RavenDB doesn’t implement pessimistic locking. At least, not explicitly. It turns out that we have all the components in place to support it. If you want to read more about what pessimistic locking actually is, this Stack Overflow answer has good coverage of the topic.

There are two types of pessimistic locking. Offline and online locking. In the online mode, the database server will take an actual lock when modifying a record. That model works for a conversation pattern with the database. Where you open a transaction and hold it open while you mutate the data. In today’s world, where most processing is handled using request / response  (REST, RPC, etc), that kind of interaction is rare. Instead, you’ll typically want to use offline pessimistic lock. That is, a lock that can live longer than a single transaction. With RavenDB, we build this feature on top of the usual optimistic concurrency as well as the document expiration feature.

Let’s take the classic example of pessimistic locking. Reserving seats for a show. Once you have selected a seat, you have 15 minutes to complete the order, otherwise the seats will automatically be released. Here is the code to do this:

The key here is that we rely on the @expires feature to remove the seatLock document automatically. We use a well known document id to coordinate concurrent requests that try to get the same seat. The rest is just the usual RavenDB’s optimistic concurrency behavior.

You have 15 minutes before the expiration and then it goes poof. From the point of view of implementing this feature, you’ll spend most of your time writing the edge cases, because from the point of view of RavenDB, there is really not much here at all.

time to read 3 min | 568 words

We are now working on proper modeling scenarios for RavenDB’s time series as part of our release cycle. We are trying to consider as many possible scenarios and see that we have good answer to them. As part of this, we looked at applying timeseries in RavenDB to problems that were raised by customers in the past.

The scenario in question is storing data from a traffic camera. The idea is that we have a traffic camera that will report [Time, Car License Number, Speed] for each car that it capture. The camera will report all cars, not just those that are speeding. Obviously, we don’t want to store a document for each and every car registered by the camera. At the same time, we are interested in knowing the speed on the roads over time.

There for, we are going to handle this in the following manner:

This allows us to handle both the ticket issuance and recording the traffic on the road over time. This works great, but it does leave one thing undone. How do I correlate the measurement to the ticket?

In this case, let’s assume that I have some additional information about the measurement that I record in the time series (for example, the confidence level of the camera in its speed report) and that I need to be able to go from the ticket to the actual measurement and vice versa.

The question is how to do this? The whole point of time series is that we are able to compress the data we record significantly. We use about 4 bits per entry, and that is before we apply actual compression here. That means that if we want to be able to use the minimal amount of disk space, we need to consider how to do this.

One way of handling this is to first create the ticket and attach the Ticket’s Id to the measurement. That is where the tag on the entry comes into play. This works, but it isn’t ideal. The idea about the tag on the entry is that we expect there to be a lot of common values. For example, if we have a camera that uses two separate sensors, we’ll use the tag to denote which sensor took the measurement. Or maybe it will use the make & model of the sensor, etc. The set of values for the tag is expected to be small and to highly repeat itself. If the number of tickets issued is very small, of course, we probably wouldn’t mind. But let’s assume that we can’t make that determination.

So we need to correlate the measurement to the ticket, and the simplest way to handle that is to record the time of the measurement in the ticket, as well as which camera generated the report. With this information, you can load the relevant measurement easily enough. But there is one thing to consider. RavenDB’s timestamps use millisecond accuracy, while .NET’s DateTime has 100 nanosecond accuracy. You’ll need to account for that when you store the value.

With that in place, you can do all sort of interesting things. For example, consider the following query.

This will allow us to show the ticket as well as the road conditions around the time of the ticket. You can use it to say “but everyone does it”, which I am assured is a valid legal defense strategy.

time to read 2 min | 383 words

I did a code review recently and pretty much the most frequent suggestion was something along the line of: “This needs to be pushed to the infrastructure”. I was asked to be clearer about this, so I decided to write a blog post about it.

In general, whenever you see a repeating code pattern, you don’t need to start extracting it. What you need to do is to check whatever this code pattern serves a purpose. If it doesn’t serve a purpose, only then it is time to see if we can abstract that to remove duplication. I phrase things in this manner because all too often we see a tendency to immediately want to abstract things out. The truth is that in many cases, trying to abstract things is going to cause things to be less clear down the line. That is why I wanted to call it out first, even when I want to explain how to do the exact thing that I caution you about it.  Resource cleanup in performance sensitive code is a good example of one scenario where you don’t want to put things to the infrastructure, you want everything to be just there. There are other reasons, too.

After all the disclaimers, let’s talk about a concrete example in which we should do something about it.

Error handling is a great case for moving to infrastructure. This code is running inside an MVC Controller, and we can move our error handling from inside each action to the infrastructure, you can read about it here. I’m not sure if this is the most up to date reference for error handling, but that isn’t the point. The exact mechanism that you do it doesn’t matter. The whole idea is that you don’t want to see it. You push it to the infrastructure and then it is handled.

In the same manner, if you need to do logging or auditing, push them down the stack if they are in the form of: “User X accessed Y”. On the other hand, if you need something like: “Manager X authorized N vacations days for Y”, that is a business audit which should be recorded in the business logic, not in the infrastructure.  I wrote about this a lot in the past.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmortem (28):
    21 Feb 2020 - The self signed certificate that couldn’t
  2. RavenDB 5.0 (2):
    21 Jan 2020 - Exploring Time Series–Part II
  3. Webinar (2):
    15 Jan 2020 - RavenDB’s unique features
  4. Challenges (2):
    03 Jan 2020 - Spot the bug in the stream–answer
  5. Challenge (55):
    02 Jan 2020 - Spot the bug in the stream
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats