Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

You can reach me by:

oren@ravendb.net

+972 52-548-6969

Posts: 6,990 | Comments: 49,635

filter by tags archive
time to read 4 min | 654 words

A user reported that a particular query returned the results in an unexpected order. The query in question looked something like the following:

image

Note that we first search by score(), and then by the amount of sales. The problem was that documents that should have had the same score were sorted in different locations.

Running the query, we would get:

image

But all the documents have the same data (so they should have the same score), and when sorting by descending sales, 2.4 million should be higher than 62 thousands. What is going on?

We looked at the score, here are the values for the part where we see the difference:

  • 1.702953815460205
  • 1.7029536962509155

Okay… that is interesting. You might notice that the query above has include explanations(), which will give you the details of why we have sorted the data as we did. The problem? Here is what we see:

image

I’m only showing a small piece, but the values are identical on both documents. We managed to reduce the issue to a smaller data set (few dozen documents, instead of tens of thousands), but the actual issue was a mystery.

We had to dig into Lucene to figure out how the score is computed. In the land of indirectness and virtual method calls, we ended up tracing the score computation for those two documents and figure out the following, here is how Lucene is computing the score:

image

They sum the various scoring to get the right value (highly simplified). But I checked, the data is the same for all of the documents. Why do we get different values? Let’s see things in more details, shall we?

image

Here is the deal, if we add all of those together in calculator, we’ll get: 1.702953756

This is close, but not quite what we get from the score. This is probably because calculator does arbitrary precision numbers, and we use floats. The problem is, all of the documents in the query has the exact same numbers, why do we get different values.

I then tried to figure out what was going on there. The way Lucene handle the values, each subsection of the scoring (part of the graph above) is computed on its own and them summed. Still doesn’t explain what is going on, then I realized that Lucene is using a heap of mutable values to store the scorers at it was scoring the values. So whenever we scored a document, we will mark the scorer as a match and then put it in the end of the heap. But the order of the items in the heap is not guaranteed.

Usually, this doesn’t matter, but please look at the above values and consider the following fact:

What do you think are the values of C and D in the code above?

  • c = 1.4082127
  • d = 1.4082128

Yes, the order of operations for addition matters a lot for floats. This is expected, because of the way floating points are represented in memory, but unexpected.

The fact that the order of operations on the Lucene scorer is not consistent means that you may get some very subtle bugs. In order to avoid reproducing this bug, you can do pretty much anything and it will go away. It requires very careful setup and is incredibly delicate. And yet it tripped me hard enough to make me lose most of a day trying to figure out exactly where we did wrong.

Really annoying.

time to read 2 min | 219 words

I announced the beta availability of RavenDB 5.0 last week, but I missed up on some of the details on how to enable that. In this post, I’ll give you detailed guide on how to setup RavenDB 5.0 for your use right now.

For the client libraries, you can use the MyGet link, at this time, you can run:

Install-Package RavenDB.Client -Version 5.0.0-nightly-20200321-0645 -Source https://www.myget.org/F/ravendb/api/v3/index.json

If you want to run RavenDB on your machine, you can download from the downloads page, click on the Nightly tab and select the 5.0 version:

image

And on the cloud, you can register a (free) account and then, add a product:

image

Create a free instance:

image

Select the 5.0 release channel:

image

And then create the RavenDB instance.

Wait a few minutes, and then you can connect to your RavenDB 5.0 instance and start working with the new features.

You can also run it with Docker using:

docker pull ravendb/ravendb-nightly:5.0-ubuntu-latest

time to read 3 min | 454 words

I can a lot about the performance of RavenDB, a you might have noticed from this blog. A few years ago we had a major architecture shift that increased our performance by a factor of ten, which was awesome. But with the meal, you get appetite, and we are always looking for better performance.

One of the things we did with RavenDB is build things so we’ll have the seams in place to change the internal behavior without users noticing how things are working behind the scenes. We have used this capability a bunch of time to improve the performance of RavenDB. This post if about one such scenario that we recently implemented and will go into RavenDB 5.0.

Consider the following query:

image

As you can see, we are doing a range based query on a date field. Now, the source collection in this case has just over 5.9 million entries and there are a lot of unique elements in the specified range. Let’s consider how RavenDB will handle this query in version 4.2:

  • First, find all the unique CreatedAt values between those ranges (there can be tens to hundreds of thousands).
  • Then, for each one of those unique values, find all the match documents (usually, only one).

This is expensive and the problem almost always shows up when doing date range queries over non trivial ranges because that combine the two elements of many unique terms and very few results per term.

The general recommendation was to avoid running the query above and instead use:

image

This allows RavenDB to use a different method for range query, based on numeric values, not distinct string values. The performance different is huge.

But the second query is ugly and far less readable. I don’t like such a solution, even if it can serve as a temporary workaround. Because of that, we implemented a better system in RavenDB 5.0. Behind the scenes, RavenDB now translate the first query into the second one. You don’t have to do anything to make it happen (when migrating from 4.2 instances, you’ll need to reset the index to get this behavior). You just use dates as you would normally expect them to be used and RavenDB will do the right thing and optimize it for you.

To give you a sense of the different in performance, the query above on a data set of 5.9 million records will have the following performance:

  • RavenDB 4.2 - 7,523 ms
  • RavenDB 5.0 –    134 ms

As you can imagine, I’m really happy about this kind of performance boost.

time to read 1 min | 99 words

RavenDB 5.0 release train is gathering steam as we speak. The most recent change to talk about is the fact that you can now deploy RavenDB 5.0 beta in RavenDB Cloud:

image

This allows you to start experimenting with the newest features of RavenDB, including the time series capabilities.

Please take a look, the RavenDB 5.0 option is available at the free tier as well, so I would encourage you to give it a run.

As always, your feedback is desired and welcome.

time to read 1 min | 165 words

2020_03_18_BuildingAGrownUpDatabase_OrenEini_RavenDB_flyerDue to the Coronavirus issue that has been going around, my March 18 session has been moved to be online only.

You can register to the event here, we’ll also be sharing it live in the RavenDB facebook page.

As a reminder, the talk is: Building a Grown Up Database

A database is a complex, often fussy beast. For years, Oren Eini has made his living by fixing performance issues of various kinds. After seeing the same mistakes happen again and again, Oren decided to build his own database where these problems will never arise.
In this Webinar he will talk about the kind of features that make RavenDB a grown up database:
-- It doesn't need a full-time babysitter
-- Uses AI automatic indexing and self optimizing engines
-- Understands the operational environment and adjusts to it without the need for a human in the loop
-- High Availability
-- Secured development

time to read 2 min | 303 words

Recently the time.gov site had a complete makeover, which I love. I don’t really have much to do with time in the US in the normal course of things, but this site has a really interesting feature that I love.

Here is what this shows on my machine:

image

I love this feature because it showcase a real world problem very easily. Time is hard. The concept we have in our head about time is completely wrong in many cases. And that leads to interesting bugs. In this case, the second machine will be adjusted on midnight from the network and the clock drift will be fixed (hopefully).

What will happen to any code that runs when this happens? As far as it is concerned, time will move back.

RavenDB has a feature, document expiration. You can set a time for a document to go away. We had a bug which caused us to read the entries to be deleted at time T and then delete the documents that are older than T. Expect that in this case, the T wasn’t the same. We travelled back in time (and the log was confusing) and go an earlier result. That meant that we removed the expiration entries but not their related documents. When the time moved forward enough again to have those documents expire, the expiration record was already gone.

As far as RavenDB was concerned, the documents were updated to expire in the future, so the expiration records were no longer relevant. And the documents never expired, ouch.

We fixed that by remembering the original time we read the expiration records. I’m comforted with knowing that we aren’t the only one having to deal with it.

time to read 6 min | 1077 words

I run across this article, which talks about unit testing. There isn’t anything there that would be ground breaking, but I run across this quote, and I felt that I have to write a post to answer it.

The goal of unit testing is to segregate each part of the program and test that the individual parts are working correctly. It isolates the smallest piece of testable software from the remainder of the code and determines whether it behaves exactly as you expect.

This is a fairly common talking point when people discuss unit testing. Note that this isn’t the goal. The goal is what you what to achieve, this is a method of applying unit testing. Some of the benefits of unit test, are:

Makes the Process Agile and Facilitates Changes and Simplifies Integration

There are other items in the list on the article, but you can just read it there. I want to focus right now on the items above, because they are directly contradicted by separating each part of the program and testing it individually, as is usually applied in software projects.

Here are a few examples from posts I wrote over the years. The common pattern is that you’ll have interfaces, and repositories and services and abstractions galore. That will allow you to test just a small piece of your code, separate from everything else that you have.

This is great for unit testing. But unit testing isn’t a goal in itself. The point is to enable change down the line, to ensure that we aren’t breaking things that used to work, etc.

An interesting thing happens when you have this kind of architecture (and especially if you have this specifically so you can unit test it): it becomes very hard to make changes to the system. That is because the number of times you repeated yourself has grown. You have something once in the code and a second time in the tests.

Let’s consider something rather trivial. We have the following operation in our system, sending money:

image

A business rule says that we can’t send money if we don’t have enough in our account. Let’s see how we may implement it:

This seems reasonable at first glance. We have a lot of rules around money transfer, and we expect to have more in these in the future, so we created the IMoneyTransferValidationRules abstraction to model that and we can easily add new rules as time goes by. Nothing objectionable about that, right? And this is important, so we’ll have unit tests for each one of those rules.

During the last stages of the system, we realize that each one of those rules generate a bunch of queries to the database and that when we have load on the system, the transfer operation will create too much pain as it currently stand. There are a few options that we have available at this point:

  • Instead of running individual operations that will each load their data, we’ll do it once for every one. Here is how this will look like:

As you can see, we now have a way to use Lazy queries to reduce the number of remote calls this will generate.

  • Instead of taking the data from the database and checking it, we’ll send the check script to the database and do the validation there.

And here we moved pretty much the same overall architecture directly into the database itself. So we’ll not have to pay the cost of remote calls when we need to access more information.

The common thing for both approach is that it is perfectly in line with the old way of doing things. We aren’t talking about a major conceptual change. We just changed things so that it is easier to work with properly.

What about the tests?

If we tested each one of the rules independently, we now have a problem. All of those tests will now require non trivial modification. That means that instead of allowing change, the tests now serve as a barrier for change. They have set our architecture and code in concrete and make it harder to make changes.  If those changes were bugs, that would be great. But in this case, we don’t want to modify the system behavior, only how it achieve its end result.

The key issue with unit testing the system as a set of individually separated components is that concept that there is value in each component independently. There isn’t. The whole is greater than the sum of its parts is very much in play here.

If we had tests that looked at the system as a whole, those wouldn’t break. They would continue to serve us properly and validate that this big change we made didn’t break anything. Furthermore, at the edges of the system, changing the way things are happening usually is a problem. We might have external clients or additional APIs that rely on us, after all. So changing the exterior is something that I want to enforce with tests.

That said, when you build your testing strategy, you may have to make allowances. It is very important for the tests to run as quickly as possible. Slow feedback cycles can be incredibly annoying and will kill productivity. If there are specific components in your system that are slow, it make sense to insert seams to replace them. For a example, if you have a certificate generation bit in your system (which can take a long time) in the tests, you might want to return a certificate that was prepared ahead of time. Or if you are working with a remote database, you may want to use an in memory version of that. An external API you’ll want to mock, etc.

The key here isn’t that you are trying to look at things in isolation, the key is that you are trying to isolate things that are preventing you from getting quick feedback on the state of the system.

In short, unless there is uncertainty about a particular component (implementing new algorithm or data structure, exploring unfamiliar library, using 3rd party code, etc), I wouldn’t worry about testing that in isolation. Test it from outside, as a user would (note that this may take some work to enable that as an option) and you’ll end up with a far more robust testing infrastructure.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmortem (29):
    23 Mar 2020 - high CPU when there is little work to be done
  2. RavenDB 5.0 (3):
    20 Mar 2020 - Optimizing date range queries
  3. Webinar (2):
    15 Jan 2020 - RavenDB’s unique features
  4. Challenges (2):
    03 Jan 2020 - Spot the bug in the stream–answer
  5. Challenge (55):
    02 Jan 2020 - Spot the bug in the stream
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats