Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,162 | Comments: 50,148

Privacy Policy Terms
filter by tags archive
time to read 2 min | 298 words

Yesterday I had presented a webinar on using Machine Learning and Time Series in RavenDB. One of the things that I love about doing those webinars is that I get to field questions from the audience and have to really think on my feet.

In almost all cases, I think that I am able to provide good answers, and I usually accompany these with a live demo showing what is going on.

Yesterday, that wasn’t the case. During the demo, I managed to run into a fairly obscure bug very deep in the internals of RavenDB and got the wrong results. Then I got stuck on that and couldn’t figure out what is a proper workaround until just after the webinar concluded.

Hugely embarrassing, but at least I can take comfort that it wasn’t the first time that a live demo failed and probably won’t be the last time.

The good news, on the other hand, is that I created an issue to fix this problem. Today I made myself a cup of coffee and was about to dig into the problem when I realized that it has already been fixed. Time from opening the bug to it getting fixed, under 6 hours. That is without rushing it, mind you. I think that given the turnaround time, it is a good thing overall that I run into this.

The actual problem, for what it is worth, is that we had lazily evaluated a collection during its enumeration. However, when using GroupBy(), we effectively enumerated the value twice, fetching different values each time. The second time, we would use the last value from the collection, since it was lazily evaluated. You can check the pull request if you are that much of a nerd Smile.

time to read 2 min | 222 words

I’m doing some cloud work, and I am working based off the official documentation, trying to automate the creation of an AWS Lambda. In order to allow me to quickly iterate, I’m basically creating the entire thing from scratch each time.

I have the following code:

  • aws iam create-role --role-name $AWS_ROLE --assume-role-policy-document file://trust-policy.json
  • aws iam attach-role-policy --role-name $AWS_ROLE  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
  • aws lambda create-function --function-name $FUNC_NAME --zip-file fileb://lambda.zip --handler lambda_function.lambda_handler  --runtime python3.8 --role $ARN_ROLE

So far, so good, and exactly like it shows in the docs. But if you’ll run it as a script, it will fail with:

An error occurred (InvalidParameterValueException) when calling the CreateFunction operation: The role defined for the function cannot be assumed by Lambda.

If I re-run the exact same command, however, it works properly.

There is this interesting command, which indicates that roles are using eventual consistency:

aws iam wait role-exists --role-name $AWS_ROLE

Except… that this doesn’t work. It looks like there is some additional delay between creating the role, validating that it was created and when it is actually available for Lambda to be used.

After looking around and feeling like a fool, I added a sleep for 10 seconds to the script, and the problem went away.

I’m posting this for posterity sake and in the hope that someone can tell me that there is a better way. For now, I think I need a shower.

time to read 1 min | 188 words

Just had to trouble shoot a really strange problem in a production system. The situation is pretty simple. At one point in time, a machine had an issue and shortly afterward became utterly unable to reach the outside world. I was luckily able to SSH into the machine, but any outgoing connection would fail.

Given that TCP worked (I was using it or SSH), how can that be? I can create a connection to the system and have bidirectional communication, but not initiate any calls to the outside world.

Strange doesn’t begin to cover how weird this is. The reason for the failure, by the way, was that the disk was absolutely full. The reason for the network outage was a mystery.

It took a while to figure out that the error was that the disk was full, which caused an error is systemd-resolved, which failed to setup DNS properly. I was also unable to make connections via IP address, which was strange, but the problem went away after the disk was cleared.

I have to say, network failure due to disk errors was not how I expected to start the day.

time to read 2 min | 235 words

A user reported an issue with RavenDB. They got unexpected results in their production database, but when they imported the data locally and tested things, everything worked.

Here is the simplified version of their index:

This is a multi map index that covers multiple collections and aggregate data across them. In this case, the issue was that in production, for some of the results, the CompanyName field was null.

The actual index was more complex, but once we trimmed it down in size to something more manageable, it became obvious what the problem is. Let’s look at the problematic line:

CompanyName = g.First().CompanyName,

The problem is with the First() call. There is no promise of ordering in the grouping results, and you are getting the first item there. If the item happened to be the one from the Company map, the index will appear to work and you’ll get the right company name. However, if the result from the User map will show up first, we’ll have null in the CompanyName.

We don’t make any guarantees about the order of elements in the grouping, but in practice it is often (don’t rely on it) depends on the order of updates in the documents. So you can update the user after the company and see the changes in the index.

The right way to index this data is to do so explicitly, like so:

CompanyName = g.First(x => x.CompanyName != null).CompanyName,

time to read 2 min | 233 words

Our internal ERP system is configured to send an email when we have orders that cross a certain threshold. For obvious reasons, I like knowing when a big deal has closed. For the most part, for large orders, there is a back and forth process with the customer, so we know to expect the order. In most cases, the process involves talking to legal, purchasing agents and varying degrees of bureaucracy.

Sometimes we have a customer that come out of the blue, place a large order and never even talk to a person throughout the whole thing. I love those cases, as you can imagine, because it means that there was no friction anywhere that we needed to deal with.

Today, I got notified about an order with a large enough sum that I had to stop and count the number of zeroes. I was quite happy with the number, but was somewhat concerned that I would have to pay enough taxes to close down the national debt. I checked a bit further with the accounting department what was going on there.

Sadly, we won’t be able to purchase a large island from pocket change. This was a manual order and instead of putting the amount of the order, they put in the invoice id.

Cue sad trombone sound.

If you need me, I’m currently busy trying to cancel my tickets to the moon.

time to read 5 min | 970 words

A customer reported that on their system, they suffered from frequent cluster elections in some cases. That is usually an indication that the system resources are hit in some manner. From experience, that usually means that the I/O on the machine is capped (very common in the cloud) or that there is some network issue.

The customer was able to rule these issues out. The latency to the storage was typically withing less than a millisecond and the maximum latency never exceed 5 ms. The network monitoring showed that everything was fine, as well. The CPU was hovering around the 7% CPU and there was no reason for the issue.

Looking at the logs, we saw very suspicious gaps in the servers activity, but with no good reason for them. Furthermore, the customer was able to narrow the issue down to a single scenario. Resetting the indexes would cause the cluster state to become unstable. And it did so with very high frequency.

“That is flat out impossible”, I said. And I meant it. Indexing workload is something that we have a lot of experience managing and in RavenDB 4.0 we have made some major changes to our indexing structure to handle this scenario. In particular, this meant that indexing:

  • Will run in dedicated threads.
  • Are scoped to run outside certain cores, to leave the OS capacity to run other tasks.
  • Self monitor and know when to wind down to avoid impacting system performance.
  • Indexing threads are run with lower priority.
  • The cluster state, on the other hand, is run with high priority.

The entire thing didn’t make sense. However… the customer did a great job in setting up an environment where they could show us: Click on the reset button, and the cluster become unstable.  So it is impossible, but it happens.

We explored a lot of stuff around this issue. The machine is big and running multiple NUMA node, maybe it was some interaction with that? It was highly unlikely, and eventually didn’t pan out, but that is one example of the things that we tried.

We setup a similar cluster on our end and gave it 10x more load than what the customer did, on a set of machines that had a fraction of the customer’s power. The cluster and the system state remain absolutely stable.

I officially declared that we were in a state of perplexation.

When we run the customer’s own scenario on our system, we saw something, but nothing like what we saw on their end. One of the things that we usually do when we investigate resource constraint issues is to give the machines under test a lot less capability. Less memory and slower disks, for example, means that it is much easier to surface many problems. But the worse we made the situation for the test cluster, the better the results became.

We changed things up. We gave the cluster machines with 128 GB of RAM and fast disks and tried it again. The situation immediately reproduced.

Cue facepalm sound here.

Why would giving more resources to the system cause instability in the cluster? Note that the other metrics also suffered, which made absolutely no sense.

We started digging deeper and we found the following index:

It is about as simple an index as you can imagine it would be and should cause absolutely no issue for RavenDB. So what was going on? We then looked at the documents…

image

I would expect the State field to be a simple enum property. But it is an array that describe the state changes in the system. This array also holds complex objects. The size of the array is on average about 450 items and I saw it hit a max of 13,000 items.

That help clarify things. During index, we have to process the State property, and because it is an array, we index each of the elements inside it. That means that for each document, we’ll index between 400 – 13,000 items for the State field. What is more, we have a complex object to index. RavenDB will index that as a JSON string, so effectively the indexing would generate a lot of strings. These strings are going to be held in memory until the end of the indexing batch. So far, so good, but the problem in this case was that there were enough resources to have a big batch of documents.

That means that we would generate over 64 million string objects in one of those batches.

Enter GC, stage left.

The GC will be invoked based on how many allocations you have (in this case, a lot) and how many live objects you have. In this case, also a lot, until the index batch is completed. However, because we run GC multiple times during the indexing batch, we had promoted significant numbers of objects to the next generation, and Gen1 or Gen2 collections are far more expensive.

Once we knew what the problem was, it was easy to find a fix. Don’t index the State field. Given that the values that were indexed were JSON strings, it is unlikely that the customer actually queried on them (later confirmed by talking to the customer).

On the RavenDB side, we added monitoring for the allocation frequency and will close the indexing batch early to prevent handing the GC too much work all at once.

The reason we failed to reproduce that on lower end machine was simple, RavenDB already used enough memory so we closed the batch early, before we could gather enough objects to cause the GC to really work hard. When running on a big machine, it had time to get the ball rolling and hand the whole big mess to the GC for cleanup.

time to read 3 min | 449 words

A user reported a bug to our support. When running on MacOS, they were unable to authenticate against a remote RavenDB instance.

That was strange, since we support running on MacOS, both as a client and as a server. We have had some issues around different behavior, but it is working, so what could the issue be?

RavenDB uses X509 certificates for authentication. That ensure mutual authentication for client and server, as well as secure the communication from any prying eyes. But on that particular system, it just did not work. RavenDB was accessible, but when attempting to access it, we weren’t able to authenticate. When using the browser, we didn’t get the “Choose the certificate” dialog either. That was really strange.  Digging deeper, we verified that the certificate was setup property in the keychain. We also tested FireFox, which has a separate store for certificates, nothing worked.

Then we tested using curl, and were able to properly access and authenticate to the server. So something was really strange here. Testing from a different machine, we were able to observe no issues.

The user mentioned that they recently moved to Catalina, which is known to have some changes in how it process certificates. None of which applied to our scenario, however.

Eventually, we started comparing network traces and then we found something really interesting. Take a look at this:

image

That was an interesting discovery. The user had an anti virus installed, and the AV installed a root CA and then setup a proxy to direct all traffic through the AV. Because it added a root CA, it was able to sniff all the traffic on the machine.

However, with a client certificate, that model doesn’t work. The proxy would need to have the private key of the certificate to be able to authenticate to the remote system, which it obviously does not have. It silently stripped the request for a client certificate, which meant that as far as RavenDB was concerned, we saw no client certificate in the request, so we rightfully rejected it.

I found it interesting that we were able to actually use curl, I assume that Avast didn’t setup the proxy so curl would be included.

The solution was simple, exclude RavenDB from the inspected addresses, which immediately fixed the problem.

I spent some time trying to figure out if there was a good way for us to detect this automatically. Sadly, there is no way to tell from the client side what is the certificate that was used. If there was, we could compare it to the expected result and alert on that.

time to read 5 min | 965 words

Recently JetBrains announce that dotTrace and dotMemory supports profiling applications on Linux. We have been testing these out for production usage, analyzing production scenarios and have been finding (and fixing) issues with about an order of magnitude less work and hassle. I have been a fan of JetBrains for over 15 years and they have been providing excellent software for a long time. So why this post (and no, this isn’t an ad and we are paying for a subscription to JetBrains for all our devs).

I’m super excited about being able to profile on Linux, because this is where most of our customers are now running. Before, we had to use arcane tooling and try to figure things out. With better tools, we are able to pinpoint issues almost immediately.

Here is the result of a single profiling session:

image

Here, we create a new List and then called ToArray() on it. All we had to do here was just create the array upfront, without going through resize and additional allocations.

image

Here we had to do a much move involved operation, remembering values across transactions to reduce load. Note how much time we saved!

And of course, just as important:

image

We tried to optimize an operation. It looks like it didn’t work out exactly as expected, so we need to take additional look at that.

These are about performance,  but even more important here is the ability to track memory and its sources. We were able to identify multiple places where we are using too much memory. In some cases, this is about configuration options. In one such case, limiting the amount of memory that we speculatively hold back resulted in a reduction of memory utilization over time from 3.5 GB to 800 MB. And these are much harder to figure out, because it is very hard to see the full picture.

Another factor that became obvious is that RavenDB’s current infrastructure was written in 2015 / 2016. The lowest pieces of the code were written on what was then called DNX. It was written to be high performance and stable. We didn’t have to touch it much ever since. Profiling memory utilization showed us that we hold a large number of values in the large object heap. This is because we need to have a byte[] that we can also send to native code, so we need to pin it. That was because we needed to call some methods, such as Stream.Read(byte[], int,int()  and then process it using raw pointers.

When the code was written, that was the only choice. Allocate it on the large object heap ( to reduce fragmentation) and run with it. Now, however, we have Memory<byte> and Span<byte>, which are much easier to work with for this scenario, so we can discard this specific feature and optimize our memory utilization further.

I don’t think we would have gotten to this without the profiler point a big red blinking arrow at the issue. And here is another good example, here is the memory state of RavenDB from dotTrace after running a benchmark consisting of a hundred thousands queries with javascript projections, this one:

image

And here is the state:

image

That is… interesting. Why do we have so many? The JavaScript engine is single threaded, but we have concurrent queries, so we use multiple such engines. We cache them, so we have:

image

And the mechanism there to clean them up will not be triggered normally. So we have this needlessly around. After fixing the cache eviction logic, we get:

image

Looking at the byte arrays here, we can track this down further, to see that the root cause is the pinning buffer I spoke about earlier. We’re handling that fix in a different branch, so this is pretty good sign that we are done with this kind of benchmark.

image

The down sides dotTrace and dotMemory, exist. On Linux, dotTrace only supports sampling, not tracing. That sounds like a minor issue, but it means that when I’m running the same benchmark, I’m missing something that I find critical. The number of times a method was run. If a method is taking 50% of the runtime, but it is invoked hundred millions times, I need a very different approach from something that happens three times. The lack on tracing option with number of times the methods were called is a real issue for my usual workflow.

For dotMemory, if you have a highly connected graph (for example, a doubly linked list) and a lot of objects, the amount of resources that it takes to analyze the process is huge. I deployed a cloud machine with 128 GB of RAM and 32 cores to try to analyze an 18 GB process dump. That wasn’t sufficient to give all the details, sadly. It was still able to provide me with enough so I could figure out what the issue was, though.

Another issue is that the visuals are amazing, but when I have a 120 millions links in a node, trying to represent that in the UI is a hopeless cause. I wish that there was a way to dump to text the same information that dotMemory provides. That would allow to analyze ridiculous amount of information in a structured manner.

time to read 2 min | 287 words

When you have error code model for returning errors, you are going to be fairly limited in how you can report actual issues.

Here is a good example, taken from the ZStd source code:

image

You can see that the resulting error is the same, but we have three very different errors. Well, to be fair, we have two types of errors.

The total size is wrong and the number of samples is either too big or too small. There is no way to express that to the calling code, which may be far higher in the stack. There is just: “The source size is wrong” error.

There is actually an attempt at proper error reporting. The DISPLAYLEVEL is a way to report more information about the error, but like any C library, we are talking about creating custom error reporting. The DISPLAYLEVEL macro will write to the standard output if a flag is set. That flag is impossible to be set from outside the compilation unit, as far as I can see. So consuming this from managed code means that I have to just guess what these things are.

You can say a lot about the dangers and complexities of exceptions. But having a good way to report complex errors to the caller is very important. Note that in this case, complex means an arbitrary string generated at error time, not a complex object. An error code is great if you need to handle the error. But if you need to show it to the user, log it or handle it after the fact, a good clear error message is the key.

time to read 4 min | 654 words

A user reported that a particular query returned the results in an unexpected order. The query in question looked something like the following:

image

Note that we first search by score(), and then by the amount of sales. The problem was that documents that should have had the same score were sorted in different locations.

Running the query, we would get:

image

But all the documents have the same data (so they should have the same score), and when sorting by descending sales, 2.4 million should be higher than 62 thousands. What is going on?

We looked at the score, here are the values for the part where we see the difference:

  • 1.702953815460205
  • 1.7029536962509155

Okay… that is interesting. You might notice that the query above has include explanations(), which will give you the details of why we have sorted the data as we did. The problem? Here is what we see:

image

I’m only showing a small piece, but the values are identical on both documents. We managed to reduce the issue to a smaller data set (few dozen documents, instead of tens of thousands), but the actual issue was a mystery.

We had to dig into Lucene to figure out how the score is computed. In the land of indirectness and virtual method calls, we ended up tracing the score computation for those two documents and figure out the following, here is how Lucene is computing the score:

image

They sum the various scoring to get the right value (highly simplified). But I checked, the data is the same for all of the documents. Why do we get different values? Let’s see things in more details, shall we?

image

Here is the deal, if we add all of those together in calculator, we’ll get: 1.702953756

This is close, but not quite what we get from the score. This is probably because calculator does arbitrary precision numbers, and we use floats. The problem is, all of the documents in the query has the exact same numbers, why do we get different values.

I then tried to figure out what was going on there. The way Lucene handle the values, each subsection of the scoring (part of the graph above) is computed on its own and them summed. Still doesn’t explain what is going on, then I realized that Lucene is using a heap of mutable values to store the scorers at it was scoring the values. So whenever we scored a document, we will mark the scorer as a match and then put it in the end of the heap. But the order of the items in the heap is not guaranteed.

Usually, this doesn’t matter, but please look at the above values and consider the following fact:

What do you think are the values of C and D in the code above?

  • c = 1.4082127
  • d = 1.4082128

Yes, the order of operations for addition matters a lot for floats. This is expected, because of the way floating points are represented in memory, but unexpected.

The fact that the order of operations on the Lucene scorer is not consistent means that you may get some very subtle bugs. In order to avoid reproducing this bug, you can do pretty much anything and it will go away. It requires very careful setup and is incredibly delicate. And yet it tripped me hard enough to make me lose most of a day trying to figure out exactly where we did wrong.

Really annoying.

FUTURE POSTS

  1. The cost of the cloud - about one day from now
  2. Installing RavenDB on a Ubuntu machine - 3 days from now

There are posts all the way to Jun 22, 2021

RECENT SERIES

  1. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
  2. Webinar (4):
    11 Jun 2021 - Machine Learning and Time Series in RavenDB with Live Examples
  3. Webinar recording (13):
    24 May 2021 - The Rewards of Escaping the Relational Mindset
  4. Building a phone book (3):
    02 Apr 2021 - Part III
  5. Building a social media platform without going bankrupt (10):
    05 Feb 2021 - Part X–Optimizing for whales
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats