Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:

oren@ravendb.net

+972 52-548-6969

Posts: 7,408 | Comments: 50,840

Privacy Policy Terms
filter by tags archive
time to read 1 min | 183 words

RavenDB has the public live test instance, and we have recently upgraded that to version 6.0.  That means that you can start playing around with RavenDB 6.0 directly, including giving us feedback on any issues that you find.

Of particular interest, of course, is the sharding feature, it is right here:

image

And once enabled, you can see things in more details:

image

If we did things properly, the only thing you’ll notice that indicates that you are running in sharded mode is:

image

Take a look, and let us know what you think.

As a reminder, at the top right of the page, there is the submit feedback option:

image

Use it, we are waiting for your insights.

time to read 2 min | 259 words

Let’s say that you have the following scenario, you have an object in your hands that is similar to this one:

It holds some unmanaged resources, so you have to dispose it.

However, this is used in the following manner:

What is the problem? This object may be used concurrently. In the past, the frame was never updated, so it was safe to read from it from multiple threads. Now there is a need to update the frame, but that is a problem. Even though only a single thread can update the frame, there may be other threads that hold a reference to it. That is a huge risk, since they’ll access freed memory. At best, we’ll have a crash, more likely it will be a security issue. At this point in time, we cannot modify all the calling sites without incurring a huge cost. The Frame class is coming from a third party and cannot be changed, so what can we do? Not disposing the frame would lead to a memory leak, after all.

Here is a nice trick to add a finalizer to a third party class. Here is how the code works:

The ConditionalWeakTable associates the lifetime of the disposer with the frame, so only where there are no more outstanding references to the frame (guaranteed by the GC), the finalizer will be called and the memory will be freed.

It’s not the best option, but it is a great one if you want to make minimal modifications to the code and get the right behavior out of it.

time to read 2 min | 262 words

Trevor Hunter from Kobo Rakuten is going to be speaking about Kobo’s usage of RavenDB in a webinar next Wednesday.

When I started at Kobo, we needed to look beyond the relational and into document databases. Our initial technology choice didn't work out for us in terms of reliability, performance, or flexibility, so we looked for something new and set on a journey of discovery, exploratory testing, and having fun pushing contender technologies to their limits (and breaking them!). In this talk, you'll hear about our challenges, how we evaluated the options, and our experience since widely adopting RavenDB. You'll learn about how we use it, areas we're still a bit wary of, and features we're eager to make more use of. I'll also dive into the key aspects of development - from how it affects our unit testing to the need for a "modern DBA" role on a development team.

About the speaker: Trevor Hunter: "I am a leader and coach with a knack for technology. I’m a Chief Technology Officer, a mountain biker, a husband, and a Dad. My curiosity to understand how things work and my desire to use that understanding to help others are the things I hope my kids inherit from me. I am currently the Chief Technology Officer of Rakuten Kobo. Here I lead the Research & Development organization where our mission is to deliver the best devices and the best services for our readers. We innovate, create partnerships, and deliver software, hardware, and services to millions of users worldwide."

You can register to the webinar here.

time to read 2 min | 246 words

RavenDB Sharding is now running as a production replication in our backend systems and we are stepping up our testing in a real-world environment.

We are now also publishing nightly builds of RavenDB 6.0, including Sharding support.

image

There are some known (minor) issues in the Studio, which we are busy fixing, but it is already possible to create and manage sharded clusters. As usual, we would love your feedback.

Here are some of the new features that I’m excited about, you can see that we have an obese bucket here, far larger than all other items:

image

And we can drill down and find why:

image

We have quite a few revisions of a pretty big document, and they are all obviously under a single bucket.

You can now also get query timings information across the entire sharded cluster, like so:

image

And as you can see, this will give you a detailed view of exactly where the costs are.

You can get the nightly build and start testing it right now. We would love to hear from you, especially as you test the newest features.

time to read 9 min | 1605 words

It’s very common to model your backend API as a set of endpoints that mirror your internal data model. For example, consider a blog engine, which may have:

  • GET /users/{id}: retrieves information about a specific user, where {id} is the ID of the user
  • GET /users/{id}/posts: retrieves a list of all posts made by a specific user, where {id} is the ID of the user
  • POST /users/{id}/posts: creates a new post for a specific user, where {id} is the ID of the user
  • GET /posts/{id}/comments: retrieves a list of all comments for a specific post, where {id} is the ID of the post
  • POST /posts/{id}/comments: creates a new comment for a specific post, where {id} is the ID of the post

This mirrors the internal structure pretty closely, and it is very likely that you’ll get to an API similar to this if you’ll start writing a blog backend. This represents the usual set of operations very clearly and easily.

The problem is that the blog example is so attractive because it is inherently limited. There isn’t really that much going on in a blog from a data modeling perspective. Let’s consider a restaurant and how its API would look like:

  • GET /menu: Retrieves the restaurant's menu
  • POST /orders: Creates a new order
  • POST /orders/{order_id}/items: Adds items to an existing order
  • POST /payments: Allows the customer to pay their bill using a credit card

This looks okay, right?

We sit at a table, grab the menu and start ordering. From REST perspective, we need to take into account that multiple users may add items to the same order concurrently.

That matters, because we may have bundles to take into account. John ordered the salad & juice and Jane the omelet, and Derek just got coffee. But coffee is already included in Jane’s order, so no separate charge for that. Here is what this will look like:

 ┌────┐┌────┐┌─────┐┌──────────────────────┐
 │John││Jane││Derek││POST /orders/234/items│
 └─┬──┘└─┬──┘└──┬──┘└─────┬────────────────┘
   │     │      │         │       
   │    Salad & Juice     │       
   │─────────────────────>│       
   │     │      │         │       
   │     │     Omelet     │       
   │     │───────────────>│       
   │     │      │         │       
   │     │      │ Coffee  │       
   │     │      │────────>│       

The actual record we have in the end, on the other hand, looks like:

  • Salad & Juice
  • Omelet & Coffee

In this case, we want the concurrent nature of separate requests, since each user will be ordering at the same time, but the end result should be the final tally, not just an aggregation of the separate totals.

In the same sense, how would we handle payments? Can we do that in the same manner?

 ┌────┐┌────┐┌─────┐┌──────────────────┐
 │John││Jane││Derek││POST /payments/234│
 └─┬──┘└─┬──┘└──┬──┘└────────┬─────────┘
   │     │      │            │          
   │     │     $10           │          
   │────────────────────────>│          
   │     │      │            │          
   │     │      │ $10        │          
   │     │──────────────────>│          
   │     │      │            │          
   │     │      │    $10     │          
   │     │      │───────────>│  

In this case, however, we are in a very different state. What happens in this scenario if one of those charges were declined? What happens if they put too much. What happens if there is a concurrent request to add an item to the order while the payment is underway?

When you have separate operations, you have to somehow manage all of that. Maybe a distributed transaction coordinator or by trusting the operator or by dumb luck, for a while. But this is actually an incredibly complex topic. And a lot of that isn’t inherent to the problem itself, but instead about how we modeled the interaction with the server.

Here is the life cycle of an order:

  • POST /orders: Creates a new order – returns the new order id
  • ** POST /orders/{order_id}/items: Adds / removes items to an existing order
  • ** POST /orders/{order_id}/submit: Submits all pending order items to the kitchen
  • POST /orders/{order_id}/bill: Close the order, compute the total charge
  • POST /payments/{order_id}: Handle the actual payment (or payments)

I have marked with ** the two endpoints that may be called multiple times. Everything else can only be called once.

Consider the transactional behavior around this sort of interaction. Adding / removing items from the order can be done concurrently. But submitting the pending orders to the kitchen is a boundary, a concurrent item addition would either be included (if it happened before the submission) or not (and then it will just be added to the pending items).

We are also not going to make any determination on the offers / options that were selected by the diners until they actually move to the payment portion. Even the payment itself is handled via two interactions. First, we ask to get the bill for the order. This is the point when we’ll compute orders, and figure out what bundles, discounts, etc we have. The result of that call is the final tally. Second, we have the call to actually handle the payment. Note that this is one call, and the idea is that the content of this is going to be something like the following:

{
  "order_id": "789",
  "total": 30.0,
  "payments": [
    {
      "amount": 15.0,
      "payment_method": "credit_card",
      "card_number": "****-****-****-3456",
      "expiration_date": "12/22",
      "cvv": "123"
    },
    { 
        "amount": 10.0, 
        "payment_method": "cash" },
    {
      "amount": 5.0,
      "payment_method": "credit_card",
      "card_number": "****-****-****-5678",
      "expiration_date": "12/23",
      "cvv": "456"
    }
  ]
}

The idea is that by submitting it all at once, we are removing a lot of complexity from the backend. We don’t need to worry about complex interactions, race conditions, etc. We can deal with just the issue of handling the payment, which is complicated enough on its own, no need to borrow trouble.

Consider the case that the second credit card fails the charge. What do we do then? We already charged the previous one, and we don’t want to issue a refund, naturally. The result here is a partial error, meaning that there will be a secondary submission to handle the remainder payment.

From an architectural perspective, it makes the system a lot simpler to deal with, since you have well-defined scopes. I probably made it more complex than I should have, to be honest. We can simply make the entire process serial and forbid actual concurrency throughout the process. If we are dealing with humans, that is easy enough, since the latencies involved are short enough that they won’t be noticed. But I wanted to add the bit about making a part of the process fully concurrent, to deal with the more complex scenarios.

In truth, we haven’t done a big change in the system, we simply restructured the set of calls and the way you interact with the backend. But the end result of that is the amount of code and complexity that you have to juggle for your infrastructure needs are far more lightweight. On real-world systems, that also has the impact of reducing your latencies, because you are aggregating multiple operations and submitting them as a single shot. The backend will also make things easier, because you don’t need complex transaction coordination or distributed locking.

It is a simple change, on its face, but it has profound implications.

time to read 2 min | 399 words

Let’s assume that you want to make a remote call to another server. Your code looks something like this:

var response = await httpClient.GetAsync("https://api.myservice.app/v1/create-snap", cancellationTokenSource.Token);

This is simple, and it works, until you realize that you have a problem. By default, this request will time out in 100 seconds. You can set it to a shorter timeout using HttpClient.Timeout property, but that will lead to other problems.

The problem is that internally, inside HttpClient, if you are using a Timeout, it will call CancellationTokenSource.CancelAfter(). That is... what we want to do, no?

Well, in theory, but there is a problem with this approach. Let's sa look at how this actually works, shall we?

It ends up setting up a Timer instance, as you can see in the code. The problem is that this will modify a global value (well, one of them, there are by default N timers in the process, where N is the number of CPUs that you have on the machine.

What that means is that in order to register a timeout, you need to take a look. If you have a high concurrency situation, setting up the timeouts may be incredibly expensive.

Given that the timeout is usually a fixed value, within RavenDB we solved that using a different manner. We set up a set of timers that will go off periodically and then use this instead. We can request a task that will be completed on the next timeout duration. This way, we'll not be contending on the global locks, and we'll have a single value to set when the timeout happens.

The code we use ends up being somewhat more complex:

var sendTask = httpClient.GetAsync("https://api.myservice.app/v1/create-snap", cancellationTokenSource.Token);
var waitTask = TimeoutManager.WaitFor(TimeSpan.FromSeconds(15), cancellationTokenSource.Token);

if (Task.WaitAny(sendTask, waitTask) == 1)
{
        throw new TimeoutException("The request to the service timed out.");
}

Because we aren't spending a lot of time doing setup for a (rare) event, we can complete things a lot faster.

I don't like this approach, to be honest. I would rather have a better system in place, but it is a good workaround for a serious problem when you are dealing with high-performance systems.

You can see how we implemented the TimeoutManager inside RavenDB, the goal was to get roughly the same time frame, but we are absolutely fine with doing roughly the right thing, rather than pay the full cost of doing this exactly as needed. For our scenario, roughly is more than accurate enough.

time to read 5 min | 962 words

When I started using GitHub Copilot, I was quite amazed at how good it was. Sessions using ChatGPT can be jaw dropping in terms of the generated content.

The immediate reaction from many people is to consider what the impact of that would be on the humans who currently fill those roles. Surely, if we can get a machine to do the task of a human, we can all benefit (except for the person made redundant, I guess).

I had a long discussion on the topic recently and I think that it is a good topic for a blog post, given the current interest in the subject matter.

The history of replacing manual labor with automated machines goes back as far as you’ll like to stretch it. I wouldn’t go back to the horse & plow, but certain the Luddites and their arguments about the impact of machinery on the populace will sound familiar to anyone today.

The standard answer is that some professions will go away, but new ones will pop up, instead. The classic example is the ice salesman. That used to be a function, a guy on a horse-drawn carriage that would sell you ice to keep your food cold. You can assume that this profession is no longer relevant, of course.

The difference here is that we now have computer programs and AI taking over what was classically thought impossible. You can ask Dall-E or Stable Diffusion for an image and in a few seconds, you’ll have a beautiful render that may actually match what you requested.

You can start writing code with GitHub Copilot and it will predict what you want to do to an extent that is absolutely awe-inspiring.

So what is the role of the human in all of this? If I can ask ChatGPT or Copilot to write me an email validation function, what do I need a developer for?

Here is ChatGPT’s output:

image

And here is Copilot’s output:

image

I would rate the MailAddress version better, since I know that you can’t actually manage emails via Regex. I tried to take this further and ask ChatGPT about the Regex, and got:

image

ChatGPT is confused, and the answer doesn’t make any sort of sense.

Most of the time spent on “research” for this post was waiting for ChatGPT to actually produce a result, but this post isn’t about nitpicking, actually.

The whole premise around “machines will make us redundant” is that the sole role of a developer is taking a low-level requirement such as email validation and producing the code to match.

Writing such low-hanging fruit is not your job. For that matter, a function is not your job. Nor is writing code a significant portion of that. A developer needs to be able to build the system architecture and design the interaction between components and the overall system.

They need to make sure that the system is performant, meet the non-functional requirements, etc. A developer would spend a lot more time reading code than writing it.

Here is a more realistic example of using ChatGPT, asking it to write to a file using a write-ahead log. I am both amazed by the quality of the answer and find myself unable to use even a bit of the code in there. The scary thing is that this code looks correct at a glance. It is wrong, dangerously so, but you’ll need to be a subject matter expert to know that. In this case, this doesn’t meet the requirements, the provided solution has security issues and doesn’t actually work.

On the other hand, I asked it about password hashing and I would give this answer a good mark.

I believe it will get better over time, but the overall context matters. We have a lot of experience in trying to get the secretary to write code. There have been many tools trying to do that, going all the way back to CASE in the 80s.

There used to be a profession called: “computer”, where you could hire a person to compute math for you. Pocket calculators didn’t invalidate them, and Excel didn’t make them redundant. They are now called accountants or data scientists, instead. And use the new tools (admittedly, calling calculators or Excel new feels very strange) to boost up their productivity enormously.

Developing with something like Copilot is a far easier task, since I can usually just tab complete a lot of the routine details. But having a tool to do some part of the job doesn’t mean that there is no work to be done. It means that a developer can speed up the routine bits and get to grips faster / more easily with the other challenges it has, such as figuring out why the system doesn’t do what it needs to, improving existing behavior, etc.

Here is a great way to use ChatGPT as part of your work, ask it to optimize a function. For this scenario, it did a great job. For more complex scenarios? There is too much context to express.

My final conclusion is that this is a really awesome tool to assist you. It can have a massive impact on productivity, especially for people working in an area that they aren’t familiar with. The downside is that sometimes it will generate junk, then again, sometimes real people do that as well.

The next few years are going to be really interesting, since it provides a whole new level of capability for the industry at large, but I don’t think that it would shake the reality on the ground.

time to read 2 min | 393 words

A customer reported a scenario where RavenDB was using stupendous amounts of memory. In the orders of tens of GB on a system that didn’t have that much load.

Our first suspicion was that this is an issue with reading the metrics, since RavenDB will try to keep as much of the data in memory, which sometimes leads users to worry. I spoke about this at length in the past.

In this case, that wasn’t the case. We were able to drill down into the exact cause of the memory usage and we found out that RavenDB was using an abnormally high amount of memory. The question was why that was, exactly.

We looked into the common operations on the server, and we found a suspicious query, it looked something like this:

from index 'Sales/Actions'
where endsWith(WorkflowStage, '/Final')

The endsWith query was suspicious, so we looked into that further. In general, endsWith requires us to scan all the unique terms for a particular field, but in most cases, there aren’t that many unique values for a field. In this case, however, that wasn’t the case, here are some of the values for WorkflowStage:

  • Workflows/3a1af12a-b5d2-4c96-9348-177ebaacab6c/Step-2
  • Workflows/6aacc86c-2f28-4b8b-8dee-1024314d5add/Final

In total, there were about 250 million sales in the database, each one of them with a unique WorflowStage value.

What does this mean, in terms of RavenDB query execution? Well, the fields are indexed, but we need to effectively do:

This isn’t the actual code, but it will show you what is going on.

In other words, in order to process this query, we have to scan (and materialize) all 250 million unique terms for this field. Obviously that is going to consume a lot of memory.

But what is the solution to that? Instead of doing an expensive endsWith query, we can move the computation from the query time to the index time.

In other words, instead of indexing the WorkflowStage field  as is, we’ll extract the information we want from it. The index would have one of those:

IsFinalWorkFlowStage = doc.WorkflowStage.EndsWith(“/Final”),

WorkflowStagePostfix = doc.WorkflowStage.Split(‘/’).Last()

The first one will check whether the value is final or not, while the second just gets the (one of hopefully a few) postfixes for the field. We can then query using equality instead of endsWith, leading to far better performance and greatly reduced memory usage, since we don’t need to materialize any values during the query.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Recording (8):
    17 Feb 2023 - RavenDB Usage Patterns
  2. Production postmortem (48):
    27 Jan 2023 - The server ate all my memory
  3. Answer (12):
    05 Jan 2023 - what does this code print?
  4. Challenge (71):
    04 Jan 2023 - what does this code print?
  5. RavenDB Indexing (2):
    20 Oct 2022 - exact()
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats