Ayende @ Rahien

Hi!
My name is Oren Eini
Founder of Hibernating Rhinos LTD and RavenDB.
You can reach me by email or phone:

ayende@ayende.com

+972 52-548-6969

, @ Q j

Posts: 6,726 | Comments: 48,712

filter by tags archive

Paranoid decisions and OMG customers

time to read 3 min | 427 words

imageI used to be a consultant for a long while, and that meant that I worked on a lot of customer projects. That led to me seeing and acting is some really strange ways.

Sometimes you go into a codebase and you can’t really believe what you see there. I think that this is similar to how an archeologist feels, seeing just remnants of something and having to deduct what were the forces that drove the people who built it. In some cases, what looks like bad code is actually a reaction to a bad policy that people are trying to workaround.

I think that the strangest of these cases was when I was working for a customer that refused to let external consultants to use their internal source control system. Apparently, they had sensitive stuff there that they couldn’t isolate or something like that. They were using either Team Foundation Server or Visual Source Safe and I didn’t really want to use that source control anyway, so I didn’t push. I did worry about source control, so we had a shared directory being used as a Subversion repository, this was over a decade ago, mind.

So far, so good, and nothing really interesting to talk about. What killed me was that their operations team flat out refused to back up the Subversion folder. That folder was hosted on shared server that belong to the consulting company (but resided at the customer site), and they were unwilling to either back up a “foreign” computer or to provide us with a shared space to host Subversion that they would back up.

For a while, I would backup the Subversion repository every few days to my iPod, then take a copy of the entire source code history with me home. That wasn’t sustainable, and I was deeply concerned about the future of the project over time, so I also added a twist. As part of the build process, we packed the entire source directory of the codebase as an embedded resource into the binary. In this way, if the code was ever lost, which I considered to be a real possibility, I would have a way to recover it all back.

After we handed off the project, I believe they moved the source to their own repository, so we never actually needed that, but I slept a lot better knowing that I had a second string in my bow.

What is your craziest story?

Managing a multi version project

time to read 4 min | 707 words

image

As I’m writing this, we have the following branches in the main repository of RavenDB. Looking at their history, we have:

Branch

Last Commit

Number of commits this year

v1.0

Feb 3, 2013

0

v2.0

Oct 14, 2016

0

v2.5

Oct 18, 2018

14

v3.0

Aug 14, 2018

10

v3.5

Oct 11, 2018

45

v4.0

Oct 18, 2018

2,270

v4.1

Oct 18, 2018

3,214

v4.2

Oct 18, 2018

95

The numbers are actually really interesting. Branches v1.0 and v2.0 are legacy and not longer supported. Branch v2.5 is also legacy, but we have a few customers with support contracts that are still using it so there are still minor bug fixes going on there occasionally. Most of the people on the 3.x line are using 3.5, which is now in maintenance mode, so you can see that there are very little work on the v3.0 branch and a bit of ongoing bug fixes for customers.

The bulk of the work is on the 4.x line. We released v4.0 in Feb of this year, and then switch to working on v4.1, which was released a couple of months ago. We actively started working on v4.2 this month. We are going to close down the v4.0 branch for new features at the end of this month and move it too to maintenance mode.

In practical terms, we very rarely need to do cross major version work but we do have a lot of prev, current, next parallel work. In other words, the situation right now is that a bug fix has to go to at least v4.1 and v4.2 and usually to v4.0 as well. We have been dealing with several different ways to handle this task.

For v4.0 and v4.1 work, which went on in parallel for most of this year, we had the developers submit two pull requests for their changes, one for v4.0 and one for v4.1. This increased the amount of work each change took, but the cost was usually just a few minutes at PR submission time, since we could usually get cherry pick the relevant changes and be done with it. The reason we did it this way is to avoid big merges as we move work between actively worked on branches. That would require having someone dedicated just to handle that, and it was easier to do it in line, rather than in a big bang fashion.

For the v4.2 branch, we are experimenting with something else. Most of the work is going on in the v4.1 branch at this point, mostly minor features and updates, while the v4.2 branch is experimenting with much larger scope of changes. It doesn’t make sense to ask the team to send three PRs, and we are going to close down v4.0 this month anyway. What we are currently doing is designating a person that is in charge of merging the v4.1 changes to v4.2 on a regular basis. So far, we are still pretty close and there hasn’t been a big amount of changes. Depending on how it goes, we’ll keep doing the dual PR once v4.0 is retired from active status or see if the merges can keep going on.

For feature branches, the situation is more straightforward. We typically ask the owner of the feature to rebase on a regular basis on top of whatever the baseline is, and the responsibility to do that is on them.

A long feature branch for us can last up to a month or so, but we had a few that took 3 months when it was a big change. I tend to really dislike those and we are trying to get them to a much shorter timeframes. Most of the work doesn’t happen in a feature branch, we’ll accept partial solutions (if they don’t impact anything else) and we tend to collaborate a lot more closely on code that is already merged rather than in independent branches.

Memory management goop in Windows & Linux

time to read 2 min | 323 words

Regardless of the operating system you use, you are going to get roughly the same services from each of them. In particular, process and memory isolation, managing the hardware, etc. It can sometimes be really interesting to see the difference between the operating systems approach to solving the same problem. Case in point, how both Windows and Linux manage memory. Both of them run on the same hardware and do roughly the same thing. But they have very different styles, this end up having profound implications on the application using them.

Consider what appears to be a very simple question, what stuff do I have in my RAM? Linux keeps track of Resident Set Size on a per mapping basis, which means that we are able to figure out how much of a mmap file is actually in memory. Further more, we can figure out how of the mmap data is clean, which means that it is easily discardable and how much is dirty and needs to be written to disk. Linux exposes this information via the /proc/[pid]/smaps

On the other hand, Windows doesn’t seem to bother to do this tracking. You can get this information, but you need to ask it for each page individually. This means that it isn’t feasible to check what percentage of the system memory is clean (mmap pages that hasn’t been modified and can be cheaply discarded). Windows expose this via the QueryWorkingSetEx method.

As a result, we have to be more conservative on Windows when the system reports high memory usage. We know that our usage pattern means that high amount of memory in use (coming from mmap clean pages) is fine. It is a small detail, but it has caused us to have to jump through several hurdles when we are running under load. I guess that Windows doesn’t need this information, so it isn’t exposed, while on Linux it seems to be used by plenty of callers.

The mental weight of open pull requests

time to read 1 min | 108 words

I just merged two PRs into RavenDB, and for the first time in a while, I got this beautiful number:

image

For the past few months, we have been working on several long running features, graph queries being the most obvious example. We are now at a stage where we are ready to pull all this work together, which mean that all the long running feature branches (and the discussion about them in the PR) is merged to the next release branch.

And for a while, I can luxuriate in that wonderful feeling.

Travel schedule, 2019

time to read 2 min | 220 words

imageI’m going to be traveling extensively in the first part of 2019.

On January, I’m going to be in Sandusky, Ohio for CodeMash where I’ll be speaking about Extreme Performance Architecture, how we were able to refactor RavenDB to get insane level of performance. I’m going to be covering both low level details (there might be some assembly code) and the high level architecture that make it all possible.

On February, I’ll be in the RavenDB booth in the O’Reilly Software Architecture in New York. You should come and see what goodies we’ll have to show off at that stage. If you are in healthcare, we’ll also be at the HIMMS conference in Orlando where we’ll showing off some of the ways RavenDB make working on healthcare software easier.

I’m also going to have an event in New York in February (details to follow) in which I’ll speak about a grown up database. How RavenDB is able to run without a dedicated admin and what kind of behaviors you can expect from it.

On March, I’m going to be visiting the Gartner Data & Analytics conference in Orlando, you can ping me if you want to sit and have lunch there.

RavenDB C++ client: Laying the ground work

time to read 2 min | 371 words

The core concept underlying the RavenDB client API is the notion of Unit of Work. This provide core features such as change tracking and identity map. In all our previous clients, that was pretty easy to deal with, because the GC solved memory ownership and reflection gave us a lot of stuff basically for free.

Right now, what I want to achieve is the following:

It seems pretty simple, right? Because both the session and the caller code are going to share ownership on the passed User. Notice that we modify the user after we call store() but before save_changes(). We expect to see the modification in the document that is being generated.

The memory ownership is handled here by using shared_ptr as the underlying mode in which we accept and return data to the session. Now, let’s see how we actually deal with serialization, shall we? I have chosen to use nlohmann’s json for the project, which means that the provided API is quite nice. As a consumer, you’ll need to write your JSON serialization code, but it is fairly obvious how to do so, check this out:

Given that C++ doesn’t have reflection, I think that this represent a really nice handling of the issue. Now, how does this play with everything else? Here is what the skeleton of the session looks like:

There is a whole bunch of stuff that is going on here that we need to deal with.

First, we have the IEntityDetails interface, which is non generic. It is implemented by the generic class EntityDetails, which has the actual type that we are using and can then use the json serialization we defined to convert the entity to JSON. The rest are just details, we need to use a vector of shared_ptr, instead of the abstract class, because the abstract class has no defined size.

The generic store() method just capture the generic type and store it, and the rest of the code can work with the non generic interface.

I’m not sure how idiomatic this code is, or how performant, but at least as a proof of concept, it works to show that we can get a really good interface to our users in C++.

Abusing system flexibility to avoid paying collect tolls

time to read 3 min | 512 words

imageI’m going to feel like an old man for this post, but if you were born post 1995, it is likely that you have no idea what I’m talking about in this post, crazy as this sounds to me.

Before there was a phone in every pocket, there were land lines. It is like today’s phone, but much larger, you could only do voice calls and if you wanted to screen your calls you needed to buy another appliance. If you’ll watch the first few sessions of Friends, you’ll see how important a detail that can be. If you were out of the house or office and needed to place a call, you could use something called a public phone booth or a pay phone.

Sadly, the easiest way I can convey what this was is to invoke the Tardis. A small booth in which you had a public access phone. Because phone calls used to cost a lot, these phone had a way to drop some coins or tokens into the phone to pay for the phone call.

As a child, I didn’t have a wallet and still needed to occasionally make calls. Being stuck without cash at hand wasn’t such a strange thing so there was another way to perform the call. You could reverse the charge, instead of the person placing the call paying for it, you could call collect. In that case, the person answering the call would be paying for it. Naturally, since money is involved, you need the other party to accept the charge before actually charging them.

At some point in time, you called a special number and told the operator what number you wanted to do a collect call. The operator would ring this number and ask for permission to connect the call and charge the receiver. I think that the rate for a collect call was significantly higher than the normal call, so you wouldn’t normally do that.

As part of the system automation, the phone company replaced the manual operator collect call with an automated system. You would record a short message, which would be played to the other party. If they wanted to accept the call (and the charge), the could press 1 on the phone, or disconnect to avoid the charge.

As a kid, I quickly learned that instead of telling the other party who is calling and why (so they would accept the call), I could just tell them what my phone number is. In this way, they would write down the number, refuse the call and then call me back. That would avoid the collect toll charge.

I remember that at some point the phone company made the length of the collect hello message really short, but I got around that by speaking really fast (or sometimes by making two separate calls). I remember having to practice saying the phone number a few times to get it done in the right time.

System flexibility

time to read 3 min | 540 words

One of the absolutely most challenging things in designing software systems is that there is really no such thing is a perfect world. A business requirement that is set in stone turns out to be quite malleable. That can cause quite a big hassle for the development team, as they try to anticipate and address all aspects of change ahead of time.

A better alternative would be to not attempt to address all such issues in software, but in wetware.

I recently ordered lunch to go at a restaurant. I already paid and was waiting to get my order when the clerk double checked with me that the order is to go. The ordered was entered as if I was going to eat in the location, instead of taking the food away. After I confirmed that I want to take my order to go, I watched how the clerk fixed things up. She went to the kitchen window and shouted, “that last order, make it to go”. The kitchen staff double checked which order it was, then moved on with their tasks, eventually hanging me a baggie of tasty food to go.

On the way back, I kept wondering in my head how a software system would handle something like this. You’ll need to shred the idea of “an order is immutable once it is paid”, for example. Or you’ll need to add a side channel of additional instructions to the kitchen, etc.

Or, you can ignore the whole thing completely and shout at the cook. In software, that might mean that we’ll keep ourselves agile and provide a “Manual” mode in which a user can enter free text / instructions for the next person on the line to process this.

There are some cases where this would be a bad idea, but mostly these are involved not trusting your users to do their jobs. Sometimes, it is literally the software’s job to force the users to follow a specific path (usually because management decided that this must be so). However, a really important aspect of design is that it isn’t rigid, it allows the user to do their work, instead of working around the software. Part of that involves designing specific places where users can do stuff that you didn’t think that they would need.

For example, in an order, having a “Notes” text field that is editable even after the order is placed, which can be used for further communication. The idea is that you spend just a little bit of time to consider whatever scenarios you didn’t cover and try to give the user something (just bare minimum, maybe even below that), just to allow them to get by. The idea isn’t to provide a solution, but to get something that give the user a choice and that will raise enough feedback so you can plug this into the next iteration of your product.

Not having anything may mean that the users will solve their own problem using something else (email, or even just talking directly to one another) and we can’t have that*, obviously.

* It may sound silly, but in some cases, you literally can’t have that. Certain actions need to be logged, authorized and track appropriately for many purposes.

The perils of full system resource utilization

time to read 3 min | 471 words

The following quotes (or something very similar) came from our interactions with customers:

“We paid a lot of money for this hardware, why isn’t your database making full use of it?”

“The machine is peaking at 100% CPU, the sky is falling, help, NOW!”

This is a problem, because I can empathize with both sides. On the one hand, having just put a five or six figure sum into new hardware, it can be depressing to see is “going to waste”. On the other hand, seeing the system under high load gives you that sinking feeling that the boat is going to overturn at any moment and production will go down.

Balancing resource consumption is a really hard problem, mostly because we don’t have any control over our work intake. We can’t control how many requests we accept nor do we control what kind of work is being asked of us. Actually, that isn’t true. We could control that, but in most cases, that is a false distinction.

At some point, RavenDB had a max number of concurrent request limit, and users have hit that in the past. This resulted in angry calls from customers about RavenDB refusing requests. The fact that we did that to maintain the overall health of the system was immaterial. Refusing requests meant that the system (or some portion of it) was down. In those cases, it was actually better, from the customer’s perspective, for the whole thing to slow down a bit, as long as there were no errors.

Inside RavenDB, we attempt to manage our CPU consumption using separation of concerns. First, we have the processing of requests. The assumption is that such requests end up being waited on by an actual human, directly or indirectly, so we process them first, prioritizing them above almost everything else. The only thing that has higher priority is the cluster health and monitoring system, which ensure that all nodes are up, running and in the same state.

As it turns out, RavenDB have a lot of additional processes internally that can be given a lower priority under load. For example, indexing, which are something that RavenDB runs in the background, are something that we can increase the latency of to give more resources for request processing.

We have a lot of experience in balancing the overall needs, and I’m still not sure that I have a good answer here. The reason for this post is that I just analyzed a dump file where it looked like requests were waiting for indexing to complete, but they were actually starving the indexes from the CPU time that they needed to actually run.  The system progressed, but not fast enough for the user to not notice things.

Actually, that is the primary criteria that we use. If the system is slow, but no one notices, the system ain’t slow.

The design & challenges of a RavenDB C++ client

time to read 3 min | 468 words

When I wrote the first version of RavenDB, I was coming off about six years of intensive work on NHibernate. I wanted the same level of convenience that I had with a world class OR/M with non of the relational constraints (pun intended).

Given that I was working in a managed language, features such as change tracking, unit of work, etc. Since then, we created clients for: C#, Java, Python, Node.JS, Ruby and Go. A common feature of all these languages is that they all have automatic memory management. Go, in particular, has been interesting, because while it deals with explicit pointers, there is no need to deal with manually freeing memory.

We are now looking at what it would take to bring the same level of experience to a C++ client. For example, here is about the simplest CRUD scenario that I can think of:

This code isn’t showing something special, until you realize that when you want to translate it to C++, you’ll need to take into account the explicit memory ownership. Another issue to deal with is how we can implement seamless integration between business objects and JSON documents.

I looked at how this is handled in other similar databases, and the results seems to be, pretty badly.

At least, when I compare it to how much higher the level of the code is in C++. Now, it is possible that C++ developers like working at this level. And certainly, the RavenDB client APIs actually have user exposed layers that are similar to this, but this is something that you’ll usually not need. Ideally, I want to be able to give the same level of experience to the C++ client as well.

The issue of JSON serialization actually seems to be already well taken for already.  A user will need to define to_json and from_json functions to make this work, but given that C++ has no reflection, that seems reasonable to request. It also gives the user complete control over the serialization / deserialization process and avoid the process of “customizing” the JSON serialization, which you sometimes have to do.

The issue of memory ownership, though, it a bit more complex. I was thinking about exposing this via the following interface:

The idea is that the RavenDB C++ client will only deal with shared_ptr, with the idea that we can accept that the entities we manage may live longer than the lifetime of the session.

I’m no longer able to consider myself a C++ developer, and the dev we have started working on the C++ stuff is currently busy learning RavenDB itself, so I thought this would be a good time to ask for feedback.

Both on the kind of interface that you’ll like to see for C++ client and whatever this approach is going to work.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Challenge (54):
    28 Sep 2018 - The loop that leaks–Answer
  2. Graphs in RavenDB (4):
    21 Sep 2018 - Graph modeling vs. document modeling
  3. Reviewing FASTER (9):
    06 Sep 2018 - Summary
  4. RavenDB 4.1 features (12):
    22 Aug 2018 - MongoDB & CosmosDB Migration Wizards
  5. Reading the NSA’s codebase (7):
    13 Aug 2018 - LemonGraph review–Part VII–Summary
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats