Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 568 words

Our test process occasionally crashed with an access violation exception. We consider these to be Priority 0 bugs, so we had one of the most experience developers in the office sit on this problem.

Access violation errors are nasty, because they give you very little information about what is going on, and there is typically no real way to recover from them. We have a process to deal with them, though. We know how to setup things so we’ll get a memory dump on error, so the very first thing that we work toward is to reproduce this error.

After a fair bit of effort, we managed to get to a point where we can semi-reliably reproduce this error. This means, if you wanna know, that do “stuff” and get the error in under 15 minutes. That’s the reason we need the best people on those kind of investigations. Actually getting to the point where this fails is the most complex part of the process.

The goal here is to get to two really important pieces of information:

  • A few memory dumps of the actual crash – these are important to be able to figure out what is going on.
  • A way to actually generate the crash – in a reasonable time frame, mostly because we need to verify that we actually fixed the issue.

After a bunch of work, we were able to look at the dump file and found that the crash originated from Voron’s code. The developer in charge then faulted, because they  tried to increase the priority of an issue with Priority 0 already, and P-2147483648 didn’t quite work out.

We also figured out that this can only occur on 32 bits, which is really interesting. 32 bits is a constrained address space, so it is a great way to run into memory bugs.

We started to look even more closely at this. The problem happened while running memcpy(), and looking at the addresses that were passed to the function, one of them was Voron allocated memory, whose state was just fine. The second value pointed to a MEM_RESERVE portion of memory, which didn’t make sense at all.

Up the call stack we went, to try to figure out what we were doing. Here is where we ended up in (hint: The crash happened deep inside the Insert() call).

image

This is test code, mind you, exercising some really obscure part of Voron’s storage behavior. And once we actually looked at the code, it was obvious what the problem was.

We were capturing the addresses of an array in memory, using the fixed statement.

But then we used them outside the fixed. If there happened to be a GC between these two lines, and if it happened to move the memory and free the segment, we would access memory that is no longer valid. This would result in an access violation, naturally. I think we were only able to reproduce this in 32 bits because of the tiny address space. In 64 bits, there is a lot less pressure to move the memory, so it remains valid.

Luckily, this is an error only in our tests, so we reduce our DEFCON level to more reasonable value. The fix was trivial (move the Insert calls to the fixed scope), and we were able to test that this fixed the issue.

time to read 2 min | 245 words

I’ll be writing a lot more about our RavenDB C++ client, but today I was reviewing some code and I got a reply that made me go: “Ohhhhh! Nice”, and I just had to blog about it.

image

This is pretty much a direct transaction of how you’ll write this kind of query in C#, and the output of this is a RQL query that looks like this:

image

The problem is that I know how the C# version works. It uses Reflection to extract the field names from the type, so we can figure out what fields you are interested in. In C++, you don’t have Reflection, so how can this possibly work?

What Alexander did was really nice. Given that the user already have to provide us with the serialization routine for this type (so we can turn the JSON into the types that will be returned). Inside the select_fields() call, he constructed an empty object, serialize that and then use the field names in the resulting JSON to figure out what fields we want to project from the Users documents.

It make perfect sense, it require no additional work from the user and it gives us consistent API. It is also something that I would probably never think to do.

time to read 6 min | 1078 words

Unusually for me, I had a bit of a pause in reviewing Sled. As a reminder, Sled is an embedded database engine written in Rust. I last stopped looking at the buffer management, but I still don’t really have a good grasp of what is going on.

The next file is the iterator. It looks like it translates between segments and messages in these segments. Here is the struct:

image

As you can see, the log iterator holds an iterator of segments, and iterating over it looks like it will go over all the messages in the segments in order. Yep, here is the actual work being done:

image

The next() method is fairly straightforward, I found. But I have to point out this:

image

First, the will need call is really interesting. Mostly because you have a pretty obvious way to do conditional compiling that doesn’t really sucks. #if is usually much more jarring in the code.

Second, I think that the style of putting really important functions inside an if result in a pretty dense code. Especially since the if is entered only on error. I would have preferred to have it as a stand alone variable, and then check if it failed.

What I don’t understand is the read_segment call. Inside that method, we have:

image

There are also similar calls on segment trailer. It looks like we have a single file for the data, but stuff that is too large is held externally, in the blob files.

We then get to this guy, which I find really elegant way to handle all the different states.

image

That is about it for interesting bits in the iterator, the next fun bit is the Log. I do have to admit that I don’t like the term log. It is too easy to get it confused with a debug log. In Voron, I used the term Journal or Write Ahead Journal (OH in the office: “did we waj it yet?”).

image

The fact that you need to figure out where to get the offset of the data you are about to write is really interesting. This is the method that does the bulk of the work:

image

Note that it just reserve and complete the operation. This also does not flush the data to disk. That is handled by the flusher or by explicit call. The reserve() method calls to reserve_internal() and there we find this gem:

image

I know what it does (conditional compilation), but I find it really hard to follow. Especially because it looks like a mistake, with buf being defined twice. This is actually a case where an #if statement would be better, in my eyes.

Most of the code in there is to manage calls to the iobuf, which I already reviewed. So I’m going to skip ahead and look at something that is going to be more interesting, the page cache. Sled has an interesting behavior, in that it can shred a page into multiple location, requiring some logic to bring it all back together. That is going to be really interesting to look at, I hope.

The file stats with this:

image

And this… takes a while to unpack.  Remember that epoch is manual GC pattern for concurrent data structure without GC.

The cached_ptr value is a shared pointer to a Node (inside a lock free stack) that holds a CacheEntry with static lifetime and thread safe to a generic argument that must have static lifetime and be thread safe. And there is a unsigned long there as well.

No idea yet what is going on. But here is the first method on this struct:

image

That is… a lot. The cache entry is a discriminated union with the following options:

image

There are some awesome documentation comments here, including full blown sample code that really help understand what is going on in the code.

There seems to be a few key methods that are involved here:

  • allocate(val) – create a new page and write an initial value, gets back a page id.
  • link(id, val) – make a write to a page id. Which simply write a value out.
  • get(id) – read all the values for a page id, and uses a materializer to merge them all to a single value.
  • replace(id, val) – set the page id to the new value, removing all the other values it had.

The idea here, as I gather. Is to allow sequential writes to the data, but allow fast reads, mostly by utilizing SSD’s random read feature.

I’m trying to follow the code, but it is a bit complicated. In particular, we have:

image

This try to allocate either a free page or allocate a new one. One of the things that really mess with me here is that the use of the term Page. I’m using to B+Tree, where a page is literally some section of memory. Here it refers to something more nebulous. Key point here, I don’t see where the size is specified. But given that you can link items to page, that sort of make sense. I just need to get used to Pages != storage.

The fact that all of this is generic also make it hard to follow what is actually going on.  I’m getting lost in here, so I think that I’ll stop for now.

time to read 3 min | 483 words

imageOver the weekend, I learned that Joe Armstrong has passed away. I have been thinking about through all of yesterday, because I have met Joe and had a few discussions with him, but I never had the chance to actually know him. Which is a shame, in a way, he changed my life.

One of the advantages of having a blog is that I can go back in time and trace things. In Sep 2007, I run into Joe for the first time. It was in the JAOO conference in Aarhus. I sat in his talk and was quite impressed. This is what I had to say at the time:

I was at the Erlang talk, which is quite probably the best one that will be here. Joe has created the language and wrote the book about it, so he certainly knows his stuff, and he is a Character with a capital C. I am not sure if it is a show or not, but it was amazingly amusing.

Bought the Erlang book, it is a weird language compare to those I know, but I really need to learn a new language this year, and Erlang gets me both functional and concurrent aspects for the "price" of one.

A couple of years later I was at the same conference and wrote:

I remember sitting at a session with Joe Armstrong talking about Erlang and finally getting things that were annoying just beyond my grasp.

Even since, whenever we were in the same conferences, I made sure to sit in his talk. He was an amazing speaker and I still carry with me his advice on system design and distributed architecture. I never really liked the Erlang syntax, but the concepts were very attractive to me. It took a while for this to percolate, but after reading some more about Erlang, I looked for an OSS project in Erlang that I could read, to actually grok what it is like to write in Erlang. I chose to read the CouchDB source code.

This was the first time that I really dove down into NoSQL and I remember running into all sort of things inside the CouchDB source code and thinking: “That isn’t how I would do it.” That code review ended up giving me so many ideas that I had to put them on paper (on keyboard, actually, I guess) and I wrote a whole series of blog posts on how to design a document database.

Just writing about it didn’t help, so I sat down and wrote some code. Some code turned into a lot of code, and that ended up being RavenDB.

And I can trace it all back to sitting in a conference room in JAOO, listening to Joe speak and being blown away.

Thank you, Joe.

time to read 4 min | 786 words

imageKrzysztof has been working on our RavenDB Go Client for almost a year, and we are at the final stretch (docs, tests, deployment, etc). He has written a blog post detailing the experience of porting over 50,000 lines of code from Java to Go.

I wanted to point out a few additional things about the porting effort and the Go client API that he didn’t get to.

From the perspective of RavenDB, we want to have as many clients as possible, because the more clients we have, the more approachable we are for developers. There are over million Go developers, so that is certainly something that we want to enable. More important, Go is a great language for server side work and primary used for just the kind of applications that can be helped from using RavenDB.

RavenDB currently have clients for:

  1. .NET  / CLR – C#, VB.Net, F#, etc.
  2. JVM – Java, Kotlin, Clojure, etc.
  3. Node.js
  4. Python
  5. Go – finalization stage
  6. C++ – alpha stage

We also have a Ruby client under wraps and wouldn’t object to having a PHP one.

We used to only run on Windows and really only pay attention to the C# client. That has changed toward the end of 2015, when we started the work on the 4.0 release of RavenDB. We knew that we were going to be cross platform and we knew that we were going to target additional languages and runtimes. That meant that we had to deal with a pretty tough choice.

Previously, when we had just a single client, we could do quite a lot in it. That meant that a lot of the  functionality and the smarts could reside in the client. But we now have 6+ clients that we need to maintain, which means that we are in a very different position.

For reference, the RavenDB Server alone is 225 KLOC, the .NET client is 62 KLOC and the other clients are about 50 KLOC each (Linq support is quite costly for .NET, in terms of LOC and overall complexity).

One of the design guidelines for RavenDB 4.0 was that we want to move, as much as possible, responsibility from the client side to the server side. We have done a lot of stuff to make this happen, but the RavenDB client is still a pretty big chunk of code. With 50 KLOC, you can do quite a lot, so what is actually going on in there?

The RavenDB client core responsibilities are:

  • Commands on the server / documents – About 12 KLOC. This provide strongly typed access to commands, including specific command error handling and handling.
  • Caching, Failover & request processing – About 3 KLOC. Handles failover and recovery, topology handling and the client side portion of RavenDB’s High Availability features by implementing transparent failover if there is a failure. Also handles request caching as well as aggressive caching.
  • JSON handling. About 3 KLOC. Type convertors, serialization helpers and other stuff related to handling JSON that we need client side.
  • Exceptions – 1.5 KLOC. Type safe exceptions for various errors takes a lot of bit of code, mostly because we try hard to get good errors to the user.

But by far, the most complex part of the RavenDB client is the session. The session is the typical API you have for working with RavenDB and it is how you’ll usually interact with it. You can see the Go client above using the session to store a document and save it to the database.

The sessions is about 20 KLOC or so. By far the biggest single component that we have.

But why it is to big? Especially since I just told you that we spent a lot of time moving responsibilities away from the client.

Because the session implements a lot of really important behaviors for the client. Without any particular order, and off the top of my head, we have:

  • Unit of Work
  • Change Tracking
  • Identity Map
  • Queries
  • Patching
  • Lazy operations

The surface area of RavenDB’s client API is very important to me. I think that giving you a high level API is quite important to reduce the complexity that you have to deal with and making it easy for you to get things done. And that end up taking quite a lot of code to implement.

The good news is that once we have a client, keeping it up to date is relatively simple. And having the taken the onus of complexity upon ourselves, we free you from having to manage that. The overall experience of building application using RavenDB is much better, to the point where you can pretty much ignore the database, because it will Just Work.

time to read 1 min | 72 words

I’m going to be in London at the beginning of June. I’ll be giving a keynote at Skills Matters as well as visiting some customers.

I have a half day and a full day slots available for consulting (RavenDB, databases and overall architecture). Drop me a line if you are interested.

I also should have an evening or two free is there is anyone who wants to sit over a beer and chat.

time to read 2 min | 287 words

In a previous post about authorization in a microservice environment, I wrote that one option is to generate an authorization token and have it hold the relevant claims for the application. I was asked how I would handle a scenario in which the security claim is over individual categories of orders and a user may have too many categories to fit the token.

This is a great question, because it showcase a really important part of such a design. An inherent limit to complexity.  The fact that having a user with a thousand individual security claims is hard isn’t a bug in the system, it is a feature.

For many such cases, it really doesn’t make sense to setup security in such a manner. How can you ever audit or reason about such a system? It just doesn’t work this way in the real world. An agent may be authorized to a dozen customers, and her manager will be allowed access to them as well. But attaching each individual customer to the manager doesn’t work. Instead, you would create a group and attach the customers to the group, then allow the manager to access the group. Such a system is much easier to work with and review. It also match a lot more closely how the real world works.

Some of the problems here are derived from the fact that it seems like, when we use a computer, we can build such a system. But in most cases, this is a false premise. Not because of actual technical limitations, but because of management overhead.

Building the system upfront so the things that should be hard are actually hard is going to be a lot better in the long run.

time to read 5 min | 808 words

imageI talked a bit about microservices architecture in the past few weeks, but I think that there is a common theme to those posts that is missed in the details.

A microservices architecture, just like Domain Driven Design or Event Source and CQRS are architectural patterns that are meant to manage complexity. In the realms of operations, Kubernetes is another good example of a tool that is meant to manage complexity.

I feel that this is a part that is all too often getting lost. The law of leaky abstractions means that you can’t really reduce complexity, you can only manage it. This means that tools and architectures that are meant to deal with complexity are themselves complex, by necessity. The problem is when you try to take a solution that was successfully applied to solve a complex problem, and  apply that to something that isn’t of equal complexity.

Keep the following formula in mind:

Solution Complexity = Architecture Complexity + ( Problem Complexity / Architecture Factor )

Let’s try to solve this formula for a couple of projects. One would be managing a little league soccer website and the other would be the standard online shop. Here are the results

Cost / Benefit of Architecture

Little League

Online Shop

Architecture Complexity

10

10

Problem Complexity

2

20

Architecture Factor

3

3

Solution Complexity

10.6

16.6

By the way, the numbers are arbitrary, I’m trying to show a point, and showing it with numbers make it easier to get the point across. The formula is real, though, based on my experience.

The idea behind the formula and the table above is simple. Every architecture you make can be ranked along two axes. One is the architectural complexity and the second is the architecture factor. The architectural complexity is a fixed (usually) number that ranks how complex it is to use the architecture. The architectural factor is how much this architecture help you deal with the overall problem complexity.

You can see above that applying the same architecture for two different problem can result in very different results. The overall solution complexity for the little league website is less than the online shop, as expected. But you can also see that there are huge fixed costs here that drive the overall complexity far higher.

Using a different architecture, which will have a much smaller architectural factor, but also much lower fixed complexity, will allow you to deliver a solution that has much lower complexity (and get it faster, with less bugs, etc).

Choosing a microservice architecture implies that you are going to have a net benefit here. The additional complexity of using microservices is offset by the fact that the architectural factor is going to reduce your overall complexity. Otherwise, it just doesn’t make sense.

An 18 wheelers is a great thing to have, if you need to ship a whole bunch of stuff. It is the Wrong Tool For The Job if you need to commute to work.

In most cases, people select the architecture that sounds right for their project, mostly because they focus on the architectural factor. Without taking into account the fixed complexity cost. When they run into that, they either re-evaluate or strive forward regardless. Let’s assume that you run into a project where they chose the microservice architecture, and then they realize that some parts of it are complex, so they cut some corners. I’m thinking about something like what is shown here. Let’s analyze what you end up with?

Architecture Complexity – 10, Architecture Factor – 1, Problem Complexity – 8 = Overall Complexity = 18

And that is for the good case where your architectural factor isn’t actually below 1, which I would argue is actually going to be in the kind of architecture that these kind of solutions reach. A Distributed Monolith has an architecture complexity of 10 and a factor of 0.75. So trying to solve a problem that has a complexity of 8 here will result in overall complexity of 20.6.

I don’t actually have real numbers to evaluate different architectures and solution complexities. That would probably require rigorous study, but empirical evidence can give good off the cuff numbers for most of the common architectures. I’m going to leave it up to the comments, if someone want to take this challenge.

Keep this in mind when you are choosing your architecture, for both green fields and brown fields projects. That can save you a lot of trouble.

time to read 3 min | 520 words

They just aren’t. And I’m talking as someone who has actually implemented multiple distributed transaction systems. People moving to microservices are now discovering a lot of the challenges and hurdles of distributed systems and it is only natural to want to go back to the cozy transactional world, where you can reason about things properly.

This post is in response to this article: Microservices and distributed transactions, which I read with interest, because it isn’t often that a post will refute it’s own premise with the very first statement.

The two-phase commit protocol was designed in the epoch of the “big iron” systems like mainframes and UNIX servers; the XA specification was defined in 1991 when the typical deployment model consisted of having all the software installed in a single server.

That is a really important observation, because in this case, we remove one big factor from the distributed transactions, the distributed part. Note that this is almost 30 years ago, distributed transactions and the two phase commit protocol aren’t running on a single node any longer. But the architecture is still rooted into the same concept. And it doesn’t work. I wrote a blog post explaining the core issues with two phase commit about 5 years ago. Nothing changed so far.

From a technical perspective, the approach that is shown in the article is interesting. It is really nice that you can have a “transaction” that spawn multiple services and databases. It is a problem that this isn’t going to result in an atomic behavior (you can observe some of the transactions being committed before others), it is a problem that this has really bad failure modes (hanging / timeout / inconsistencies) under fairly common scenarios and finally, it is a really bad approach because your microservices shouldn’t be composed using transactions.

Leaving aside all the technical details about why two phase commit is a bad idea, there is still the core architectural issue, you are tying together the services in your system. If service A is stalled for whatever reason, your service B is now impacted because it is waiting for a transaction to close.

Have fun trying to debug something like that, especially because you actual state is hidden away in some transaction manager and not readily visible. It means adding a tricky layer of complexity that will break, and will cause issues, and will create silent dependencies between your services. Silent ones, invisible ones, and they will come to haunt you.

The whole point of a microservice architecture is separation of concerns to independently managed, deployed and provisioned systems. If you need to actually have cross service transactions, you either have modelled things wrong or are doing very badly. Go back to a monolith with a single database backend and use that as the transactional store. You’ll be much happier.

Remember: Microservices. Are. Separated.

That isn’t a bug, that isn’t a hurdle to overcome. That is the point. Tying them close together is a mistake, but you’ll usually only see it after a few months of production. So take a measure of prevention before you’ll need a metric ton of cures.

time to read 4 min | 722 words

This post was triggered by this post. Mostly because I got people looking strangely at me when I shouted DO NOT DO THAT when I read the post.

We’ll use the usual Users and Orders example, because that is simple to work with. We have the usual concerns about users in our application:

  • Authentication
    • Password reset
    • Two factor auth
    • Unusual activity detection
    • Etc, etc, etc.
  • Authorization
    • Can the user perform this particular operation?
    • Can the user perform this action on this item?
    • Can the user perform this action on this item on behalf of this user?

Authentication itself is a fairly simple process. Don’t build that, go and use a builtin solution, authentication is complex, but the good side of it is that there are rarely any business specific stuff around it. You need to authenticate a user, and that is one of those things that is generally such a common concern that you can take an off the shelve solution and go with that.

Authorization is a lot more interesting. Note that we have three separate ways to ask the same question. It might be better to give concrete examples about what I mean for each one of them.

Can the user create a new order? Can they check the recent product updates, etc? Note that in this case, we aren’t operating on a particular entity, but performing global actions.

Can the user view this order? Can they change the shipping address?  Note that in this case, we have both authorization rules (you should be able to view your own orders) and business rules (you can change the shipping address on your order if the order didn’t ship and the shipping cost is the same).

Can the helpdesk guy check the status of an order for a particular customer? In this case, we have a user that is explicitly doing an action on behalf on another user. We might allow it (or not), but we almost always want to make a special note of this.

The interesting thing about this kind of system is that there are very different semantics for each of those operations. One off the primary goals for a microservice architecture is the separation of concerns, I don’t want to keep pinging the authorization service on each operation. That is important. And not just for the architectural purity of the system, one of the most common reasons for performance issues in systems is the cost of authorization checks. If you make that go over the network, that is going to kill your system.

Therefor, we need to consider how to enable proper encapsulation of concerns. An easy to do that is to have the client hold that state. In other words, as part off the authentication process, the client is going to get a token, which it can use for the next calls. That token contains the list of allowed operations / enough state to compute the authorization status for the actual operations. Naturally, that state is not something that the client can modify, and is protected with cryptography. A good example of that would be JWT. The authorization service generate a token with a key that is trusted by the other services. You can verify most authorization actions without leaving your service boundary.

This is easy for operations such as creating a new order, but how do you handle authorization on a specific entity? You aren’t going to be able to encode all the allowed entities in the token, at least not in most reasonable systems. Instead, you combine the allowed operations and the semantics of the operation itself. In other words, when loading an order, you check whatever the user has “orders/view/self” operation and that the order is for the same user id.

A more complex process is required when you have operations on behalf of. You don’t want the helpdesk people to start sniffing into what <Insert Famous Person Name Here> ordered last night, for example. Instead of complicating the entire system with “on behalf of” operations, a much better system is to go back to the authorization service. You can ask that service to generate you a special “on behalf of” token, with the user id of the required user. This create an audit trail of such actions and allow the authorization service to decide if a particular user should have the authority to act on behalf of a particular user.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}