Ayende @ Rahien

Oren Eini aka Ayende Rahien CEO of Hibernating Rhinos LTD, which develops RavenDB, a NoSQL Open Source Document Database.

Get in touch with me:


+972 52-548-6969

Posts: 7,198 | Comments: 50,268

Privacy Policy Terms
filter by tags archive
time to read 4 min | 606 words

imageI recently had to discuss the issue on the impact of latency a few times, and I found the coffee cup analogy to be an excellent tool to explain exactly what is going on. Consider the humble coffee cup, without which there would be no code.

It is a pretty simple drink, composed of coffee, water and milk. I’ll ignore coffee snobs and the like for now and focus strictly on the process of making a cup of coffee. I found this recipe:

  • 1 cup milk
  • ½ cup cold brewed coffee
  • 2 sweetener

Mix milk, coffee, and sweetener together in a glass until sweetener is dissolved.

If I was writing this in code, I would probably write something like this:

Simple enough, right? There is just a little bit of details to fill. How are the coffee() or sweetner() methods implemented?

The nice thing about this code is that this is nicely abstracted, the coffee recipe and the code reads almost in the same manner. However, there is an issue with the actual implementation. We have the go_to_store() method, but we know that this is an expensive operation. To avoid making it too often, we calculate the amounts that we need to make 20 cups of coffee and make sure that we set the relevant XYZ_AMOUNT_TO_BUY appropriately.

What do you think will happen on the 21th cup of coffee, however? We run out of coffee, so we’ll go to the store to get some. Once we got it, we can pour the coffee to the cup, but then we need to put the milk in, in which case we’ll discover that we run out. Off to the store we go, and all the way back. And then there is the sweetener that run out, so that is the third trip to the store.

Abstraction, in this case, is actively hurting us. We ignore the fact that ingredients may be missing, and that isn’t something that we can afford to. The cost of going to the store outweigh anything else in the process of making a cup of coffee, and we just did that three times.

In the context of software, of course, we are talking about the issue of making a remote call. For example, sending a separate query to the database for each datum that you need. The cost of the remote call far exceed any other costs you have in the system.

To solve the coffee cup problem, you’ll need to do something like:

Abstraction? What abstraction? There are no abstractions here. We are very clearly focused on the things that need to happen to get it working properly. In fact, a better alternative would be to not check that we have enough for the current cup but to schedule a purchase when we notice that we are low.

That, again, intermix the responsibilities of making the coffee and making sure that we have the ingredients at hand. That is not an actual problem, however. That is something that we are fine with, given the difference in performance that this entails.

In the same manner, when I see people trying to hide (RPC, database calls, etc) behind an abstraction layer, I know that it will almost always end in tears. Because if you have what looks like a cheap function call go to the store for you, the end result is that you have to wait a lot of time for your coffee. Maybe enough to (gasp) not even have coffee.

On that note, I have a cup of coffee to finish…

time to read 3 min | 585 words

imageI recently run into a bit of code that made me go: Stop! Don’t you dare going this way!

The reason that I had such a reaction for the code in question is that I have seen where such code will lead you, and that is not anywhere good. The code in question?

This is a pretty horrible thing to do to your system. Let’s count the ways:

  • Queries are happening fairly deep in your system, which means that you’re now putting this sort of behavior in a place where it is generally invisible for the rest of the code.
  • What happens if the calling code also have something similar? Now we got retries on retries.
  • What happens if the code that you are calling has something similar? Now we got retries on retries on retries.
    • You can absolutely rely on the code you are calling to do retries. If only because that is how TCP behaves. But also because there are usually resiliency measures implemented.
  • What happens if the error actually matters. There is no exception throw in any case, which means that important information is written to the log, which no one ever reads.
  • There is no distinction of the types of errors where retry may help and where it won’t.
  • What is the query has side effects? For example, you may be calling a stored procedure, but multiple times.
  • What happens when you run out of retries? The code will return null, which means that the calling code will like fail with NRE.

What is worst, by the way, is that this piece of code is attempting to fix a very specific issue. Being unable to reach the relevant database. For example, if you are writing a service, you may run into that on reboot, your service may have started before the database, so you need to retry a few times to the let the database to load. A better option would be to specify the load order of the services.

Or maybe there was some network hiccup that you had to deal with? That would sort of work, and probably the one case where this will work. But TCP already does that by resending packets, you are adding this again and it is building up to be a nasty case.

When there is an error, your application is going to sulk, throw strange errors and refuse to tell you what is going on. There are going to be a lot of symptoms that are hard to diagnose and debug.

To quote Release It!:

Connection timeouts vary from one operating system to another, but they’re usually measured in minutes! The calling application’s thread could be blocked waiting for the remote server to respond for ten minutes!

You added a retry on top of that, and then the system just… stops.

Let’s take a look at the usage pattern, shall we?

That will fail pretty badly (and then cause a null reference exception). Let’s say that this is a service code, which is called from a client that uses a similar pattern for “resiliency”.

Question – what do you think will happen the first time that there is an error?  Cascading failures galore.

In general, unknown errors shouldn’t be handled locally, you don’t have a way to do that here. You should raise them up as far as possible. And yes, showing the error to the user is general better than just spinning in place, without giving the user any feedback whatsoever.

time to read 5 min | 834 words

I wrote a post a couple of weeks ago called: Architecture foresight: Put a queue on that. I got an interesting comment from Mike Tomaras on the post that deserve its own post in reply.

Even though the benefits of an async queue are indisputable, I will respectfully point out that you brush over or ignore the drawbacks.

… redacted, see the real comment for details …

I think we agree that your sync code example is much easier to reason about than your async one. "Well, it is a bit more complex to manage in the user interface", "And you can play games on the front end" hides a lot of complexity in the FE to accommodate async patterns.

Your "At more advanced levels" section presents no benefits really, doing these things in a sync pattern is exactly the same as in async, the complexity is moved to the infrastructure instead of the code.

This is a great discussion, and I agree with Mike that there are additional costs to using the async option compared to the synchronous one. There is a really good reason why pretty much all modern languages has something similar to async/await, after all. And anyone who did any work with Node.js and promises without that knows exactly what are the cost of trying to keep the state of the system through multiple levels of callbacks.

It is important, however, that my recommendation had nothing to do with async directly, although that is the end result. My recommendation had a lot more to do with breaking apart the behavior of the system, so you aren’t expected to give immediate replies to the user.

Consider this: ⏱. When you are processing a user’s request, you have a timer inherent to the operation. That timer can be a real one (how long until the request times out) or it can be a mental one (how long until the user gets bored). That means that you have a very short SLA to run the actual request.

What is the impact of that on your system? You have to provision enough capacity in the system to handle the spikes within the small SLA that you have to work with. That is tough. Let’s assume that you are running a website that accepts comments, and you need to run spam detection on the comment before actually posting that. This seems like a pretty standard scenario, right? It doesn’t require specialized scenarios.

However, the service you use has a rate limit of 10 comments / sec. That is also something that is pretty common and reasonable. How would you handle something like that if you have a post that suddenly gets a lot of comments? Well, you’ll have something that ensure that you don’t pass the limit, but then the user is sitting there, waiting and thinking that the request timed out. On the other hand, if you accept the request and place it into a queue, you can show it in the UI as accepted immediately and then process that at leisure.

Yes, this is more complex than just making the call inline, it requires a higher degree of complexity, but it also ensure that you have proper separation in your system. The front end submit messages to the backend, which will reply when it is done. By having this separation upfront, as part of your overall design, you get options. You can change how you are processing things in the backend quickly. Your front end feel fast (which is usually much more important than being fast, mind you).

As for the rate limits and the SLA? In the case of spam API or similar services, sure, this is obvious. But there are usually a lot of implicit SLAs like that. Your database disk is only able to serve so many writes a second, for example. That isn’t usually surfaced to you as X writes / sec limit, but it is true nevertheless. And a queue will smooth over any such issues easily. With making the request directly, you have to ensure that you have enough capacity to handle spikes, and that is usually far more expensive.

What is more interesting, in my opinion, is that the queue gives you options that you wouldn’t have otherwise. For example, tracing of all operations (great for audits), retries if needed, easy model for scale out, smoothing out of spikes, etc.

You cannot actually put everything into a queue, of course. The typical example is that you’ll want to handle a login page. You cannot really “let the user login immediately and process in the background”. Another example where you don’t want to use asynchronous processing is when you are making a query. There are patterns for async query completions, but they are pretty horrible to work with.

In general, the idea is that whenever the is any operation in the system, you throw that to a queue. Reads and certain key aspects are things that you’ll need to run directly.

time to read 6 min | 1120 words

I was pointed to the Odin language after my post about the Zig language. On the surface, Odin and Zig are very similar, but they have some fundamental differences in behavior and mindset. I’m basing most of what I’m writing here on admittedly cursory reading of the Odin language docs and this blog post.

Odin has a great point on conditional compilation. The if statements that are evaluated at compile time are hard to distinguish. I like Odin’s when clauses better, but Zig has comptime if as well, which make it easier. The actual problem I have with this model in Zig is that it is easy to get to a situation where you write (new) code that doesn’t get called, but Zig will detect that it is unused and not bother compiling it. When you are actually trying to use it, you’ll hit a lot of compilation errors that you need to fix. This is in contrast to the way I would usually work, which is to almost always have the code in compliable state and leaning hard on the compiler to double check my work.

Beyond that, I have grave disagreements with Ginger, the author of the blog post and the Odin language. I want to pull just a couple of what I think are the most important points from that post:

I have never had a program cause a system to run out of memory in real software (other than artificial stress tests). If you are working in a low-memory environment, you should be extremely aware of its limitations and plan accordingly. If you are a desktop machine and run out of memory, don’t try to recover from the panic, quit the program or even shut-down the computer. As for other machinery, plan accordingly!

This is in relation to automatic heap allocations (which can fail, which will usually kill the process because there is no good way to report it). My reaction to that is “640KB is enough for everything”, right?

To start with, I write databases for a living. I run my code on containers with 128MB when the user uses a database that is 100s of GB in size. Even if running on proper server machines, I almost always have to deal with datasets that are bigger than memory. Running out of memory happens to us pretty much every single time we start the program. And handling this scenario robustly is important to building system software. In this case, planning accordingly in my view is not using a language that can put me in a hole. This is not theoretical, that is real scenario that we have to deal with.

The biggest turnoff for me, however, was this statement on errors:

…my issue with exception-based/exception-like errors is not the syntax but how they encourage error propagation. This encouragement promotes a culture of pass the error up the stack for “someone else” to handle the error. I hate this culture and I do not want to encourage it at the language level. Handle errors there and then and don’t pass them up the stack. You make your mess; you clean it.

I didn’t really know how to answer that at first. There are so many cases where that doesn’t even make sense that it isn’t even funny. Consider a scenario where I need to call a service that would compute some value for me. I’m doing that as gRPC over TCP + SSL. Let me count the number of errors that can happen here, shall we?

  • Bad reaction on the service (run out of memory, for example).
  • Argument passed is not a valid one
  • Invalid SSL certificate
  • Authentication issues
  • TCP firewall issue
  • DNS issue
  • Wrong URL / port

My code, which is calling the service, need to be able to handle any / all of those. And probably quite a few more that I didn’t account for. Trying to build something like that is onerous, fragile and doesn’t really work. For that matter, if I passed the wrong URL for the service, what is the code that is doing the gRPC call supposed to do but bubble the error up? If the DNS is returning an error, or there is a certificate issue, how do you clean it up? The only reasonable thing to do is to give as much context as possible and raise the error to the caller.

When building robust software, bubbling it up so the caller can decide what to do isn’t about passing the back, it is a b best practice. You only need to look at Erlang and how applications with the highest requirements for reliability are structured. They are meant to fail, error handling and recovery is something that happens in dedicated (supervisors) locations, because these places has the right context to make an actual determination.

The killer impact of this, however, is that Zig has explicit notion of errors, while Odin relies on the multiple return values system. We have seen how good that is with Go. In fact, one of the most common issues with Go is the issue with how much manual work it takes to do proper error handling.

But I think that the key issue here is that errors as a first class aspect of the language gives us a very powerful ability, errdefer. This single language feature is the reason I think that Zig is an amazing language. The concept of first class errors combine with errdefer makes building complex structures so much easier.

Consider the following code:

Note that I’m opening a file, mapping it to memory, validating its size and then that it has the right hash. I’m using defer to ensure that I cleanup the file handle, but what about the returned memory, in this case, I want to clean it up if there is an error, but not otherwise.

Consider how you would write this code without errdefer. I would have to add the “close the map” portion to both places where I want to return an error. And what happens if I’m using more than a couple of resources, I may be needing to do something that require a file, network socket, memory, etc. Any of those operations can fail, but I want to clean them up only on failure. Otherwise, I need to return them to my caller. Using errdefer (which relies on the explicit distinction between regular returns and errors) will ensure that I don’t have a problem. Everything works, and the amount of state that I have to keep in my head is greatly reduce.

Consider how you’ll that that in Odin or Go, on the other hand, and you can see how error handling become a big beast. Having explicit language support to assist in that is really nice.

time to read 6 min | 1001 words

RavenDB offers both single node transactions as well as cluster wide transactions. You are free to use either one or even mix them together. That level of freedom, on the other hand, brings with it its own set of challenges. How do you know what to use? What are the scenarios and implications for each operation?Remember, RavenDB is a distributed database that can allow you to make modification on any node in the cluster.

In essence, this boil down to a simple concept, how important is the write that you are making. In detail, this gets complex. It’s easy to say that if for low importance writes, you’ll use single node transactions, and for high value items, you’ll use cluster wide transactions. But that isn’t correct. The primary issue is what you are trying to achieve. I’m afraid that I have no choice but to dig into this topic.

Let’s consider the following scenario: A user clicked on “Add to Cart” in the application. How should we record this fact? There is a “shopping-carts/ayende” document for this user, which represent their current shopping cart. But how should we save it?

Obviously, we never want to lose an item from the shopping cart, right? We can use a cluster wide transaction here to ensure maximum safety! Except… a cluster wide transaction will fail if the node that we reached cannot access the majority of the nodes in the cluster. Going back to the business, I asked them about it. The answer I got was “never lose an item from the shopping cart”. That means that we need to process the write even if we can reach no other node.

That leads us to single node transactions, which will do just that. However, now we have to deal with another issue, what happened if two concurrent transactions modified the same document on different nodes at the same time? Now we have a conflict, and when the nodes will replicate the data to one another, we’ll need to resolve it somehow. RavenDB will default to resolving to latest, meaning that some of the changes will be lost. However, we can setup a resolution script that can merge our changes between multiple versions of a document.

This is confusing, I’m aware. The rule of thumb goes like this:

  1. Use single node transactions by default – if there are errors / conflicts / issues, let RavenDB resolve them to the latest version (a revision exists so you can recover anything lost).
  2. Use single node transactions + conflict resolver script if you actually care about applying any sort of logic to the merging of conflicts. This is rare, the scenario is usually when we have something that can be modified and merged together. Shopping cart is an excellent example of this.
  3. Use a cluster wide transaction when you would rather fail than go forward if you cannot ensure the operation is successful. This is also rare, usually reserved for things such as selling limited amount of some item.

The default recommendation, let RavenDB manage that and accept that it may select the latest version is not something that I make lightly. It is based on quite a bit of experience in how users are actually using RavenDB.  In almost any business context, you are going to have large parts of the model that have only a single reason to change, even in the worst case assumption. A customer changing its billing address, for example, can be reasonably assumed to want to keep the latest version they put in. There is also no real meaning to concurrency in this scenario, the modification to a particular document is done by the relevant customer directly.  Failures are rare (but they do happen, so you have to account for them), so you need to consider what the impact you’ll have. If this is something that doesn’t have multiple concurrent operations going on it normally (and proper document modeling will suggest that this isn’t the case), you can just ignore the problem.

I’m saying ignore the problem because there is the question on what is the meaning of not ignoring the issue? You can try to write your conflict resolution script, but even with knowledge of your model, what are you expected to do with two conflicting versions of a customer, with different billing addresses in each?

And trying to do something generic doesn’t work. It will fail, but because this is rare, it will happen only a year after deployment, when no one recalls what exactly the behavior is and an error on such a case will cause hard failure in production.

For some cases, like the shopping cart, you can meaningful write merge code, and the scenario make sense, I may click on two “Add to Cart” buttons at the same time from different locations and I don’t want to lose any of that.

The last scenario, using cluster wide transaction, is actually the reserve. Usually RavenDB will jump through all sorts of hoops to ensure that it won’t lose a write, but cluster wide transactions are actually going the other way. They need to fail if they can’t ensure that they went through. In this case, you’ll usually be working on something very specific. The classic example is ensuring a unique user name in the system, we want to fail if we can’t absolutely ensure that this username is unique. But that isn’t something that we want to do all the time, updating the LastLogin time on the user’s document is not something that you need to ensure will be consistent (and in this case, selecting the latest is also by definition the right thing to do).

I like to say that you should use a single node transaction to record that you purchased a lottery ticket, and a cluster wide transaction to record who won the lottery. That gives the right mindset about the stakes involved. I never want to lose the record of a sale, but I want to ensure that once the win is awarded, I get it absolutely right.

time to read 3 min | 549 words

Assume that you have a service that needs to accept some data from a user. Let’s say that the scenario in question is that the user wants to upload a photo that you’ll later process / aggregate / do stuff with.

How would you approach such a system? One way to do this is to do something similar to this:


The user will upload the function to your code (in the case above, a Lambda function, but can be an EC2 instance, etc) which will then push the data to its final location (S3, in this case). This is simple, and quite obvious to do. It is also wrong.

There is no need to involve your code in the middle. What you should do, instead, is to have the user talk directly to the end location (S3, Azure Blob Storage, Backblaze, etc). A better option would be:


In this model, we have:

  1. User ping your code to generate a secured upload link. You can also setup an “upload only area” in storage that a user can upload files to ahead of time, removing this step.
  2. User upload directly to S3 (or equivalent).
  3. S3 will then ping your code when the upload is done.

Why use this approach rather than the first one?

Quite simply, because in the first example, you are paying for the processing / upload / bandwidth for the work. In the second option, it is on the cloud / storage provider to actually provision enough resources to handle this. You are paying for the storage, but nothing else.

Consider the case of a user that uploads a 5 MB image over 5 seconds, if you are using the first option, you’ll pay for the full 5 seconds of compute time if you are using something like Lambda. If you are using EC2, your machine is busy and consume resources.

This is most noticeable if you also have to handle spikes / load. If you have 100 concurrent users, the first option will likely cost quite a lot just in the compute resources you use (either server less or provisioned machines). In the second option, it is the cloud provider that needs to have the machines ready to accept the data, and we don’t pay for any of that.

In fact, a much better solution is shown here. Again, the user gets the upload link in some manner and then upload directly to S3. At that point, instead of S3 calling you, it will push the notification to a queue (SQS) and then your code can handle this.

Here is what this looks like:


Note that in this case, you are in control of how fast or slow you want to process the data on the queue. You can set a maximum number of concurrent workers / lambdas and let the cloud infrastructure manage that for you. At this point, you can smooth any peaks that you have in the process.

A lot of this is just setting up the orchestration properly so you aren’t in the way, that you utilize the cloud infrastructure instead of writing your code.

Looking into Zig

time to read 5 min | 939 words

I think that it was the Pragmatic Programmer that recommend that you should learn a new language a year. For me, in 2020 that was Rust. I read a bunch of books about the language, I read significant amount of code and wrote some non trivial amount of code in Rust. That was sufficient to get me to grok the language, I’m not a Rust developer by any mean, but I walked with Rusty shoes for long enough to get the feeling.

This year, I decided to look into Zig. Both Zig and Rust are more or less in the same space, replacing C. They couldn’t be more different, however. Zig is a small language. I spent a couple of evenings going through the language reference and that was it, I had a pretty good idea about how to do things.

The learning curve is mostly flat, and that is pretty huge. This is especially because I can’t help but compare Zig to Rust. I spent a lot of effort understanding Rust, but I had spent almost no cycles trying to grok Zig. It was simple, obvious and quite clear.  In terms of power, mind, I would rate both languages on roughly the same spot. You can write really nice code in Zig, it is just that you don’t need to bend your head into funny shapes to get the compiler to accept your code.

One of the key features for Zig is its comptime feature. This is a feature that allow Zig to run code at compilation time. That isn’t a new or exciting feature, to be honest, C++ had it for many years. The key difference is that Zig can use this feature for code generation. For example, to create a generic list, you’ll write the following code:

Note that we are writing here a function that returns a type, which you can then use. That approach is incredibly powerful, but at the same time, this is simple, it is obvious.

Zig is easy to learn, because there isn’t a whole lot more that is hidden from you behind the scenes.

That actually leads to another really important aspect in the design of Zig. There isn’t anything behind the scenes. For example, you cannot allocate memory in Zig, there is no global function that will do that for you. You need to do so using an allocator. That means that the fact that memory allocations can fail is pervasive throughout the API, standard library and your code. There isn’t a "this may fail on rare occasions” scenario that you somehow need to handle, this is clear and in your face.

At the same time, Zig does a lot more to make things easier than C. I want to focus on a few important points:

  • Zig has the concept of Errors. In the code above, the function push() may fail because of an allocation failure. In this case, the function will fail with a return code. That is handled by the try keyword, which will abort the current function and return the error. Note that errors and regular values are separate channels in Zig (there is a union mark with ! at the function declaration).
  • Zig has support for defer and errdefer keyword. The defer keyword works just as you would expect it to, at the function exit, it will run all the deferred statement in reverse order. The errdefer is a lot more interesting, because that will only run if the function exits with an error. This seemingly simple change has a huge impact on the code quality and the complexity that a developer need to keep in their head.
  • Zig has built-in testing, to the point where test is a keyword in the language.

To give you some context, when I was writing C code, I literally wrote the exact same thing (manually, with macros and nastiness) in order to help me get things done.

In the same manner, the fact that allocation are explicit and managed from the top (all types that needs to allocate gets the allocator from their parents) means that you get to do some really cool things with memory. It is easy to say something like “this piece of code gets 10MB of memory only” and let it run like that. It also end up creating a more robust software system, I think, so memory allocations happen aren’t a rare occurrence, they happen all the time.

In general, Zig feel like a lot better C, no additional mental overhead. Compared to Rust, you can get working almost immediately and the compilation speed is excellent, to the point where you don’t really need to think about it. Rust makes you feel the slow compilation cost from the get go, basically, which is noticeable as your system grows bigger.

Thinking about this, I actually feel that we should compare Zig to Go, because it is closer in concept to what I think Go wanted to be. In fact, looking at the most common complaints people has against Go, Zig answers them all.

If you haven’t noticed, I’m quite enjoying working with Zig.

And as an aside, the fact that a language can implement a language server and get automatic IDE support is freaking amazing. You can also debug Zig code inside VS Code, for example, pretty much with no more issues than you would for native code. Zig is implemented on top of LLVM and gains a lot of the benefits from it.

One thing that kept going through my mind when I looked at all that I got out of the package is: standing on the shoulders of giants.

time to read 3 min | 511 words

I’m teaching a course in university, which gives me some interesting perspective into the mind of new people who join our profession.

One of the biggest differences that I noticed was with the approach to software architecture and maintenance concerns.  Frankly, some of the the exercises that I had to review made my eyes bleed a little (the students got full marks, because the point was getting things done, not code quality). I talked with the students about the topic and I realized that I have a very different perspective on software development and architecture than they have.

The codebase that I work with the most is RavenDB, I have been working on the project for the past 12 years, with some pieces of code going back closer to two decades. In contrast, my rule for giving tasks for students is that I can complete the task in under two hours from an empty slate.

Part and parcel of the way that I’m thinking about software is the realization that any piece of code that I’ll write is going to be maintained for a long period of time. A student writing code for a course doesn’t have that approach, in fact, it is rare that they use the same code across semesters. That lead to seeing a lot of practices as unnecessary or superfluous. Even some of the things that I consider as the very basic (source control, tests, build scripts) are things that the students didn’t even encounter up to this point (years 2 and 3 for most of the people I interact with) and they may very well get a degree with no real exposure for those concerns.

Most tasks in university are well scoped, clear and they are known to be feasible within the given time frame. Most of the tasks outside of university are anything but.

That got me thinking about how you can get a student to realize the value inherent in industry best practices, and the only real way to do that is to immerse them in a big project, something that has been around for at least 3 – 5 years. Ideally, you could have some project that the students will do throughout the degree, but that requires a massive amount of coordination upfront. It is likely not feasible outside of specific fields. If you are learning to be a programmer in the gaming industry, maybe you can do something like produce a game throughout the degree, but my guess is that this is still not possible.

A better alternative would be to give students the chance to work with a large project, maybe even contributing code to it. The problem there is that having a whole class start randomly sending pull requests to a project is likely to cause some heartburn to the maintenance staff.

What was your experience when moving from single use, transient projects to projects that are expected to run for decades? Not as a single running instance, just a project that is going to be kept alive for a long while…

time to read 5 min | 955 words

The Open Closed Principle is part of the SOLID principles. It isn’t new or anything exciting, but I wanted to discuss this today in the context of using that not as a code artifact but as part of your overall architecture.

The Open Closed Principle states that the code should be opened for extension, but closed for modification. That is a fancy way to say that you should spend most of your time writing new code, not modifying old code. Old code is something that is known to be working, it is stable (hopefully), but messing around with old code can break that. Adding new code, on the other hand, carry far less risk. You may break the new things, but the old stuff will continue to work.

There is also another aspect to this, to successfully add new code to a project, you should have a structure that support that. In other words, you typically have very small core of functionality and then the entire system is built on top of this. 

Probably the best example of systems that follow the Open Closed Principle is the vast majority of PHP applications.

imageHold up,I can hear you say. Did you just called out PHP as an architectural best practice? Indeed I did, and more than that, the more basic the PHP application in question, the closer it is to the ideal of Open Closed Principle.

Consider how you’ll typically add a feature to a PHP application. You’ll create a new script file and write the functionality there. You might need to add links to that (or you already have this happen automatically), but that is about it. You aren’t modifying existing code, you are adding new one. The rest of the system just know how to respond to that and handle that appropriately.

Your shared component might be the site’s menu, a site map and the like. Adding a new functionality may occasionally involve adding a link to a new page, but for the most parts, all of those operations are safe, they are isolated and independent from one another.

In C#, on the other hand, you can do the same by adding a new class to a project. It isn’t at the same level of not even touching anything else, since it all compiles to a single binary, but the situation is roughly the same.

That is the Open Closed Principle when it applies to the code inside your application. What happens when you try to apply the same principle to your overall architecture?

I think that Terraform is a great example of doing just that. They have a plugin system that they built, which spawns a new process (so completely independent) and then connect to it via gRPC. Adding a new plugin to Terraform doesn’t involve modifying any code (you do have to update some configuration, but even that can be automated away). You can write everything using separate systems, runtime and versions quite easily.

If we push the idea a bit further, we’ll discover that Open Closed Principle at the architecture level is the Service Oriented Architecture.  Note that I explicitly don’t count Microservices in this role, because they are usually intermixed (yes, I know they aren’t supposed to, I’m talking about what is).

In those situations, adding a new feature to the system would involve adding a new service. For example, in a banking system, if you want to add a new feature to classify fraudulent transactions, how would you do it?

One way is to go to the transaction processing code and write something like:

That, of course, would mean that you are going to have to modify existing code, that is not a good idea. Welcome to six months of meeting about when you can deploy your changes to the code.

On the other hand, applying the Open Closed Principle to the architecture, we won’t ever touch the actual system that process transactions. Instead, we’ll use a side channel. Transactions will be written to a queue and we’ll be able to add listeners to the queue. In such a way, we’ll have the ability to add additional processing seamlessly. Another fraud system will just have to listen to the stream of messages and react accordingly.

Note that there is a big difference here, however, unlike with modifying the code directly, we can no longer just throw an exception to stop the process. By the time that we process the message, the transaction has already been applied. That requires that we’ll build the system in such a way that there are ways to stop transactions after the fact (maybe by actually submitting them to the central bank after a certain amount of time, or releasing them to the system only after all the configured endpoints authorized it).

At the architecture level, we are intentionally building something that is initially more complex, because we have to take into account asynchronous operations and work that happens out of band, including work that we couldn’t expect. In the context of a bank, that means that we need to provide the mechanisms for future code to intervene. For example, we may not know what we’ll want the additional code to do, but we’ll have a way to do things like pause a transaction for manual review, add additional fees, raise alerts, etc.  Those are the capabilities of the system, and the additional behavior would be policy around building that.

There are other things that make this very attractive, you don’t have to run everything at the same time, you can independently upgrade different pieces and you have clear lines of demarcation between the different pieces of your system.

time to read 2 min | 253 words

imageFrom a conceptual model, a thread and a task are very similar. That is very much by design, since the Task is meant to allow you to work with asynchronous code while maintaining the illusion that you are running in a sequential manner. It is tempting to think about this in terms of the Task actually executing the work, but that isn’t actually the case.

The Task doesn’t represent the execution of whatever asynchronous process is running, the Task represent a ticket that will be punched when the asynchronous process is done. Consider the case of going to a restaurant and asking for a table, if there isn’t an available table, you cannot be seated. What the restaurant will do is hand you a pager that will buzz when the table is ready. In the same sense, a Task is just such a token. The restaurant pager doesn’t means that someone is actively clearing a table for you. It is just something that will buzz when a table is ready.

A code sample may make things clearer:

In this case, we are manually coordinating the Task using its completion source and you can see that the Task instance that was handed when trying to get a table doesn’t actually start anything. It is simply waiting to be raised when called.

That is an important aspect of how System.Threading.Tasks.Task works, because it is usually in contrast to the mental model in our head.


  1. Atomic reference counting (with Zig code samples) - 2 days from now

There are posts all the way to Sep 20, 2021


  1. Production postmortem (31):
    17 Sep 2021 - The Guinness record for page faults & high CPU
  2. RavenDB 5.2 (2):
    06 Aug 2021 - Simplifying atomic cluster wide transactions
  3. Postmortem (2):
    23 Jul 2021 - Accidentally quadratic indexing output
  4. re (28):
    23 Jun 2021 - The performance regression odyssey
  5. Challenge (58):
    16 Jun 2021 - Detecting livelihood in a distributed cluster
View all series


Main feed Feed Stats
Comments feed   Comments Feed Stats