In this talk, Oren Eini, founder of the RavenDB NoSQL database, will speak about how you can design and architect software systems that can easily scale to stupendous sizes, without causing either headache or pinching your wallet. It is obvious that if you chose the right algorithm for a particular task, you can gain tremendous benefits in terms of efficiency. What isn't as obvious is that the same apply to the global system architecture. In this talk, we aren't going to cover specific products or implementation. Rather, we'll cover architectural styles and approaches that withstood the test of time and allow you to create systems that are simple and cheap to operate while giving you the flexibility to grow. Arguably even more important, large complex systems require ongoing maintenance and care. We'll look into how we can provide insight and visibility into the innards of the system without putting too much of a strain on the development team. How we can enable operations features such as rolling deployment, recovery and audits at ease.
I gave a fast paced talk in the .NET Conf 2021 about architecting for performance, you can see it here:
I spoke at the Donetos conference about how to design your system for high performance, using RavenDB’s story as the backdrop. I think it went great.
I was pointed to the Odin language after my post about the Zig language. On the surface, Odin and Zig are very similar, but they have some fundamental differences in behavior and mindset. I’m basing most of what I’m writing here on admittedly cursory reading of the Odin language docs and this blog post.
Odin has a great point on conditional compilation. The if statements that are evaluated at compile time are hard to distinguish. I like Odin’s when clauses better, but Zig has comptime if as well, which make it easier. The actual problem I have with this model in Zig is that it is easy to get to a situation where you write (new) code that doesn’t get called, but Zig will detect that it is unused and not bother compiling it. When you are actually trying to use it, you’ll hit a lot of compilation errors that you need to fix. This is in contrast to the way I would usually work, which is to almost always have the code in compliable state and leaning hard on the compiler to double check my work.
Beyond that, I have grave disagreements with Ginger, the author of the blog post and the Odin language. I want to pull just a couple of what I think are the most important points from that post:
I have never had a program cause a system to run out of memory in real software (other than artificial stress tests). If you are working in a low-memory environment, you should be extremely aware of its limitations and plan accordingly. If you are a desktop machine and run out of memory, don’t try to recover from the panic, quit the program or even shut-down the computer. As for other machinery, plan accordingly!
This is in relation to automatic heap allocations (which can fail, which will usually kill the process because there is no good way to report it). My reaction to that is “640KB is enough for everything”, right?
To start with, I write databases for a living. I run my code on containers with 128MB when the user uses a database that is 100s of GB in size. Even if running on proper server machines, I almost always have to deal with datasets that are bigger than memory. Running out of memory happens to us pretty much every single time we start the program. And handling this scenario robustly is important to building system software. In this case, planning accordingly in my view is not using a language that can put me in a hole. This is not theoretical, that is real scenario that we have to deal with.
The biggest turnoff for me, however, was this statement on errors:
…my issue with exception-based/exception-like errors is not the syntax but how they encourage error propagation. This encouragement promotes a culture of pass the error up the stack for “someone else” to handle the error. I hate this culture and I do not want to encourage it at the language level. Handle errors there and then and don’t pass them up the stack. You make your mess; you clean it.
I didn’t really know how to answer that at first. There are so many cases where that doesn’t even make sense that it isn’t even funny. Consider a scenario where I need to call a service that would compute some value for me. I’m doing that as gRPC over TCP + SSL. Let me count the number of errors that can happen here, shall we?
- Bad reaction on the service (run out of memory, for example).
- Argument passed is not a valid one
- Invalid SSL certificate
- Authentication issues
- TCP firewall issue
- DNS issue
- Wrong URL / port
My code, which is calling the service, need to be able to handle any / all of those. And probably quite a few more that I didn’t account for. Trying to build something like that is onerous, fragile and doesn’t really work. For that matter, if I passed the wrong URL for the service, what is the code that is doing the gRPC call supposed to do but bubble the error up? If the DNS is returning an error, or there is a certificate issue, how do you clean it up? The only reasonable thing to do is to give as much context as possible and raise the error to the caller.
When building robust software, bubbling it up so the caller can decide what to do isn’t about passing the back, it is a b best practice. You only need to look at Erlang and how applications with the highest requirements for reliability are structured. They are meant to fail, error handling and recovery is something that happens in dedicated (supervisors) locations, because these places has the right context to make an actual determination.
The killer impact of this, however, is that Zig has explicit notion of errors, while Odin relies on the multiple return values system. We have seen how good that is with Go. In fact, one of the most common issues with Go is the issue with how much manual work it takes to do proper error handling.
But I think that the key issue here is that errors as a first class aspect of the language gives us a very powerful ability, errdefer. This single language feature is the reason I think that Zig is an amazing language. The concept of first class errors combine with errdefer makes building complex structures so much easier.
Consider the following code:
Note that I’m opening a file, mapping it to memory, validating its size and then that it has the right hash. I’m using defer to ensure that I cleanup the file handle, but what about the returned memory, in this case, I want to clean it up if there is an error, but not otherwise.
Consider how you would write this code without errdefer. I would have to add the “close the map” portion to both places where I want to return an error. And what happens if I’m using more than a couple of resources, I may be needing to do something that require a file, network socket, memory, etc. Any of those operations can fail, but I want to clean them up only on failure. Otherwise, I need to return them to my caller. Using errdefer (which relies on the explicit distinction between regular returns and errors) will ensure that I don’t have a problem. Everything works, and the amount of state that I have to keep in my head is greatly reduce.
Consider how you’ll that that in Odin or Go, on the other hand, and you can see how error handling become a big beast. Having explicit language support to assist in that is really nice.
It turns out that there were quite a lot podcasts and videos that we took part of recently, enough that I didn’t get to talk about all of them. This post is to clear the queue, so to speak.
What Is a noSQL Database? – Dejan, our developer advocate, is discussing the high level details of non relational databases and how they fit into the ecosystem.
Getting started with RavenDB – Dejan is talking about how to get started using RavenDB, including live demos.
Applying BDD techniques using RavenDB – Dejan will show you how you can build a behavior driven design system while leaning on RavenDB’s capabilities.
Live demoing RavenDB – I talk about RavenDB and take you for a walk through all its features. This video is close to two hours and cover many of the interesting bits of RavenDB and its design.
Interview with Oren Eini – Here I talk about my career and how I got to building RavenDB
The Birth of RavenDB – This is a podcast in which I discuss in details how I got started working in the databases field and how I ended up here .
In this episode of Coffee with Pros , we have special guest speaker Oren Eini. He is the CEO of RavenDB, a NoSQL Distributed Database that's Fully Transactional (ACID) both across your database and throughout your database cluster. RavenDB Cloud is the Managed Cloud Service (DBaaS) for easy use.
I’m teaching a course in university, which gives me some interesting perspective into the mind of new people who join our profession.
One of the biggest differences that I noticed was with the approach to software architecture and maintenance concerns. Frankly, some of the the exercises that I had to review made my eyes bleed a little (the students got full marks, because the point was getting things done, not code quality). I talked with the students about the topic and I realized that I have a very different perspective on software development and architecture than they have.
The codebase that I work with the most is RavenDB, I have been working on the project for the past 12 years, with some pieces of code going back closer to two decades. In contrast, my rule for giving tasks for students is that I can complete the task in under two hours from an empty slate.
Part and parcel of the way that I’m thinking about software is the realization that any piece of code that I’ll write is going to be maintained for a long period of time. A student writing code for a course doesn’t have that approach, in fact, it is rare that they use the same code across semesters. That lead to seeing a lot of practices as unnecessary or superfluous. Even some of the things that I consider as the very basic (source control, tests, build scripts) are things that the students didn’t even encounter up to this point (years 2 and 3 for most of the people I interact with) and they may very well get a degree with no real exposure for those concerns.
Most tasks in university are well scoped, clear and they are known to be feasible within the given time frame. Most of the tasks outside of university are anything but.
That got me thinking about how you can get a student to realize the value inherent in industry best practices, and the only real way to do that is to immerse them in a big project, something that has been around for at least 3 – 5 years. Ideally, you could have some project that the students will do throughout the degree, but that requires a massive amount of coordination upfront. It is likely not feasible outside of specific fields. If you are learning to be a programmer in the gaming industry, maybe you can do something like produce a game throughout the degree, but my guess is that this is still not possible.
A better alternative would be to give students the chance to work with a large project, maybe even contributing code to it. The problem there is that having a whole class start randomly sending pull requests to a project is likely to cause some heartburn to the maintenance staff.
What was your experience when moving from single use, transient projects to projects that are expected to run for decades? Not as a single running instance, just a project that is going to be kept alive for a long while…
Dejan, our developer advocate, is live coding with RavenDB in this video cast:
Yesterday I asked about dealing with livelihood detection of nodes running in AWS. The key aspect is that this need to be simple to build and easy to explain.
Here are a couple of ways that I came up with, nothing ground breaking, but they do the work while letting someone else do all the heavy lifting.
Have a well known S3 bucket that each of the nodes will write an entry to. The idea is that we’ll have something like (filename – value):
- i-04e8d25534f59e930 – 2021-06-11T22:01:02
- i-05714ffce6c1f64ad – 2021-06-11T22:00:49
The idea is that each node will scan the bucket and read through each of the files, getting the last seen time for all the nodes. We’ll consider all the nodes whose timestamp is within the last 1 minute to be alive and any other node is dead. Of course, we’ll also need to update the node’s file on S3 every 30 seconds to ensure that other nodes know that we are alive.
The advantage here is that this is trivial to explain and implement and it can work quite well in practice.
The other option is to actually piggy back on top of the infrastructure that is dedicated for this sort of scenario. Create an elastic load balancer and setup a target group. On startup, the node will register itself to the target group and setup the health check endpoint. From this point on, each node can ask the target group to find all the healthy nodes.
This is pretty simple as well, although it requires significantly more setup. The advantage here is that we can detect more failure modes (a node that is up, but firewalled away, for example).
Other options, such as having the nodes ping each other, are actually quite complex since they need to find each other. That lead to some level of service locator, but then you’ll have to avoid each node pining all the other nodes, since that can get busy on the network.