Byzantine and non-Byzantine distributed systems
The underlying assumption for any distributed system is that the network is hostile. That assumption is pervasive. If you open a socket, you have to be aware of malicious people on the other side (and in the middle). If you accept input from external sources, you have to be careful, because it may be carefully crafted to do Bad Things. In short, the network is hostile, and you need to protect yourself from harm at all levels.
In many cases, you already have multiple layers of protection. For example, when building web applications, the HTTP Server already validate that the incoming streams follows the HTTP protocol. Even ignoring maliciousness, you will get services that connect to you using the wrong protocols and creating havoc. For reference, look at 1347703880, 1213486160 and 1195725856 and the issues they cause. As it turns out, these are relatively benign issues, because they are caught almost immediately. In the real world, the network isn’t only hostile, it is also smart.
The problem was originally posed by Lamport in the Byzantine Generals paper. You have a group of generals that needs to agree on a particular time to attack a city. They can only communicate by (unreliable) messenger, and one or more of them are traitors. The paper itself is interesting to read and the problem is pervasive enough that we now divide distributed systems to Byzantine and non-Byzantine systems. We now have pervasive cryptography deployed, to the point where you read this post over an encrypted channel, verified using public key infrastructure to validate that it indeed came from me. You can solve the Byzantine generals problem easily now.
Today, the terminology changed. We now refer to Byzantine networks as systems where some of the nodes are malicious and non-Byzantine as systems where we trust that other nodes will do their task. For example, Raft or Paxos are both distributed consensus algorithms that assumes a non-Byzantine system. Oh, the network communication gores through hostile environment, but that is why we have TLS for. Authentication and encryption over the wire are mostly a solved problem at this point. It isn’t a simple problem, but it is a solved one.
So where would you run into Byzantine systems today? The obvious examples are cryptocurrencies and Bit Torrent. In both cases, you have distributed environment with incentives for the other side to cheat. In the case of cryptocurrencies, this is handled by proof of work / proof of stake as well as the cost of getting to 51% majority on the network. In the case of Bit Torrent, it is an attempt to get peers to both download and upload. These examples are the first one that pops to mind, but in reality, the most common tool to use with a Byzantine network is the browser.
The browser has to assume that every site is malicious and that the web servers has to assume that each client is malicious. For that matter, every server has to assume that every client is malicious as well. You only have to read through OWASP listing to understand that.
And how does this related to databases? Distributed databases are composed of independent nodes, which cooperate together to store and process your data. Most of the generally available database systems are assuming non-Byzantine model. In other words, they authenticate the other nodes, but once past the authentication, the other node is trusted to operate as expected.
For the most part, that is a reasonable assumption to make. You run your database on machine that you tend to trust, after all. And assuming non-Byzantine systems allow for a drastically simpler system design and much higher performance.
However, we are starting to see more and more system deployed on the edge. And that raise an interesting question, who controls the edge? Let’s assume that we have a traffic monitoring system, based on software that is running on your phone. While it may be all part of a single system, you have to take into account that you are now running on a system that is controlled by someone else, who may modify / change it at will.
That leads to interesting issues with regards to the design of such a system. On the one hand, you want to get data from the nodes in the fields, but on the other hand, you need to be careful about trusting those nodes.
How would you approach such a system? Keep in mind that you want to reduce, as much as possible, the complexity of the system while not breaching its security.
Comments
I would probably introduce a hard distinction: Systems you can trust and systems you cannot trust. Systems you cannot trust have a specific set of permissions. e.g they are not eligible for RAFT and other "admin" operations. Ideally, I would only allow to push data from them - they should not be able to modify or delete data. Naturally, requests from these edge systems have to be validated, e.g. that they do not try to do any operation not allowed for them - if the request is malicious, terminate the request. Probably a careful review of the API surface would be required, too - to avoid any malformed requests which can take down your system. Oh, that leads to my last point: We need to design access controls etc. in a resilient manner, so that bad edge systems cannot take down the cluster. E.g. circuit breakers
COVID-19 tracking apps come to mind; the official german service uses verification by phone and hands out a Token: https://github.com/corona-warn-app/cwa-verification-portal
Christian,
What you suggest is actually really complex. If you have untrusted nodes as part of the cluster, that is bad.
There are Raft suggestions that handle this (with signatures), but they are complex and much more costly.The review of the API you suggested is something that would have an impact on the overall design of the system and affect pretty much everything there.
Comment preview