Answers: Detecting livelihood in a distributed cluster
Yesterday I asked about dealing with livelihood detection of nodes running in AWS. The key aspect is that this need to be simple to build and easy to explain.
Here are a couple of ways that I came up with, nothing ground breaking, but they do the work while letting someone else do all the heavy lifting.
Have a well known S3 bucket that each of the nodes will write an entry to. The idea is that we’ll have something like (filename – value):
- i-04e8d25534f59e930 – 2021-06-11T22:01:02
- i-05714ffce6c1f64ad – 2021-06-11T22:00:49
The idea is that each node will scan the bucket and read through each of the files, getting the last seen time for all the nodes. We’ll consider all the nodes whose timestamp is within the last 1 minute to be alive and any other node is dead. Of course, we’ll also need to update the node’s file on S3 every 30 seconds to ensure that other nodes know that we are alive.
The advantage here is that this is trivial to explain and implement and it can work quite well in practice.
The other option is to actually piggy back on top of the infrastructure that is dedicated for this sort of scenario. Create an elastic load balancer and setup a target group. On startup, the node will register itself to the target group and setup the health check endpoint. From this point on, each node can ask the target group to find all the healthy nodes.
This is pretty simple as well, although it requires significantly more setup. The advantage here is that we can detect more failure modes (a node that is up, but firewalled away, for example).
Other options, such as having the nodes ping each other, are actually quite complex since they need to find each other. That lead to some level of service locator, but then you’ll have to avoid each node pining all the other nodes, since that can get busy on the network.
Comments
Perhaps I did not understand the question completely but, if you have a distributed cluster, isn't it kind of necessary that the nodes know about each other? Wouldn't you need to have all nodes communicate via network anyway (peer-to-peer)? If the nodes do not know about themselves then they probably know some central service then (master-slave)? If that is so then you already can know which services are live or not by having a protocol where you can ask any one of them for the 'status' of the cluster.
Dalibor,
Yes, the question is how do you discover the additional nodes.
The question is how that presence discovery is done.
Comment preview