A note on distributed election algorithms

2018-12-08 · 1220 words · 6 minute read

distributed-systems · english

Distributed election is a problem of a choosing the single process, named “coordinator”, among the group of a concurrently running processes in a distributed systems. Furthermore, each process in a system must be aware of who the coordinator is, and all of them must agree on that.

This article largely relates on work “Elections in a Distributed Computing” by H. Garcia-Molina - ‎1982

Environment model

There are two possible environment models, reflecting guaranties for algorithms performance: weak and strong. The first algorithm would require the strong model, the second - only the weak. The key difference between those models is a communication network reliability.

First of all, each process has a unique id. Furthermore, each process knows ids of each other process within the system.

Process internal state

At any given point of time each process can be in one of four states:

Down - a process cannot perform its duties currently
Election - process participates in an election
Reorganization - a new coordinator is selected, process interacts with it in order to adjust its program for new network topology
Normal - process finished reorganization, coordinator now is redundant

Each process has some atomic non-volatile storage (designated $S(i)$), which persist even when a process is in Down state. It consists of several fields:

$S(i).s$ - one of four states
$S(i).c$ - coordinator id
$S(i).d$ - “definition of task” - an information got from cooritator, during the “reorganization” stage.

There are several other assumptions:

Each node functions according to the same algorithm.
There aren’t any Byzantine faults.
Whenever a process discovers its own malfunction, it declares itself as Down.
Other processes in a system may discover that if the failed process doesn’t answer within a predefined $T$ time limit.

Common networking guarantees

There aren’t any transmission errors. In other words, the sent message and the received one are the same.
Messages are received in the same order as it was sent. If the connection pipe doesn’t have such guarantee, it can be obtained artificially. Each message has to be populated with it’s unique_id, and whenever a process gets a message, it declares all messages with lower id to be lost and ignores it.
Messages are delivered relatively fast. If a message is delivered longer, than $T$, the process will be recognized as failed.

Strong model

In a strong model, the reliability of the networking is assumed. Each sent message will reach it’s destination. Therefore, the only fault that can be handled is a complete process’s fault.

Under this model, there are two criteria of a correct election algorithm:

The coordinator must be the same: $\forall i,j: S(i).s, S(j).s \in {‘Reorganization’, ‘Normal’} \implies S(i).c=S(j).c$
If there are no failures during the election, the algorithm eventually has to select a single process as coordinator. Although, if we allow for failures to occur, we can design a counterexample with an infinite election.

Weak model

Under the weak model, we would like to weak the network constraint and allow messages to get lost. It implies that a partitioning may also occur. If information can’t be transferred between the parts, electing a single coordinator is an impossible task.

Therefore, the definition of an election should be changed. We introduce the concept of groups. So, each process at any point is a part of a single group, and in each group, we require only one coordinator to be selected.

Formally:

$S(i).g$ is added to state, representing current group id.
$\forall i,j: S(i).s, S(j).s \in {‘Reorganization’, ‘Normal’} \vee S(i).g=S(j).g \implies S(i).c=S(j).c$
For each living process the coordinator should be eventually selected.

It can be seen that a strong model can be obtained from the weak model by setting equal group id to all processes.

HighRank algorithm

This algorithm was originally named “The Bully”, but according to the principles of nonviolent communication, we would instead name it “HighRank”.

The key idea - the process with the largest id should be selected as a coordinator.

There are several types of messages:

Election - indicating an intent to start an election
Answer - indicate that other process agreed to be a coordinator in a future election.
Coordinator - indicate finishing election with an elected coordinator.

Election start

A round of elections can be started by any process. For example, if it discovers some of the other’s process failure and therefore reconfiguration is required.

The process changes its state to Election and sends an Election message to all processes with higher rank.

If there aren’t any responses, therefore, the current process has a “throne claim” - it can elect itself. It broadcasts the Coordinator message with its id.

Message response

Otherwise, if a process gets at least one Answer, the process has to wait while another process with a higher rank will repeat the same algorithm.
If a process gets an Election message, it responds with an Answer. If the current process is not participating in an election yet ($S(i).s \not = ‘Election’), it should start the election - again, it has to ensure if there are no alive processes with higher ranks.
If a process gets a Coordinator message, it means that elections are now over, it updates $S(i).c$ and changes state to Reorganization. Election now have succeeded

The Invitation Election Algorithm

This algorithm operates in a weak environment model.

Each process has to maintain the group set - a set of process belonging to its group.

The algorithm for electing coordinator within a group is the same. The only thing that is needed - is a protocol for merging groups.

Merging

For simplicity, the right to perform a merge can be delegated to a separate agent - merger. There can be arbitrary many instances of mergers, for example in each network partition.

Merger periodically asks all process. If there are at least two coordinators online, their groups can be merged. First of all, when a merger decides to merge two groups, each process within each group has to ensure that each process from the opposing group is available. Therefore, we can ensure that in the new group reliable network constraint will be valid.

If merger decides that groups can be merged, it broadcast the new group set to all members of the new group and starts the election process.

Conclusions

The proposed algorithms may operate in a relatively unstable environment. Although, there are several assumptions, which can be hard to meet in some systems:

The algorithm requires to assign a unique id and make sure, each other process in a system knows all other ids. This can be hard to achieve in distributed systems with a large number of processes.
The algorithm demands messages are delivered within a fixed time gap. It introduces typical trade-off - with high $T$ failures are discovered much later and with lower $T$ system tends to recognize functioning processes as dead.

Anyway, the proposed algorithms can be used is a broad range of unstable environment and can be a basis for the development of more sophisticated algorithms.

One possible direction is organizing processes in a balanced tree, such that all communication and election can be done through a proxy. This would allow processes to hold less information about other parts of a system and reduce a network load around coordinator, as it would broadcast messages not directly, but through a path in a balanced tree.

	github.com/omrigan
	linkedin.com/in/omrigan
	gitlab.com/u/omrigan
	twitter.com/omrigan