Raft Algorithm Log Analysis - etcd Cluster Leader Election Process

ยท

3 min read

๐Ÿ’ก
The Raft algorithm is a consensus algorithm designed to manage a replicated log. It ensures that multiple servers (nodes) in a distributed system agree on a consistent state, even in the presence of failures. The algorithm operates in three main states
  1. Follower: Passive state, responds to requests from leaders and candidates.

  2. Candidate: State during an election, where a node attempts to become the leader.

  3. Leader: Active state, handles all client requests and log replication.

Terms and Raft Index

  • Terms: Each term begins with an election. If a candidate wins, it remains the leader for the duration of the term. Terms help nodes identify stale information and avoid split-brain scenarios.

  • Raft Index: Used to maintain the order of log entries. Nodes with higher indexes have more up-to-date logs, which is crucial during the pre-vote and vote phases.

Voting and Pre-voting Concepts

  • Pre-vote Phase: Nodes ask other nodes if they would likely support them in the next election. This phase helps reduce the chances of disruptive elections.

  • Vote Phase: After a successful pre-vote, the candidate officially asks for votes. The candidate must receive votes from a majority of the cluster nodes to become the leader.

Quorum Concept

  • A quorum is essential for maintaining consistency. In a cluster with n nodes, a quorum is achieved with ceil(n/2) + 1 nodes. This ensures that any majority-based decision is respected across the cluster, preventing data inconsistencies.

1. Case 1: When the term is lower

2. Case 2: When the term is the same but the index is different

Case 1: When the term is lower

  • Term: Represents the recent election round a node has participated in.

  • Master node fails.

  • Timer on master1 ends first, making it a pre-candidate.

  • Pre-candidate sends a prevote request.

  • Nodes compare terms; if master1's term is lower, the election is rejected.

  • All nodes revert to followers, and timers reset.

  • Timer on node3 ends first, making it a pre-candidate.

  • Pre-candidate sends a prevote request.

  • Pre-candidate sends a prevote request.

  • Prevote succeeds, initiating the main vote.

  • All nodes' terms update to 41.

  • Nodes compare raft indexes.

  • Node3 becomes the leader.
  • The leader sends periodic heartbeats to followers to reset their random timers. This prevents pre-candidates from emerging while the leader is still active.
  • Now, let's simulate the failure of the leader node.

  • Heartbeats will no longer be sent to followers.

  • A new pre-candidate will emerge, send a prevote request, and the main vote will start, initiating the 42nd term election.

  • A new leader is elected.

  • Revive node3.

  • The leader synchronizes the current term information, and node3 rejoins the cluster.

Case 2: When the term is the same but the index is different

  • Node1 is down.

  • Timer on node3 ends first, making it a pre-candidate.

  • Pre-candidate sends a prevote request.

  • Node2 rejects due to a higher index while Timer on node2 is running out.

  • Node2 becomes a pre-candidate and sends a prevote request.

  • Node2's higher index makes it the candidate.

  • Node3 reverts to follower.

  • Main vote starts, and node2 becomes the leader.
ย