Why Zookeeper is always configured with odd number of nodes ?

Someone in Quora.com asked me  “Why Zookeeper is always configured with odd number of nodes ?”. Well, thats a great question but sad part is, not even many practitioners, even those who use Zookeeper in production can explain it simply.

I will try to keep this really simple, I promise.

ZooKeeper (ZK) is a highly-available, highly-reliable and fault-tolerant coordination and consensus service for distributed applications like Apache Storm or Kafka. (Specifically in the case of Kafka it also acts as distributed metadata and configurations service). ZK uses a simple database a.k.a data tree.

Critical parts of ZK definition

Highly-available and highly-reliable – ZK achieves these through replication, and is designed to have good performance in read-heavy workloads. Total replication of the ZK database is performed on an ensemble, i.e., an array of hosts/nodes, of which one is the leader of a quorum (i.e., majority). see the figures.

Quorum is nothing but the number of nodes that should be up and running in the ensemble for the ZK service to be considered up. ZK, by default uses “node majority quorum”. (Thats all we need to worry about now. But if you are curious here is a good source to understand different forms of quorum configurations.) see figure 2

Screen Shot 2016-09-02 at 8.13.34 PM

Figure 1

Screen Shot 2016-09-02 at 8.14.36 PM

Figure 2

Why odd number of nodes ?

Just imagine how many node failures the above cluster can tolerate. Formula is

 

Screen Shot 2016-09-02 at 8.18.46 PM

Applying the above formula for the above cluster it is Round (5/2) – 1 =  2 Nodes.  Now say we had 6 nodes instead of 5. Re-applying the formula gives you Round(6/2) – 1 = 2 Nodes again.  So the extra node doesn’t add any tangible benefit for the cluster. So replicating to that one extra node is just a performance overhead.

Other way to put it, when you are thinking about availability and fault tolerance of your cluster, just think how many node failures do you want your cluster to tolerate.  If you want N failures to tolerate, then you need 2N + 1 nodes in your cluster. Apply this to our case, say, we wanted to tolerate 2 failures hence we have 2(2) + 1 = 5 nodes in our cluster. if you want a 3 node failure to be tolerated it will 2(3) + 1 = 7. This only goes as odd numbers.

This explanation holds good for any distributed application that uses a Quorum based replication and it is a “node majority quorum”. 

Hope this make sense.

Further reads