Zookeeper 4 Letter Word Netcat No Response

Problem

To report Zookeeper service health, we use the 4 letter word (4LW) command ruok and check that response is imok.

netcat is the channel via which the message is sent.

After upgrading the operating system from Ubuntu 16.04 to 18.04, we see that intermittently there is no response. Hmm…

Triage

Is Zookeeper getting those commands?#

A simple bash script can be used to verify this:

#!/bin/bash

for i in {1..10}
do
  status=$(echo "ruok" | nc localhost 2181)
  echo -e "try ${i}: ${status}"
done

Output looks suspect:

root@zk-node:~# ./check_zk.sh
try 1: imok
try 2: imok
try 3: imok
try 4: imok
try 5: imok
try 6:
try 7: imok
try 8:
try 9: imok
try 10: imok
root@zk-node:~#

But Zookeeper service has no issues processing the requests:

Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,054 [myid:] - INFO  [NIOWorkerThread-31:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49880
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,091 [myid:] - INFO  [NIOWorkerThread-22:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49882
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,110 [myid:] - INFO  [NIOWorkerThread-20:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49884
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,125 [myid:] - INFO  [NIOWorkerThread-4:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49886
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,139 [myid:] - INFO  [NIOWorkerThread-3:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49888
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,153 [myid:] - INFO  [NIOWorkerThread-1:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49890
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,166 [myid:] - INFO  [NIOWorkerThread-23:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49892
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,179 [myid:] - INFO  [NIOWorkerThread-11:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49894
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,190 [myid:] - INFO  [NIOWorkerThread-9:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49896
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,201 [myid:] - INFO  [NIOWorkerThread-5:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49898

No errors or warnings or exceptions. Looks like it is responding just fine.

4LW commands do not work consistently#

Next, we tried looking for various causes that result in 4LW commands silently failing. Bumped into this known Zookeeper issue ZOOKEEPER-737: some 4 letter words may fail with netcat (nc) 1.

Now we did not observe any exceptions in ZK logs - probably because we’re using a later version of ZK - v3.6.1. But it does seem like a plausible root cause.

A StackOverflow post 2 points to the same with a fix to add delay in netcat before closing the channel.

Solution

Add delay to netcat before closing the channel#

Based on the StackOverflow post 2, we added the -q 1 flag; which adds a 1 second delay after sending the message and before it closes the channel.

Result?#

# status=$(echo "ruok" | nc localhost 2181 -q 1)

root@zk-node:~# ./check_zk.sh
try 1: imok
try 2: imok
try 3: imok
try 4: imok
try 5: imok
try 6: imok
try 7: imok
try 8: imok
try 9: imok
try 10: imok
root@zk-node:~#

Consistency at last!

The goal of any platform - it works or it doesn’t; either is fine, as long as it does that consistently.

TIL


  1. https://issues.apache.org/jira/browse/ZOOKEEPER-737 ↩︎

  2. https://stackoverflow.com/questions/26182416/zookeeper-server-started-but-ruok-no-output ↩︎

comments powered by Disqus

© 2020 Aditya Vaja