Zookeeper 4 Letter Word Netcat No Response
Problem
To report Zookeeper service health, we use the 4 letter word (4LW) command
ruok and check that response is imok.
netcat is the channel via which the message is
sent.
After upgrading the operating system from Ubuntu 16.04 to 18.04, we see that
intermittently there is no response. Hmm…
Triage
Is Zookeeper getting those commands?#
A simple bash script can be used to verify this:
#!/bin/bash
for i in {1..10}
do
status=$(echo "ruok" | nc localhost 2181)
echo -e "try ${i}: ${status}"
done
Output looks suspect:
root@zk-node:~# ./check_zk.sh
try 1: imok
try 2: imok
try 3: imok
try 4: imok
try 5: imok
try 6:
try 7: imok
try 8:
try 9: imok
try 10: imok
root@zk-node:~#
But Zookeeper service has no issues processing the requests:
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,054 [myid:] - INFO [NIOWorkerThread-31:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49880
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,091 [myid:] - INFO [NIOWorkerThread-22:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49882
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,110 [myid:] - INFO [NIOWorkerThread-20:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49884
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,125 [myid:] - INFO [NIOWorkerThread-4:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49886
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,139 [myid:] - INFO [NIOWorkerThread-3:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49888
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,153 [myid:] - INFO [NIOWorkerThread-1:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49890
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,166 [myid:] - INFO [NIOWorkerThread-23:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49892
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,179 [myid:] - INFO [NIOWorkerThread-11:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49894
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,190 [myid:] - INFO [NIOWorkerThread-9:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49896
Jul 24 22:21:07 zk-node zookeeper[23515]: 2020-07-24 22:21:07,201 [myid:] - INFO [NIOWorkerThread-5:NIOServerCnxn@507] - Processing ruok command from /127.0.0.1:49898
No errors or warnings or exceptions. Looks like it is responding just fine.
4LW commands do not work consistently#
Next, we tried looking for various causes that result in 4LW commands silently failing. Bumped into this known Zookeeper issue ZOOKEEPER-737: some 4 letter words may fail with netcat (nc) 1.
Now we did not observe any exceptions in ZK logs - probably because we’re
using a later version of ZK - v3.6.1. But it does seem like a plausible
root cause.
A StackOverflow post 2 points to the same with a fix to add delay in netcat before closing the channel.
Solution
Add delay to netcat before closing the channel#
Based on the StackOverflow post 2, we added the -q 1 flag;
which adds a 1 second delay after sending the message and before it closes
the channel.
Result?#
# status=$(echo "ruok" | nc localhost 2181 -q 1)
root@zk-node:~# ./check_zk.sh
try 1: imok
try 2: imok
try 3: imok
try 4: imok
try 5: imok
try 6: imok
try 7: imok
try 8: imok
try 9: imok
try 10: imok
root@zk-node:~#
Consistency at last!
The goal of any platform - it works or it doesn’t; either is fine, as long as it does that consistently.
TIL