Signals to freeze a Data Node: simulating trouble

Last week I was struggling to find an easy way to simulate a troubled *Data Node* (ndbd process) using MySQL Cluster. It’s as simple as pancackes: using the kill command!

To freeze a process you just need to kill the process using the SIGSTOP signal. To let the processes continue, use SIGCONT. Here’s an example shell script showing how you would use these two signals on a data node:

# 2010-05-03 08:11:46 [ndbd] INFO     -- Angel pid: 542 ndb pid: 543
NDBDPID=`grep 'Angel pid' ndb_3_out.log | tail -n1 | awk '{ print $11 }'`
kill -STOP $NDBDPID
sleep 10
kill -CONT $NDBDPID

I’m using the out-log because the file ndb_3.pid contains only the PID of the Angel process. The sleep command is something variable which you can set as low or as high as you want.

In the above example the script sleeps long enough for data node to fail with an Arbitration Error. If you would set options HeartbeatIntervalDbDb and TimeBetweenWatchDogCheck to a lower value than the default, you would only be able to sleep for a few seconds. The result:

 [MgmtSrvr] WARNING  -- Node 2: Node 3 missed heartbeat 2
 [MgmtSrvr] WARNING  -- Node 2: Node 3 missed heartbeat 3
 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
 [MgmtSrvr] WARNING  -- Node 2: Node 3 missed heartbeat 4
 [MgmtSrvr] ALERT    -- Node 2: Node 3 declared dead due to missed heartbeat
 [MgmtSrvr] INFO     -- Node 2: Communication to Node 3 closed
 [MgmtSrvr] ALERT    -- Node 2: Network partitioning - arbitration required
 [MgmtSrvr] INFO     -- Node 2: President restarts arbitration thread [state=7]
 [MgmtSrvr] ALERT    -- Node 2: Arbitration won - positive reply from node 1
 [MgmtSrvr] ALERT    -- Node 2: Node 3 Disconnected
 [MgmtSrvr] INFO     -- Node 2: Started arbitrator node 1 [ticket=019b00025cc8aad8]
 [MgmtSrvr] ALERT    -- Node 3: Forced node shutdown completed.
  Caused by error 2305:  'Node lost connection to other nodes and can not
  form a unpartitioned cluster, please investigate if there are error(s)
  on other node(s)(Arbitration error). Temporary error, restart node'.

How is this useful?

Well, for simulating a data node which is having problems while having load for example. Maybe you would like to see what happens if you tune the WatchDog or Hearbeat parameters. Or maybe you want to give a demonstration to your management without going through hassel of overloading a disk or CPU or pulling network cables (e.g. for prove of concept).

In any case, I think it’s a cool use of the kill-command. One I didn’t know of.

Comments

Evert
Smart trick, I'll be using this too =)