Bug 15695 and NDB initial start

The process for starting up a cluster is pretty interesting. Where, of course, “interesting” is translated to “complex”. There’s a lot of things you have to watch out for (namely you want one cluster, not two or ten or anything). You also want to actually start a cluster, not just wait forever for everybody to show up.

Except in some situations. For example, initial start. With an initial start, you really want to have all the nodes present (you don’t want to run the risk of starting up two separate clusters!).

Bug 15695 is a bug to do with Initial Start. If you have three nodes (a management node and two data nodes) and break the network connection just between the two data nodes, and then reconnect it (at the wrong time – where the wrong time means you trigger the bug) the cluster will never start. A workaround is to restart one of the data nodes and everything comes up.

Note that this is just during initial start so it’s not a HA bug or anything. Just really annoying.

This seems to get hit when people have firewalls stopping the nodes talking to each other and then fix the firewall (but not shutting down the cluster).

As is documented in the bug, you can replicate this with some iptables foo.

One of the main blocks involed in starting the cluster (and managing it once it’s up) is the Quorum manager – QMGR. You’ll find the code in ndb/src/kernel/blocks/qmgr/. You’ll also find some in the older CMVMI (Cluster Manager Virtual Machine Interface).

A useful thing to do is to define DEBUG_QMGR_START in your build. This gives you some debugging output printed to the ndb_X_out.log file.

The first bit of code in QmgrMain.cpp is the heartbeat code. execCM_HEARTBEAT simply resets the number of outstanding heartbeats for the node that sent the heartbeat. Really simple signal there.

During SR (System Restart) there is a timeout period for which we try to wait for nodes to start. This means we’ll be starting the cluster with the most number of nodes present (it’s much easier doing a SR with as many nodes as possible than doing NR – Node Recovery – on lots of nodes). NR requires copying of data over the wire. SR probably doesn’t. Jonas is working on optimised node recovery which is going to be really needed for disk data. This will only copy the changed/updated data over the wire instead of all data that that node needs. Pretty neat stuff.

We implement the timeout by sending a delayed signal to ourself. Every 3 seconds we check how the election of a president is going. If we reach our limit (30seconds) we try to start the cluster anyway – not allowing other nodes to join).

Current problem is that each node in this two node not-quite-yet cluster thinks it has won the election and so switches what state it’s in to ZRUNNING (see Qmgr.h) hence stopping the search for other nodes. When the link between the two nodes is brought back up – hugs and puppies do not ensue.

I should have a patch soon too.

For a more complete explanation on the stages of startup, have a look at the text files in ndb/src/kernel/blocks. Start.txt is a good one to read.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.