One of the things I’m working on is adding the ability to use ndb_mgm to issue a restart command to ndb_mgmd (i.e. from a management client, get management servers to restart). At the moment, you have to go and shut it down then start it yourself.
So why would you ever want to restart a management server? Well, bugs aren’t really a reason – I’ve never heard of anyone having to restart the management server “just to get something to work again”.
The reason is online configuration upgrades.
There are a bunch of parameters you can change without having to restart your cluster. We call it a “rolling upgrade” as it’s the same procedure as upgrading one compatible version to another.
The whole procedure would be really easy if we only ever had one management server (which is what a lot of people have anyway – you only ever need a mgm server to have a node join the cluster).
It’s also tricky because you don’t want management servers up and serving different configurations. This would tend to be bad and never lead to hugs and puppies.
For supporting more dramatic configuration changes (e.g. add/drop node) we’ll be needing configuration locks and enforcing that everyone agrees on the config they’re serving out.
There exists some code from a previous effort a few years ago. So I’m having a look through it and trying to work out the state of everything. There seems to be a bit of bitrot and I’m trying to work ouf if anything is worth using.
The approach that I’ve come up with is to have a “single user mode” for the mgm server – i.e. nobody but one connection can do anything. This is where we’d do updates and changes before unlocking.
I wasn’t really caring about notifying ndbd about changes as the way you do things atm is to restart each ndbd and they then pick up the changes.
Otherwise, we really want a “this parameter changed” rather than “there’s a new configuration”.
So, the ‘mgm restart’ thing is really going to be implemented as “config reload” – getting the mgmds to stop and restart in the right order and so there is never more than one version of the configuration is being served at a time.
hrrm… back to the code to figure out what’s going on with this older stuff.