/
Rolling reboot script

Rolling reboot script

There is a rolling reboot option to our snmp-castor-tool.sh support script bundled in our support tools: swarm-support-tools.tgz

The snmp-castor-tool.sh script is developed on and only tested on the CSN.

The option for rolling reboot is -G

Often times, I like to make a manual list of IPs, one IP per chassis, to run this script from. That way the script doesn't try to reboot each IP address on every chassis. The script should account for multiple IPs per chassis and NOT reboot each chassis more than once, but I like to make sure it only sends the reboot to one IP address per chassis.  A reboot to any one IP address in the chassis reboots the entire chassis. If you'd prefer to use that option, simply create a file called NODES.csv in the directory where you are running the script from and add one IP address from each chassis- one per line- and add the -n option when running the script.

Here's a demonstration. You might prefer to run this command in a screen session in case the SSH session breaks for any reason.

Notice that I have 2 IPs per MAC address (2 node processes per chassis):

[root@c-csn1 ~]# /opt/caringo/csn/bin/ip-assignments
000c294816c6 192.168.201.84
000c294816c6 192.168.201.85
000c2957da63 192.168.201.86
000c2957da63 192.168.201.87
000c299e564c 192.168.201.88
000c299e564c 192.168.201.89
000c292422d5 192.168.201.90
000c292422d5 192.168.201.91

I have created a file in my local directory that includes one IP address from each of those chassis:

[root@c-csn1 ~]# cat NODES.csv
192.168.201.84
192.168.201.86
192.168.201.88
192.168.201.90

I created it by typing: /opt/caringo/csn/bin/ip-assignments --ips >> NODES.csv and I removed the IPs that I didn't want to send reboots to.


[root@c-csn1 ~]# snmp-castor-tool.sh -G -n
Please type NO in all CAPS to avoid a rolling restart of all nodes. Any other response will restart the nodes. This requires the SNMP read/write password.
yes
Beginning a rolling reboot of the cluster. The script will timeout if a rebooted node isn't in an ok state within 60 minutes after reboot
A log of this rolling reboot is at ./rollingreboot-2016_0310-143816.log
Rebooting 192.168.201.84
Waiting 5 minutes before trying to contact the rebooted node
Node 192.168.201.84 has returned and its volumes are mounted
Rebooting 192.168.201.86
Waiting 5 minutes before trying to contact the rebooted node
Node 192.168.201.86 has returned and its volumes are mounted
Rebooting 192.168.201.88
Waiting 5 minutes before trying to contact the rebooted node
Node 192.168.201.88 has returned and its volumes are mounted
Rebooting 192.168.201.90
Waiting 5 minutes before trying to contact the rebooted node
Node 192.168.201.90 has returned and its volumes are mounted

The rolling reboot was successful.


There is an output file to show the progress which is especially useful if something unexpected happens.  If a chassis doesn't return and you have to manually reboot it (and assuming the script times out), you can simply remove the IPs of the nodes that already were rebooted from NODES.csv and run the script again to reboot the remaining nodes.

The script figures that a node has returned when the node IP that it rebooted is back online with a drive state of OK or retired/ retiring.  It only checks the state of the node IP that it rebooted, so there is a chance that it continues to reboot the next node in the list while some drives in the previous node are still mounting.  This is only ever a concern if you have some very slow-mounting disks. 

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.