Can you explain the process of a booting Swarm node in a CSN environment?

In a CSN environment, the CSN is used to provide all management services for a Swarm cluster. A booting node receives networking information, operating system, and Swarm configuration from the CSN. The following are the steps involved in that process.

  • This process assumes a running CSN with two logical (bond) interfaces- private and public. Each bond should include two physical interfaces.
  • We will assume that the CSN is powered on and the CSN processes are running.
  • A node is first booted with the BIOS option set to boot from the network.
  • The node should have a network interface in the same broadcast domain (VLAN) as the private bond of the CSN.
  • For sake of example, the node has never been booted as a Swarm node previously.
  • Since the node is configured to boot from the network, it will first send a DHCP request out its interface.
  • The CSN should be configured to listen for DHCP requests on its private network interface (bond0).
  • The CSN hears the DHCP request from the node and responds with various information including an IP address, netmask, gateway, filename of the pxelinux configuration file, next-server (the tftp-server where the node can find information on how to pxe boot), etc.
  • The node configures its networking interface with the information provided and then unicasts a tftp request to the next-server for the filename indicated in the configuration file parameter (ex: gpxelinux.0).
  • Both the next-server and the filename to retrieve were handed to the node in the DHCP request with special options.
  • The node downloads the pxelinux configuration files via tftp.
  • The node then uses the instructions in the pxelinux configuration file to make a new request via http on a special port to the CSN for its operating system.
  • The CSN hears the request for the node's operating system and allows the node to download the Swarm code and the node's associated Swarm configuration file.
  • If the node has never booted as a Swarm node, the node's disks will be reformatted in Swarm's proprietary structure.
  • As the node lays down the Swarm configuration file, it assigns the node an IP address per Swarm process.
  • The CSN chooses an IP address from a different address range than the DHCP range.
  • The node's networking is restarted with the newly assigned IP address.
  • You will not see the node's final IP address in /var/lib/dhcpd/dhcpd.leases file as the final address is not assigned via DHCP.
  • After the node receives its configuration file that includes its IP address, it will send a DHCP release for the IP address it used to download the gpxelinux.0 file/ Swarm OS. That address is now available for other newly booted nodes.
  • To be clear, there are two pools of addresses that the CSN manages- one is the pool of DHCP addresses that are only used on newly booting nodes. The other pool is used by the CSN to assign the final IP address to a booted node. This is assigned in the configuration file and is not DHCP.
  • This means that the final assigned IP address is not susceptible to DHCP lease timeouts.
  • These two IP ranges are dynamically Created when you choose a subnet range during CSN startup. More information on the specifics of these ranges is discussed in the CSN guides.
  • The CSN monitors all of its managed nodes. If one of its nodes becomes unresponsive, then the CSN can re-use the IP address that was assigned to the unresponsive node.
  • If a node goes unresponsive, another newly booting node can use its IP address. If the unresponsive node comes back online later and another node is using its previously assigned address, it will have a new IP address assigned by the CSN. This is by design.
  • If a node is rebooted in a CSN environment, and a CSN is no longer available, the node will fail to boot. The Swarm OS is stored in RAM and is lost on reboot.
  • The CSN is required even after the node is booted for management purposes. It provides NTP, syslog, and monitoring functions essential to proper Swarm cluster operation.

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.