Procedure for Shutting Down and Starting Up DataCore Swarm Cluster

This document outlines the appropriate procedure to safely power down and power on a DataCore Swarm
Cluster, including the SCS, Elasticsearch cluster, Gateways, HAProxy and storage nodes.

Open a support Ticket to Swarm Support, collect fresh support bundle on SCS, all Elasticsearch nodes, and all Gateways upload to the ticket.

How to collect a support bundle

Power Down Procedure

Step 1: Disable Client Access via Load Balancers

Stop client access at the load balancers, typically using HAProxy:

sudo systemctl stop haproxy

Step 2: Stop and Power Down Gateways

Stop the cloudgateway services and power down the gateway nodes:

  • On each Gateway node:

    sudo systemctl stop cloudgateway sudo shutdown now

Step 3: Suspend Recoveries on Storage Nodes

Suspend recoveries

  • On the SCS

    cd /root/dist/ ./swarmctl -p admin:<adminPassword> -d <any_swarm_node_ip> -fsuspend
  • Wait for approximately 6 minutes for the recoveries to suspend

Step 4: Power Down All Storage Nodes

  • Power down all the storage nodes from the SCS:

  • Verify shutdown: Check that all nodes have powered down using IPMI or by pinging their network addresses.

Step 5: Shutdown Elasticsearch Cluster

  • Put Elasticsearch in maintenance mode:

    • On the SCS or any Elasticsearch node:

    • Stop Elasticsearch services and power down the Elasticsearch nodes:

      • On each Elasticsearch nodes:

Step 6: Power Down SCS Node

  • Power down the SCS

Power On Procedure

Bringing it all back up again is pretty much the reverse

Step 1: Power on the SCS

Begin by powering on the SCs node and wait for approximately 5 minutes. Ensure all containers on SCS is up and running

Step 2: Power on Elasticsearch Nodes

  • Power on all Elasticsearch nodes and wait for them to become ready

    • On SCS verify Elasticsearch Health:

    • Once the cluster is in yellow status, take Elasticsearch out of ‘maintenance’ mode and ensure the cluster become green on SCS:

Step 3: Power on Storage Nodes

Power on the storage nodes, staggering them by about 30 seconds to avoid multiple nodes requesting PXE images simultaneously.

Monitor the storage node’s status via IPMI or ping their IP addresses to confirm they are online.

Step 4: Monitor Swarm Cluster Rejoining

  • Monitor the rejoining of storage nodes to the cluster from SCS:

  • Wait for the nodes to mount and show OK status

Step 5: Resume Recoveries

Once all nodes are up and operational, resume recoveries from SCS:

Step 6: Power on Gateway Nodes

Power on all Gateway nodes, ensure Gateway services is started

Step 7: Validate System Health

  • Check the cluster hardware status via the Storage UI

  • Ensure the Elasticsearch cluster and feeds are functioning correctly.

  • Verify content access through the Content UI for Tenants, Domains and Buckets, testing each Gateway individually.

Step 8: Power on Load Balancer (optional)

Power on all HAProxy nodes, ensure HAProxy service is started

Step 9: Enable Client Access

Re-enable client access

Additional Notes

  • Ensure that all services are started and stopped in the correct order as per the procedures above to avoid data inconsistency or service failure.

  • After the system is back online, monitor logs and check for any errors to ensure that all nodes and services are fully operational.

  • On SCS, verify above Swarm able sending Health Reports to DataCore:

Conclusion

By following this procedure, you can safely power down and power on a DataCore Swarm cluster, minimizing risks and ensuring a smoothly recovery.

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.