Retiring Hardware

These are typical reasons to retire hardware:

Important

Reformatting for Encryption - Reformat and remount the volumes if retiring volumes to implement encryption at rest. Contact DataCore Support for a utility to streamline this process. (v10.1)

  • Planned EOL and Upgrades: Replacing old, serviceable hardware with new equipment (a hardware refresh). Planned end-of-life for enclosures involves decommissioning the enclosure by retiring all chassis and moving the reformatted drives back into service.

Why not hotplug? Although Swarm does support moving drives within a cluster for quick data migration, moving an entire set of drives from a server chassis or rack enclosure temporarily risks data loss because the hardware receiving the drives may hold several of the replicas/segments of the same object. In addition, moving drives requires a level of hardware compatibility, and some hardware situations do not support such drive moves:

  • Use of incompatible RAID cards/tagging between chassis (especially true for those who emulate JBOD with single disk RAID-0 definitions).

  • Inability of the controller in the new chassis/enclosure to recognize or otherwise work with the drives being moved (such as drive firmware vs. controller firmware, even with a pure HBA/JBOD setup).

  • Inability of either the old and/or new equipment to properly support hot plug for drives (See https://perifery.atlassian.net/wiki/spaces/public/pages/2443808801).

Best Practice

The safest way to move all data being stored in an end-of-life chassis or enclosure is to retire the chassis and then format and reintroduce the drives.

Retiring Volumes

Due to current sophisticated disk storage devices and interfaces, the underlying disk system performs many error detection steps, bad sector re-mappings, and retry attempts. There is little chance a deterministic set of steps can be performed to work around the failure if a physical error propagates to the Swarm software level. Additionally, there is no guarantee the extent of the error can be isolated or the continued use of a failing device allows the node to continue operating normally with its peripheral storage devices.

Because of these inherent challenges, Swarm takes the conservative approach of retiring a volume as soon as it receives a configurable number of I/O errors. If the configured number of additional errors are received during the retire (disk.ioErrorTolerance), Swarm immediately marks the volume as Unavailable and kicks off both the failed volume recovery process (FVR) and the erasure-coding recovery process (ECR) to relocate all objects on the volume.

Tip

If Swarm retires a disk automatically because of I/O errors, check the diagnostic data collected in the logs. For the Swarm UI, see Managing Chassis and Drives. (v11.1)

Triggers for Retire

Swarm changes a volume's state to Retiring when:

  • Retire is clicked next to a volume on the node status page in the Swarm UI (or the legacy Admin Console).

  • Retire Chassis/Node is clicked, which retires all volumes on the node at the same time.

  • The number of I/O errors specified by disk.ioErrorToRetire occur in the time period specified by disk.ioErrorWindow.

A Retiring volume accepts no new or updated objects. A volume remains in the Retiring state until all objects stored on that volume (including replicas) are moved to other volumes in the cluster. The Retiring state persists even if the node is rebooted. The object count may increase.

The volume state is changed to Retired and Swarm does not use the volume anymore when all objects are moved. At that point, remove the volume for repair or discard it.

Canceling an Ongoing Retire

An ongoing retire can be cancelled by using the castorCancelVolumeRetire SNMP action. It takes a string to name a specific volume or all.

Canceling Retire on a Specific Volume

snmpset -v2c -c ourpwdofchoicehere -m ./CARINGO-MIB.txt:./CARINGO-CASTOR-MIB.txt 192.168.99.100 castorCancelVolumeRetire s "/dev/sda"

Canceling Retire on All Volumes

snmpset -v2c -c ourpwdofchoicehere -m ./CARINGO-MIB.txt:./CARINGO-CASTOR-MIB.txt 192.168.99.100 castorCancelVolumeRetire s "all"

Fast vs. Slow Retire

The only way to have a slow retire (without recovery) is to initiate it manually. If a retire is kicked off from I/O errors, it will always be the “fast” retire.

  • Retire can be initiated due to Swarm detecting hardware issues or the retire may be manually initiated. In the latter case, the volume can be unretired to return it to normal service.

  • The “fast” retire initiates a cluster-wide recovery of the volume that attempts to rapidly replace the replicas within the cluster that were on that/those volume(s). This action usually has an impact on cluster performance during recovery. The “slow” retire has minimal performance impact but takes longer. After the recovery phase, “fast” and “slow” retires do a checking phase that is largely the same.

  • Retiring a volume requires significant work and typically takes weeks to complete. The estimate of three HP cycles is not a bad one. The retire rate per hour can be monitored via SNMP. It is also visible in the management API under a node endpoint. Customers can watch the stream counts go down on the volume(s).
    It is normal for stream count to be constant for long periods near the end of the process (it’s not linear). This is because the entire disk has to be scanned for the remaining streams.

Retire a Chassis

Retiring a node/chassis is the same as retiring all its volumes (and other related facts). Swarm retires all drives within a chassis when retiring. Drives can be reformatted and returned to service if in good shape.

  1. From the Storage UI, select Cluster > Hardware, open the Chassis Details, and select Retire from the action (gear) menu.

  2. Wait for all volumes to reach the status of "retired".

  3. Stop the storage processes from the system console on the physical chassis: System Control > 3. Stop Storage Processes

  4. Format any disk that may be returned to service: Disk Volumes > ALL

  5. Shut down the node: System Control > 2. Shutdown System

  6. Transfer reformatted drives to other chassis as appropriate. (See )

Retire an Enclosure

  1. Add the new equipment to the cluster.

  2. Retire each chassis in the old enclosure following the procedure above to reclaim any serviceable drives.

  3. Power down and remove the enclosure when every chassis is retired (with drives reformatted for reuse) and shut down.

  4. The drive returns to the service storing Swarm data when formatted drives are inserted into a chassis.

Retire Multiple Chassis

In the case of multiple chassis retiring, retire one chassis at a time to minimize cluster disruption and maintain client performance.

 

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.