Table of Contents

minLevel	1
maxLevel	2
outline	false
type	list
printable	false

Due to current sophisticated disk storage devices and interfaces, the underlying disk system performs many error detection steps, bad sector re-mappings, and retry attempts. There is little chance that a deterministic set of steps can be performed to work around the failure if a physical error propagates to the Swarm software level. Additionally, there is no guarantee that the extent of the error can be isolated or that the continued use of a failing device will allow allows the node to continue operating normally with its peripheral storage devices.

Because of these inherent challenges, Swarm takes the conservative approach of retiring a volume as soon as it receives a configurable number of I/O errors. If the configured number of additional errors are received during the retire (disk.ioErrorTolerance), Swarm immediately marks the volume as Unavailable and kicks off both the failed volume recovery process (FVR) and the erasure-coding recovery process (ECR) to relocate all objects on the volume's objects.

Info

title

Tip

If Swarm retires a disk automatically because of I/O errors,

you can

check the diagnostic data collected in the logs. For the Swarm UI,

see

see Managing Chassis and Drives. (v11.1)

Triggers for Retire

Swarm changes a volume's state to Retiring when any of these events occurs:

You click
Retire is clicked next to a volume on the node status page in the Swarm UI (or the legacy Admin Console).
You click
Retire Chassis/Node is clicked, which retires all volumes on the node at the same time.
The number of I/O errors specified by disk.ioErrorToRetire occur in the time period specified by disk.ioErrorWindow.

A Retiring volume accepts no new or updated objects. A volume remains in the Retiring state until all objects stored on that volume (including replicas) are moved to other volumes in the cluster. The Retiring state persists even if the node is rebooted. You may see the The object count may increase.

The volume state is changed to Retired and Swarm does not use the volume anymore when all objects are moved. At that point, remove the volume for repair or discard it.

...

Note

If there are continued I/O errors that exceed the number specified by disk.ioErrorTolerance when the volume is in the Retiring state, the volume state is changed to Unavailable, regardless of whether Swarm has finished moving objects to other volumes.

Canceling an Ongoing Retire

You can cancel an An ongoing retire can be cancelled by using the castorCancelVolumeRetire SNMP action. It takes a string to name a specific volume, or all.

Canceling Retire on a Specific Volume

Code Block

language	bash	title	Canceling retire on a specific volume

snmpset -v2c -c ourpwdofchoicehere -m ./CARINGO-MIB.txt:./CARINGO-CASTOR-MIB.txt 
	192.168.99.100 castorCancelVolumeRetire s "/dev/sda"

Canceling Retire on All Volumes

Code Block

language	bash	title	Canceling retire on all volumes

snmpset -v2c -c ourpwdofchoicehere -m ./CARINGO-MIB.txt:./CARINGO-CASTOR-MIB.txt 
	192.168.99.100 castorCancelVolumeRetire s "all"

...

Versions Compared

Old Version 4

New Version Current

Key

Tip

Triggers for Retire

Note

Canceling an Ongoing Retire

Canceling Retire on a Specific Volume

Canceling Retire on All Volumes

Page Comparison

Versions Compared

Old Version 4

New Version Current

Key

Tip

Triggers for Retire

Note

Canceling an Ongoing Retire

Canceling Retire on a Specific Volume

Canceling Retire on All Volumes