Table of Contents | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
Due to current sophisticated disk storage devices and interfaces, the underlying disk system performs many error detection steps, bad sector re-mappings, and retry attempts. There is little chance that a deterministic set of steps can be performed to work around the failure if a physical error propagates to the Swarm software level. Additionally, there is no guarantee the extent of the error can be isolated or the continued use of a failing device allows the node to continue operating normally with its peripheral storage devices.
Because of these inherent challenges, Swarm takes the conservative approach of retiring a volume as soon as it receives a configurable number of I/O errors. If the configured number of additional errors are received during the retire (disk.ioErrorTolerance), Swarm immediately marks the volume as Unavailable and kicks off both the failed volume recovery process (FVR) and the erasure-coding recovery process (ECR) to relocate all objects on the volume's objects.
Info |
---|
TipIf Swarm retires a disk automatically because of I/O errors, check the diagnostic data collected in the logs. For the Swarm UI, see see Managing Chassis and Drives. (v11.1) |
Triggers for Retire
...
The volume state is changed to Retired and Swarm does not use the volume anymore when all objects are moved. At that point, remove the volume for repair or discard it.
...
Note
If there are continued I/O errors that exceed the number specified by disk.ioErrorTolerance when the volume is in the Retiring state, the volume state is changed to Unavailable, regardless of whether Swarm has finished moving objects to other volumes.
Canceling an Ongoing Retire
An ongoing retire can be cancelled by using the castorCancelVolumeRetire SNMP action. It takes a string to name a specific volume, or all.
Canceling
...
Retire on a
...
Specific Volume
Code Block | ||
---|---|---|
| ||
snmpset -v2c -c ourpwdofchoicehere -m ./CARINGO-MIB.txt:./CARINGO-CASTOR-MIB.txt 192.168.99.100 castorCancelVolumeRetire s "/dev/sda" |
Canceling
...
Retire on
...
All Volumes
Code Block | ||
---|---|---|
| ||
snmpset -v2c -c ourpwdofchoicehere -m ./CARINGO-MIB.txt:./CARINGO-CASTOR-MIB.txt 192.168.99.100 castorCancelVolumeRetire s "all" |
...