Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: wording

...

Due to current sophisticated disk storage devices and interfaces, the underlying disk system performs many error detection steps, bad sector re-mappings, and retry attempts. If a physical error propagates to the Swarm software level, there There is little chance that a deterministic set of steps can be performed to work around the failure if a physical error propagates to the Swarm software level. Additionally, there is no guarantee that the extent of the error can be isolated or that the continued use of a failing device will allow the node to continue operating normally with its peripheral storage devices.

Because of these inherent challenges, Swarm takes the conservative approach of retiring a volume as soon as it receives a configurable number of I/O errors. If the configured number of additional errors are received during the retire (disk.ioErrorTolerance), Swarm immediately marks the volume as Unavailable and kicks off both the volume recovery process (FVR) and the erasure-coding recovery process (ECR) to relocate all the volume's objects.

Info
titleTip

If Swarm retires a disk automatically because of I/O errors, you can check the diagnostic data collected in the logs. For the Swarm UI, see Managing Chassis and Drives. (v11.1)

Triggers for Retire

Swarm changes a volume's state to Retiring when any of these events occurs:

...

A Retiring volume accepts no new or updated objects. A volume remains in the Retiring state until all of the objects stored on that volume (including replicas) are moved to other volumes in the cluster. The Retiring state persists even if the node is rebooted. You may see the object count increase.

When all objects are moved, the The volume state is changed to Retired and Swarm does not use the volume anymore when all objects are moved. At that point, remove the volume for repair or discard it.

...