/
I/O errors and Swarm nodes

I/O errors and Swarm nodes

It is normal behavior for a disk volume to go into "retiring" mode following any I/O errors. Unless there are a cascade of additional errors, Swarm will continue to retire the volume until it is empty. Once it transitions to "retired" state, you can replace the disk. If there are a cascade of errors that prevent retiring from completing, Swarm will initiate a failed volume recovery (FVR) and more quickly re-replicate the content from the failed disk as a high-priority cluster operation.

After a single I/O error, a volume starts to retire. After 200 I/O errors, a volume is moved offline, whether or not the retire has started or finished. 200 is the default value for disk.ioErrorTolerance. Currently, there is no configuration option for the first response – one I/O error, and a volume is shuffled down the retire path.

The state change isn't done when the I/O error occurs, but on the next disk heartbeat (every 60 seconds.) That means if there has been more than 1 I/O error, the log message may indicate that a disk is retiring because of some number of errors. In fact, it is retiring because of 1, and that number is just how many occurred in the subsequent fraction of a minute before the retire was begun.

Related content

How Swarm Responds to Disk Changes
How Swarm Responds to Disk Changes
More like this
Operational Problems
Operational Problems
More like this
Troubleshooting a spike in disk failures or retires
Troubleshooting a spike in disk failures or retires
More like this
Replacing Failed Drives
Replacing Failed Drives
More like this
Returning a Stale Volume to Service
Returning a Stale Volume to Service
More like this
What happens when a node or disk fails?
What happens when a node or disk fails?
More like this

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.