I/O errors and Swarm nodes

It is normal behavior for a disk volume to go into "retiring" mode following any I/O errors. Unless there are a cascade of additional errors, Swarm will continue to retire the volume until it is empty. Once it transitions to "retired" state, you can replace the disk. If there are a cascade of errors that prevent retiring from completing, Swarm will initiate a failed volume recovery (FVR) and more quickly re-replicate the content from the failed disk as a high-priority cluster operation.

After a single I/O error, a volume starts to retire. After 200 I/O errors, a volume is moved offline, whether or not the retire has started or finished. 200 is the default value for disk.ioErrorTolerance. Currently, there is no configuration option for the first response – one I/O error, and a volume is shuffled down the retire path.

The state change isn't done when the I/O error occurs, but on the next disk heartbeat (every 60 seconds.) That means if there has been more than 1 I/O error, the log message may indicate that a disk is retiring because of some number of errors. In fact, it is retiring because of 1, and that number is just how many occurred in the subsequent fraction of a minute before the retire was begun.

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.