Troubleshooting a spike in disk failures or retires

If you see an unusually large number of hard disk failures or hard disk retires in a short period of time (seen as a high frequency of Unavailable or Retired volumes in the Swarm UI), a few things could be at fault.

Drive errors — On a Swarm node, when the Linux kernel experiences 2 drive I/O failures (disk.ioErrorToRetire = 2 by default), Swarm starts retiring the drive. These errors are considered serious enough that Swarm will no longer write to the drive. If more than 200 I/O errors (disk.ioErrorTolerance = 200 by default) occur on a single drive, the drive is immediately taken off-line with no chance to recover the streams from the retiring drive itself.

Swarm is dependent on how the embedded Linux OS surfaces disk errors to it. If the kernel says there is an I/O error, Swarm cannot validate that the kernel is correct; the kernel is responsible for the correctness of the error, and the Linux Generic SCSI driver (sg) is determining what constitutes an exception. Swarm only sees the error code returned by this driver, such as "Errno 5: Input/Output".

Important

Relaxing error threshold settings (disk.ioErrorToRetire, disk.ioErrorWindows, disk.ioErrorTolerance) can jeopardize data integrity. Do not change any of these settings from their defaults without contacting DataCore Support for a thorough analysis and recommendation.

Firmware and driver versions — If a drive is being taken offline, either the drive has experienced genuine errors (as described above) or else something in the SAS/SATA/SCSI end-to-end chain is amiss:

  • The disk firmware is outdated
  • The controller's firmware version is out-of-date
  • The controller's firmware version is out of sync with the mpt2sas driver that is delivered with that particular Swarm version

Important

The SAS/SATA/SCSI end-to-end chain (HDD firmware, RAID Controller firmware, and mpt2sas driver delivered with Swarm version) must be consistent in order to exclude erroneous disk failures.

To check the consistency of the SAS/SATA/SCSI end-to-end chain and resolve any found conflicts, follow these steps:

  1. Check with the HDD manufacturer in order to determine if this is a bad batch or one that has already known issues.
  2. Check that the HDD firmware is at the latest version provided by the manufacturer.
    • Not having the correct HDD firmware version has proved to produce errors that cause disk failures on otherwise still functional disk drives.
  3. Check that the RAID Controller firmware is at the latest version provided by the manufacturer.
    • Not having the correct RAID Controller firmware has been know to create conflicts and critical failure (due to the need for upgrade) which are translated to disk failures of otherwise functional hard disks.
  4. Check that the mpt2sas driver embedded in your version of Swarm is the version recommended by your disk controller's manufacturer.
    • A mismatch between RAID Controller firmware and the mpt2sas driver may not yield the best performance and could, as well, result in hard disk failures. If a mismatch is identified, upgrade to the appropriate Swarm version (that has the correct mpt2sas driver).
    • To check the mpt2sas driver version of a particular Swarm version, see: http://list.of.mpt2sas_version.to.swarm_version

keywords: failed drive, unavailable, retired, retiring

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.