Diagnosing Missing, Unavailable or Degraded Disks on Swarm Storage Node

Diagnosing Missing, Unavailable or Degraded Disks on Swarm Storage Node

This article guides you through diagnosing missing or degraded disks on a Swarm storage node. Persistent I/O errors, buffer errors, slow touch times, and unrecovered read errors in system logs often indicate hardware issues requiring attention. This guide provide a step-by-step process for gathering diagnostics, identifying faulty disks.

Step 1: Update Support Tools

Ensure you have the latest version of Support Tools to retrieve accurate diagnostics.

  1. Access the Support Tools directory:

    cd /root/dist/
  2. Update the Support Tools bundle:

    ./updateBundle.sh

Step 2: Retrieve System Logs and Hardware Information

Collect dmesg logs and hardware details (hwinfo) to check for hardware-related issues on the storage node.

  1. Retrieve dmesg logs (to check recent hardware events and errors):

    cd /root/dist/ ./swarmctl -p admin:<adminPassword> -d <node_IP_address> -Qdmesg -x

Replace <adminPassword> and <node_IP_address> (e.g. 192.168.1.84) with the node’s credentials and IP address.

  1. Retrieve hwinfo (for detailed hardware information):

    ./swarmctl -p admin:<adminPassword> -d <node_IP_address> -Qhwinfo -x

Example Logs: If you observe messages such as “unrecovered read errors”, “buffer I/O errors”, or “Read Capacity([number]) failed” in dmesg, this likely points to physical issues on the disk, such as bad sectors or critical medium errors. For instance:

[Tue Jun 25 17:46:38 2024] sd 1:0:3:0: [sdc] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK [Tue Nov 12 15:35:44 2024] Buffer I/O error on dev sds, logical block 4378185822, async page read [Tue Nov 12 15:36:55 2024] sd 14:0:18:0: [sds] tag#93 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s [Tue Nov 12 15:36:49 2024] blk_update_request: critical medium error, dev sds, sector 35025486576 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 0 [Tue Nov 12 15:36:55 2024] sd 14:0:18:0: [sds] tag#93 CDB: Read(16) 88 00 00 00 00 08 27 ae 82 f0 00 00 00 08 00 00

With the Read Capacity(10) failed errors, it is possible you won’t even see the disk in the Swarm console.

Step 3: Verify Disk Information from Health Report

The health report provides details on the disk device names and serial numbers, which can help verify faulty disks and locate replacement details.

  1. Generate a health report for the node:

    cd /root/dist/ ./swarmctl -p admin:<adminPassword> -d <node_IP_address> -Qhealthreport -x
  2. Requires jq. If you don’t have in on your SCS natively, you can use the jq binary in the Support tools bundle (usually /root/dist/jq) directly. Extract disk information from the JSON health report (e.g., 2024_1113_1241-healthreport-100.126.145.57.json as shown below) and display each disk in a readable format to verify against dmesg output. Example reading this information from an SCS or Content Gateway with Support tools loaded in /root/dist.

    /root/dist/jq -r '.["SNMP tables"]["Drive Table"] | .["drive name"] as $names | .["drive serial number"] as $serials | [$names, $serials] | transpose[] | "drive name: \(.[0]) - drive serial number: \(.[1])"' 2024_1113_1241-healthreport-100.126.145.57.json

Sample output:

drive name: /dev/sda - drive serial number: ZVT86YFA drive name: /dev/sdb - drive serial number: ZVT86YES drive name: /dev/sdc - drive serial number: ZVT87VXK drive name: /dev/sdd - drive serial number: ZVT86TXG drive name: /dev/sde - drive serial number: ZVT88E8A drive name: /dev/sdf - drive serial number: ZVT88E8K drive name: /dev/sdg - drive serial number: ZVT86W67 drive name: /dev/sdh - drive serial number: ZVT86V03 drive name: /dev/sdi - drive serial number: ZVT7CLDZ drive name: /dev/sdj - drive serial number: ZVT88GR6 drive name: /dev/sdk - drive serial number: ZVT86W31 drive name: /dev/sdl - drive serial number: ZVT87HNQ drive name: /dev/sdm - drive serial number: ZVT87VXY drive name: /dev/sdn - drive serial number: ZVT88E93 drive name: /dev/sdo - drive serial number: ZVT88DRE drive name: /dev/sdp - drive serial number: ZVT86WAN drive name: /dev/sdq - drive serial number: ZVT87VXW drive name: /dev/sdr - drive serial number: ZVT88MED drive name: /dev/sds - drive serial number: ZVT87HJY drive name: /dev/sdt - drive serial number: ZVT87T65

Step 4: Logs showing disk errors in castor.log

Besides looking in dmesg, you can also see disk error messages in the castor.log on your CSN/ SCS. There are some warning level messages that typically show when a drive is failing. These typically look like a slow response to our touch messages or IO errors. Touch times should be incredibly fast. We warn at two levels- touch times higher than 2 seconds and touch times higher than 10 seconds. They look like this (the following log lines are abbreviated):

VOL-SDI-1 DISK.ZFS ERROR: EZF76 [/dev/sdi] touch 2 I/O requests took longer than threshold of 2000 ms(longest: 2440 ms), degrading node performance and possibly indicating impending drive failure. VOL-SDAC-1 DISK.ZFS CRITICAL: CZF12 [/dev/sdac] touch an I/O request took longer than threshold of 10000 ms(longest: 18386 ms), degrading node performance and possibly indicating impending drive failure.

You can grep the castor.log for error code EZF76 and CZF12 to see these messages.

When you see too many IO errors- which is common when you have slow disks- a disk will be retired. You will see a message such as this.

MAINPROC DISK CRITICAL: CDD24 [/dev/sde] 2 IOErrors forces volume to retire sn=ZR5AAAAA busId=14:0:4:0

You can grep for CDD24 to see these messages. If you see these messages, the disk should be removed from the cluster and replaced- it is untrustworthy for protecting data.

If a disk is forced to retire and continues to exhibit significant issues, we will mark it Unavailable instead of “retiring”. Those errors look like this:

MAINPROC CRITICAL: CDD23 [/dev/sdj] Too many IOErrors (35920/200) - marking unavailable sn=ZAD5J6SA0000C902AAAA busId=14:2:9:0

You can grep for CDD23 to see these messages.

If a disk can’t even be marked as Unavailable, we will tell the remaining nodes in the cluster that it is Unavailable, but, on reboot, the disk will not know that it is Unavailable as we were unable to mark it Unavailable. The disk may try to mount again. It might successfully mount and then subsequently fail again after a time period. This is the worst case. Any disk exhibiting this pattern of behavior should be removed immediately.

CDD24 [/dev/sdf] Could not mark volume unavailable. Please physically remove the volume. sn=00afac9f094bf0ef2a0027b6bfa0aaaa busId=10:2:5:0

You can search for CDD24 to see these errors although they are rare.

Step 5: When All Disks are Reported Unavailable - Check the RAID Controller

If ALL disks on a node are reported as unavailable or show I/O errors at the same time, this is a strong indicator of HBA (Host Bus Adpater) or RAID controller failure rather than individual disk issues.

This condition may appear in health report as missing or zero-capacity drives, and in dmesg logs as widespread I/O errors or firmware faults related to the controller.

Example:

All disks were reported as unavailable due to a critical failure of the MegaRAID SAS controller at PCI address 0000:18:00.0. The controller entered a firmware fault state and was unable to recover, disconnecting all disks.

Key evidence from dmesg.log:

megaraid_sas 0000:18:00.0: FW in FAULT state Fault code:0xfff0000 subcode:0xff00 func:megasas_wait_for_outstanding_fusion megaraid_sas 0000:18:00.0: resetting fusion adapter scsi10. megaraid_sas 0000:18:00.0: Diag reset adapter never cleared megasas_adp_reset_fusion 4063 megaraid_sas 0000:18:00.0: Reset failed, killing adapter scsi10.

Subsequent disk I/O failures:

blk_update_request: I/O error, dev sds, sector 5599281408 op 0x0:(READ) Buffer I/O error on dev sds, logical block 699910176 sds: detected capacity change from 35156656128 to 0

Recommendation: If such symptoms are observed, contact the hardware vendor immediately to service or replace the HBA or RAID controller. Do not attempt to replace the all disks unless the RAID controller has been verified to be healthy.

Step 6: Coordinating Replacement with Hardware Vendor

If persistent I/O errors are observed and confirmed by comparing disk serial numbers in dmesg and health report, contact your hardware vendor for replacements.

Summary

By following these steps, you can identify and address missing or degraded disks on a Swarm storage node. Replacing affected drives promptly will help prevent data instability and ensure smooth node operations. Bad disks are a common occurrence in the life of any storage cluster and typically should not require a support ticket to Datacore. Armed with the information above, you should feel confident in troubleshooting related issues.