Most of the time, we can identify the defective drive by turn on the LED light and stays on for one hour when volume is marked unavailable or retired. Use the light features of the UI to Identifying a failed or failing drive.
Drive Identify Plugin can help to identify where is the defective drive by LED light but some hardware not support Identify function.
Identify defective Drive
Cross reference Web UI info with Dmesg to find out the Slot number of the failed or failing disk.
Login to Web UI, select the storage node with defective drive, identify the device Drive (e.g.
sdac
) of the failed or failing disk.
2. Export Dmesg of the storage node (e.g. 100.126.4.94
) with defective drive from CSN/SCS.
cd /root/dist/ ./swarmctl -d 100.126.4.94 -Q -x -p admin:[password]
Output:
dmesg for 100.126.4.94 written to 2023_0307_1004-dmesg-100.126.4.94.txt in this directory
3. Get the Bus Id of the failed or failing disk from Dmesg.
grep 'sdac' 2023_0307_1004-dmesg-100.126.4.94.txt | grep 'failed'
Output:
[Tue Feb 14 21:30:05 2023] sd 14:0:28:0: [sdac] tag#887 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s [Tue Feb 14 21:30:45 2023] sd 14:0:28:0: [sdac] tag#2096 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s [Tue Feb 14 21:30:50 2023] sd 14:0:28:0: [sdac] tag#2097 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=10s [Tue Feb 14 21:43:21 2023] sd 14:0:28:0: [sdac] tag#3101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=9s [Tue Feb 14 21:43:26 2023] sd 14:0:28:0: [sdac] tag#3102 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=14s [Tue Feb 14 21:43:52 2023] sd 14:0:28:0: [sdac] tag#2894 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s [Tue Feb 14 21:44:20 2023] sd 14:0:28:0: [sdac] tag#803 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s
The Bus Id of failed disk is 14:0:28:0
.
4. Identify slot number of the failed disk by Bus Id.
grep '14:0:28:0' 2023_0307_1004-dmesg-100.126.4.94.txt | grep 'slot'
Output:
[Tue Jan 17 11:52:22 2023] scsi 14:0:28:0: enclosure logical id (0x500605b0000273bf), slot(28)
slot number of the failed disk is slot 28.
5. Check the slot number of the chassis start with 0
or 1
from Dmesg.
grep 'slot(' 2023_0307_1004-dmesg-100.126.4.94.txt | head -5
Output:
[Tue Jan 17 11:52:22 2023] scsi 14:0:0:0: enclosure logical id (0x500605b0000273bf), slot(0) [Tue Jan 17 11:52:22 2023] scsi 14:0:1:0: enclosure logical id (0x500605b0000273bf), slot(1) [Tue Jan 17 11:52:22 2023] scsi 14:0:2:0: enclosure logical id (0x500605b0000273bf), slot(2) [Tue Jan 17 11:52:22 2023] scsi 14:0:3:0: enclosure logical id (0x500605b0000273bf), slot(3) [Tue Jan 17 11:52:22 2023] scsi 14:0:4:0: enclosure logical id (0x500605b0000273bf), slot(4)
slot number of the storage node start with 0
.
Replace defective disk
Cross reference slot number with hardware spec of the storage node to identify the drive bay number physical location of the chassis.
e.g. Disk Slots on Huawei 5288
Slot of front disksSlot of rear disks
Suspend Volume Recovery either from UI or CLI
/root/dist/swarmctl -d [node_ip] -C recovery.suspend -V true -p admin:[password]
Pull out the failed disk from the chassis according the slot number identified from previous section, verify the Serial number to make sure no mistake.
Simply insert back the disk to chassis in case the serial number not match.
NOTE: Swarm capable to recognize data on the disk without data loss and rebuilt, verify the drive re-appear in Swarm Admin Console after few minutes.Insert replacement to replace the defective drive.
Turn off Volume Suspend Recovery from UI or CLI
/root/dist/swarmctl -d [node_ip] -C recovery.suspend -V false -p admin:[password]
Verify the new drive appears in the Swarm Admin Console and has non-zero stream count after several minutes.