Table of Contents |
---|
Most of the time, we can identify the defective drive by turn on the LED light and stays on for one hour when volume is marked unavailable or retired. Use the light features of the UI to Identifying a failed or failing drive.
Drive Identify Plugin can help to identify where is the defective drive by LED light but some hardware not support Identify function.
Identify defective Drive
Cross reference Web UI info with Dmesg to find out the Slot number of the failed or failing disk.
...
Although the hard drive ID feature built into Swarm can usually be used for drive identification, there are some instances where the hardware in place is not supported by it. The scope of this article is to assist with drive identification in a cluster when normal means to do so cannot be used.
Identify the Disk
Log into Swarm UIS, then select the storage node with the defective drive , identify and determine the SCSI device Drive name (e.g.
/dev/sdac
) of the failed or failing disk.:
...
2. Export Dmesg . Next, log into a system where the Swarm Support Tools bundle is installed (typically for the ‘root’ user under /root/dist) for cluster management (usually a SCS or CSN system). From there, use the ‘swarmctl’ utility to export ‘dmesg -T’ output of the storage node with the defective drive (e.g. 100.126.4.94
) with defective drive from CSN/SCS.
Code Block |
---|
cd /root/dist/ ./swarmctl -d 100.126.4.94 -Q -x -p admin:[password] |
...
Code Block |
---|
dmesg for 100.126.4.94 written to 2023_0307_1004-dmesg-100.126.4.94.txt in this directory |
3. Get From the dmesg output file generated, look for the the Bus IdID of the failed or failing disk from Dmesg.:
Code Block |
---|
grep 'sdac' 2023_0307_1004-dmesg-100.126.4.94.txt | grep 'failed' |
...
Code Block |
---|
[Tue Feb 14 21:30:05 2023] sd 14:0:28:0: [sdac] tag#887 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s [Tue Feb 14 21:30:45 2023] sd 14:0:28:0: [sdac] tag#2096 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s [Tue Feb 14 21:30:50 2023] sd 14:0:28:0: [sdac] tag#2097 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=10s [Tue Feb 14 21:43:21 2023] sd 14:0:28:0: [sdac] tag#3101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=9s [Tue Feb 14 21:43:26 2023] sd 14:0:28:0: [sdac] tag#3102 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=14s [Tue Feb 14 21:43:52 2023] sd 14:0:28:0: [sdac] tag#2894 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s [Tue Feb 14 21:44:20 2023] sd 14:0:28:0: [sdac] tag#803 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s |
The Bus Id of From this output, we can see that the Bus ID of the failed disk is 14:0:28:0
.
4. Identify slot number Now, identify the Slot Number of the failed disk by Bus Id.ID:
Code Block |
---|
grep '14:0:28:0' 2023_0307_1004-dmesg-100.126.4.94.txt | grep 'slot' |
...
Code Block |
---|
[Tue Jan 17 11:52:22 2023] scsi 14:0:28:0: enclosure logical id (0x500605b0000273bf), slot(28) |
slot number of the failed disk is With this , we see that the location of the drive in question is enclosure “0x500605b0000273bf
" at slot 28
.
5. Check the slot number of the chassis start with 0
or 1
from Dmesg.The slot numbers, by convention, should start at value '0'. This can be confirmed as follows:
Code Block |
---|
grep 'slot(' 2023_0307_1004-dmesg-100.126.4.94.txt | head -5 |
...
Code Block |
---|
[Tue Jan 17 11:52:22 2023] scsi 14:0:0:0: enclosure logical id (0x500605b0000273bf), slot(0) [Tue Jan 17 11:52:22 2023] scsi 14:0:1:0: enclosure logical id (0x500605b0000273bf), slot(1) [Tue Jan 17 11:52:22 2023] scsi 14:0:2:0: enclosure logical id (0x500605b0000273bf), slot(2) [Tue Jan 17 11:52:22 2023] scsi 14:0:3:0: enclosure logical id (0x500605b0000273bf), slot(3) [Tue Jan 17 11:52:22 2023] scsi 14:0:4:0: enclosure logical id (0x500605b0000273bf), slot(4) |
slot number of the storage node start with 0
Confirmed in this example, the slot numbering scheme starts at '0'.
Replace defective disk
Cross reference the slot number with hardware spec determined above with the drive enclosure specification of the storage node. This will allow you to identify the drive bay number physical location of the chassis.
e.g. Disk Slots on Huawei 5288for the failed drive in the enclosure / chassis.
For the example below, we use the disk slot designations on a Huawei 5288 V5 server enclosure:
Slot of front disksSlot of rear disks
Suspend Volume Recovery either from UI or CLIWith the physical location & drive determined, you can now replace it. First, suspend volume recovery in the storage cluster from either Swarm UIS or using the ‘swarmctl’ CLI tool (we use swarmctl below):
Code Block /root/dist/swarmctl -d [node_ip] -C recovery.suspend -V true -p admin:[password]
Pull out Next, remove the failed disk from the chassis according the slot number identified from previous section, . You can verify the Serial number to make sure no mistake.
Simply insert back the disk to chassis in case the serial number not match.
NOTE: Swarm capable to recognize data on the disk without data loss and rebuilt, verify the drive re-appear in Swarm Admin Console after few minutes.Insert replacement to replace the defective drive.
Turn off Volume Suspend Recovery from UI or CLIserial number on the drive vs. that displayed in Swarm UIS confirm the correct drive has been removed. If the the serial numbers don’t match, simply insert the disk back into the enclosure.
With the failed drive removed, insert your replacement drive into the empty slot.
You can now re-enable volume recovery in the cluster by turning off volume recovery suspension:
Code Block /root/dist/swarmctl -d [node_ip] -C recovery.suspend -V false -p admin:[password]
Verify the new drive appears in the Swarm Admin Console and has Swarm UIS. After a few minutes have passed, it should show it has a non-zero stream count after several minutes(which means it’s actively taking data).