Identifying the Slot Number of a Defective Drive in Swarm

Although the hard drive ID feature built into Swarm can usually be used for drive identification, there are some instances where the hardware in place is not supported by it. The scope of this article is to assist with drive identification in a cluster when normal means to do so cannot be used.

Identify the Disk

Log into Swarm UIS, then select the storage node with the defective drive and determine the SCSI device name (e.g. /dev/sdac) of the failed or failing disk:

2. Next, log into a system where the Swarm Support Tools bundle is installed (typically for the ‘root’ user under /root/dist) for cluster management (usually a SCS or CSN system). From there, use the ‘swarmctl’ utility to export ‘dmesg -T’ output of the storage node with the defective drive (e.g. 100.126.4.94).

cd /root/dist/
./swarmctl -d 100.126.4.94 -Q -x -p admin:[password]

Output:

dmesg for 100.126.4.94 written to 2023_0307_1004-dmesg-100.126.4.94.txt in this directory

3. From the dmesg output file generated, look for the the Bus ID of the failed or failing disk:

grep 'sdac' 2023_0307_1004-dmesg-100.126.4.94.txt | grep 'failed'

Output:

[Tue Feb 14 21:30:05 2023] sd 14:0:28:0: [sdac] tag#887 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s
[Tue Feb 14 21:30:45 2023] sd 14:0:28:0: [sdac] tag#2096 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s
[Tue Feb 14 21:30:50 2023] sd 14:0:28:0: [sdac] tag#2097 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=10s
[Tue Feb 14 21:43:21 2023] sd 14:0:28:0: [sdac] tag#3101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=9s
[Tue Feb 14 21:43:26 2023] sd 14:0:28:0: [sdac] tag#3102 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=14s
[Tue Feb 14 21:43:52 2023] sd 14:0:28:0: [sdac] tag#2894 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s
[Tue Feb 14 21:44:20 2023] sd 14:0:28:0: [sdac] tag#803 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=5s

From this output, we can see that the Bus ID of the failed disk is 14:0:28:0.

4. Now, identify the Slot Number of the failed disk by Bus ID:

grep '14:0:28:0' 2023_0307_1004-dmesg-100.126.4.94.txt | grep 'slot'

Output:

[Tue Jan 17 11:52:22 2023] scsi 14:0:28:0: enclosure logical id (0x500605b0000273bf), slot(28)

With this , we see that the location of the drive in question is enclosure “0x500605b0000273bf" at slot 28.

5. The slot numbers, by convention, should start at value '0'. This can be confirmed as follows:

grep 'slot(' 2023_0307_1004-dmesg-100.126.4.94.txt | head -5

Output:

[Tue Jan 17 11:52:22 2023] scsi 14:0:0:0: enclosure logical id (0x500605b0000273bf), slot(0)
[Tue Jan 17 11:52:22 2023] scsi 14:0:1:0: enclosure logical id (0x500605b0000273bf), slot(1)
[Tue Jan 17 11:52:22 2023] scsi 14:0:2:0: enclosure logical id (0x500605b0000273bf), slot(2)
[Tue Jan 17 11:52:22 2023] scsi 14:0:3:0: enclosure logical id (0x500605b0000273bf), slot(3)
[Tue Jan 17 11:52:22 2023] scsi 14:0:4:0: enclosure logical id (0x500605b0000273bf), slot(4)

Confirmed in this example, the slot numbering scheme starts at '0'.

Replace defective disk

Cross reference the slot number determined above with the drive enclosure specification of the storage node. This will allow you to identify the drive bay number for the failed drive in the enclosure / chassis.
For the example below, we use the disk slot designations on a Huawei 5288 server enclosure:
Slot of front disks
Slot of rear disks
With the physical location & drive determined, you can now replace it. First, suspend volume recovery in the storage cluster from either Swarm UIS or using the ‘swarmctl’ CLI tool (we use swarmctl below):
```
/root/dist/swarmctl -d [node_ip] -C recovery.suspend -V true -p admin:[password]
```
Next, remove the failed disk from the chassis according the slot number identified from previous section. You can verify the serial number on the drive vs. that displayed in Swarm UIS confirm the correct drive has been removed. If the the serial numbers don’t match, simply insert the disk back into the enclosure.
With the failed drive removed, insert your replacement drive into the empty slot.
You can now re-enable volume recovery in the cluster by turning of volume recovery suspension:
```
/root/dist/swarmctl -d [node_ip] -C recovery.suspend -V false -p admin:[password]
```
Verify the new drive appears in Swarm UIS. After a few minutes have passed, it should show it has a non-zero stream count (which means it’s actively taking data).