Managing Chassis and Drives
Chassis Details
Detailed hardware and status information for each chassis (physical or virtual machine) are displayed on the hardware details page.
Tip
Streams are counts of the total number of Swarm-managed data components (such as replicas and segments). Streams are not logical objects (such as video files).
Status States: These are the states reported for hardware in a cluster and how to interpret them:
Status | Nodes / Chassis | Volumes / Disks |
---|---|---|
ok | Nominal | Nominal |
idle | Nominal, but the node is idle | Nominal, but idle |
retiring | One or more volumes are offloading streams to the cluster due to retire | Offloading streams to the cluster due to retire |
retired | All volumes are retired | Empty of objects and not taking new ones |
unavailable | In an error state | |
error | Errors are reported on the node (hardware or software) | |
mounting | One or more volumes are mounting | Mounting at startup/discovery |
finalizing | Can appear while the node is rebooting or shutting down, as the node finishes sessions in process | |
maintenance | A 3-hour window during an administrative reboot or shutdown where Failed Volume Recovery does not run | |
initializing | Volumes have mounted but the node is not yet ready for client activity. |
|
offline | Node is known to be offline but not in maintenance |
|
Details Tab
Each detailed row displays a disk name, status, total capacity, amount of used journal space, the largest stream size it contains in MB, Model number, Serial Number, ID, Firmware version, and Encryption status. The largest value displays as 0 if the largest stream on disk is less than 1MB.
Tip
Watch the Streams count to track the progress when retiring a disk.
Logs Tab
The Logs tab lists the last 10 logged announcements in the cluster as well as the last 10 logged critical alerts. The tab itself includes a count of these messages, and appears red if any are errors:
Use the Clear command to remove log messages which have either been addressed or are not interesting from the display.
Click the Log Level (gear) settings command to view and change the log levels set for this machine.
Hot-Swapping Disks: Messages display on this tab if a disk is removed or inserted into a running node. This feature, referred to as Hot Swapping and Plugging Disks, allows removal of failed disks for analysis or to add storage capacity to a node at any time.
The following messages appears if adding and then removing a volume:
mounted /dev/sdb, volumeID is 561479FB832DCC526B1D7EDCD06B83E1
removed /dev/sdb, volumeID was 561479FB832DCC526B1D7EDCD06B83E1
Message Levels
These messages appear at the announcement level. Additional debug level messages appear in the syslog.
Driver Message Tab
dmesg (driver message) prints the message buffer of the kernel. These driver messages are useful for diagnosing a Swarm issue when a system panic or error occurs.
Hardware Info Tab
hwinfo (hardware information) is the Linux hardware detection tool output. This tool probes for the hardware present in the system and displays detailed information about various hardware components in human-readable format.
Memory Tab
The usage report on the Memory tab provides detailed information to help with troubleshooting insufficient memory.
Each node uses memory to hold an index of the objects stored in it. A node stops storing new content until space is freed through deletions if a node runs out of index space. A full node continues to respond to client read requests for data already present. Each named or alias object requires two index slots. Erasure coding typically requires more memory than replication; exactly how much depends on the encoding.
Statistics Tab
The Statistics tab rolls up a detailed, expandable report combining Health Processor (HP), Communications (cluster network), and Memory usage counts and values, to help with analysis and troubleshooting.
The health processor runs on each Swarm node to check the status of streams, performing a wide range of actions:
Sends replica checks to the other nodes and adds or trims replicas based on responses
Deletes streams requiring deletion according to life points
Provides a safety net to remove older alias and named stream versions when a newer version is found in the cluster (which can happen when nodes are restored)
Checks each stream for data corruption using comparison with the stored stream hash
Moves the stream on disk if defragmentation is needed
Verifies the disk index is consistent with the streams found on disk
Verifies replicas are distributed properly in the cluster
Advanced Tab
The Advanced tab allows dynamically changing machine-level logging levels and also work with Swarm's management API, both through a hands-on HAL browser and a Swagger visualizer.
The Health Data is the raw JSON content of the health report the cluster sends to DataCore Support. See Health Data to Support.
The log levels can be reset from this tab as well as from the Logs tab:
Restarting or Shutting Down a Chassis
The gear icon at the top of the page allows restarting or shutting down the chassis. A node shut down or rebooted by an Administrator appears with a Maintenance state on other nodes in the cluster.
Retiring a Chassis
Retire the chassis when replacing Swarm storage volumes for regular maintenance or to upgrade the cluster chassis with higher capacity disks. Retiring a chassis copies all objects to other chassis in the cluster, allowing safe removal of the chassis disks without risking any data loss.
Select the Retire option under the gear icon at the top of the Chassis Details page to initiate a retire. Choose to perform a minimally disruptive retire limited to the chassis being retired, or an accelerated retire using all nodes in the cluster to replicate objects on the retiring chassis as quickly as possible when initiating a retire.
A retiring chassis accepts no new or updated objects. Each chassis volume's state changes to Retired and Swarm no longer uses the volume after all objects are copied elsewhere. The volume can be safely removed at this point.
Rate of the Retire: Swarm calculates the retire rate over the last hour, which it publishes using SNMP as retireRatePerHour
. This covers the entire chassis regardless of how many volumes are being retired.
Canceling the Retire: Cancel an in-process retire by selecting the Cancel Retire option under the gear icon at the top of the Chassis Details page. Cancel a retire while one or more disks in the chassis have a Retiring status.
Retiring a Disk (Volume)
Disk-level retires are useful for targeting bad (slow) disks and for working around having too limited capacity for retires of entire chassis. Check the diagnostic data collected in the logs if a disk retires automatically because of I/O errors. (v11.1)
Locate and click the gear icon in the row for the affected disk to retire a volume:
Select the speed of retire. The fastest method incurs maximum effort by the cluster to move the content:
Rate of the Retire: Swarm generates an announce-level message reporting the overall duration and rate of the retire when Swarm completes a retire task on a disk. (v11.0)
See Retiring Hardware | Retire Rate.
Canceling the Retire: Click the gear icon in the row for the affected disk and select the Cancel retire command:
Identifying a Disk
It is helpful to enable the LED disk light for the disk when attempting to identify a failed or failing disk. Click on the disk light toggle in the disk's display row to flash the disk light for a specific disk:
© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.