SNMP Tools and Monitoring Systems
Tip
Swarm supports SNMP version 2.
Any standard SNMP query tool and monitoring system can be used to interact with Swarm. The examples in this section use the open source Net-SNMP (formerly UCD-SNMP) package available for UNIX and Microsoft Windows platforms. Install the Swarm MIB definition file before using most tools and monitoring packages. Follow the instructions included with the tool or package for more information.
Open Source Tools
The following tools can be useful to monitor and manage Swarm. DataCore does not endorse the applicability nor the fitness of these products when used within any environment.
Net-SNMP (net-snmp.sourceforge.net). Provides command-line tools for UNIX and Windows environments to send and receive SNMP requests.
Nagios (nagios.org). Provides web-based monitoring system for UNIX environments for monitoring systems and sending alerts through email and pager.
Zenoss (zenoss.com). An SNMP-based system for IT monitoring and management.
SNMP Examples with Swarm
Complete the following to prepare to use the examples in this section:
Record the IP address of a storage cluster node. Record the SCSP Proxy if the cluster is not in the subnet.
The node's IP address is172.16.0.32
in the examples below.Run the command from the directory containing CARINGO-CASTOR-MIB.txt.
Copy CARINGO-CASTOR-MIB.txt from the root directory of the USB flash drive or distribution to a local directory.Record the following passwords:
See Defining Swarm Admins, Swarm Users, and Swarm Passwords.
Change in snmpwalk
The 7.2 release changed the snmpwalk of the whole CASTOR MIB to make it skip several large, detailed tables in SNMP groups to protect cluster performance. Administrators must upgrade from CSN v6.5 to update the CSN reporter.
Create a targeted snmpwalk request if data from those skipped tables is needed. The snmp.getnextskips setting directs top-level snmpwalk to skip the groups and tables under the following: clusterConfig, responseHistogramTable, hp, clusterdata, indexer, configVariableTable, castorFeeds, feedVolTable, performance, recoveryTable
SNMP walk (snmpwalk) of all Swarm values on a node:
snmpwalk -v 2c -c read-only-password -m +./CARINGO-CASTOR-MIB.txt 172.16.0.32 caringo
Request for a specific SNMP variable from a Swarm node:
snmpget -v 2c -c read-only-password -m +./CARINGO-CASTOR-MIB.txt 172.16.0.32 reads
Set request to shut down a Swarm node:
snmpset -v 2c -c read-write-password -m +./CARINGO-CASTOR-MIB.txt 172.16.0.32
castorShutdownAction s shutdown CARINGO-CASTOR-MIB::castorShutdownAction = STRING: "shutdown"
Set request to change the cluster's sleepAfter
setting to 7260 seconds (121 minutes):
SNMP Action OIDs
The "action" OIDs in Swarm are the SNMP objects affecting the operation of a node or the cluster.
Important
The action is recommended to be written to a single node to allow updates to the persisted settings UUID from a single node to prevent conflicts for cluster-level parameters such as volumeRecoverySuspend.
castorFeedRestartAction | Restarts a feed on a node using SNMP. The feed restarts on all nodes in the cluster when setting the OID value to a specific feed value. The castorFeedTable OID allows viewing the Swarm feed information for a specific node. Each entry indicates a feed running on the selected node. The Admin Console allows viewing the SNMP Repository Dump page, which provides node-specific information. |
---|---|
logHost | Sets the logging host for writing log messages. A node sets the logging host based on the loghost parameter when booted. Redirect syslog messages to a workstation to debug an issue. |
logLevel | Sets the logging level. A node sets the logging level based on the loglevel parameter when booted. Increase the logging level to debug an issue and then return the level to the previous value when completed. |
nodeLogLevel | Sets the logging level for a specific node in the cluster, overriding the boot configuration specified by the loglevel parameter as well as the cluster-wide logLevel object. |
logForceAudit | Sets forced audit logging for all nodes in the cluster, independent of the overall log level. |
castorRetireAction | Removes the contents of a disk volume or an entire node in an orderly fashion. Consider retiring disks to save content not saved on another disk instead of removing disks. The device name from the node configuration vols parameter or the all string is written to this OID. Volumes from multiple nodes in the cluster can be simultaneously retired. |
castorShutdownAction | Sets a graceful shutdown or reboot a node or an entire cluster. The supported values are:
|
volumeRecoverySuspend | Suspends volume recovery and erasure coding recovery behavior in the cluster during an upgrade or a network outage. |
Practical SNMP with Swarm
This section outlines practical approaches in using the built-in SNMP agent to monitor the health and operational aspects of a storage cluster.
Health Monitoring
The following variables can be used to monitor the basic health of a Swarm node. The volume table has n from 1 to the number of volumes.
caringo.castor.castorState. Equal "OK."
caringo.castor.castorVolTable.volEntry.volState.n. Equal "OK."
caringo.castor.castorVolTable.volEntry.volErrors.n. Equal to zero.
There is something wrong with the node if the monitoring console receives timeouts when trying to read these variables. The node or the disks are transitioning from the normal state if the state values are anything other than "ok”.
Node | Volume | |
---|---|---|
Valid States | OK Retiring Retired | OK Retiring Retired Unavailable |
Any non-zero value in the volume error count indicates a hard error has surfaced from the hardware through the OS driver and to the Swarm process.
Capacity Monitoring
The following variables can be monitored and collected for capacity alerting and reporting. The volume table has n from 1 to the number of volumes.
caringo.castor.castorFreeSlots. Greater than zero.
caringo.castor.castorVolTable.volEntry.volMaxMbytes.n
caringo.castor.castorVolTable.volEntry.volFreeMbytes.n
caringo.castor.castorVolTable.volEntry.volTrappedMbytes.n
The castorFreeSlots variable indicates how many more objects a node can hold before it exhausts the memory index. The node is unable to store additional objects until objects are deleted or moved to other cluster nodes (or more RAM is added to the node) if this occurs. The free slots indicate how much RAM is required per object.
See the Hardware Requirements for Storage for RAM effects on node storage.
Add the values volFreeMbytes and volTrappedMbytes to compute the amount of disk space available for writing content.
(volFreeMbytes + volTrappedMbytes) / volMaxMbytes = % free space on a disk volume
volUsedMbytes / volMaxMbytes = % space used by current context
Client Activity Reporting
Collect and report the amount of client activity received by the nodes to understand the end-user usage patterns and identify nodes receiving significantly more activity than others. The resulting value can indicate a poor primary access node selection mechanism in the client application code.
The following SNMP variables indicate client request activity on a Swarm node.
caringo.castor.scsp.writes
caringo.castor.scsp.reads
caringo.castor.scsp.infos
caringo.castor.scsp.deletes
caringo.castor.scsp.errors
caringo.castor.scsp.updates
caringo.castor.scsp.copies
caringo.castor.scsp.appends
SNMP Repository Dump
The SNMP Repository Dump page provides additional node-specific information.
Accessing the Repository Dump
Access the SNMP Repository Dump page for a cluster node:
Open the legacy Admin Console.
In the Node IP column, click the IP address of the target node.
Scroll down and maximize Node Info.
Scroll down and click SNMP Repository.
See the SNMP MIB Reference file included in the Swarm download bundle for more on the SNMP Repository Dump tables.
Disk Monitoring
Swarm 12 collects more health data from the SMART values reported by storage disks, can be accessed via the SNMP Drive table. (v12.0)
driveStatus is now correctly computed.
drivePowerOnHours is from SMART attribute 9.
driveTempC is from SMART attribute 194.
driveCompromisedCount is the sum of SMART attributes 5, 187, 188, 197, and 198. A non-zero value may indicate an impending disk failure.
Discontinued Items
Note these SNMP items are no longer populated (v9.4):
planarTemp
tempStatus
fanRedundancy
psuRedundancy
instantaneousWatts
instantaneousMA
minPowerCap
maxPowerCap
nics
nicTable (including detail)
nicFwVsn
driveTable.driveStatus
fans
fanTable (including detail)
psus
psuTable (including detail)
powerIntervals
powerDrawTable (including detail)
Those SNMP values can be re-populated with a configuration change if relying on them. Contact DataCore Support for instructions.
© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.