SNMP Tools and Monitoring Systems

Tip

Swarm supports SNMP version 2.

Any standard SNMP query tool and monitoring system can be used to interact with Swarm. The examples in this section use the open source Net-SNMP (formerly UCD-SNMP) package available for UNIX and Microsoft Windows platforms. Install the Swarm MIB definition file before using most tools and monitoring packages. Follow the instructions included with the tool or package for more information.

Open Source Tools

The following tools can be useful to monitor and manage Swarm. DataCore does not endorse the applicability nor the fitness of these products when used within any environment.

  • Net-SNMP (net-snmp.sourceforge.net). Provides command-line tools for UNIX and Windows environments to send and receive SNMP requests.

  • Nagios (nagios.org). Provides web-based monitoring system for UNIX environments for monitoring systems and sending alerts through email and pager.

  • Zenoss (zenoss.com). An SNMP-based system for IT monitoring and management.

SNMP Examples with Swarm

Complete the following to prepare to use the examples in this section:

  1. Record the IP address of a storage cluster node. Record the SCSP Proxy if the cluster is not in the subnet.
    The node's IP address is 172.16.0.32 in the examples below.

  2. Run the command from the directory containing CARINGO-CASTOR-MIB.txt.
    Copy CARINGO-CASTOR-MIB.txt from the root directory of the USB flash drive or distribution to a local directory.

  3. Record the following passwords:

    • read-only-password. The password for the read-only user defined in the security.operators setting. Default: public

    • read-write-password. The password for the read-write user defined in the security.administrators setting. Default: ourpwdofchoicehere

See https://perifery.atlassian.net/wiki/spaces/public/pages/2443811291.

Change in snmpwalk

The 7.2 release changed the snmpwalk of the whole CASTOR MIB to make it skip several large, detailed tables in SNMP groups to protect cluster performance. Administrators must upgrade from CSN v6.5 to update the CSN reporter.

Create a targeted snmpwalk request if data from those skipped tables is needed. The snmp.getnextskips setting directs top-level snmpwalk to skip the groups and tables under the following: clusterConfig, responseHistogramTable, hp, clusterdata, indexer, configVariableTable, castorFeeds, feedVolTable, performance, recoveryTable

SNMP walk (snmpwalk) of all Swarm values on a node:

snmpwalk -v 2c -c read-only-password -m +./CARINGO-CASTOR-MIB.txt 172.16.0.32 caringo

Request for a specific SNMP variable from a Swarm node:

snmpget -v 2c -c read-only-password -m +./CARINGO-CASTOR-MIB.txt 172.16.0.32 reads

Set request to shut down a Swarm node:

snmpset -v 2c -c read-write-password -m +./CARINGO-CASTOR-MIB.txt 172.16.0.32 castorShutdownAction s shutdown CARINGO-CASTOR-MIB::castorShutdownAction = STRING: "shutdown"

Set request to change the cluster's sleepAfter setting to 7260 seconds (121 minutes):

SNMP Action OIDs

The "action" OIDs in Swarm are the SNMP objects affecting the operation of a node or the cluster.

Important

The action is recommended to be written to a single node to allow updates to the persisted settings UUID from a single node to prevent conflicts for cluster-level parameters such as volumeRecoverySuspend.

castorFeedRestartAction

Restarts a feed on a node using SNMP. The feed restarts on all nodes in the cluster when setting the OID value to a specific feed value. The castorFeedTable OID allows viewing the Swarm feed information for a specific node. Each entry indicates a feed running on the selected node. The Admin Console allows viewing the SNMP Repository Dump page, which provides node-specific information.

logHost

Sets the logging host for writing log messages. A node sets the logging host based on the loghost parameter when booted. Redirect syslog messages to a workstation to debug an issue.

logLevel

Sets the logging level. A node sets the logging level based on the loglevel parameter when booted. Increase the logging level to debug an issue and then return the level to the previous value when completed.

nodeLogLevel

Sets the logging level for a specific node in the cluster, overriding the boot configuration specified by the loglevel parameter as well as the cluster-wide logLevel object.

logForceAudit

Sets forced audit logging for all nodes in the cluster, independent of the overall log level.

castorRetireAction

Removes the contents of a disk volume or an entire node in an orderly fashion. Consider retiring disks to save content not saved on another disk instead of removing disks. The device name from the node configuration vols parameter or the all string is written to this OID. Volumes from multiple nodes in the cluster can be simultaneously retired.

castorShutdownAction

Sets a graceful shutdown or reboot a node or an entire cluster. The supported values are:

  • shutdown. Shuts down this node.

  • reboot. Reboots this node.

  • clustershutdown. Shuts down all nodes in the cluster.

  • clusterreboot. Reboots all nodes in the cluster.

volumeRecoverySuspend

Suspends volume recovery and erasure coding recovery behavior in the cluster during an upgrade or a network outage.

Practical SNMP with Swarm

This section outlines practical approaches in using the built-in SNMP agent to monitor the health and operational aspects of a storage cluster.

Health Monitoring

The following variables can be used to monitor the basic health of a Swarm node. The volume table has n from 1 to the number of volumes.

  • caringo.castor.castorState. Equal "OK."

  • caringo.castor.castorVolTable.volEntry.volState.n. Equal "OK."

  • caringo.castor.castorVolTable.volEntry.volErrors.n. Equal to zero.

There is something wrong with the node if the monitoring console receives timeouts when trying to read these variables. The node or the disks are transitioning from the normal state if the state values are anything other than "ok”.



Node

Volume



Node

Volume

Valid States

OK

Retiring

Retired

OK

Retiring

Retired

Unavailable

Any non-zero value in the volume error count indicates a hard error has surfaced from the hardware through the OS driver and to the Swarm process.

Capacity Monitoring

The following variables can be monitored and collected for capacity alerting and reporting. The volume table has n from 1 to the number of volumes.

  • caringo.castor.castorFreeSlots. Greater than zero.

  • caringo.castor.castorVolTable.volEntry.volMaxMbytes.n

  • caringo.castor.castorVolTable.volEntry.volFreeMbytes.n

  • caringo.castor.castorVolTable.volEntry.volTrappedMbytes.n

The castorFreeSlots variable indicates how many more objects a node can hold before it exhausts the memory index. The node is unable to store additional objects until objects are deleted or moved to other cluster nodes (or more RAM is added to the node) if this occurs. The free slots indicate how much RAM is required per object.

See the https://perifery.atlassian.net/wiki/spaces/public/pages/2443808741 for RAM effects on node storage.

Add the values volFreeMbytes and volTrappedMbytes to compute the amount of disk space available for writing content.

(volFreeMbytes + volTrappedMbytes) / volMaxMbytes = % free space on a disk volume

volUsedMbytes / volMaxMbytes = % space used by current context

Client Activity Reporting

Collect and report the amount of client activity received by the nodes to understand the end-user usage patterns and identify nodes receiving significantly more activity than others. The resulting value can indicate a poor primary access node selection mechanism in the client application code.

The following SNMP variables indicate client request activity on a Swarm node.

  • caringo.castor.scsp.writes

  • caringo.castor.scsp.reads

  • caringo.castor.scsp.infos

  • caringo.castor.scsp.deletes

  • caringo.castor.scsp.errors

  • caringo.castor.scsp.updates

  • caringo.castor.scsp.copies

  • caringo.castor.scsp.appends

SNMP Repository Dump

The SNMP Repository Dump page provides additional node-specific information.

Accessing the Repository Dump

Access the SNMP Repository Dump page for a cluster node:

  1. Open the legacy Admin Console.

  2. In the Node IP column, click the IP address of the target node.

  3. Scroll down and maximize Node Info.

  4. Scroll down and click SNMP Repository.

See the SNMP MIB Reference file included in the Swarm download bundle for more on the SNMP Repository Dump tables.

Disk Monitoring

Swarm 12 collects more health data from the SMART values reported by storage disks, can be accessed via the SNMP Drive table. (v12.0)

  • driveStatus is now correctly computed.

  • drivePowerOnHours is from SMART attribute 9.

  • driveTempC is from SMART attribute 194.

  • driveCompromisedCount is the sum of SMART attributes 5, 187, 188, 197, and 198. A non-zero value may indicate an impending disk failure.

Discontinued Items

Note these SNMP items are no longer populated (v9.4):

  • planarTemp

  • tempStatus

  • fanRedundancy

  • psuRedundancy

  • instantaneousWatts

  • instantaneousMA

  • minPowerCap

  • maxPowerCap

  • nics

  • nicTable (including detail)

  • nicFwVsn

  • driveTable.driveStatus

  • fans

  • fanTable (including detail)

  • psus

  • psuTable (including detail)

  • powerIntervals

  • powerDrawTable (including detail)

Those SNMP values can be re-populated with a configuration change if relying on them. Contact DataCore Support for instructions.

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.