Prometheus Node Exporter and Grafana

Hardware Diagnostics with Prometheus

Prometheus is an open-source system monitoring and alerting toolkit which allows viewing what statistics are available for a system, even under failure conditions. 

  • Prometheus scrapes metrics from instrumented jobs, running rules over this data to record aggregated time series or to generate alerts. 

  • Grafana and other API consumers can allow visualizing collected data.

The Prometheus Node Exporter is included with Swarm for monitoring and diagnostics on the machines in a Swarm cluster, to provide a wide variety of hardware and kernel-related metrics.

Configuring the Node Exporter

The required Storage setting for Node Exporter is enabled by default: metrics.enableNodeExporter = True. A cluster reboot is required to re-enable if disabled.

Change how frequently the exports occur if needed. Perform this using Swarm UI or SNMP on the running cluster:  metrics.nodeExporterFrequency = 120

Adding Grafana Dashboards

DataCore has published public Grafana dashboards for monitoring Swarm products and features to visualize this Prometheus data. Check here for the latest dashboards for the versions of Swarm products being used: 

Customized dashboards are available for the following products:

Swarm System Monitoring (Select the dashboard for the version of Storage)

  • Visualizations include cluster health, capacity, indexing, licensing, temperature, and network and CPU loads:

     

  • Cluster-wide operations:

Swarm Node View (new since v12.0)

  • Detail view of a single Swarm node:

Gateway Monitoring

Note

Some statistics show a value after S3 operations have run against the Gateway

  • Visualizations include CPU load, operations, connections, and HTTP status codes:

Swarm Search (new since v15 )

  • Visualizations include elasticsearch 7.5.x metrics

Video Clipping

Optional

The Video Clipping is Optional.

  • Gateway / Content UI added the optional feature Video Clipping for Partial File Restore.

  • Visualizations include numbers, rates, and error counts for video clipping requests.

  • The errors are counted by stage (preprocessing, processing, postprocessing), to help with troubleshooting:

Importing a Dashboard

  1. Navigate to Grafana get started | Cloud, Self-managed, Enterprise to obtain a free hosted instance of Grafana (1 user, 5 dashboards).

  2. View the desired dashboard page and select Copy ID to Clipboard to get the ID for the dashboard:

  3. Open the dashboard search on the Grafana instance and then click Import to import a dashboard.

  4. Paste in the ID when prompted:

  5. Verify the name is correct once the dashboard is found.

Important

Set the Folder option to make the dashboard visible. The folder "General" is available by default.

  1. In the following import process, Grafana prompts setting the data source, and specify any metric prefixes (if the dashboard uses any).

Troubleshooting "No Data" Errors

There are multiple points at which things can go wrong in the pipeline from collecting data to displaying charts. The following is the process for troubleshooting No Data errors in graphs.

Checking Endpoints

Services monitored by Prometheus (Swarm nodes and Gateways) expose an endpoint (usually port 9100).

Swarm: In the Swarm UI Cluster Settings, Advanced, verify metrics.nodeExporterFrequency=120. The metrics.enableNodeExporter=True must be explicitly set if not on the latest Swarm release. Test the endpoint:

curl http://SWARM_NODE:9100/metrics

Prometheus: Prometheus polls those endpoints as set in /etc/prometheus/prometheus.ini. Test the targets:

http://PROMETHEUS:9090/targets

Configuring Elasticsearch Exporter

A new Grafana dashboard Swarm Search v7 is added to SwarmTelemetry. The new dashboard uses a new Prometheus exporter called elasticsearch_exporter and runs as a service on SwarmTelemetry. It is important to set the target Elasticsearch host IP in /usr/lib/system.d/system/elasticsearch_exporter.service.  

The process of changing the IP is as follows:

modify /usr/lib/system.d/system/elasticsearch_exporter.service --es.uri parameter to match one of your ES node IP’s

systemctl daemon-reload systemctl enable elasticsearch_exporter systemctl start elasticsearch_exporter

By default, this IP points to the swarm storage network, internal IP address of the SwarmSearch VM. See GitHub - prometheus-community/elasticsearch_exporter: Elasticsearch stats exporter for Prometheus for various Elasticseach metrics.

Checking Grafana

  1. Verify Grafana has a “Prometheus” Data Source and is set to Default.
    The dashboards automatically use this when they are imported; Edit a panel to see.

  2. Verify Grafana is at least version 9.3.2

Node Exporter Statistics

Following is information about what Swarm statistics are exported by Prometheus. As possible, these statistics are correlated with MIB entries, although the scales may differ. 

Metric Name
(Blue indicates cluster-level scope)

Label(s)

Value Meaning

Related SNMP Entry Name(s)

Metric Name
(Blue indicates cluster-level scope)

Label(s)

Value Meaning

Related SNMP Entry Name(s)

caringo_swarm_cluster_license_capacity_tb

cluster_name

Cluster capacity in terabytes.

totalGBLicensedCapacity

caringo_swarm_cluster_license_days_remaining

cluster_name

Integer number of days remaining on the license.

 

caringo_swarm_cluster_license_enabled

cluster_name

1 for license enabled.  0 for not enabled.

 

caringo_swarm_cluster_state

cluster_name

-1 = unknown; 0 = ok; 1 = idle; 2 = mounting; 3 = initializing; 4 = finalizing; 5 = maintenance; 6 = retiring; 7 = retired; 8 = error; 9 = unavailable; 10 = offline

clusterState

caringo_swarm_index_overlay_state

 

The overlay index status. 2=authoritative; 1=operational; 0 otherwise.

indexOverlayStatus

caringo_swarm_index_overlay_inflating

 

Whether the overlay index is inflating on this node. 1=true; 0; false.

indexOverlayInflating

caringo_swarm_index_overlay_attractors

 

The number of desired attractors.

indexOverlayDesiredAttractors

caringo_swarm_feeds_deleted_pending

feed_name, feed_type

The number of deleted object events pending waiting to be processed.

feedNodeDeletesUnprocessed

caringo_swarm_feeds_deleted_retrying

feed_name, feed_type

The number of deleted object events needing to be retried.

feedNodeDeletesFailing

caringo_swarm_feeds_deleted_successful

feed_name, feed_type

The number of deleted object events successfully processed.

feedNodeDeletesSuccess

caringo_swarm_feeds_deleted_unqualified

feed_name, feed_type

The number of deleted object events potentially requiring processing.

feedNodeDeletesUnqualified

caringo_swarm_feeds_est_backlog_clear_time

feed_name, feed_type

The estimated number of seconds to complete all processing.  -1 for unknown.

feedEstBacklogClearTime

caringo_swarm_feeds_existing_pending

feed_name, feed_type

The number of current object events pending waiting to be processed.

feedNodeExistsUnprocessed

caringo_swarm_feeds_existing_retrying

feed_name, feed_type

The number of current object events needing to be retried.

feedNodeExistsFailing

caringo_swarm_feeds_existing_successful

feed_name, feed_type

The number of current object events successfully processed.

feedNodeExistsSuccess

caringo_swarm_feeds_existing_unqualified

feed_name, feed_type

The number of current object events potentially requiring processing.

feedNodeExistsUnqualified

caringo_swarm_feeds_feed_id

feed_name, feed_type

The id number of the feed.

feedFeedId

caringo_swarm_feeds_feed_state

feed_name, feed_type

-1 = unknown; 0 = closed; 1 = config-error; 2 = too many overlapping feeds; 3 = blocked; 4 = paused by request; 5 = paused for recovery; 6 = priority (processing contexts after start/restart); 7 = ok

feedState

caringo_swarm_feeds_last_failure

feed_name, feed_type

The time of the last failure event in epoch milliseconds.

feedLastExistFailure, feedLastDeleteFailure, feedLastVersionedFailure

caringo_swarm_feeds_last_success

feed_name, feed_type

The time of the last successful event in epoch milliseconds.

feedLastSuccess

caringo_swarm_feeds_remote_failure

feed_name, feed_type

The number of replication/indexing failures.

feedPluginRemoteFailure

caringo_swarm_feeds_remote_success_duplicate

feed_name, feed_type

The number of duplicate indexing/replication successes.

feedPluginRemoteSuccessDuplicate

caringo_swarm_feeds_remote_success_transfer

feed_name, feed_type

The number of new indexing/replication successes.