Table of Contents

minLevel	1
maxLevel	2
outline	false
type	list
printable	false

Info

Important

The Swarm Telemetry VM allows

...

quick deployment of a single instance Prometheus/Grafana installation in combination with Swarm 15 and higher

...

Table of Contents

minLevel	1
maxLevel	6
outline	false
type	list
printable	false

...

.

Environment Prerequisites

The following infrastructure services are needed in the customer environment:

Swarm 15 solution stack installed and configured, with nodeExporter enabled and nodeExporterFrequency set to 120 ( do Do not set it too fast; this number is in seconds).
Enabled Swarm metrics:
1. metrics.nodeExporterFrequency = 120
2. metrics.enableNodeExporter = True
If your deploying you deploy your own telemetry solution, you will need to use the following software versions:
1. nodeNode_exporter 1.6.0
2. prometheus Prometheus 2.45.0
3. grafana Grafana server 9.3.2
4. elasticsearchElasticsearch_exporter 1.5.0
DHCP server, needed for first-time boot and configuration (optional) configuration.
DNS server (recommended if you don't do not want to see IP addresses in your dashboards, but can also be solved by configuring static entries in /etc/hosts on this VM) .

Info
This step is optional.

Configuration

VMware Network

...

Configuration

Make sure the VM can reach the Swarm storage nodes directly via port 9100 before proceeding with configuring Grafana and Prometheus.

...

By default, the VM uses a single Nic configured with DHCP.
If you have deployed a dual-network SCS/Swarm configuration then we must first select the appropriate "storage vlan" for the second virtual network card.
Boot the VM and configure the second virtual network card inside the OS.
Edit /etc/sysconfig/network-scripts/ifcfg-ens160 and modify/add the following to it:

...

Panel

bgColor	#DEEBFF

ONBOOT=yes

NETMASK=255.255.255.0 0 (Match the same netmask as your storage vlan)

IPADDR=Storage VLAN IP IP ( picked Picked from 3rd party range to avoid conflicts)

BOOTPROTO=none none

GATEWAY=SCS IP IP ( usually Usually, this is the gateway in the Swarm vlan)

Enable it by typing:

Panel

bgColor	#DEEBFF

ifdown ens160

ifup ens160

Verify the new

...

IP is coming up correctly with "ip a"

...

.

Info

Important

Please note that CentOS7 sometimes renames rename interfaces, if . If this happens you will need to , rename the matching /etc/sysconfig/network-scripts/ifcfg-xxxx files xxxx files to the new name you see with "ip a" , don't forget to also .

Also, rename the config parameter inside the ifcfg-xxx file "NAME" and "DEVICE".

...

Note
It is recommended to assign a static ip IP for the swarm storage network facing nic.

Time Synchronization

Prometheus requires correct time synchronization for it to work and present data to

...

Grafana.
The following has already been done on SwarmTelemetry VM, but it is mentioned in case you need to re-apply it.

...

Panel

bgColor	#DEEBFF

timedatectl set-timezone UTC

Edit /etc/chrony.conf and

...

add server 172.29.0.3

...

iburst (

...

Set to your SCS IP), if missing.

Panel

bgColor	#DEEBFF

systemctl stop chronyd

hwclock --systohc

systemctl start chronyd

Prometheus

...

Master Configuration

We need to tell Prometheus which Swarm storage nodes we wish to collect metrics from.
Inside the /etc/prometheus/prometheus.yml file, you will

...

see a list of swarm nodes

...

to modify

...

the following section:

Code Block
- job_name: 'swarm' scrape_interval: 30s static_configs: - targets: ['10.10.10.84:9100','10.10.10.85:9100','10.10.10.86:9100']

Make sure to change the targets to match your Swarm storage node IP's.

...

Note

...

You can also use DNS names

...

in the absence of a DNS server

...

. You can first modify /etc/hosts with the desired names for each Swarm storage node and then use those in the configuration file. This is highly recommended to avoid showing IP addresses on potential public dashboards.

...

If you have content gateway in your deployment, you can add them to prometheus.yml as follows:

...

Note: if you have multiple gateways , just add them to the targets list.

Code Block

- job_name: 'swarmcontentgateway' 
   scrape_interval: 30s 
   static_configs: 
    - targets: ['10.10.10.20:9100','10.10.10.21:9100' ] 
relabel_configs: 
  - source_labels: [__address__] 
  regex: "([^:]+):\\d+" 
  target_label: instance

...

Note

If you have multiple gateways, just add them to the targets list.

You will see the job_name

...

in the gateway dashboard, so make sure to make it human friendly.
Modify the swarmUI template in /etc/prometheus/alertmanager/template/basic-email.tmpl

...

. This will be used for the email html template showing a button to the chosen URL.

...

Change the part in bold:

...

Panel

bgColor	#DEEBFF

{{ define "__swarmuiURL" }}https://172.30.10.222:91/_admin/storage/{{ end }}

Modify the gateway job name in /etc/prometheus/alertmanager/alertmanager.yml

...

. It must match what you chose in prometheus.yml.

...

Code Block
routes: - match: job: swarmcontentgateway

Modify the gateway job name in /etc/prometheus/alert.rules.

...

yml

Code Block
- alert: gateway_down expr: up{job="swarmcontentgateway"} == 0

To

...

restart the service, type:

Panel

bgColor	#DEEBFF

systemctl restart prometheus

To enable it for reboots, type:

Panel

bgColor	#DEEBFF

systemctl enable prometheus

You can test

...

Prometheus is up by opening a browser and going to http://YourVMIP:9090/targets

...

. This page will show which targets it is currently collecting metrics for and if they are reachable.

...

You can also do this from a terminal by doing:

Panel

bgColor	#DEEBFF

curl YOURVMIP:9090/api/v1/targets

Gateway

...

Node Exporter Configuration

Starting Swarm 15.3, the gateway dashboard require you to run the node_exporter service on the gateways.
The systemd service must be configured to listen on port 9095, because the default port 9100 is used by the gateway metrics component.
Make sure to put the node_exporter golang binary in the /usr/local/bin directory.

...

(Example: systemd config file for the node exporter.)

Code Block

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/usr/local/bin/node_exporter --web.listen-address=:9095 --collector.diskstats.ignored-devices=^(ram|loop|fd|(h|s|v|xv)d[a-z])\\d+$ --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker)($|/) --collector.filesystem.ignored-fs-types=^/(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)($|/) --collector.meminfo_numa --collector.ntp --collector.processes --collector.tcpstat --no-collector.nfs --no-collector.nfsd --no-collector.xfs --no-collector.zfs --no-collector.infiniband --no-collector.vmstat --no-collector.textfile --collector.conntrack --collector.qdisc --collector.netclass

[Install]
WantedBy=multi-user.target

...

Enable and configure the service.

Panel

bgColor	#DEEBFF

systemctl enable node_exporter

systemctl start node_exporter

...

Add a job definition for it in the

...

Prometheus master configuration file

...

.

exampleExample:

Code Block
- job_name: 'gateway-node-exporter' scrape_interval: 30s static_configs: - targets: ['10.10.10.20:9095'] relabel_configs: - source_labels: [__address__] regex: "([^:]+):\\d+" target_label: instance

SCS

...

Node Exporter Configuration

Starting Swarm 15.3, the SCS requires the node_exporter service to monitor partition capacity information which is exposed at the end of the Swarm Node View dashboard.
Use the same systemd script provided by the gateway, but here you must use the default listen port of 9100

...

. SCS 1.5.1 has been modified to add a firewall rule for port 9100 on the swarm storage network

...

.
Make sure to put the node_exporter golang binary in the /usr/local/bin directory.

...

Enable and configure the service.

Panel

bgColor	#DEEBFF

systemctl enable node_exporter

systemctl start node_exporter

You will have to add a job definition for it in the

...

Prometheus master configuration file

...

.

exampleExample:

Code Block
- job_name: 'scs-node-exporter' scrape_interval: 30s static_configs: - targets: ['10.10.10.2:9100'] relabel_configs: - source_labels: [__address__] regex: "([^:]+):\\d+" target_label: instance

Elasticsearch

...

Exporter Configuration

The Swarm Search v7 Dashboard requires a new elasticsearch_exporter service that runs locally on the Telemetry VM.
You will need to modify the systemd script to tell

...

what the

...

IP address is of one of your elasticsearch nodes

...

.

...

Modify /usr/lib/systemd/system/elasticsearch_exporter.service if the elasticsearch node IP is

...

different.
The --uri needs to be pointing at the

...

IP address of one of your elasticsearch nodes

...

. It will auto-discover the other nodes from the metrics.
The new elasticsearch_exporter

...

need it's own job and replaces the old way of scraping metrics from elasticsearch nodes via plugins.
The job you need to add if missing in /etc/prometheus/prometheus.yml is as follows:

...

Code Block
- job_name: 'elasticsearch' scrape_interval: 30s static_configs: - targets: ['127.0.0.1:9114'] relabel_configs: - source_labels: [__address__] regex: "([^:]+):\\d+" target_label: instance

Make sure the elasticsearch exporter is running and configured to start on a reboot.

Panel

bgColor	#DEEBFF

systemctl enable elasticsearch_exporter

systemctl start elasticsearch_exporter

Prometheus

...

Retention Time

By default, Prometheus will keep metrics for 15 days

...

(It can be modified to store for 30 days

...

). If you wish to change this, follow

...

the below instructions:

Edit the /root/prometheus.service file and choose select your default retention time for the collected metrics. Note:

Info

Tip

30 days is more than enough for POC's and demo's. Modify the --storage.tsdb.retention.time=30d flag to your new desired retention time.

...

The rule of thumb is 600MB of disk space for 30 days per Swarm Node. This VM comes with a 50 GB dedicated vmdk partition for prometheusPrometheus. (This means we can handle up to 32 chassis for 30 days).
If you have modified the retention, then you need to commit the change:

Panel

bgColor	#DEEBFF

cp /root/prometheus.service /usr/lib/systemd/system

systemctl daemon-reload

promtool check config /etc/prometheus/prometheus.yml

systemctl restart prometheus

...

/etc/prometheus/prometheus.yml

systemctl restart prometheus

Info

Tip

30 days is more than enough for POC's and demo's. Modify the --storage.tsdb.retention.time=30d flag to your new desired retention time.

Prometheus Security

It may be desirable to restrict prometheus Prometheus server to only allow query's from the local host, since grafana-server is running on the same VM. This can be done by editing prometheus.service file and adding the flag --web.listen-address=127.0.0.1:9090 Note: if you decide to bind only to localhost , you 9090.

Note

Warning

You will not be able to access the prometheus bultin UI on port 9090 remotely, if you decide to bind only to localhost.

Grafana

...

Configuration

The /etc/grafana/grafana.ini file should be modified to setup the IP address, the server should be listening too. By default,

...

it will bind to all local IP's on port 80.
Review the admin_password parameter.

...

Note

...

The default admin password is "datacore" for Grafana.

...

Grafana has several authentication options including google-auth / oAuth / ldap and by default basic http auth.

...

See https://docs.grafana.org/ for more details.
To start the service, type "service grafana-server start" or "systemctl start grafana-server"

...

.
To enable it for reboots, type "systemctl enable grafana-server"

...

.

Alertmanager

...

Configuration

We currently have 4 alerts defined in /etc/prometheus/alert.rules.

...

yml
1. Service_down:

...

1. Triggered if any swarm storage node is down for more than 30

...

1. minutes.
2. Gateway_down:

...

1. Triggered if the cloudgateway service is down for more than 2

...

1. minutes.
2. Elasticsearch_cluster_state:

...

1. Triggered if the cluster state changed to "red" after 5

...

1. minutes.
2. Swarm_volume_missing:

...

1. Triggered if reported drive count is decreasing over a period of 10

...

1. minutes.

...

/etc/prometheus/prometheus.yml now contains a section that points to the alertmanager service on port 9093 as well as which alert.rules.yml file to use.
The configuration for where to send alerts is defined in /etc/prometheus/alertmanager/alertmanager.yml.

...

By default, the route is disabled as it requires manual input from your environment (smtp server , user, pass etc.)

...

. Here is an example of a working route to email alerts via

...

Gmail:

info

Code Block

- name: 'swarmtelemetry' 
  email_configs: 
  - to: swarmtelemetry@gmail.com 
    from: swarmtelemetry@gmail.com 
    smarthost: smtp.gmail.com:587 
    auth_username: swarmtelemetry@gmail.com 
    auth_identity: swarmtelemetry@gmail.com 
    auth_password: YOUGMAILPASSWORD or APPPASSWORD 
    send_resolved: true

Note

...

You need to configure this for the swarmtelemetry and gatewaytelemetry route

...

. They are defined separately because they use their own custom email templates.

...

Note

Warning

Prometheus alertmanager does not support SMTP NTLM authentication, as such hence, you cannot use it to send authenticated emails directly to Microsoft Exchange. Instead Alternatively, you should configure the smarthost to connect to localhost:25 without authentication, where the default Centos postfix server is running. It will know how to send the email to your corporate relay (auto-discovered via DNS). You will need to add require_tls: false to the email definition config section in alertmanager.yml.

...

Code Block
- name: 'emailchannel' email_configs: - to: admin@acme.com from: swarmtelemetry@acme.com smarthost: smtp.acme.com:25 require_tls: false send_resolved: true

Once configuration has completed, restart the alertmanager:

Panel

bgColor	#DEEBFF

systemctl restart alertmanager

To verify the alertmanager.yml has the correct syntax, run:

Panel

bgColor	#DEEBFF

amtool check-config /etc/prometheus/alertmanager/alertmanager.yml

...

You will

...

get the following output:

Code Block
Checking '/etc/prometheus/alertmanager/alertmanager.yml' SUCCESS Found: - global config - route - 1 inhibit rules - 2 receivers - 1 templates SUCCESS

To show a list of active alerts, run:

Panel

bgColor	#DEEBFF

amtool alert

...

To show which alert route is enabled, run:

Panel

bgColor	#DEEBFF

amtool config routes show

Routing tree:

└── default-route receiver: disabled

Example Email Alert:

...

The easiest way to trigger an alert for testing purposes is to shutdown 1 gateway.

Infonote

Important

If you are aware of an alert and know that the resolution will take several days or weeks to resolve, you can silence alerts via the alert manager GUI on port 9093.

...

Dashboards on

...

Grafana

DashBoard ID

...

	Dashboard Name
16545	DataCore Swarm AlertManager v15
16546	DataCore Swarm Gateway v7
16547	DataCore Swarm Node View
16548	DataCore Swarm System Monitoring v15
17057	DataCore Swarm Search v7
19456	DataCore Swarm Health Processor v1

General Advice

...

Around Defining New Alerts

Pages should be urgent, important, actionable, and real.
They should represent either ongoing or imminent problems with your service.
Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
You should almost always be able to classify the problem into one of: availability &
- Availability and basic functionality
; latency; correctness
- Latency
- Correctness (completeness, freshness and durability of data)
;
- and
feature
- Feature-specific problems
.
Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.
Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
The further up your serving stack you go, the more distinct problems you catch in a single rule. But But don't go so far you can't sufficiently distinguish what's going on.
If you want a quiet on call rotation, it's imperative to have a system for dealing with things that need timely response, but are not imminently critical.

...

Version	Old Version 11	New Version 12
Changes made by	Aaron Enfield	Bala Harish A(AC)
Saved on	Nov 22, 2023	Nov 29, 2023

Versions Compared

Key

Important

Environment Prerequisites

Configuration

VMware Network

Configuration

Important

Time Synchronization

Prometheus

Master Configuration

Note

Note

Gateway

Node Exporter Configuration

SCS

Node Exporter Configuration

Elasticsearch

Exporter Configuration

Prometheus

Retention Time

Tip

Tip

Prometheus Security

Warning

Grafana

Configuration

Note

The default admin password is "datacore" for Grafana.

Alertmanager

Configuration

Note

Warning

Important

Dashboards on

Grafana

General Advice

Around Defining New Alerts

Content Comparison

Versions Compared

Key

Important

Environment Prerequisites

Configuration

VMware Network

Configuration

Important

Time Synchronization

Prometheus

Master Configuration

Note

Note

Gateway

Node Exporter Configuration

SCS

Node Exporter Configuration

Elasticsearch

Exporter Configuration

Prometheus

Retention Time

Tip

Tip

Prometheus Security

Warning

Grafana

Configuration

Note

The default admin password is "datacore" for Grafana.

Alertmanager

Configuration

Note

Warning

Important

Dashboards on

Grafana

General Advice

Around Defining New Alerts