Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

For disk-related events requiring user action (such as disk removal), Swarm helps you locate the hardware by logging the SCSI locator (bus ID) and volume serial number at CRITICAL and ANNOUNCE log levels, which makes them display displayed in the UI. (v9.2)

statistics
Info
title

Helpful

Statistics

Swarm keeps statistics on incomplete read and write requests, which can help you diagnosis diagnose clients that may be behaving incorrectly.

  • SNMP: clientPrematureCloseRead, clientPrematureCloseWrite

  • UI: Drill into the health reports for chassis-level statistics. For the legacy Admin Console, statistics appear on a node's status page, under Node Operations: SCSP: Client premature close (read), SCSP: Client premature close (write)

title
Info

Tip

For disk-related events requiring user action (such as disk removal), Swarm helps you locate the hardware by including the SCSI locator (bus ID) and volume serial number in the log message that displays displayed in the UI. (v9.2)

Symptom

Action

A volume device failed.

Allow the node to continue running in a degraded state (lowered storage)
OR
Replace the volume at

your

the earliest convenience.

See Replacing Failed Drives.

A node failed.

If

Repair the hardware and return it to service within 14 days if a node fails but the volume storage devices are functioning properly

, you can repair the hardware and return it to service within 14 days.If

.

All volumes are considered stale and cannot be used if a node is down for more than 14 days

, all of its volumes are considered stale and cannot be used. After 14 days, you can force

. Force a volume to be remounted by modifying the volume specification and adding the:k (keep) policy option after 14 days.

See Managing Volumes.

In the UI, all remaining cluster nodes
are consistently or intermittently offline.

Viewing the legacy Admin Console from different nodes, other nodes appear offline and unreachable.

If a new node cannot see the remaining nodes in the cluster, check the

Check the Swarm network configuration setting in each node (particularly the group parameter) to

ensure that

verify all nodes are configured as part of the same cluster and connected to the same subnet

.If

if a new node cannot see the remaining nodes in the cluster.

Verify IGMP Snooping is enabled on the network switch if the network configuration appears to be correct

, verify that IGMP Snooping is enabled on your network switch. If enabled, an

. An IGMP querier must be enabled in the same network (broadcast domain) if enabled. In multicast networks, this is normally enabled on the router leading to the storage cluster, which is usually the default gateway for the nodes.

See IGMP Snooping.

You have read

Read-only access to the UIs even though

you are

the user is listed in security.administrators.

You cannot

Cannot view the Swarm UI.

You added

Added an operator (a read-only user) to security.operators but did not add

your

the administrator user name and password to security.operators as well.

As a result, you cannot access the

The Swarm UI cannot be accessed as an administrator.

To resolve this issue, add all of your

Add all administrator users to the security.operators parameter in the node or cluster configuration file to resolve this issue.

See Defining Swarm Admins and Users.

The network does not connect to a node configured with multiple NIC ports.

Ensure that

Verify the network cable is plugged into the correct NIC. Depending on the bus order and the order

that

the kernel drivers are loaded, the network ports may not match

their

the external labeling.

A node automatically reboots.

If

This issue may indicate a software problem if the node is plugged into a reliable power outlet and the hardware is functioning properly

, this issue may indicate a software problem

.

The Swarm system includes a built-in fail safe that

will reboot

reboots itself if something goes wrong. Contact DataCore Support for guidance.

A node is unresponsive to network requests.

Perform the following steps until the node responds to network requests.

Ensure that your
  • Verify the client network settings are correct.

  • Ping the node.

  • Open the legacy Admin Console on the node by entering

its
  • the IP address in a browser window (http://{ip-address}:90).

  • Attach a keyboard to the failed node and

press
  • click Ctrl-Alt-Delete to force a graceful shutdown.

  • Press the hardware reset button on the node or power cycle the node.

The cluster is using more data than expected.

Using Elasticsearch, enumerate the CAStor-Application field to determine how much data is being written by which application. Many Swarm applications use this metadata header, and having it indexed

lets you analyze

allows analyzing which application created which content.

A node is not performing as expected.

In the castor.log, view the node statistics, which include periodic logging of CPU utilization for each process:

Code Block
languagebash
2015-11-05 16:13:22,
	898 NODE INFO: system utilization stats: 
		pid_cpusys: 0.06, 
		pid_cputot: 1.67, 
		pid_cpuusr: 1.61, 
		sys_contexts_rate: 5728.00, 
		sys_cpubusy: 0.91, 
		sys_cpubusy0: 0.37, 
		sys_cpubusy1: 1.46, 
		sys_cpuio: 0.02, 
		sys_cpuirq: 0.01, 
		sys_cpusys: 0.06, 
		sys_cpuusr: 0.82