Swarm Storage 10.2 Release

New Features

Swarm 10 Performance
- The rate at which nodes retire is now improved over both version 10.1 and version 9.6 of Swarm Storage. (SWAR-8386)
- Swarm has boosted the performance of erasure-coded range reads under high loads. (SWAR-8182)
Prometheus Node Exporter — The Prometheus Node Exporter preview has configuration enhancements.
- The service is now enabled by default (metrics.enableNodeExporter=True), which makes basic hardware queries across nodes available without reboot.
- A new setting, metrics.nodeExporterFrequency, sets how frequently to refresh Swarm-specific metrics in Elasticsearch; it defaults to 0, which disables this export. (SWAR-8408)
Swarm Management
- The new node-level Swarm configuration setting security.securePhysicalConsole allows locking out access to the console's System Menu commands. This security measure is for nodes located where they can be at risk for unauthorized viewing and tampering. (10.2.1: SWAR-5309)
- To ease upgrades to Swarm 10, the cluster-wide setting ec.protectionLevel is now a persisted setting, so that it can be changed on demand via Swarm UI or SNMP. The setting is no longer managed within and across config files, requiring consistency and cluster restarts. (SWAR-8231)
- For better management of multipart uploads, both castor-system-uploadid and castor-system-partnumber now allow query args to use either hyphens or underscores in the field name, as is supported for metadata headers such as content-type. (SWAR-8274)
- Swarm now raises alerts on objects that have persistent feed-related failures, such as objects that cannot be indexed in Elasticsearch or be remotely replicated. To investigate the cause for such failures, examine the details in the logs. (SWAR-8383)
- The versions query argument on listing queries now accepts versions=previous to limit results to only the past versions of an object. (SWAR-6847)
- Swarm now accepts named objects whose path name relative to the bucket looks like a UUID (32-character hexadecimal). (SWAR-8199)

Additional Changes

These items are other changes and improvements including those that come from testing and user feedback.

OSS Versions — See Third-Party Components for 10.2.1 for the complete listing of packages and versions.
- The Linux kernel is updated to 4.19.37 and the mpt3sas driver is updated to 26.100.00.00. (10.2.1: SWAR-8480)
- Intel network drivers are updated, ixgbe to 5.5.5 and i40e to 2.7.29. (10.2.1: SWAR-8498)
Fixed
- A 3-node cluster does not retire a volume efficiently if it contains objects requiring 3 replicas. (10.2.1: SWAR-8482)
- Changing the metrics.target host from an Elasticsearch 2.3.3 cluster to a 5.6.12 cluster did not trigger the needed update of the index schemas before new data was indexed. (SWAR-8426)
- An SNMP shutdown request for a Swarm node instead caused it to be rebooted. (SWAR-8422)
- Maintenance activities on the Elasticsearch cluster created erroneous reports of an index missing in the Swarm cluster. (SWAR-8413)
- Swarm now installs the python requests package needed for the metrics migration script that is used during migration to Elasticsearch 5.6. (SWAR-8407)
- An issue caused Swarm to erroneously report low memory. (SWAR-8399)
- Swarm search queries hang if the associated Search feed referred to invalid or unavailable Elasticsearch nodes. (SWAR-8200)

Upgrade Impacts

These items are changes to the product function that may require operational or development changes for integrated applications. Address the upgrade impacts for each of the versions since the one you are currently running:

Impacts for 10.2

Upgrading Elasticsearch — You may continue to use Elasticsearch 2.3.3 with Storage 10.2 until you are able to move to 5.6 (see Migrating from Older Elasticsearch). Support for ES 2.3.3 ends in a future release. Before you upgrade to Gateway 6.0, however, you must complete the upgrade to Elasticsearch 5.6.
Configuration Settings — Run the Storage Settings Checker before any Swarm 10 upgrade to identify configuration issues. Note these changes:
- ec.protectionLevel is now persisted. (SWAR-8231)
- index.ovMinNodes=3 is the new default for the overlay index, in support of Swarm 10's new architecture. To keep your overlay index operational, set this new value in your cluster, through the UI or by SNMP (overlayMinNodes). (SWAR-8278)
- metrics.enableNodeExporter can be set to True, which enables the Prometheus Node Exporter on that node. (SWAR-8408, SWAR-8578)
- metrics.nodeExporterFrequency, a new dynamic setting, sets how frequently to refresh Swarm-specific Prometheus metrics in Elasticsearch; it defaults to 0, which disables this export. (SWAR-8408).

Impacts for 10.1

Upgrading Elasticsearch — Continue to use Elasticsearch 2.3.3 with Storage 10.1 until able to move to 5.6 (see Migrating from Older Elasticsearch). Support for ES 2.3.3 ends in a future release. Complete the upgrade to Elasticsearch 5.6 before upgrading to Gateway 6.0.
Configuration Settings — Run the Storage Settings Checker before any Swarm 10 upgrade to identify configuration issues.
- metrics.enableNodeExporter=true enables Swarm to run the Prometheus node exporter on port 9100. (SWAR-8170)
IP address update delay — When upgrading from Swarm 9 to the new architecture of Swarm 10, note the "ghosts" of previously used IP addresses may appear in the Storage UI; these resolve within 4 days. (SWAR-8351)
Update MIBs on CSN — Before upgrading to Storage 10.x, the MIBs on the CSN must be updated. From the Swarm Support tools bundle, run the platform-update-mibs.sh script. (CSN-1872)

Impacts for 10.0

Upgrading Elasticsearch: You may continue to use Elasticsearch 2.3.3 with Storage 10.0 until you are able to move to 5.6 (see Migrating from Older Elasticsearch). Support for ES 2.3.3 ends in a future release.
Configuration Settings: Run the Storage Settings Checker to identify these and other configuration issues.
- Changes for the new single-IP dense architecture:
  - network.ipAddress - multiple IP addresses now disallowed
  - chassis.processes - removed; multi-server configurations are no longer supported
  - ec.protectionLevel - new value "volume"
  - ec.subclusterLossTolerance - removed
- Changes for security (see next section)
  - security.administrators, security.operators - removed 'snmp' user
  - snmp.rwCommunity, snmp.roCommunity - new settings for 'snmp' user
  - startup.certificates - new setting to hold any and all public keys
- New settings:
  - disk.atimeEnabled
  - health.parallelWriteTimeout
  - search.pathDelimiter
Required SNMP Security Change: Remove the snmp key from the security.administrators setting, and update snmp.rwCommunity with its value. Nodes that contain only the snmp key in the security.administrators setting does not boot. If you changed the default value of the snmp key in the security.operators setting, update snmp.roCommunity with that value and then remove the snmp key from security.operators. In the security.operators setting, 'snmp' is a reserved key, and it cannot be an authorized console operator name. (SWAR-8097)
EC Protection
- Best practice: Use ec.protectionLevel=node, which distributes segments across the cluster's physical/virtual machines. Do not use ec.protectionLevel=subcluster unless you already have subclusters defined and are sure the specified EC encoding is supported. A new level, ec.protectionLevel=volume, allows EC writes to succeed if you have a small cluster with fewer than (k+p)/p nodes. (Swarm always seeks the highest protection possible for EC segments, regardless of the level you set.)
- Optimize hardware for EC by verifying there are more than k+p subclusters/nodes (as set by ec.protectionLevel); for example, with policy.ecEncoding=5:2, you need at least 8 subclusters/nodes. When Swarm cannot distribute EC segments adequately for protection, EC writes can fail despite ample free space. (SWAR-7985)
- Setting ec.protectionLevel=subcluster without creating subclusters (defining node.subcluster across sets of nodes) causes a critical error and lowers the protection level to 'node'. (SWAR-8175)
Small Clusters: Verify the following settings if using 10 or fewer Swarm nodes. Do not use fewer than 3 in production.
Important: If you need to change any, do so before upgrading to Swarm 10.
- policy.replicas: The min and default values for numbers of replicas to keep in your cluster must not exceed your number of nodes. For example, a 3-node cluster may have only min=2 or min=3.
- EC Encoding and Protection: For EC encoding, verify you have enough nodes to support the cluster's encoding (policy.ecEncoding). For EC writes to succeed with fewer than (k+p)/p nodes, use the new level, ec.protectionLevel=volume.
- Best Practice: Keep at least one physical machine in your cluster beyond the minimum number needed. This allows for one machine to be down for maintenance without compromising the constraint.
Cluster in a Box: Swarm supports a "cluster in a box" configuration as long as that box is running a virtual machine host and Swarm instances are running in 3 or more VMs. Each VM boots separately and has its own IP address. Follow the recommendations for small clusters, substituting VMs for nodes. If you have two physical machines, use the "cluster in a box" configuration, but move to direct booting of Swarm with 3 or more.
Offline Node Status: Because Swarm 10's new architecture reduces the number of IP addresses in your storage cluster, you may see the old IPs and subclusters reporting as Offline nodes until they timeout in 4 days (crier.forgetOfflineInterval), which is expected.

Info

The Multipath support is obselete from Swarm 10 onward.

For Swarm 9 impacts, see Swarm Storage 9 Releases.

Watch Items and Known Issues

The following operational limitations and watch items exist in this release.

Under some conditions, Swarm may start without mounting some of its volumes. If this happens, reboot the node. (10.2.1: SWAR-8597)
The OS in 10.2.1 cannot mount USB flash drives and so cannot read node.cfg files from them. If you boot Swarm from a USB drive, contact DataCore Support for a corrected version. (10.2.1: SWAR-8501)
During a rolling reboot of a small cluster, erroneous CRITICAL errors may appear on the console, claiming that EC objects have insufficient protection. These errors may be disregarded. (SWAR-8421)
When restarting a cluster of virtual machines that are UEFI-booted (versus legacy BIOS), the chassis shut down but do not come back up. (SWAR-8054)
If you wipe your Elasticsearch cluster, the Storage UI shows no NFS config. Contact DataCore Support for help repopulating your SwarmFS config information. (SWAR-8007)
If you delete a bucket, any incomplete multipart upload into that bucket leaves the parts (unnamed streams) in the domain. To find and delete them, use the s3cmd utility (search the Support site for "s3cmd" for guidance). (SWAR-7690)
Logs showed the error "FEEDS WARNING: calcFeedInfo(etag=xxx) cannot find domain xxx, which is needed for a domains-specific replication feed". The root cause is fixed; if you received such warnings, contact DataCore Support so the issue can be resolved. (SWAR-7556)
With multipath-enabled hardware, the Swarm console Disk Volume Menu may erroneously show too many disks, having multiplied the actual disks in use by the number of possible paths to them. (SWAR-7248)

Note these installation issues:

The elasticsearch-curator package may show an error during an upgrade, which is a known curator issue. Workaround: Reinstall the curator: yum reinstall elasticsearch-curator (SWAR-7439)
Do not install the Swarm Search RPM before installing Java. If Gateway startup fails with "Caringo script plugin is missing from indexer nodes", uninstall and reinstall the Swarm Search RPM. (SWAR-7688)

Upgrading Swarm

To upgrade Swarm 9 or higher, proceed to How to Upgrade Swarm.

Important

If you need to upgrade from Swarm 8.x or earlier, contact DataCore Support for guidance.