Swarm Storage 10.0 Release

New Features

  • Density-Friendly Machine Addressing: With Storage 10.0, Swarm has an internal architecture to leverage very dense servers containing several CPU cores and disks. The visible effect of having a single IP per physical or virtual machine is that far fewer IP addresses are needed for deployment, which simplifies network administration, monitoring, and architecture. With Swarm 10's new single-IP address architecture, every physical or virtual machine only requires one IP address, and each Swarm "node" refers to the machine that hosts it, because only a single instance of Swarm software runs on it.

  • Settings Checker: A new Storage Settings Checker tool, bundled with the Swarm Support Tools, identifies configuration changes needed to support the new Swarm 10 architecture. For this and all future upgrades of Swarm, running the settings checker before upgrading or during troubleshooting greatly improves Support's ability to keep the configuration pruned and tuned. See Storage Settings Checker.

  • Remote Replication without VPN Tunnels: The push-style (direct POST) replication protocol introduced in 9.6 offers better performance and flow control. With 10.0, direct POST now supports SSL/TLS network encryption and standard proxy servers for replication feeds, which eliminates the need for separate VPN tunnels between clusters. This capability streamlines deployments where encrypted communications are needed over wide-area, untrusted networks. See Replication Feeds over Untrusted Networks.

    • Swarm now supports using SSL for remote replication data transfer. This configuration requires an SSL offload proxy (such as HAProxy) in the target cluster environment. (SWAR-7826)

    • For sites using SSL with remote replication, Swarm allows the establishment of trusted certificates (public keys) that may be self-signed. (SWAR-8080). See Adding a Trusted Certificate to Swarm.

    • Swarm allows placing a forward proxy to the source cluster into the replication path. (SWAR-8025)

  • Elasticsearch 5.6: Swarm now ships with Elasticsearch 5.6, which extends Swarm's built-in metadata searching capabilities and allows integration with readily available off-the-shelf tools (such as the ELK stack) that use Elasticsearch for data analytics and monitoring. Swarm remains backwards compatible with Elasticsearch 2.3.3 to allow for a well-timed migration to the new version, which requires re-indexing of your search data. See Migrating from Older Elasticsearch.

  • Time of Access (atime) Tracking: Once you have begun re-indexing the search data on the new Elasticsearch 5.6 schema, you can enable Swarm's new atime (access time) feature, which provides client applications a method to track content usage and determine candidates for deletion or tiering to cold storage. See Time of Last Access - atime.

  • Improved Administration

    • Security: The SNMP password is separated from the security.administrators setting into its own setting (snmp.rwCommunity), as part of better administrative password handling. (SWAR-8097)

    • Search:

      • The new "versioned" search query argument allows filtering objects based on the versioning status. It checks the value of the Castor-System-IsVersioned header, which captures whether versioning was enabled (true) or suspended (false) in the object's context at the time it was written. (SWAR-6851)

      • When indexing quoted strings, Swarm now preserves any existing RFC-2047 encoding. (SWAR-8089)

    • Logging:

      • To help troubleshoot a failure to boot, the Swarm console can now display the startup log ("castor_init"). Under Diagnostics, select "11. Castor Startup log". (SWAR-8070)

      • dmesg is logged on boot and preserved automatically as an historical record of the starting state of the machine. (SWAR-8040)

  • OSS Updates — Storage 10.0 includes significant updates to third-party components. See Third-Party Components for 10.0 for the complete listing of packages and versions.

    • Upgraded kernel version to 4.14.53, linux firmware to 1.174, and bnx2 firmware to firmware-bnx2_20161130-3-bpo8+1. (SWAR-8048)

    • Added kernel support for Virtio devices to better support KVM virtual machines. (SWAR-8043)

    • Added the Broadcom NetXtreme driver, to support newer adapters. (SWAR-7627)

    • Updated third-party packages Twisted 18.4.0 and Boost 1.67. (SWAR-7981)

Additional Changes

These items are other changes and improvements including those that come from testing and user feedback.

  • SCSP fixes

    • When a write fails because volumes are not available, Swarm returns a 507 Insufficient Storage error rather than 4xx client errors. (SWAR-8073)

    • A multipart APPEND on a previously renamed object dropped the generation NID header, which prevented any segments written before the rename from finding the manifest, causing them to be deleted as orphans(SWAR-8300)

    • Swarm no longer allows an object with multipart APPEND or PATCH initiated but then is updated by a POST or PUT to complete its multipart write, because those segments are unable to find the new manifest, causing them to be deleted as orphans. (SWAR-8192)

  • Feed fixes

    • During replication through Gateway, when Gateway disallowed replication on particular objects for whatever reason, Swarm treated those events as failures to be retried instead of allowing the feed to block. (SWAR-8060)

    • On the Feeds definition page of the legacy Admin Console (port 90), deselecting the "Propagate Deletes" checkbox had no effect on propagation. (SWAR-8076)

  • Boot fixes

    • Rebooting a recovered volume can erroneously trigger a recovery (FVR). Upgrading to Swarm 10 and completing a full HP cycle prevents this from occurring. (SWAR-8081)

    • Errors (Failed to start Create Volatile Files and Directories) appeared on the Swarm console during startup. (SWAR-7762)

    • When using DHCP IP assignment (no assigned network.ipAddress), Swarm 9.6+ failed to boot. (SWAR-8212)

Upgrade Impacts

These items are changes to the product function that may require operational or development changes for integrated applications.

Impacts for 10.0

  • Upgrading Elasticsearch: You may continue to use Elasticsearch 2.3.3 with Storage 10.0 until you are able to move to 5.6 (see Migrating from Older Elasticsearch). Support for ES 2.3.3 ends in a future release.

  • Configuration Settings: Run the Storage Settings Checker to identify these and other configuration issues.

    • Changes for the new single-IP dense architecture:

      • network.ipAddress - multiple IP addresses now disallowed

      • chassis.processes - removed; multi-server configurations are no longer supported

      • ec.protectionLevel - new value "volume"

      • ec.subclusterLossTolerance - removed

    • Changes for security (see next section)

      • security.administrators, security.operators - removed 'snmp' user

      • snmp.rwCommunity, snmp.roCommunity - new settings for 'snmp' user

      • startup.certificates - new setting to hold any and all public keys

    • New settings:

      • disk.atimeEnabled

      • health.parallelWriteTimeout

      • search.pathDelimiter

  • Required SNMP Security Change: Remove the snmp key from the security.administrators setting, and update snmp.rwCommunity with its value. Nodes that contain only the snmp key in the security.administrators setting does not boot. If you changed the default value of the snmp key in the security.operators setting, update snmp.roCommunity with that value and then remove the snmp key from security.operators. In the security.operators setting, 'snmp' is a reserved key, and it cannot be an authorized console operator name. (SWAR-8097)

  • EC Protection

    • Best practice: Use ec.protectionLevel=node, which distributes segments across the cluster's physical/virtual machines. Do not use ec.protectionLevel=subcluster unless you already have subclusters defined and are sure the specified EC encoding is supported. A new level, ec.protectionLevel=volume, allows EC writes to succeed if you have a small cluster with fewer than (k+p)/p nodes. (Swarm always seeks the highest protection possible for EC segments, regardless of the level you set.)

    • Optimize hardware for EC by verifying there are more than k+p subclusters/nodes (as set by ec.protectionLevel); for example, with policy.ecEncoding=5:2, you need at least 8 subclusters/nodes. When Swarm cannot distribute EC segments adequately for protection, EC writes can fail despite ample free space. (SWAR-7985)

    • Setting ec.protectionLevel=subcluster without creating subclusters (defining node.subcluster across sets of nodes) causes a critical error and lowers the protection level to 'node'. (SWAR-8175)

  • Small Clusters: Verify the following settings if using 10 or fewer Swarm nodes. Do not use fewer than 3 in production.
    Important: If you need to change any, do so before upgrading to Swarm 10.

    • policy.replicas: The min and default values for numbers of replicas to keep in your cluster must not exceed your number of nodes. For example, a 3-node cluster may have only min=2 or min=3.

    • EC Encoding and Protection: For EC encoding, verify you have enough nodes to support the cluster's encoding (policy.ecEncoding). For EC writes to succeed with fewer than (k+p)/p nodes, use the new level, ec.protectionLevel=volume.

    • Best Practice: Keep at least one physical machine in your cluster beyond the minimum number needed. This allows for one machine to be down for maintenance without compromising the constraint.

  • Cluster in a Box: Swarm supports a "cluster in a box" configuration as long as that box is running a virtual machine host and Swarm instances are running in 3 or more VMs. Each VM boots separately and has its own IP address. Follow the recommendations for small clusters, substituting VMs for nodes. If you have two physical machines, use the "cluster in a box" configuration, but move to direct booting of Swarm with 3 or more.

  • Offline Node Status: Because Swarm 10's new architecture reduces the number of IP addresses in your storage cluster, you may see the old IPs and subclusters reporting as Offline nodes until they timeout in 4 days (crier.forgetOfflineInterval), which is expected.

Info

The Multipath support is obselete from Swarm 10 onward.

For Swarm 9 impacts, see Swarm Storage 9 Releases.

Watch Items and Known Issues

The following operational limitations and watch items exist in this release.

  • The node may become temporarily unresponsive when Health Data (the raw JSON of the health report) on the Advanced tab of the Chassis Details page is viewed. (SWAR-8349)

  • While a reboot of a storage node is in progress, it may be reported to be in an unknown state rather than in maintenance mode. (SWAR-8348)

  • If you wipe your Elasticsearch cluster, the Storage UI shows no NFS config. Contact DataCore Support for help repopulating your SwarmFS config information. (SWAR-8007)

  • If you delete a bucket, any incomplete multipart upload into that bucket leaves the parts (unnamed streams) in the domain. To find and delete them, use the s3cmd utility (search the Support site for "s3cmd" for guidance). (SWAR-7690)

  • Dell DX hardware has less chassis-level monitoring information available via SNMP. If this is a concern, contact DataCore Support. (SWAR-7606)

  • Logs showed the error "FEEDS WARNING: calcFeedInfo() couldn't find realm for xxx". The root cause is fixed; contact DataCore Support so the issue can be resolved if such warnings are received. (SWAR-7556)

Upgrading from 9.x

Important

Do not begin the upgrade until you complete the following:

  1. Plan Upgrade Impacts: Review and plan for the 10.0 upgrade impacts (above) and the impacts for each of the releases since the version you are running. For Swarm 9 impacts, see Swarm Storage 9 Releases.

  2. Finish Volume Retires: Do not start any elective volume retirements during the upgrade. Wait until the upgrade is complete before initiating any retires.

  3. Run Checker Script: Swarm 10 includes a migration checker script to run before upgrading from Swarm 9; it reports configuration setting issues and deprecations to be addressed. (SWAR-8230) See Storage Settings Checker.

If you need to upgrade from Swarm 8.x or earlier, contact DataCore Support for guidance.

  1. Download the correct bundle for the site. Swarm distributions bundle together the core components needed for implementation and updates; the latest versions are available in the Downloads section on the DataCore Support Portal.
    There are two bundles available:

    • Platform CSN 8.3 Full Install or Update (for CSN environments): Flat structure for scripted install/update on a CSN (See CSN Upgrades).

    • Swarm 10 Software Bundle (Platform 9.x and custom environments): Contains complete updates of all core components, organized hierarchically by component.

Note

Contact DataCore Support for new installs of Platform Server and for optional Swarm client components, such as SwarmFS Implementation, that have separate distributions.

  1. Download the comprehensive PDF of Swarm Documentation that matches your bundle distribution date, or use the online HTML version from the Documentation Archive.

  2. Select your type of upgrade. Swarm supports rolling upgrades (a single cluster running mixed versions during the upgrade process) and requires no data conversion unless noted for a release. Upgrades can be performed without scheduling an outage or bringing down the cluster. Restart the nodes one at a time with the new version and the cluster continues serving applications during the upgrade process.

    • Rolling upgrade: Reboot one node at a time and wait for its status to show as "OK" in the UI before rebooting the next node.

    • Alternative: Reboot the entire cluster at once after the software on all USB flash drives or the centralized configuration location has been updated.

  3. Choose whether to upgrade Elasticsearch 2.3.3 at this time. 

    • To upgrade to Elasticsearch 5.6 with an existing cluster, reindex the Search data and migrate any Metrics data to be kept. See Migrating from Older Elasticsearch for details. (SWAR-7395) 

  4. Note these installation issues:

    • The elasticsearch-curator package may show an error during an upgrade, which is a known curator issue. Workaround: Reinstall the curator: yum reinstall elasticsearch-curator (SWAR-7439)

    • Do not install the Swarm Search RPM before installing Java. If Gateway startup fails with "Caringo script plugin is missing from indexer nodes", uninstall and reinstall the Swarm Search RPM. (SWAR-7688)

    • During a rolling upgrade from 9.0.x–9.2.x, you may see intermittent "WriterMissingRemoteMD5 error token" errors from a client write operation through the Gateway or on writes with gencontentmd5 (or the equivalent). To prevent this, set autoRepOnWrite=0 during the upgrade and restore autoRepOnWrite=1 after it completes. (SWAR-7756)

  5. Review the https://perifery.atlassian.net/wiki/spaces/public/pages/2443808433.

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.