Hardware Requirements for Storage

Hardware requirements for implementing a storage cluster in a corporate enterprise and the best practices for maintaining it.

Note

Swarm installs and runs on enterprise-class (not consumer-grade) x86 commodity hardware.

Caution

Configure a cluster with a minimum of four nodes to guarantee high availability and failover in the event of a node failure.

Virtualization

Swarm storage nodes can run in a VM environment. Swarm supports VMware/ESXi and Linux KVM. Contact sales for more information and guidance.

Best Practice

Enable volume serial numbers on any virtual machines housing Swarm storage nodes (set disk.EnableUUID=TRUE).

Minimum Requirements

The following table lists the minimum hardware requirements for a storage cluster because Swarm nodes are designed to run using lights-out management (or out-of-band management), they do not require a keyboard, monitor, and mouse (KVM) to operate.

Component

Requirement

Component

Requirement

Node

x86 with Pentium-class CPUs

Number of nodes

Four (to guarantee adequate recovery space in the event of node failure)

Switch

Nodes must connect to switches configured for multicasting

Log server

To accept incoming syslog messages

Node boot capability

USB flash drive or PXE boot

Network interfaces

One Gigabit Ethernet NIC with one RJ-45 port *

Hard disks

One hard disk

RAM

number of volumes x 4 GB (example: 2 volumes x 4 GB = 8 GB)

NTP server

To synchronize clocks across nodes

Recommended Requirements

The following table lists the recommended hardware requirements for a storage cluster.

Component

Requirement

Component

Requirement

Node

x86 with Intel Xeon or AMD Athlon64 (or equivalent) CPUs

Number of nodes

Four or more

Switch

Nodes must connect to switches configured for multicasting

Log server

To accept incoming syslog messages

Node boot capability

USB flash drive or PXE boot

Network interfaces

Two Dual Gigabit Ethernet NICs with two RJ-45 ports for link aggregation (NIC teaming)

Important: Mixing network speeds among nodes is not supported. Do not put a node with a 100 Mbps NIC in the same cluster with a node containing a 1000 Mbps NIC.

Hard disks

One to four standard non-RAID SATA hard disks

RAM

number of volumes x 8 GB (example: 16 volumes x 8 GB = 128 GB)

NTP server

To synchronize clocks across nodes

Although Swarm runs on a variety of x86 hardware, the above table lists the recommended base characteristics for a storage cluster. Adding additional systems with more intensive CPU hardware and additional memory exponentially improves the cluster performance.

What hardware is best depends on storage and performance requirements, so ask a sales representative for hardware recommendations specific to the needs of the business.

CPU Guidelines for Storage Node

Following is the list of CPU counts related to a storage node running Swarm 15.3.x:

  • 1 CPU core minimum for base OS bookkeeping

  • 1 CPU core for the main process

  • At least 4 CPU cores for SCSP processes

  • At least 2 CPU cores for the CIP processes

  • 1 CPU core per volume (disk) in the server

Memory Sizing Requirements

Review the following sections for factors influencing how memory and erasure coding are sized, as well as how to configure Swarm.

How RAM Affects Storage

The storage cluster is capable of holding the sum of maximum object counts from all nodes in the cluster. How many objects can be stored on a node depends on the node's disk capacity and the amount of system RAM.

The following table shows the estimates of the maximum possible number of replicated objects (regardless of size) that can be stored on a node, based on the amount of RAM in the node, with the default 2 replicas being stored. Each replica takes one slot in the in-memory index maintained on the node.

Amount of RAM

Maximum Immutable Objects

Maximum Alias or Named Objects

Amount of RAM

Maximum Immutable Objects

Maximum Alias or Named Objects

4 GB

33 million

16 million

8 GB

66 million

33 million

12 GB

132 million

66 million

How the Overlay Index Affects RAM

Larger clusters (those above 4 nodes by default) need additional RAM resources to take advantage of the Overlay Index.

To store the same number of reps=2 object counts above and utilize the Overlay Index, increase RAM as follows:

  • Immutable unnamed objects: 50% additional RAM

  • Alias or named objects: 25% additional RAM

Smaller clusters and larger clusters where the Overlay Index is disabled do not need this additional RAM.

See https://perifery.atlassian.net/wiki/spaces/public/pages/2443811756

How to Configure Small Objects

Swarm allows storage of objects up to a maximum of 4 TB. Configure the storage cluster accordingly if storing mostly small files.

Swarm allocates a small amount of disk space to store, write, and delete the change logs (journals) of the disk's file by default. This default amount is sufficient for deployments because the objects file the remainder of the disk before the log space is consumed.

The file log space can fill up before the disk space for installations writing mostly small objects (1 MB and under). Increase the configurable amount of log space allocated on the disk before booting Swarm on the node for the first time if a cluster usage focuses on small objects.

The parameters used to change this allocation differ depending on the software version in use.

Supporting Erasure Coding

Erasure coding conserves disk space but involves additional calculations and communications. Anticipate an impact on memory and CPU utilization. 

CPU

In general, plan to support EC with more and faster CPU cores.

Memory

Scale up for larger objects: Larger objects require more memory to manage erasure sets. How many EC objects can be stored on a node per GB of RAM depends on the size of the object and the encoding specified in the configuration. The erasure-coding manifest takes two index slots per object, regardless of the type of object (named, immutable, or alias). Each erasure-coded segment in an erasure set takes one index slot. Larger objects can have multiple erasure sets, so multiple sets of segments exist. In k:p encoding (where integers for the data (k) and parity (p) segment counts), there are p+1 manifests (up to the ec.maxManifests maximum). That means 3 manifests for 5:2 encoding. With the default segment size of 200 MB and a configured encoding of 5:2:

  • 1-GB object: (5+2) slots for segments and (2+1) for manifests = 10 index slots

  • 3-GB object: 3 sets of segments @ 10 slots each = 30 index slots

Increase for Overlay Index: Larger clusters (above 4 nodes by default) need additional RAM resources to take advantage of the Overlay Index. For erasure-coded objects, allocate 10% additional RAM to enable the Overlay Index.

Network

Network use by the requesting SAN is an important factor in EC performance. The requesting SAN must orchestrate k+p segment writes (for an EC write) or k segment reads (for an EC read). Because these segment reads and writes must occur at the same rate, the slowest one slows the overall request. Consequently, when a cluster experiences a lot of EC requests, nodes can play different roles and slower nodes can affect multiple requests.

See https://perifery.atlassian.net/wiki/spaces/public/pages/2443812123

Supporting High-Performance Clusters

Swarm benefits from faster CPUs and processor technologies, such as large caches, 64-bit computing, and fast front-side bus (FSB) architectures for the demands of high-performance clusters.

Maximize these variables to design a storage cluster for peak performance:

  • Add nodes to increase cluster throughput – like adding lanes to a highway

  • Fast or 64-bit CPU with large L1 and L2 caches

  • Fast RAM BUS (front-side BUS) configuration

  • Multiple, independent, and fast disk channels

  • Hard disks with large, on-board buffer caches, and Native Command Queuing (NCQ) capability

  • Gigabit (or faster) network topology between all storage cluster nodes

  • Gigabit (or faster) network topology between the client nodes and the storage cluster

Balancing Resources

Attempt to balance resources across nodes as evenly as possible for best performance. Adding several new nodes with 70 GB of RAM can overwhelm those nodes and negatively impact a cluster of nodes with 7 GB of RAM.

Creating a large cluster and spreading the user request load across multiple storage nodes significantly improves data throughput because Swarm is highly scalable. This improvement increases as nodes are added to the cluster.

Selecting Hard Disks

Selecting the right hard disks for the storage nodes improves both performance and recovery, in the event of a node or disk failure. Follow the guidelines below when selecting disks. For an in-depth overview of disk characteristics, download the Intel white paper Enterprise class versus Desktop class Hard Drives.

Disk Type

Enterprise-level

The critical factor is whether the hard disk is designed for the demands of a cluster. Enterprise-level hard disks are rated for 24x7 continuous-duty cycles and have time-constrained error recovery logic suitable for server deployments where error recovery is handled at a higher level than the onboard controller.

In contrast, consumer-level hard disks are rated for desktop use only; they have limited-duty cycles and incorporate error recovery logic that can pause all I/O operations for minutes at a time. These extended error recovery periods and non-continuous duty cycles are not suitable or supported for Swarm deployments.

Reliability

Rated for continuous use

The reliability of hard disks from the same manufacturer vary, because the disk models target different intended use and duty cycles:

  • Consumer models targeted for the home user and assume the disk is not being used continuously. These disks do not include the more advanced vibration and misalignment detection and handling features.

  • Enterprise models targeted for server applications and tend to be rated for continuous use (24x7) and include predictable error recovery times, as well as more sophisticated vibration compensation and misalignment detection.

Performance

Large on-board cache

Independent channels

Fast bus

Optimize the performance and data throughput of the storage subsystem in a node by selecting disks with these characteristics:

  • Large buffer cache: Larger onboard caches improve disk performance.

  • Independent disk controller channels: Reduces storage bus contention.

  • High disk RPM: Faster-spinning disks improve performance.

  • Fast storage bus speed: Faster data transfer rates between storage components, a feature incorporated in these types:

    • SATA-300

    • Serial Attached SCSI (SAS)

    • Fibre Channel hard disks

The storage bus type in the computer system and hard disks often drives the use of independent disk controllers.

  • PATA: Older ATA-100 and ATA-133 (or Parallel Advanced Technology Attachment [PATA]) storage buses allow two devices on the same controller/cable. Bus contention occurs when both devices are in active use. Motherboards with PATA buses only have two controllers. Some bus sharing must occur if more than two disks are used.

  • SATA: Unlike PATA controllers, Serial ATA (SATA) controllers and disks include only one device on each bus to overcome the previous bus contention problems. Motherboards with SATA controllers typically have four or more controllers. Recent improvements in Serial ATA controllers and hard disks (commonly called SATA-300) have doubled the bus speed of the original SATA devices.

Recovery

Avoid highest capacity

Improve the failure and recovery characteristics of a node when a disk fails by selecting disks with server-class features but not the highest capacity.

  • Higher capacity means slower replication. Consider the trade-off between the benefits of high-capacity disks versus the time required to replicate the contents of a failed disk when choosing the disk capacity in a node. Larger disks take longer to replicate than smaller ones, and that delay increases the business exposure when a disk fails.

  • Delayed errors mean erroneous recovery. Unlike consumer-oriented devices, for which it is acceptable for a disk to spend several minutes attempting to retry and recover from a read/write failure, redundant storage designs such as Swarm need the device to emit an error quickly so the operating system can initiate recovery. The entire node may appear to be down, causing the cluster to initiate recovery actions for all disks in the node - not the failed disk if the disk in a node requires a long delay before returning an error.

  • Short command timeouts mean less impact. The short command timeout value inherented in most enterprise-class disks allows recovery efforts to occur while other system disks continue to support system disk access requests by Swarm.

Controllers and RAID

JBOD, not RAID

Controller-compatible

  • Evaluate controller compatibility before each purchasing decision.

  • Buy controller-compatible hardware. The more types of controllers in a cluster, the more restrictions exist on how volumes can be moved. Study these restrictions, and keep this information with the hardware inventory.

  • Avoid RAID controllers. Always choose JBOD over RAID when specifying hardware for use in Swarm. RAID controllers are problematic, for these reasons:

    • Incompatibilities in RAID volume formatting

    • Inability of many to hot plug, so the ability to move volumes between machines is lost

    • Problems with volume identification (disk lights)

  • HP Proliant systems with P840 RAID controllers - Swarm performance degrades when this card runs in HBA mode. For best performance, use only single raid 0s per disk, but volumes in this system cannot hotplug.

Firmware

Track kernel mappings

Keep track of controller driver to controller firmware mappings in the kernel shipped with Swarm Storage. This is particularly important when working with LSI-based controllers (which are the majority), because a mismatch between driver and firmware in LSI's Fusion MPT architecture can introduce indeterminate volume behavior (such as good disks reporting errors erroneously due to a driver mismatch).

Mixing Hardware

Swarm simplifies hardware maintenance by making disks independent of a chassis and disk slots. As long as disk controllers are compatible, move disks as needed. Swarm supports a variety of hardware, and clusters can blend hardware as older equipment fails or is decommissioned and replaced. The largest issue with mixing hardware is incompatibility among the disk controllers.

Follow these guidelines for best results with mixing hardware:

Track Controllers

Monitor the hardware inventory with special attention to the disk controllers when administering the cluster. Some RAID controllers reserve part of the disk for controller-specific information (DDF). Once Swarm formats a volume for use, it must be used with a chassis having that specific controller and controller configuration.

Many maintenance tasks involve physically relocating volumes between chassis to save time and data movement. Use the inventory of the disk controller types to spot when movement of formatted volumes is prohibited due to disk controller incompatibility.

Test Compatibility

To determine controller compatibility safely, test outside of the production cluster.

  1. Set up two spare chassis, each with the controller being compared.

  2. Format a new volume in Swarm in the first chassis.

  3. Move the volume to the second chassis and watch the log for error messages during mount or for any attempt to reformat the volume.

  4. Retire the volume in the second chassis and move it back to the first.

  5. Watch for errors or attempts to reformat the volume.

  6. Erase the disk using dd and repeat the procedure in reverse, where the volume is formatted on the second chassis if all goes well.

Swap volumes between these chassis within a cluster if no problems occur during this test. Do not swap volumes between these controllers if this test runs into trouble.

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.