Determining Appropriate Erasure Coding (EC) Policies to Support EC Writes
Much has been written on the importance of designing storage clusters so erasure coded (EC) data can be read when Swarm storage nodes or volumes become unavailable (either through unplanned events or routine maintenance). Although it may appear counterintuitive at first glance, it’s even more important to design for cluster failure modes while accommodating EC write activity. Typical questions may include:
“We need to be able to support EC writes with 'M' nodes unavailable”
“We need to support EC writes even when 'N' simultaneous volume failures occur”
Knowing Swarm’s constraints, it’s possible to design for supported failure modes in a cluster while still being able to support writes for a given EC protection policy. The purpose of this article is to provide guidance to that end.
NOTE: Unless otherwise noted, these guidelines will focus on the default EC boundary evaluations related to “node” level protection.
Considerations for writing EC segments in a Swarm cluster
You should always bear in mind how the data (k) and parity (p) segment sum “k+p” for a desired EC k:p protection policy aligns with the volume count in a cluster. This is regardless of whether you use node, volume, or subcluster level EC protection policies.
For node level protection (the default) you can have up to 'p' segments for an EC set in a single node, however…
You cannot have multiple segments for an EC set on a single volume.
Considerations for writing EC manifests in a Swarm cluster
Manifests for EC objects (i.e., the entities that contain EC object metadata with pointers to associated segments) are protected in Swarm clusters using replication.
The number of replicated copies used is “p+1” for a defined EC k:p protection policy.
As such, EC manifests are subject to Swarm boundary constraints for writing replicated objects.
Given this, it’s not possible to combine multiple manifest reps for an EC set in a single node.
This is because objects protected with replication can only have one rep associated with one node volume index.
Summary
In typical production architectures, the above guidelines are usually met easily. This is because production level hardware involves a reasonable number of Swarm storage nodes which, in turn, host non-trivial volume counts. Flexibility in accommodating EC protection policy choice for EC writes is increased as a result. Speaking in general terms given the above:
The EC segment (k+p) constraint will drive your necessary volume count for a given EC k:p policy.
The manifest constraint will drive your necessary Swarm storage node count for a given EC k:p policy.
© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.