Document Identifier: | TechNote 2015006 (replaces TechNote 2011009) |
Document Date: | October 31, 2015 |
Software Package: | SWARM |
Version: | Swarm 7.5.0 or later |
Abstract
Different applications and installations generally have different needs with respect to throughput requirements for write and read operations versus tolerance for some data loss in the event of a hardware failure. Swarm’s behavior can easily be tuned to match those differing requirements.
Swarm provides protection on disk by making replicas (multiple copies of each object on different nodes). You can control how many replicas are made of each object, and how quickly they are made after the object is initially stored in the cluster. Swarm provides integrity protection on disk by using an MD5 hash to validate contents; you can use the same mechanism to provide additional protection for the object in transit.
Note: If there is one replica of an object in a cluster, then there is only one instance of that object in the cluster. Replica, instance, and object are all synonymous in this usage. It would also be correct to say that there is one instance of the object in the cluster.
Protection on Disk
Swarm is designed to be both fast and safe. Data stored in a Swarm cluster under normal conditions is quite safe, but it is important to understand exactly how safe it is, and to select the appropriate tradeoffs between performance, capacity, and safety that best meet business needs in each situation.
By default, each object in Swarm is stored with two replicas, each of which is resident on a different node in the cluster. In the event of the total failure or disappearance of a drive, the cluster reacts quickly and initiates a volume recovery process for each missing drive that will rapidly create additional replicas elsewhere in the cluster of all objects that were stored on the now missing drive(s) so that each object again has two replicas.
During this recovery process, if a second drive fails before the recovery process is complete, it could risk the integrity of the only replica of the object in the cluster.
While a rapid sequence of drive failures is unlikely, it is certainly not impossible. If this presents an unacceptable risk for your application, the solution is to increase the minimum number of replicas – a tradeoff of space for security.
There can also be a potential period of vulnerability at the moment an object is first stored on Swarm. By default, Swarm writes a new object to one node, respond to the application with a success code and UUID (or name), and then quickly replicate the object as needed to other nodes and/or subclusters. The replication step is performed as a lower priority task. While this creates the best balance of throughput and fault tolerance in most circumstances, there are cases where you might want to give the replication task the same priority as reads and writes, which ensures replication occurs quickly even under heavy sustained loads.
This can be done by adding a single parameter to the cluster configuration file (or the individual node config files for all nodes) as follows:
repPriority = 1
(or health.replicationPriority)
With replication set to priority 1, object replication is interleaved in parallel with other operations. This might have a negative impact on cluster throughput for use cases involving sustained, heavy writes, and it is still possible, though much less likely, that the failure of a node or volume could cause some recently written objects to be lost if the failure occurs immediately after a write operation but before replication to another node can be completed.
Cluster administrators or application developers who have little or no tolerance for data loss of any kind can choose to require Swarm to create two replicas before the UUID (or name) of the new object is ever returned. This is done by using the Replicate on Write, or ROW feature. ROW can be utilized in two different ways, as follows:
- A cluster administrator can choose to enable ROW for all writes to a cluster by setting a parameter in the node or cluster configuration file as follows:
autoRepOnWrite = 1
(or scsp.replicateOnWrite)
- If this parameter is set, all writes return a response and the UUID (or name) of the new object only after two replicas have been successfully created in the cluster. Objects will be safe even if a failure occurs immediately after a write completes. Of course, write throughput will be impacted because two full copies need to be written to disk before a write operation can complete. In general, ROW writes have slightly better than half the throughput of ordinary writes.
- If an administrator chooses not to set autoRepOnWrite / scsp.replicateOnWrite for the entire cluster, individual applications can still opt for this high level of fault tolerance, by setting the following query argument for a POST request:
POST /?replicate=immediate HTTP/1.1
If both ROW and repPriority=1 are used at the same time, the Replicate on Write semantics apply to the first two replicas and any additional replicas (assuming reps > 2) are written with priority 1.
One of the functions performed by Swarm’s Health Processor is to periodically scan objects on disk to verify their integrity using the MD5 hash that Swarm created as the object was stored. If the object is determined to be corrupted (possibly due to a bad sector on the disk), it is trimmed (that is, deleted), so that other replicas can be re-replicated resulting again in two (or more, if minreps is set greater than two) copies of the object in the cluster.
This process happens relatively frequently and ordinarily causes only a very brief interval of exposure during which time the object is vulnerable to data loss due to drive failure. The problem can be more severe if the object is detected as corrupted after the object is initially stored, but before the first replica has been created – in this case, the only copy of the object might be deleted. Although this behavior seems bad, deleting such an object is appropriate because the object is corrupt and, unfortunately, no other good replica exists. Using replicate on write as discussed above is a way to avoid this exposure.
Protection In Transit
The Content-MD5 metadata header provides an end-to-end message integrity check of the content (excluding metadata) of an object as it is sent to and returned from Swarm. A client can check this header to detect modification of the object’s body in transit. Also, a client can provide this header to indicate that Swarm should compute and check it as it is storing or returning the object data.
Content-MD5 headers are stored with the object metadata and returned on all subsequent GET or HEAD requests.
If a Content-MD5 header is included with a GET request, Swarm computes the hash as the bytes are read, whether the header was originally stored with the object or not. If the computed and provided hashes do not match, the connection is closed before the last bytes are transmitted, which is the standard way to indicate something went wrong with the transfer.
During a POST or PUT, the client can provide a Content-MD5 header with an md5-digest computed based on the content of the entity body, including any content-coding that has been applied, but not including any transfer-encoding applied to the message body.
If this header is present, Swarm computes an MD5 digest during data transfer and then compares the computed digest to the one provided in the header. If the hashes do not match, Swarm returns a 400 (Bad Request) error response, abandons the object, and closes the client connection.
The Content-MD5 header provides an extra level of insurance, protecting against potential damage in transit as well as from damage while in storage. See “Protecting Data in Transit” in the Swarm Application Guide for more details.
Protecting the Names of Your Objects
Remember that no matter how many replicas of an object are stored in Swarm, unless you know its name or UUID, you won’t be able to retrieve it. It might be appropriate for applications to store an index in Swarm, so that there is only one name or UUID to keep track of; accessing that index object will provide UUIDs for the rest. It would certainly be appropriate to maintain several extra replicas for such a critical object.
© 2015 Caringo, Inc. - All rights reserved