On this page, you will find answers to the most commonly asked questions by customers.
No. There is no such thing as caching. The data is fully protected when the write request completes. Unfortunately, we don't have the same performance as products that do this sort of caching, but these products often cannot handle the sustained throughput that Swarm can due to problems that occur when such a cache eventually overflows.
For EC encoding, we use k and p values mostly from the zfec (erasure coding) module. The EC literature will also mention an "m" value, which is just k + p. When an EC object is written, there are k data segments and p parity segments. Data is striped into those segments so that each segment is written incrementally during an EC write and there is relatively little buffering. Those segments are distributed throughout the cluster to minimize data loss or data inaccessibility during outages. No two segments are ever put on the same volume, but there can be some "doubling up" of segments on the same chassis, but never more than k. In larger clusters, we strive for one segment per node. This distribution occurs on the original write and is maintained by the health processor over the lifetime of the object. The health processor maintains the k+p segments during failures and lost segments (for whatever reason). On read, only k segments are needed to reconstruct the data. Generally, we strive for performance when choosing which k of the k+p will be used.
The PAN/SAN largely applies to how clients interact with the cluster. Once the SAN is chosen, say on an EC write request, the segment writes and manifest writes are orchestrated, and there are no redirections during those writes as each node has a model of the cluster and its resources and level of busyness.
The overlay index keeps the records of each EC segment, but they are treated like any other object in the Swarm cluster. Overlay interaction for a GET of a particular stream (replica or segment) results in two UDP round-trips. The first round-trip is to the node with the overlay index for the stream. This gives us the most likely location(s) of the stream in the cluster. A second round-trip gets current bid information for the replica. When the SAN performs an EC GET, we usually look for the k lowest bids and then attempt to read those to assemble the EC object. There are a number of caveats and potential for retrying here, as segments may become unavailable during the request.
Smaller objects are "wholly replicated", meaning that the cluster will have two (or often three) replicas written at the same time to different volumes on different chassis. Like with EC, we maintain proper replica counts and distribution over the lifetime of the object. On reads, we only need to choose one replica (usually the one with the lowest bid) to service the request. The PAN then redirects to the SAN, and the object is served from there.
The short answer is no. The object is stored according to the policies in place at the time. That said, there are various mechanisms for changing the object’s encoding and protection over time.
The Gateway mostly serves to proxy the request. It’s the storage node SAN that distributes any parts. With a 4:2 encoding, there are (k+p) 6 segments and (p+1) 3 manifests. The manifests are treated like any other wholly replicated object. All are distributed throughout the cluster for maximum protection. Segments and manifests are often on different nodes/volumes.
The Gateway generally does not buffer significantly except to hide some of the intricacies of interacting with the storage cluster, such as redirection. The Gateway will never redirect, but storage nodes do it frequently. There’s no real advantage to it, as the data isn’t protected until it’s written to the storage cluster. Swarm is focused on throughput, so buffering potentially hundreds of requests would consume too much memory in the Gateway.
A multipart request is a bit like a transaction, with an initiate, an upload of one or more parts, and a completion operation that assembles the parts into the final object. The overall process can be immediate or spread over days. The Gateway proxies all of these requests to the storage cluster, where they are performed. During those requests, a storage node (PAN) may redirect to another node (SAN) or it may keep the request and perform it itself, where the node takes both the PAN and SAN roles. Note that a redirection is just a 301 response with a new location to send the request. It’s always the SAN that does the work. A multipart write is similar to an EC write, but the manifest created there is temporary. The completion request looks up all the parts and assembles them into a final manifest. The Gateway’s role in the requests is to authenticate the user and perform the protocol translation from S3 to our version of HTTP, which we call SCSP.
The client has full control over the size of the parts. In clients like rclone, there is a part size parameter that decides how big each part should be. Let’s say we have three parts in this case. This means there’s an initiate request, three part uploads, which can be done in parallel to improve bandwidth usage, and then a complete operation. In this case, the final object will have a slightly more complex manifest listing the “EC sets” of the original 3 parts, which include all the original segments. This encoding allows the operation to complete relatively quickly, but it makes for a small penalty for later readers of the object, as SAN in those requests has to read k segments from each of the parts. You can imagine that range reads can start to get complex, too. It’s possible to have very inefficient manifests, with hundreds of EC sets. Swarm has a mechanism to rewrite inefficient objects so that they have an ideal number of segments, but that doesn’t happen immediately. We encourage customers to configure their S3 clients, such as rclone, for part sizes of 100M. In cases where very large objects (multiple GB) are the norm, a larger part size might be appropriate.