Understanding Replication Feed Performance

Feeds are a mechanism for a cluster to do some processing for every stream (replica, EC manifest, EC segment, etc) of every object in the cluster. Replication feeds “process” those streams by replicating logical objects to another cluster. There are a lot of considerations and possibilities with feeds. There are issues of how the clusters are connected, which data will be processed by the feed, the size and characteristics of the objects, understanding the catch-up phase vs. the steady state phase, etc.

This description primarily focuses on the processing of the feed backlog, which can be thought of as the content in the cluster when the feed is defined. The backlog is processed primarily by the volume processes in the cluster, which gives a high degree of parallelism. New writes and deletes, after a feed is defined, are generally processed by the SCSP processes, but all feed work is queued. So new writes may be processed relatively quickly, even if there is significant backlog.

Let’s consider connectivity between clusters. The source cluster may have a direct network path to the target cluster. This configuration is the most efficient as this allows every Linux process in the source cluster to have a direct network path to all the nodes in the target cluster. Bandwidth is usually not an issue in this situation. More commonly, though, clusters are separated by some complex network path, possibly even going over a wide area network. Replication may optionally go through a target cluster Gateway, a target cluster reverse proxy, a source cluster forward proxy or NAT device, and/or an HTTPS offload for secure communications. Adding hops doesn’t generally impact throughput except for the fact that throughput will necessarily be limited by the slowest link. Often, this is the network link itself, especially if ISPs are involved. It’s also important to understand that with large object use cases, some remote replication requests will take hours. Any component on the network path can have timeouts, which can effectively clog the pipe as these large object transfers will be tried over and over again without success while limiting the available bandwidth for object transfers that would succeed. If there are a large number of “retrying” objects on a replication feed, this might be the issue.

It’s common for customers to replicate all of the data in the source cluster to the target cluster. It is possible to limit which domain(s) is/ are transferred by a feed. One would hope that if domain D comprised 1/100th the space used by a cluster, that it would be replicated that much more quickly to the target cluster. Unfortunately, each feed must consider every stream on every volume in the cluster, regardless of a feed definition restriction. The savings of the restriction is that relatively few of those streams will do the actual work of replication. In the feed telemetry, a new feed will have mostly “unqualified” streams. These are streams known on disk but it is unknown whether they match the feed definition. Streams that are found not to match are just removed from “unqualified”. Matching streams are converted to “processing”, and the replication work is queued. Eventually, the stream is marked as “success” or “retrying”. So, with our domain D, the feed processing time may be limited by the time it takes Swarm to scan all of the volumes for matching content.

In the case of replication feeds, only a subset of the streams on disk are processed. Each wholly replicated object will have a replica in the cluster with a number that is controlled by the replication policy of the cluster. Typically, there are three replicas. In the case of EC objects, the p+1 manifests of a k:p encoded EC object are processed by the feed, causing the logical object to be replicated to the target cluster. Segments are ignored by the feed. Remember that we replicate logical objects because the target cluster may have different replication policies in play.

Replicas and manifests are scattered throughout the cluster and will be processed by a feed at different times. When each replica/manifest is processed, the feed will attempt to perform the replication to the target cluster. It’s the case that the first replica/manifest to do so is the winner. The other replicas will find that the object is already in the target cluster (based on a HEAD request) and just check off the work as done. So if you imagine that all the streams in the source cluster are processed at an even rate, a new feed will do a lot of replication work early in the feed lifetime because everything needs to be transferred. Over time, more and more content will be found to be replicated. So the feed may be making linear progress, but the objects being replicated may look more like an S-curve: high at the start, a pretty rapid dropping, then a long tail. The shape of the curve depends on object size and EC and replication policies, and many other factors. There are statistics keeping track of the remote transfer successes and duplicated successes. The sum of these will give you an idea of feed processing but the slopes of these statistics can indicate where on this S-curve the replication process is.

Unfortunately, feed processing is not linear, as the previous example assumed. Feed processing is opportunistic within Swarm. A major expense in feed processing is just opening the stream on disk to see what kind of stream it is and whether it qualifies for the feed. Instead of making a special process for this, EVERY stream open has the potential to nudge along the feed processing. There are three main sources of opening streams. First, any GET or HEAD request will open a stream. These are not a significant driver of feed processing and they occur randomly. The health processor periodically scans every stream on every volume and this can drive feed processing. Each HP cycle starts at the start of the data region and goes to the end of the data region. But a feed may be defined during the middle of an HP cycle. In this case, feed processing “starts” where the HP iteration is when the feed is defined. A final mechanism is called the catch-up task. It is focused solely on feed processing. It starts at the beginning of a volume’s data region. Streams are opened, triggering feed processing, until the feed’s queue is full, then it pauses. Later it starts again. Over time, the catch-up task forces the feed processing of every stream on every volume. (Both HP and the catch-up task are operating on every volume in the cluster in parallel.)

Usually, when a feed is newly defined, both HP and the catch-up task are operating in different regions of a volume and they both contribute to feed processing by performing stream opens that send work to the feed queue. Over time, either the HP iteration or the catch-up task will traverse into a region of the disk that the other has already processed. It’s likely then, that no feed processing would be done by that process, but the other one may still be in new territory. It can happen that BOTH HP and the catch-up task are in regions that the other has already processed. This means that, for a while, there may be no work queued and the processing will appear stalled. Generally, during a feed lifetime, there will be “double duty” early on, then single duty, then nothing, then single duty again until the backlog has been processed. Again, this is going on with multiple volumes in parallel, so the feed processing can appear to slow down in a cluster due to these “beats effects” that generally occur near the end of the catch-up phase.

Between the S-shaped curve caused by multiple replicas in the source cluster and the beats effect caused by the HP and catch-up task interaction, the apparent feed processing can seemingly slow significantly over time. This is at odds with an expectation of linear processing and predictable completion times. And all of that doesn’t consider the expense of actually transferring the data from one cluster to another, which can add delays. This design does not lend itself to predictability and our computed completion estimates notoriously inaccurate as a result.

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.