Why do my stream counts decrease on an idle cluster?
For the sake of data redundancy, Swarm in general is quick to make replicas of data and slower to trim them down. In a steady state cluster it is normal to see stream counts fluctuate minimally, and on clusters under load it is normal to see stream counts fluctuate significantly. BUT, the stream counts should not fluctuate below the number of objects written x number of replicas. (e.g. 1,000 streams written x 2 replicas = 2,000 streams on the cluster)
By default, the replication multicast frequency is set at 1%. This means that for every health processor (HP) cycle, 1% of the streams that are visited get a replica check multicast out to the other Swarm nodes. If the node does not get enough responses from other nodes with the stream, it will have another node make a new replica. If the node receives more responses from nodes that have the stream than is needed, it will have one of the nodes trim the stream. Due to network latency, high machine utilization, etc, a node may either miss a multicast or be late to respond, which leads to another node creating a new replica and hence over-replication.
Another common source of over-replication is volume recovery, or FVR. When a Swarm node detects that a volume that it once knew about is missing, the node scans all of its streams to see if a copy was known to live on the missing volume and has another node make a new replica. The node also tells all of the other nodes in the cluster to do the same thing. The FVR process can generate over-replication in the dash to replicate the missing copies to narrow the window of data loss should another volume disappear. FVR can be triggered by an actual volume/node failure, as well as a temporary network outage, machine reboot, etc. The excess replicas will eventually be trimmed down as the HP visits the streams on each node.
© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.