Configuring Swarm Storage for Small Stream Use Cases

Small stream use cases are challenging for Swarm Storage. The main challenge is that every replica stored in Swarm has a fixed overhead for a number of activities, like health processor examinations, relocation during space re-balancing and retire, and indexing in Elasticsearch. This particular document doesn’t cover Elasticsearch sizing, but an important consideration in ES is the sheer number of objects. The object size does not matter with the type of metadata indexing that we do with Elasticsearch. Instead, we will focus on two issues that sometimes come up with small stream use cases: 1) having adequate index, and 2) journal space.

Index Space

The first issue that arises (and the easiest to address) is the number of index slots used by objects in Swarm storage. Each object consumes index memory in a running Swarm instance. With small stream use cases, full disks is rarely an issue, but the sheer number of objects can eat up index memory faster than disks fill, so index space becomes the limiter.

First, a little background. For small objects, Swarm will store objects as whole replicas- usually 2 or 3 replicas of each object in the cluster. Let’s consider the case of named objects which is our most common case. Each replica is stored on a different node and each replica uses two index slots. One slot is used for the etag of the object, making it possible to look up the object by its unique etag. The other slot is used for the “nid,” a uuid-like key that is computed from the name of the object instead of being just a random number. This allows us to use the index to find objects by name. So, if we have reps=3 in the cluster there will be 6 total slots plus one more slot, somewhere in the cluster, for the overlay index slot associated with the nid. This is a total of 7 slots in the cluster for one unique (logical) object. Then one can multiply that number by the number of logical objects in the cluster to get an idea of how many total slots might be used by a whole cluster. Generally, we then divide that total by the number of nodes to get an idea of slots needed per node, etc. The overlay index is kind of a cluster-side index that will take advantage of nodes that have an excess of memory index slots.

The memory index is a big blob of memory that is allocated by a Swarm node at startup. Its size cannot be changed dynamically, so changing it requires a reboot of the node. In the default case, a Swarm node allows the memory index to be 62% of the physical memory of the node. The memory index can only be certain sizes, so the actual amount of memory used by the index is often a bit less than that. The remainder of the memory is used by the operating system to run Swarm processes, keep the kernel happy, and even as the basis for the node’s internal Linux file system. Sometimes it’s necessary to shrink the index to “make room” for Swarm processes. In a small stream use case, we often make the index bigger so that (ideally), a node runs out of disk space around the same time it runs out of index slots.

The main parameter controlling index size is memory.reservePercentage, which is the fraction of memory that is kept FROM the memory index. Its default is 0.38. So (1 - 0.38) * 100% which is how we got at the 62% number mentioned earlier. For a small stream use case, we often want to decrease this number, but it is rare to need to decrease it by much. Often .34 or .30 is plenty. Going too far will either prevent the node from booting or keep it memory starved and inhibiting performance.

To complete the topic, each index slot takes 28 bytes. The index is actually a sparse data structure that cannot be completely filled, but as a rule of thumb, it can be about 80% full before index collisions start to defeat its usefulness. So we often just use an effective slot size of 28/80% = 35 bytes when we compute usable slots. So, for example, a node with 32G of memory will allow (62%) 19.8G of index space with each slot running 35 bytes for a total number of memory slots of around 567M slots per node. Based on your total number of logical objects in your cluster, you can do the math backward to arrive at a minimum index size per node and then compute some value of memory.reservePercentage for the cluster (or specific nodes). Again, this is a read-only node-specific setting that should appear in a node’s node.cfg file or a cluster.cfg file.

In practice, if your average object size is over 100K in size, you will probably not need to mess with this, but if your nodes are very dense, you might want to also ramp up your physical memory. It’s pretty easy to compute the average object size from health reports. Just divide the logical space by the number of logical objects.

Perhaps the final lesson here is that it’s not terribly important to “get this right” when the cluster is first sized. Often, there are too many variables. But once a cluster is running and partially filled, it’s relatively easy to see whether there will be enough index space and whether it’s appropriate to just add memory or make settings changes. Settings changes can just be staged for an eventual later reboot.

Journal Space

Each Swarm volume devotes about 5% of the physical disk to the “write journal”. This journal acts as a kind of summary about important changes (called “events”) to the disk. The health processor also journals the streams that it has visited, so, after a complete HP cycle, the journal is completely rewritten, and anything older can be discarded. The journal is used during mount to (relatively) quickly populate the memory index. It’s also used during FVR/ECR to quickly determine what new replicas and segments need to be made in the cluster.

One of the quirks about the journal is that it breathes. It’s at its smallest when the old events are trimmed at the end of an HP cycle. Just before that time, though, it’s at its largest as it has events from the last full HP cycle, the ongoing cycle, and any other writes that have happened since. Rebooting the node too often inhibits the completion of a full HP cycle needed to trim the journal. This breathing behavior makes it difficult to know when it’s getting too full for proper use.

There are CRITICAL messages that can be given when journal space is too low, but it is difficult to know whether that will happen based on looking at the cluster. If this does happen, the cluster will continue to operate normally until the journal is really full and volumes can no longer take writes. On rebooting, the node index will not be fully populated, so some content will seemingly disappear. It’s on disk, but there is no practical way for a node to find a replica. Over time, HP will find the replica and make index entries for the “missing” content. Before that happens though, the cluster may be busy making more replicas elsewhere in the cluster or content may appear missing to customer applications.

Unfortunately, the journal size on a disk is fixed at the time the volume is formatted. For a new cluster, it’s almost impossible to know that additional journal space is needed. The percentage of space is controlled by disk.volFracWriteJournal, which defaults to .05. This setting is appropriate for nearly all use cases, but for very small (on average) objects, say < 50K, increasing the setting make sense. The largest value I have ever seen used is .08.

What happens when a customer discovers late that they don’t have enough journal space? We added a feature that allows the journal space to be increased just prior to the mounting process. This feature fell into disrepair during 10.0 – 14.1 releases, however. For the best experience, use 15.0 or later. To use the feature, you have to set disk.volFracWriteJournal to a higher value than what was used when the volume was created. The journal has to be at least 30% full. The volume can’t be so full that it won’t allow journal expansion, and disk.allowJournalExpansion must be set to True. Note that both settings are read-only, per-node settings that must be put in the node.cfg file. If journal expansion is possible and enabled, the next mount will take around an hour longer (depending on various factors) and expansion will be performed. It’s safe to leave these settings as this will be a one-time operation.