Adding Capacity to a Cluster

It’s common for the data footprint of a cluster to grow overtime and eventually exceed its original cluster sizing. There are a variety of things to consider to make this a smooth process.

Start Early

Cluster administrators should be monitoring the space usage over time and not delay in adding capacity. Remember that with COVID and other economic disruptions, there are often delays in getting hardware even after you are ready to cut a purchase order. If your cluster is growing and over 80% full, it’s already time to start planning to add capacity.

Check your License

Licensed capacity and hardware capacity are different things. You might need to update your license as part of the capacity add. Note that trapped space does NOT count against your license and it’s perfectly fine to have more hardware capacity than you are licensed for.

Adding Disks

Most customers start with their nodes fully populated, but if you can afford extra server capacity earlier, leaving disk slots empty is just fine. When there’s a need for more space, new disk can just be hot plug added to a node without down time. Of course, any failed or retired drive should also be swapped for empty disks over time and these operations don’t require reboot.

Adding Nodes

Adding one or more servers at a time is the option most commonly used to add capacity. Adding nodes to a cluster is relatively easy, so I won’t dwell on those steps here. Instead, let’s focus on what happens when a new empty node is added to a cluster.

Client writes will go to all nodes/volumes that have space, so the “Start Early” advice is largely about avoiding the situation where many volumes are full and the remaining ones with space get overloaded. New writes must go to different nodes/volumes in the cluster so as to help protect objects from hardware failures.

In the background, Swarm will be re-balancing the cluster by moving objects from full volumes to less full ones. There are a couple of settings that control this behavior. But relocation is a necessary load to the cluster that can have some impact on client writes. In an ideal case, the cluster should have free space on all volumes so that there are many places to quickly write objects during the inevitable recoveries that happen when disks go bad. This is also true for new writes. Having lots of disks with space enables better load balancing and better performance. If cluster performance is important to your business, proactively adding space allows for a longer re-balancing window with less performance impact.

A more involved technique can be used if all volumes in the cluster are getting full by the time new nodes are added. The idea is to “sprinkle” empty volumes throughout all the nodes in the cluster so that when the operation is complete, “old” and “new” nodes have a mixture of “old” and “new” disks, meaning more full disks and also disks with plenty of available capacity. This option isn’t commonly done because it involves a lot of hands-on work to the cluster during the upgrade.

First, make a plan of how many new disks each node in the entire, upgraded cluster will have. Typically, one or two new disks will be on each node. Read the next steps and think through exactly where each disk will eventually go in the cluster.
Next, add the new node(s) to the cluster but only populated with the empty disks each node will have.
Iterate over all the remaining cluster nodes by taking it down using normal shutdown operations. This allows the cluster to clear any client requests from the node being rebooted. When the node is down, pull however many disks that need to be replaced and put in empty disks. Bring the node back up. Meanwhile, hot plug add the 1-2 disks to the same new node. These steps can be done in parallel. At the end of each step, you will have added a handful of empty disks while shifting the location of existing disks.
Repeat for all the remaining nodes.

At the end of this process, there will still be re-balancing work but it will be more evenly spread throughout the cluster and there will be minimal replica movement needed to protect the existing data.