DataCore Swarm Service Provider Guide v1.0
Introduction
Deploying DataCore Swarm for the purpose of hosting an online service is a complex project. In our experience, many of the challenges involved have less to do with knowledge of Swarm as a product and more to do with building the “services shell” around Swarm. As such, many of the questions we’re asked to address tend to fall into the following areas:
Business & service planning (i.e., what service customers will be provided, desired billing metrics, what customer use cases and associated workflows will be supported, etc.), with emphasis on designing for scale
Initial hardware estimates, licensing, identifying necessary third party components that aren’t Datacore Swarm products, etc.
Initial installation and service deployment along with verification for fitness of purpose
Operational needs associated with service monitoring and capacity planning (i.e., “post install / post deployment”)
To that end, the purpose of this document is to assist partners and customers who wish to deploy DataCore Swarm as an internet hosted storage endpoint service. The main focus will be on considerations for acting as an S3 API storage endpoint, as that is the desired approach for the majority of deployments of this nature.
Service Planning
Targeted Workloads
An important input into service planning involves discovery of anticipated customer workload. Associated with that are client / application workflow requirements. Examples of this for typical deployments include:
S3 Backup Applications (e.g., Veeam, Commvault et al)
Generic S3 object access (e.g., clients such as s3browser, Cyberduck et al)
S3 Virtual drives (i.e., S3 endpoint targets presented as drives / file systems to the client)
Each of these and other use cases will present a material impact on estimate calculations for your service. The more that is known ahead of time, the more accurately you can budget for your initial requirements re: networking, servers, licensing and other areas.
Note that estimate work for handling multiple use cases and client behavior in a single Swarm deployment can prove difficult due to conflicting requirements. It may be necessary to employ a “service silo” approach in situations where the targeted market addresses multiple workloads. We will touch on such strategies as we proceed through this document.
Hardware Estimates
Once all reasonable effort is completed in determining target use cases and associated workloads, the next step is to estimate what hardware will be needed to support the first group of customers. Ideally, this group will all have the same or very similar profiles, which should help normalize the effort. In the case of DataCore Swarm, hardware and networking estimates address these key components in the product architecture:
Swarm Gateway - Once traffic has been passed from the outside through the necessary load balancer and/or firewall layers, this component acts as the touch point for all customer interactions. The Gateway layer is primarily sized against estimated levels of concurrency. Specifically, it is treated as a “scale out” layer in the overall architecture (think in terms of “web server farm”).
Swarm Storage - This is the component that holds the customer data in the form of object data. Sizing for Storage is primarily driven by inputs related to average object size ingested, chosen data protection policies (e.g., replication or erasure coding schemes), and resulting total object count. The Storage component is designed to be a “scale up / scale out” architecture, so adding more load usually involves adding more storage nodes, more storage volumes, or both to address it.
Swarm Search (Elasticsearch) - This component acts as a cache of the metadata associated with the objects stored in Swarm Storage. Its primary role is to facilitate list and query operations on sets of data which have certain metadata characteristics. Since this is Elasticsearch under the covers, sizing estimates are driven mostly by the number of logical objects present in Swarm Storage, along with the average size of the metadata headers associated with those objects. These two primary inputs, along with anticipated list and query activity, drive the index and sharding approaches that will be used by the Elasticsearch cluster.
Note that DataCore can assist with your deployment sizing questions as needed.
Fault Tolerance Considerations (Durability / Availability)
As part of the overall estimate, you should also consider any requirements related to service availability, performance under degraded operation, and other service level agreement (SLA) drivers which could affect the overall solution. Examples of these can include, but are not limited to:
“We need to be able to handle X number of client requests while performing routine maintenance”
“We need to provide full service availability during both planned and unplanned downtime”
“We need to be able to maintain overall service levels while N number of Gateway/Storage/Elasticsearch systems are unavailable”
As a general rule, if requirements are hard to quantify, a good starting point is the “plus one” approach. This method takes the initial calculated result for hardware (server count etc.) and then adds one. So, for example, if it’s estimated that 6 storage servers are needed to support the first group of customers, then budget for purchasing 7 storage servers up front (same for Gateway etc.). At minimum, this allows for performing rotating scheduled maintenance for each of the servers in the solution while minimizing impact to the customers (if not making such maintenance completely transparent).
It’s worth noting at this stage of planning that a distinction should always be made between “high availability” and “performance service level agreement (SLA)” targets. High availability is formally defined as the deployment’s ability to operate continuously without interruption, even in the event of component failures. Operating continuously at a guaranteed performance level is treated as a separate exercise, even though some aspects of high availability promote this (e.g., load balancing).
Most deployments focus on being highly available with the understanding that component failures will introduce performance degradation. Understanding how sensitive your targeted customers / applications are to this will be key to your overall success. This may involve “over engineering” certain characteristics of the deployment, so if performance targets are a goal they will need to be budgeted for. Even then, guaranteeing performance in a shared service can prove to be a difficult task.
Capacity Projections
Another item to consider when budgeting for initial outlay is anticipated service growth. This usually takes the form of a business driver communicated as, “we would also like to estimate hardware resource requirements for a growth rate of 20% per year for three years”. This can be treated as a simple compound factor multiplied by the initial “year one” estimate. For this example, that would look like the following:
(initial hardware resources) * (1.2)^3 = (projected resources for 20% growth per year over three years)
With the initial hardware outlay and projected numbers in hand, a more educated budget projection for supporting the service can be determined for a given time frame and customer onboard rate.
License Estimate
Once you’ve completed the capacity estimates in prior sections, you can then derive what your necessary “day one” licensed capacity will be. Note that, in practice, Swarm Storage licenses check for the use of a given amount of raw storage available in the cluster. This can be as simple as:
The total volume capacity per the storage cluster estimate
The total volume capacity, multiplied by the forward capacity factor (as outlined above)
A percentage of the total volume capacity (this is usually employed to lessen the initial outlay for the license cost e.g., licensing for 50% of the estimated volume capacity for “day one”)
Naturally, a business decision will need to be made to determine which approach for licensing is desired. We won’t go cover the details related to CSP/MSP pricing here, other than to note that your account representative can help translate the calculations for raw capacity into a suitable license for your deployment.
Third Party Component Requirements
Some key third party components that are needed to support “Swarm as a Service” can include, but are not limited to:
Load Balancer(s) - it is strongly recommended to use one of the many commercially available load balancer solutions to act as the public touch point for the service front end. Some examples include:
F5 BIG-IP
Cisco A10 (A10 Networks)
HAProxy Enterprise
Firewall(s) - in similar approach to the load balancer needs, if a dedicated firewall is required then commercially available solutions are strongly recommended, for example:
Cisco ASA / Firepower
Juniper SRX
Check Point Security Appliance
Operating system images (and, if necessary, associated support licenses) - the operating system dependencies for all Swarm components (Storage Cluster Services / SCS, Gateway, Elasticsearch et al) are outlined in DataCore’s documentation and should be adhered to unless otherwise directed by DataCore Support. More details related to this can be researched in the requirements sections per Swarm Deployment
Network Time Servers (NTP) - this piece is critical to the overall function of the solution and has specific requirements related to it. We recommend familiarizing yourself with Setting Up the Network Services | Setting Up NTP for Time Synchronization as part of service planning. As a best practice, you will want to have all of your service layers time synchronized end-to-end for logging triage and service event reconciliation. Note that Swarm Storage in particular logs information using UTC time as opposed to local time, so this will have to be accounted for in any troubleshooting or audit inspection.
Networking and Namespace Conventions
Planning your network requirements for a Swarm service architecture usually involves the following areas:
Network Address Allocation - this will need to be defined end-to-end and will involve allocations for public and private address space, including such areas as:
Service front end (public IP address space)
Load balancer / firewall public address allocation and DMZ subnet(s), including Gateway address assignments and Swarm management server front end (SCS)
Note that a route to your authentication store (LDAP, Active Directory et al) will also need to be made available for the Gateway systems
Storage subnet assignment - this includes the IP address allocation for the Swarm management system back end (SCS) and the address pool to be used by both Swarm storage and Elasticsearch nodes
Name Service / Naming Conventions (DNS) - As a best practice, you will want to have an appropriate DNS zone in place to support the proposed deployment. This is especially required for outside clients to resolve to your service and, more importantly, be properly routed to the storage domain that has been provisioned for them. The end-to-end naming convention that should be employed should be as follows:
Fully qualified domain name (FQDN) resolution in public DNS for the customer’s provisioned storage domain (e.g., “customer01.myservice.net”)
FQDN formatted name for the customer’s provisioned storage domain in Swarm that maps with the external DNS record (i.e., create a storage domain for the customer named “customer01.myservice.net”)
This approach allows for correct host header mapping via Gateway access and insures that the customer properly references the stored data associated with their account
Note that a wildcard DNS record (e.g., “*.myservice.net”) can be employed to accommodate provisioning for a group of customers; use of this approach is encouraged for onboarding customers at scale, as this has ramifications for SSL/TLS certificate management for the service
SSL/TLS Certificate Management - as mentioned previously, “HTTPS” will likely be the access protocol used by your customer’s client software. Use of wildcard certificates for this purpose is usually employed, as managing individual certificates for each customer’s domain can prove cumbersome. As with DNS, the choice of supporting “vanity domains” is left as a business decision to the reader.
Network Port & Protocol Mapping
Note that Swarm’s port & protocol requirements are documented. That information is outlined here: Setting Up the Swarm Network | Network Communications
It’s strongly recommended to have a logical network diagram produced that incorporates these port & protocol requirements. This will help later for determining network level workflow, address / port / protocol exposure necessary for firewall and load balancer rules, audit materials, etc.
Installation Approach
With the planning exercise from the previous section in hand, we now can focus on executing the installation of the deployment. This is typically done in the following order:
Networking Layout
This involves implementing the routing, IP address numbering, and namespace plan in the the network layer and naming services, specifically:
Preparing the network switch layer for hosting of Swarm storage, with emphasis on how multicast traffic will be handled in the storage cluster subnet (see notes on Swarm port & protocol requirements as outlined in Setting Up the Swarm Network )
Setting up appropriate routing and firewall rules for the service
Configuring the necessary DNS zones to handle both external and internal name mapping requirements
Configuring the load balancer layer with necessary server pool configuration for the Gateway servers
Configuring the load balancer with the necessary SSL/TLS certificates for SSL/TLS service termination and offload
Note that Gateway itself cannot handle SSL/TLS connection requests, which is why offload must be done at the load balancer layer
Environment Validation
Before installing the Swarm components, it’s strongly recommended at this stage to perform a validation of networking and systems in place. This work can include, but is not limited to:
Testing network connectivity between the various components and noting the results (e.g., iperf2 / iperf3 testing for networking between all the component relationships)
Testing disk I/O characteristics on the storage servers, Elasticsearch servers, and Gateway servers (e.g., using ‘fio’ or similar tools to profile raw disk performance and file system performance where appropriate)
Having this information in hand before the installation of the Swarm stack will help set expectations for expected performance in the overall cluster. It also is useful for identifying unforeseen bottlenecks in the infrastructure and third party component layouts. In our experience, many of the issues reported related to poor Swarm performance can have roots in environment issues, so it’s wise to test for this up front before the software stack and services are in place.
Swarm Server & Service Installation
This involves the following steps, usually done in order:
Installing Swarm Cluster Services (SCS) on the designated server to support the initial bootstrap of the the storage cluster, including the install of the storage license that was issued
Performing a network boot (PXE boot) of the storage servers from the SCS server and confirming the storage nodes are in working order while reflecting the configuration outlined with the SCS install
Installation of Swarm Search on the designated Elasticsearch servers and, once confirmed to be in working order, configuring the necessary search feed in Swarm Storage to point to the Elasticsearch cluster
Installation of Swarm Gateway on the designated Gateway servers, including confirmation that integration with the designated authentication store, Elasticsearch cluster, and storage cluster are in working order
Configuring a test account in the designated authentication store, along with a storage domain that can be used for verifying storage service functionality
This includes configuring an S3 token and secret for this account, so that testing with S3 clients can be performed
With the above steps completed, functional end-to-end service verification can be performed through interacting with the public service touch point
Swarm Telemetry and Service Baseline
At this stage, you can now deploy Swarm Telemetry (as outlined in Swarm Telemetry Install Guide ). This will allow you to gather metrics in the deployment to establish baseline behavior for the service. As a best practice, use of representative workloads to profile overall service operation would be performed after this step. This can be done through use of the actual client software itself (e.g., Veeam test load) or via synthetic means (e.g., MinIO Warp, AWS Boto3, rclone, or similar).
Support Registration
Finally, in preparation for going live with your service, you will want to confirm that you’re properly registered with DataCore Support. This involves verifying that support access for the team has been set up in advance, understanding where supporting documentation such as knowledge base articles reside (as found at the Knowledge Base page), and finalizing all documentation related to the specifics around your deployment. Should an escalation be required, this will help insure that Support response is prompt and that there is backing knowledge related to your deployment which can be quickly referenced.
Post-Deployment Operations
Runbooks and Changelogs
As use of the service progresses over time, it’s good to develop best practices related to day-to-day operations. One approach which we can recommend is the combination of developing a runbook in conjunction with changelog tracking. This can assist in efficiently handling operational tasks and service escalation response. The purpose of both is outlined below:
Runbook
Provides focus for handling routine tasks and troubleshooting inspection
Content includes service diagrams, step-by-step instructions, and guides
The primary audience is IT service engineers or other operations personnel who need to execute procedures and respond to escalations quickly
Changelog
Focuses on change management (what changes were made, and why)
Content includes list of modifications, dates, authors, reasons for the changes, and appropriate sign-off from management
This addresses a broader audience that can include developers / DevOps, testers, project managers, up to and including customers who need to have an understanding of the evolution of the service
Many times, it’s tempting to blend the lines between the two. These functions are different, however, and those differences should be respected to insure correct processes are in place. The main distinctions to keep clear are:
Runbooks guide actions, while Changelogs record history
Runbooks are highly detailed and focus on prescribed procedures, whereas Changelogs are more concise and descriptive (i.e., what happened and why)
Runbooks are process oriented, Changelogs are event oriented
Combined together, these two practices can assist with providing the operational discipline needed to provide a robust and reliable online service.
Logging and Auditing
The various components of DataCore Swarm are set up to generate system logging traffic in the form of syslog output. By default, these logs are primarily generated in the following areas:
Local log files on the Swarm Cluster Services (SCS) server
Local log files on each individual Gateway server
Remote logging of syslog traffic from the Swarm Storage nodes to the SCS
As a best practice, it’s worth considering setting up a centralized syslog target that can capture this information from each of the components. A good starting point for determining how this can be designed can be found at Configuring External Logging , in addition to the various knowledge base articles made available around this subject.
Note that, in the event that Datacore Support asks for tech support bundles to be generated, it will also be necessary to include the logs captured at any centralized log server in use.
Capacity Monitoring and Upgrade Planning
DataCore Swarm provides access to multiple metrics that can be tracked in a deployment. From a capacity planning perspective, however, there are some key items that you will want to monitor over time to make a determination when service expansion is needed. As mentioned in previous sections, these items of note can be easily inspected using your (by now fully populated) Prometheus database and Grafana dashboards. Access to the latest dashboards can be found at Search grafana.com | Grafana Labs for this purpose.
Key items of note for capacity planning fall into the following areas:
Storage capacity - This relates directly to license requirements and determining when extra storage nodes and/or volumes may be needed. A dashboard is also provided to provide insights into health processor (HP) activity, which relates to inspection for data durability checks and other functions gated by HP cycles in the storage cluster. Overall storage node server loads can be ascertained from this information as well.
Swarm Search / Elasticsearch - These metrics can be used to inspect rate of document total logical object count in the storage cluster, Elasticsearch growth as objects are ingested, Elasticsearch disk capacity thresholds, indexing / list / query rates over time, shard counts, and other items that can be profiled to determine if extra capacity is warranted for the Search component.
Swarm Gateway - The items tracked with Gateway metrics relate to session concurrency levels, type and number of API calls being handled for clients, and other areas related to Gateway server load. These can be used to determine if the current Gateway pool is becoming overloaded and “scale out” capacity is necessary.
Software and License Upgrade Considerations
As time goes on and the service matures, it will be necessary to occasionally plan for software upgrades. From the perspective of DataCore Swarm, the following components you would receive upgrades for involve:
Swarm Cluster Services (SCS)
Swarm Storage
Gateway
Swarm Search (in the form of Elasticsearch upgrades, schema upgrades, etc.)
Of the above, only Swarm Storage requires a valid license file in order to function. As such, it is the only license that needs to be considered for any capacity upgrades. Fortunately, Swarm Storage nodes are designed to poll for license updates and do not require being restarted in order to apply an updated license with new capacity information. Thus, a license upgrade operation should be treated as transparent.
From a software upgrade perspective as a whole, it’s strongly recommended to review the release notes that ship with the new software bundles to determine what dependencies & impacts are involved. This can include but is not limited to:
Potential upgrades needed for the base operating system or other third party components (e.g., Java)
Whether or not an upgrade can be performed with or without planned down time / maintenance windows
Where down time may be necessary for the upgrade, determine if any guidance is available re: how long the upgrade operation will take
Whether or not any data conversion will be necessary as part of the upgrade e.g., the need to create a new search feed for a schema change etc.
It is also recommended to work with your assigned Solution Architects and the DataCore Support teams as part of upgrade planning. They can review your service requirements and help identify areas where you may need to perform actions such as customer notifications for service maintenance, where you may encounter pain points in the process depending on the amount of data you have under management coupled with client activity, and other criteria of note.
Service Silos (“Divide and Conquer”)
As mentioned near the beginning of this document, you may encounter a situation where it’s effectively impossible to reconcile the needs of different workloads & workflows in a single DataCore Swarm deployment. For example, you may have conflicting customer activity (a.k.a. “the noisy neighbor” problem) which can’t be addressed with single stack tuning.
It’s possible in such situations to deploy multiple Swarm stacks within your back end service to split such activity out. Note that our strategy thus far has been to route customers to their appropriate path through front end load balancer evaluation, which then routes the customer through their designated Gateway pool and ultimately to their back end storage target. A comprehensive and robust load balancer solutions will scale well in handling these “host header” level evaluation decisions, allowing you to set up the necessary segregated Swarm stacks to split up work.
As the service scales to a large pool of customers with diverse requirements, setting up different Swarm pools on the back end will allow you to scale out in addition to scaling vertically within single Swarm installs. Naturally, this will have ramifications in the following key areas:
License Management - each Swarm deployment will need its own license
Hardware Sizing - each of the deployments will need to be properly designed for the anticipated work that will be re-routed to it from the original cluster
Data Migration - moving of the “noisy” customer’s data from the original deployment to the new deployment
As outlined before, capacity monitoring and planning along with an eye on overall Swarm load will provide indicators for when such work may prove necessary.
© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.