Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents
stylenone

Introduction

Deploying DataCore Swarm for the purpose of hosting an on-line online service is a complex project. In our experience, many of the challenges involved have less to do with knowledge of Swarm as a product and more to do with building the “services shell” around Swarm. As such, many of the questions we’re asked to address tend to fall into the following areas:

...

To that end, the purpose of this document is to assist partners and customers who wish to deploy DataCore Swarm as an internet hosted storage endpoint service. The main focus will be on considerations for acting as an S3 API storage endpoint, as that is the desired approach for the majority of deployments of this nature.

Service Planning

Targeted Workloads

An important input into service planning involves discovery of anticipated customer workload. Associated with that are client / application workflow requirements. Examples of this for typical deployments include:

...

Note that estimate work for handling multiple use cases and client behavior in a single Swarm deployment can prove difficult due to conflicting requirements. It may be necessary to employ a “service silo” approach in situations where the targeted market addresses multiple workloads. We will touch on such strategies as we proceed through this document.

Hardware Estimates

Once all reasonable effort is completed in determining target use cases and associated workloads, the next step is to estimate what hardware will be needed to support the first group of customers. Ideally, this group will all have the same or very similar profiles, which should help normalize the effort. In the case of DataCore Swarm, hardware and networking estimates address these key components in the product architecture:

...

Note that DataCore can assist with your deployment sizing questions as needed.

Fault Tolerance Considerations (Durability / Availability)

As part of the overall estimate, you should also consider any requirements related to service availability, performance under degraded operation, and other service level agreement (SLA) drivers which could affect the overall solution. Examples of these can include, but are not limited to:

...

Most deployments focus on being highly available with the understanding that component failures will introduce performance degradation. Understanding how sensitive your targeted customers / applications are to this will be key to your overall success. This may involve “over engineering” certain characteristics of the deployment, so if performance targets are a goal they will need to be budgeted for. Even then, guaranteeing performance in a shared service can prove to be a difficult task.

Capacity Projections

Another item to consider when budgeting for initial outlay is anticipated service growth. This usually takes the form of a business driver communicated as, “we would also like to estimate hardware resource requirements for a growth rate of 20% per year for three years”. This can be treated as a simple compound factor multiplied by the initial “year one” estimate. For this example, that would look like the following:

...

With the initial hardware outlay and projected numbers in hand, a more educated budget projection for supporting the service can be determined for a given time frame and customer onboard rate.

License Estimate

Once you’ve completed the capacity estimates in prior sections, you can then derive what your necessary “day one” licensed capacity will be. Note that, in practice, Swarm Storage licenses check for the use of a given amount of raw storage available in the cluster. This can be as simple as:

...

Naturally, a business decision will need to be made to determine which approach for licensing is desired. We won’t go cover the details related to CSP/MSP pricing here, other than to note that your account representative can help translate the calculations for raw capacity into a suitable license for your deployment.

Third Party Component Requirements

Some key third party components that are needed to support “Swarm as a Service” can include, but are not limited to:

  • Load Balancer(s) - it is strongly recommended to use one of the many commercially available load balancer solutions to act as the public touch point for the service front end. Some examples include:

    • F5 BIG-IP

    • Cisco A10 (A10 Networks)

    • HAProxy Enterprise

  • Firewall(s) - in similar approach to the load balancer needs, if a dedicated firewall is required then commercially available solutions are strongly recommended, for example:

    • Cisco ASA / Firepower

    • Juniper SRX

    • Check Point Security Appliance

  • Operating system images (and, if necessary, associated support licenses) - the operating system dependencies for all Swarm components (Storage Cluster Services / SCS, Gateway, Elasticsearch et al) are outlined in DataCore’s documentation and should be adhered to unless otherwise directed by DataCore Support. More details related to this can be researched in the requirements sections per Swarm Deployment

  • Network Time Servers (NTP) - this piece is critical to the overall function of the solution and has specific requirements related to it. We recommend familiarizing yourself with https://perifery.atlassian.net/wiki/spaces/public/pages/2443808587/Setting+Up+the+Network+Services#Setting-Up-NTP-for-Time-Synchronization as part of service planning. As a best practice, you will want to have all of your service layers time synchronized end-to-end for logging triage and service event reconciliation. Note that Swarm Storage in particular logs information using UTC time as opposed to local time, so this will have to be accounted for in any troubleshooting or audit inspection.

Networking and Namespace Conventions

Planning your network requirements for a Swarm service architecture usually involves the following areas:

  • Network Address Allocation - this will need to be defined end-to-end and will involve allocations for public and private address space, including such areas as:

    • Service front end (public IP address space)

    • Load balancer / firewall public address allocation and DMZ subnet(s), including Gateway address assignments and Swarm management server front end (SCS)

      • Note that a route to your authentication store (LDAP, Active Directory et al) will also need to be made available for the Gateway systems

    • Storage subnet assignment - this includes the IP address allocation for the Swarm management system back end (SCS) and the address pool to be used by both Swarm storage and Elasticsearch nodes

  • Name Service / Naming Conventions (DNS) - As a best practice, you will want to have an appropriate DNS zone in place to support the proposed deployment. This is especially required for outside clients to resolve to your service and, more importantly, be properly routed to the storage domain that has been provisioned for them. The end-to-end naming convention that should be employed should be as follows:

    • Fully qualified domain name (FQDN) resolution in public DNS for the customer’s provisioned storage domain (e.g., “customer01.myservice.net”)

    • FQDN formatted name for the customer’s provisioned storage domain in Swarm that maps with the external DNS record (i.e., create a storage domain for the customer named “customer01.myservice.net”)

    • This approach allows for correct host header mapping via Gateway access and insures that the customer properly references the stored data associated with their account

    • Note that a wildcard DNS record (e.g., “*.myservice.net”) can be employed to accommodate provisioning for a group of customers; use of this approach is encouraged for onboarding customers at scale, as this has ramifications for SSL/TLS certificate management for the service

  • SSL/TLS Certificate Management - as mentioned previously, “HTTPS” will likely be the access protocol used by your customer’s client software. Use of wildcard certificates for this purpose is usually employed, as managing individual certificates for each customer’s domain can prove cumbersome. As with DNS, the choice of supporting “vanity domains” is left as a business decision to the reader.

Network Port & Protocol Mapping

Note that Swarm’s port & protocol requirements are documented. That information is outlined here: https://perifery.atlassian.net/wiki/spaces/public/pages/2443808571/Setting+Up+the+Swarm+Network#Network-Communications

It’s strongly recommended to have a logical network diagram produced that incorporates these port & protocol requirements. This will help later for determining network level workflow, address / port / protocol exposure necessary for firewall and load balancer rules, audit materials, etc.

Installation Approach

With the planning exercise from the previous section in hand, we now can focus on executing the installation of the deployment. This is typically done in the following order:

Networking Layout

This involves implementing the routing, IP address numbering, and namespace plan in the the network layer and naming services, specifically:

  • Preparing the network switch layer for hosting of Swarm storage, with emphasis on how multicast traffic will be handled in the storage cluster subnet (see notes on Swarm port & protocol requirements as outlined in Setting Up the Swarm Network )

  • Setting up appropriate routing and firewall rules for the service

  • Configuring the necessary DNS zones to handle both external and internal name mapping requirements

  • Configuring the load balancer layer with necessary server pool configuration for the Gateway servers

  • Configuring the load balancer with the necessary SSL/TLS certificates for SSL/TLS service termination and offload

    • Note that Gateway itself cannot handle SSL/TLS connection requests, which is why offload must be done at the load balancer layer

Environment Validation

Before installing the Swarm components, it’s strongly recommended at this stage to perform a validation of networking and systems in place. This work can include, but is not limited to:

...

Having this information in hand before the installation of the Swarm stack will help set expectations for expected performance in the overall cluster. It also is useful for identifying unforeseen bottlenecks in the infrastructure and third party component layouts. In our experience, many of the issues reported related to poor Swarm performance can have roots in environment issues, so it’s wise to test for this up front before the software stack and services are in place.

Swarm Server & Service Installation

This involves the following steps, usually done in order:

  • Installing Swarm Cluster Services (SCS) on the designated server to support the initial bootstrap of the the storage cluster, including the install of the storage license that was issued

  • Performing a network boot (PXE boot) of the storage servers from the SCS server and confirming the storage nodes are in working order while reflecting the configuration outlined with the SCS install

  • Installation of Swarm Search on the designated Elasticsearch servers and, once confirmed to be in working order, configuring the necessary search feed in Swarm Storage to point to the Elasticsearch cluster

  • Installation of Swarm Gateway on the designated Gateway servers, including confirmation that integration with the designated authentication store, Elasticsearch cluster, and storage cluster are in working order

  • Configuring a test account in the designated authentication store, along with a storage domain that can be used for verifying storage service functionality

    • This includes configuring an S3 token and secret for this account, so that testing with S3 clients can be performed

  • With the above steps completed, functional end-to-end service verification can be performed through interacting with the public service touch point

Swarm Telemetry and Service Baseline

At this stage, you can now deploy Swarm Telemetry (as outlined in Swarm Telemetry Install Guide ). This will allow you to gather metrics in the deployment to establish baseline behavior for the service. As a best practice, use of representative workloads to profile overall service operation would be performed after this step. This can be done through use of the actual client software itself (e.g., Veeam test load) or via synthetic means (e.g., MinIO Warp, AWS Boto3, rclone, or similar).

Support Registration

Finally, in preparation for going live with your service, you will want to confirm that you’re properly registered with DataCore Support. This involves verifying that support access for the team has been set up in advance, understanding where supporting documentation such as knowledge base articles reside (as found at the https://perifery.atlassian.net/wiki/spaces/KB page), and finalizing all documentation related to the specifics around your deployment. Should an escalation be required, this will help insure that Support response is prompt and that there is backing knowledge related to your deployment which can be quickly referenced.

Post-Deployment Operations

Runbooks and Changelogs

As use of the service progresses over time, it’s good to develop best practices related to day-to-day operations. One approach which we can recommend is the combination of developing a runbook in conjunction with changelog tracking. This can assist in efficiently handling operational tasks and service escalation response. The purpose of both is outlined below:

...

Combined together, these two practices can assist with providing the operational discipline needed to provide a robust and reliable online service.

Logging and Auditing

The various components of DataCore Swarm are set up to generate system logging traffic in the form of syslog output. By default, these logs are primarily generated in the following areas:

...

Note that, in the event that Datacore Support asks for tech support bundles to be generated, it will also be necessary to include the logs captured at any centralized log server in use.

Capacity Monitoring and Upgrade Planning

DataCore Swarm provides access to multiple metrics that can be tracked in a deployment. From a capacity planning perspective, however, there are some key items that you will want to monitor over time to make a determination when service expansion is needed. As mentioned in previous sections, these items of note can be easily inspected using your (by now fully populated) Prometheus database and Grafana dashboards. Access to the latest dashboards can be found at https://grafana.com/search/?term=Datacore%20Swarm for this purpose.

...

  • Storage capacity - This relates directly to license requirements and determining when extra storage nodes and/or volumes may be needed. A dashboard is also provided to provide insights into health processor (HP) activity, which relates to inspection for data durability checks and other functions gated by HP cycles in the storage cluster. Overall storage node server loads can be ascertained from this information as well.

  • Swarm Search / Elasticsearch - These metrics can be used to inspect rate of document total logical object count in the storage cluster, Elasticsearch growth as objects are ingested, Elasticsearch disk capacity thresholds, indexing / list / query rates over time, shard counts, and other items that can be profiled to determine if extra capacity is warranted for the Search component.

  • Swarm Gateway - The items tracked with Gateway metrics relate to session concurrency levels, type and number of API calls being handled for clients, and other areas related to Gateway server load. These can be used to determine if the current Gateway pool is becoming overloaded and “scale out” capacity is necessary.

Software and License Upgrade Considerations

As time goes on and the service matures, it will be necessary to occasionally plan for software upgrades. From the perspective of DataCore Swarm, the following components you would receive upgrades for involve:

...

It is also recommended to work with your assigned Solution Architects and the DataCore Support teams as part of upgrade planning. They can review your service requirements and help identify areas where you may need to perform actions such as customer notifications for service maintenance, where you may encounter pain points in the process depending on the amount of data you have under management coupled with client activity, and other criteria of note.

Service Silos (“Divide and Conquer”)

As mentioned near the beginning of this document, you may encounter a situation where it’s effectively impossible to reconcile the needs of different workloads & workflows in a single DataCore Swarm deployment. For example, you may have conflicting customer activity (a.k.a. “the noisy neighbor” problem) which can’t be addressed with single stack tuning.

...