S3 Protocol Special Topics
Amazon S3 is two distinct things: "S3 The Service" and "S3 The Protocol". Since the S3 protocol reflects characteristics of the Amazon service that may not be applicable outside of Amazon, the Gateway adapts these characteristics so they make sense when hosting storage within the environment. These adaptations are transparent to most applications and enhance the protocol features by making use of the unique strengths of Swarm storage.
AWS Regions versus Storage Domains
Amazon AWS currently has fewer than two dozen geographic regions worldwide, each of which is shared by thousands of end-users. Choose a permanent region for each of the S3 buckets based upon latency, cost, and any applicable regulatory requirements.
An Amazon AWS region is roughly analogous to a Swarm storage domain. A Swarm bucket is tied to a domain as an S3 bucket is tied to a region. They are fundamentally different in the following ways:
AWS regions are finite, but any number of Swarm domains can be created; this allows achieving the optimal granularity for assigning content user permissions.
AWS buckets must have a name unique across all regions, but Swarm buckets need only be unique within the domain.
AWS buckets are fixed geographically, but Swarm content dynamically distributes across the entire Swarm cluster, according to the content protection settings.
AWS buckets can be replicated to a differently named bucket, but Swarm replication preserves the domain, bucket, and object names identically across every target cluster, which supports content sharing and direct DR fail-over.
AWS buckets can replicate in one direction, but Swarm clusters can use mirrored replication. A domain's content can be created, updated, and accessed across distributed storage clusters while remaining synchronized.
Best Practice
Domain creation and allocation in Swarm is lightweight, so be generous: provide each client (or business unit) a separate storage domain. Performing this benefits both sides:
For storage admins, usage tracking and storage management are easier.
For storage end-users, access control policies are uncomplicated, and bucket naming is more flexible.
Bucket Location
The Amazon S3 GET Bucket Location request returns the AWS region in which the bucket is located. By default, Gateway returns an empty value for the location, which S3 clients interpret as us-east-1. If another region is desired, there are two options:
Supply the location in the bucket creation operation using LocationConstraint.
Set the [s3]region option in the Gateway configuration file to the preferred region. This applies to all buckets unless the location is specified during creation.
If you require the prior behavior of returning the cluster name, set [s3]region to that cluster name.
Best Practice
Unless you have an application that requires otherwise, either use the default, or choose another valid AWS region, as there are clients that validate the location against a list of known AWS regions.
Storage Class
Amazon S3 allows clients to set a storage class preference in the x-amz-storage-class
header, which defines the data durability and access frequency of content. See the Object Storage Classes – Amazon S3.
The S3 protocol tags all objects with the x-amz-storage-class-meta
header and includes the client application's requested class or STANDARD, if none is specified. Bi-directional translation between the AWS S3 x-amz-storage-class
header and the Swarm x-amz-storage-class-meta
header is performed for S3 protocol operations.
Virtual Hosting of Buckets
Amazon S3 allows virtual hostnames to bucket mappings within the storage service. This is accomplished by creating a DNS alias (CNAME record) for a virtual host name pointing to one of the Amazon S3 region endpoints. An example is to allow the web request to be mapped to the real Amazon S3 URL:
http://www.fred.com/hello.html
http://s3.amazonaws.com/www.fred.com/hello.html
The HTTP Host
header can be used to specify the bucket name if the virtual host is mapped to the bucket named www.fred.com
in the US Standard Region.
The S3 protocol also supports this mapping of virtual hosts to buckets. To accomplish this, the storage administrator configures the DNS server to perform wildcard resolution of host names to the front-end IP address of the Gateway. The Gateway then searches for a storage domain within Swarm starting with the value of the Host
header and then recursively popping off the leftmost word using period (".") as a delimiter until it finds a match or runs out of words. The previously popped words are concatenated back together using periods and the result becomes the bucket name for the request when a storage domain is found.
As an example, consider the storage domain in Swarm called fred.com
containing a bucket called www
and an object within called hello.html
. The normal method to access this object is with the URL http://fred.com/www/hello.html
, which has the following HTTP request headers:
GET /www/hello.html HTTP/1.1
Host: fred.com
Create DNS entries for www.fred.com
and fred.com
to setup the virtual host mapping from www.fred.com
to the Swarm storage domain and bucket. This is an example where the Content Gateway's front-end IP address is 10.100.100.81
and a wildcard match is used.
fred.com A 10.100.100.81
*.fred.com CNAME fred.com
After the DNS entries are in place, the hello.html
object is now accessible with the additional URL http://www.fred.com/hello.html
. The HTTP request headers arriving at the Gateway look like this:
The Gateway first checks for the non-existent storage domain www.fred.com
and then removes www
, the leftmost word, and finds the storage domain fred.com
in the storage cluster. The Gateway then transparently modifies the HTTP request headers by prefixing the URI path with the removed word www
and shortening the Host
header as follows:
The Gateway searches for a storage domain by removing the leftmost words until a domain is found or until it reaches a null hostname if the bucket name contains periods, hires.images
with a virtual host name of hires.images.fred.com
. For example, it tests for the existence of the domains hires.images.fred.com
and images.fred.com
before finding fred.com
. The request results in an error if the search reaches a null hostname. The [caching] domainExistenceRefresh
configuration parameter in gateway.cfg
is used to optimize domain existence testing.
Identity Management and S3 Authentication Tokens
The Gateway's S3 protocol uses an external identity management system (IDM) for users and groups, similar to a federation with AWS IAM, and uses an internal system for managing authentication tokens, similar to AWS temporary security credentials. These authentication tokens are created and managed within the Swarm cluster.
Authenticated requests to the S3 protocol follow the AWS S3 request signing rules for v2 and v4 signatures whereby each request includes a header in one of the following forms:
Authorization: AWS AccessKey:Signature
(v2 signature)Authorization: AWS4-HMAC-SHA256 Credential,SignedHeaders,Signature
(v4 signature)
The construction of the Authorization header is automatically handled by S3 SDKs and S3 applications. For the elements of the header's value string and the S3 HMAC authentication mechanism, see the AWS S3 documentation.
Each S3 client needs at least one authentication token to authenticate S3 protocol requests to the Gateway. The Access Key value is the ID of an authentication token.
See Token-Based Authentication.
The X-User-Secret-Key-Meta
header is required when creating the token object when creating authentication tokens for use with S3 (also referred to as "S3 authentication tokens"). The value of this header becomes the Secret Access Key (or "Secret Key") used to sign S3 requests. As previously mentioned, the Access Key becomes the token cookie's value returned by the create request.
This example shows that an authentication token is created by the user gcarlin
with an S3 secret key of abcdefg
and an expiration time of 365 days from now. Note: this request uses HTTP basic authentication to create the token.
This is an excerpt from the Gateway's response.
Construct the S3 Authorization header using the following to sign S3 requests with the created token:
Note
Tokens that contain an S3 secret key are used to sign S3 storage requests and may not be used to authenticate SCSP storage operations.
Error Response for Missing Resource
The S3 protocol includes an extra XML Resource
tag in the error response to provide additional details for troubleshooting errors regarding non-existent content. This is an example showing the additional field.
The format of the Resource
tag is: "/{bucketName}/{objectName}
".
Multipart Uploads
AWS S3 requires every part (except the last) of a multipart upload must be at least 5 MB in size. The Gateway's S3 protocol implementation does not impose this limitation and allows each part of a multipart upload to be of any size.
Multipart writes are long-running operations with initial and final responses. For S3 compatibility, the initial response returns x-amz-version-id
with the value of the expected ETag. There is no new object if there is an error completing the write, and the expected ETag given is not valid. (v6.3)
List after Update Timing
After an object is created, the delay before that object appears in a list operation can vary depending upon the Swarm metadata feed batch timeout setting. New objects are available within two seconds following a create when the batch timeout is set to 1 second. The Amazon S3 documentation has specific developer guidance about this eventual consistency behavior.
PUT Object Copy Metadata
The AWS S3 PUT Object Copy request creates a duplicate of an existing object and, when the x-amz-copy-source
header is included with the request, copies the x-amz-meta-*
custom metadata from the source object.
Although Swarm allows for more custom metadata patterns than this, the metadata matching x-amz-meta-*
pattern is copied during a PUT Object Copy operation.
© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.