How to write streams of unknown length
Problem
I'm implementing an interface that does not provide the content length of streams at the start of a write; or I'm writing code to write Swarm streams from a streaming input of unkown length.
We've seen two specific cases:
- The HDFS FileSystem interface, which uses a create/write block/write block/.../close protocol for writing streams. The create call doesn't include a length parameter.
- Streaming data input for audio or video.
Solution
We can use the following four features of Swarm to solve the problem:
- Chunked transfers - Swarm accepts streams without a
Content-length
header if sent withTransfer-encoding: chunked
- ECP - Swarm writes all streams sent with chunked encoding as erasure coded, so Elastic Content Protection must be enabled and configured in the target Swarm cluster.
- Lifepoints - These can include reps specs, so we can use those to convert from erasure coding to straight replicas.
- Swarm Management API - This lets us query Swarm for the ec.minStreamSize.
Naive Approaches
Spooling
We could write all streams to non-Swarm storage and then read from that storage on stream completion and POST the stream of known length to Swarm.
If we're using a sychronous spooler, we'll add latency to the transfer. If we're using an asynchronous spooler, we'll have to implement something to track the result of writing to Swarm after the write to the spool completes, and we'll have to manage data loss within the spooler.
Both implementations suffer from limitations of the spool available storage, both for a single stream and in aggregate.
Writing streams Transfer-Encoding:chunked without cleanup
We want to use chunked transfers and therefore store streams erasure-coded. This means, however, that small streams wil be erasure-coded, which wastes disk and index memory space.
Better Answer
Write streams Transfer-Encoding:chunked and then use COPY to convert small stream to straight reps using a lifepoint on completion of the write. COPY using a terminal reps= lifepoint will strip off the implicit EC and recode the stream as reps.
(Note that the recoding occurs asynchronously, either in the background immediately after the response or at the next HP examination.)
How large is a small stream?
Ideally, we want to use the default cluster setting ec.minStreamSize in Swarm 8.0 and prior. Swarm 8.0 has a nifty hidden feature that allows you to read that. You can use the Swarm Management API, available on port 91 of a Swarm node by default, to retrieve various configuration settings. Specifically, we want the configuration setting at /api/storage/nodes/<node-ip>/settings/ec.minStreamSize.
EXAMPLE
The following curl command retrieves the value of ec.minStreamSize
Complete Example in curl
The example below uses the following curl options:
- --upload-file (alias -T) Path|- : Transfers a file to destination URL. The example uses
-
, signifying standard input. Using - also tells curl to use chunked transfer and to write to the named resource in the destination URL without appending the input Path. - -X POST : Tells curl to use POST rather than PUT
- -X COPY : Tells curl to use COPY
- -H "<header name>:<header value>": Tells curl to send a header in the request
- -l and --post301: Tells curl to handle redirects and do so with a POST.
- -v: Turns on curl verbose mode.
Original write
In the above, note that
- The POST was done with a chunked transfer:
Transfer-encoding: chunked is in the request;
- The POST didn't include a content-length;
- The object was stored erasure-coded:
Manifest: ec
is in the response.
Retrieving the file size
We'd be able to track the file length if we were doing this in code, but for illustration we'll get the object length by making a HEAD request to Swarm
curl -I "192.168.3.84/bucket1/file.txt?domain=d1" -L --post301
[...]
Again, note that the stream is stored erasure-coded (Manifest: ec
). The stream length is 8,835 bytes (Content-Length: 8835
).
Converting to straight replicas
From our previous call to retrieve ec.minStreamSize from the Swarm Management API, we know the stream we just wrote is much smaller than the default minimum for ec. We therefore want to convert the stream to use straight replicas.
Note that we can use the Replica-Count
value (2) returned on the original POST as our replicas value in our conversion to straight reps.
We'll COPY the original stream, adding a terminal lifepoint with a policy of reps=2 (Lifepoint: [] reps=2
) to effect the conversion.
The response still indicates the stream is erasure-coded since the response still contains a Manifest: ec
header. If we HEAD the stream, however, we can see that stream is no longer erasure-coded.
Note that response no longer contains the Manifest header.
© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.