Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

Table of Contents
minLevel1
maxLevel2
outlinefalse
typelist
printablefalse

All non-ASCII characters in object headers must be rendered into a standard format for accurate metadata indexing and querying with Elasticsearch. Swarm supports character encoding so non-ASCII characters may be used in the HTTP headers of objects. 

...

How Swarm Handles Non-ASCII

...

  • Swarm tolerates validation failures.: Swarm tolerates validation failures and stores header values that are left unencoded, so it does not disturb existing objects whose stored headers may fail decoding under the new header rules. Swarm does not reject any object based on an inability to decode encoded words in a header.

  • Swarm stores and returns header fields as-is.: Swarm allows all string-typed headers to have multiple lines as well as encoded words. Swarm stores the header value as-is with the object metadata. When Swarm needs to use that value (such as for metadata indexing) it decodes the value. Swarm decodes header fields into Unicode and then operates on the decoded values. The original encoded persistent headers remain safely stored with the object and are returned performing HEAD or GET operations against the object. 

...

  • Metadata goes to Elasticsearch as Unicode.: Swarm sends metadata to Elasticsearch as document attributes through the Elasticsearch API. Swarm decodes the metadata using the algorithms specified in RFC 2047 to achieve this. 

  • ISO-8859-1 encoding for headers.: Swarm follows the HTTP/1.1 specification, which defines request header values as ISO-8859-1 characters. Swarm allows header values to be encoded according to RFC 2047 to store any Unicode characters (not isolated to ASCII or ISO-8859-1). Applications can encode header values as RFC 2047 ‘encoded-words’ to safeguard treatment of non-ASCII characters. This encoding uses only ASCII characters. Swarm decodes it to get the original non-ASCII strings for indexing and searching.

...

  • Swarm encodes other character sets.: Per HTTP/1.1 specifications, Swarm headers encode the field content so clients can encode characters in sets other than ISO-8859-1.

...

Note

Decoding affects performance, but the impact is minimal for fields with no special encodings; there is no performance impacts unless large volumes of non-ASCII metadata is stored in a cluster that enabled full metadata searching.

How to Encode Non-ASCII Characters

...

info
Code Block
languagexml
X-Alt-Meta-Name: =?UTF-8?Q?caf=C3=A9?= red white =?UTF-8?Q?caf=C3=A9_brown_orange?=

Note

Use ‘=20’ or underscores (_) to encode embedded spaces.

Infotip

Best

practice

Practice

Although Swarm does not enforce compliance, the best practice is to comply with the RFC2047 limits:

  • 75 characters per encoded word

  • 76 characters per each line with an encoded word

...

This is how Swarm decodes the following header values:

ASCII

"alpha beta gamma"

'alpha beta gamma'

UTF-8

"=?utf-8?q?caf=C3=A9?="

u'caf\xe9'

UTF-8 Base64

"=?utf-8?b?Y2Fmw6k=?="

u'caf\xe9'

Complex UTF-8

"=?utf-8?q?caf=c3=a9?= aaa =?utf-8?q?caf=c3=a9?= bbb =?utf-8?q?caf=c3=a9?="

u'caf\xe9 aaa caf\xe9 bbb caf\xe9'

ISO-8859-1

"=?iso-8859-1?q?caf=E9?="

u'caf\xe9’

Valid

...

Header Values

"alpha beta gamma"

Pure ASCII

"=?utf-8?q?caf=C3=A9?="

UTF-8

"=?utf-8?b?Y2Fmw6k=?="

UTF-8 Base64 encoded

Valid but

...

Malformed Header Values

While valid, incompletely formatted encodings are not decoded because Swarm does not recognize them as having been encoded. Swarm treats the content as valid content not encoded when the encoding format is malformed. 

"=?utf-8?q?caf=C3=A9"

UTF-8 without the expected suffix

"=?utf-8?q?caf=C3=A9="

UTF-8 with a partial suffix

"utf-8?q?caf=C3=A9?="

UTF-8 without the expected prefix

"?utf-8?q?caf=C3=A9?="

UTF-8 with a partial prefix

Invalid

...

Header Values

"=?utf-8?j?caf=C3=A9?="

UTF-8 with an invalid coding indicator (not “Q” or “B”)

"=?utf-8?qcaf=C3=A9?="

UTF-8 missing an internal “?” separator character

"=?utf-9?q?caf=C3=A9?="

Invalid or unknown character encoding

"=?utf-8?q?caf=C3=FF?="

Invalid or unknown character

"=?utf-8?b?Y2-mw6k=?="

Base64 with invalid characters

"=?utf-8?b?Y2Fmw6k?="

Base64 with invalid padding

Decoding Limitations

Swarm does not perform the following:

Feature

Swarm Behavior

Workaround

Example

Disable decoding

Swarm has no configuration parameter to disable decoding of headers. Decoding rules are applied to all header values.

Encode the content itself by encoding as ISO8859-1 with the ? and = replaced by octet values if a header value that looks like an encoded string but is meant to be taken literally is needed.

To have the header value passed as-is to Elasticsearch (instead of being decoded), replace the encoding like this:

"=?UTF-8?Q?caf=C3=A9?="

"=?ISO-8859-1?Q?=3D=3FUTF-8?Q=3Fcaf=3DC3=3DA9=3F=3D?="

Unicode normalizing

Swarm does not perform Unicode normalization.

Standardize how such words are encoded if various encodings of a word such as “café” need to match.

Valid variants for encoding diacritics in UTF-8:

=?UTF-8?Q?caf=C3=A9?=

=?UTF-8?Q?caf=65=CC=81?=

Unicode case folding

Swarm performs no Unicode case folding.

None. For case-insensitive operations, Swarm always converts uppercase to lowercase, including for non-ASCII characters.

In ASCII, uppercasing a character and then lowercasing it always results in the same character. That is not always the case for Unicode escapes.

Troubleshooting Decoding

Review these possible reasons why Swarm found the encoding incomplete or invalid if Swarm does not decode a header as expected:

Problems in

...

Encoded Word Structure

These examples have validation issues in the structure of the encoded-word framework, such as:

...

  Swarm passes these types of strings as unencoded text.

Example

Error

'=?utf-8?Q?_brown=20=20and_blue?'

Missing the closing ‘=’. Per RFC 2047, encoded word must start with “=?” and end with “?=”. 

'=?utf-8Q?_brown=20=20and_blue?='

Missing the “?” between the “utf-8” and “Q” characters. 

'=?utf-8?J?_brown=20=20and_blue?='

J encoding is invalid.

'=?utf-8??Q?_brown=20=20and_blue?='

Extra “?” before “Q”.

Unknown or

...

Unreadable Encodings

When Swarm encounters an encoded word with an unknown encoding or a valid encoding with any other problem, such as an invalid octet, it passes it through as-is:

Example

Error

'=?utf-9?Q?_brown=20=20and_blue?='

utf-9 is not a known encoding.

'=?utf-8?Q?_brown=20=FFand_blue?='

0xFF is not a valid octet in utf-8.

Problems with Base64

...

Encoding

Swarm passes through the original header as is if either validation fails in the Base64 encoding.

  • Characters: Base64-encoded words include only the characters A-F, a-f, +,  and /; all other characters are invalid. If any invalid characters are present, Swarm treats the entire encoded word as invalid.

  • Padding: Base64 encodings include groups of 4-character sequences. Base64 encodings have trailing padding (with “=”) to maintain the string as a multiple of four characters. Swarm treats any Base64 encodings that lack the trailing padding as invalid.

Problems with

...

One of Several Encoded Words

HTTP header content can contain more than one encoded word, but Swarm does not partially decode headers. If any encoded word in a header is invalid, the entire header is passed through unencoded.

Swarm ignores the incomplete encoding (treating it like a valid non-encoded word) and decodes the complete word if a header includes both complete encoding and incomplete encoding (text that looks like an encoded word missing either the leading “=?” prefix or ending “?=” suffix).

...