Encoding Non-ASCII Characters in Metadata

All non-ASCII characters in object headers must be rendered into a standard format for accurate metadata indexing and querying with Elasticsearch. Swarm supports character encoding so non-ASCII characters may be used in the HTTP headers of objects.

How Swarm Handles Non-ASCII

The presence of non-ASCII characters in object metadata requires extra processing by Swarm, which needs to create a standard format for indexing in Elasticsearch. Swarm can support these encodings with full metadata searching if encoding these characters correctly. Here is a summary of how Swarm handles these characters:

  • Swarm tolerates validation failures: Swarm tolerates validation failures and stores header values that are left unencoded, so it does not disturb existing objects whose stored headers may fail decoding under the new header rules. Swarm does not reject any object based on an inability to decode encoded words in a header.

  • Swarm stores and returns header fields as-is: Swarm allows all string-typed headers to have multiple lines as well as encoded words. Swarm stores the header value as-is with the object metadata. When Swarm needs to use that value (such as for metadata indexing) it decodes the value. Swarm decodes header fields into Unicode and then operates on the decoded values. The original encoded persistent headers remain safely stored with the object and are returned performing HEAD or GET operations against the object. 

Exception

Different line breaks may display in multiple-line headers, since Swarm does not store the actual line breaks.

  • Metadata goes to Elasticsearch as Unicode: Swarm sends metadata to Elasticsearch as document attributes through the Elasticsearch API. Swarm decodes the metadata using the algorithms specified in RFC 2047 to achieve this. 

  • ISO-8859-1 encoding for headers: Swarm follows the HTTP/1.1 specification, which defines request header values as ISO-8859-1 characters. Swarm allows header values to be encoded according to RFC 2047 to store any Unicode characters (not isolated to ASCII or ISO-8859-1)Applications can encode header values as RFC 2047 ‘encoded-words’ to safeguard treatment of non-ASCII characters. This encoding uses only ASCII characters. Swarm decodes it to get the original non-ASCII strings for indexing and searching.

Gateway

With Gateway, use only ASCII (not ISO-8859-1) characters in header values, even though ISO-8859-1 works with Swarm. ASCII header values can be RFC 2047 encoded Unicode characters, which support Elasticsearch indexing and searching.

  • Swarm encodes other character sets: Per HTTP/1.1 specifications, Swarm headers encode the field content so clients can encode characters in sets other than ISO-8859-1.

Note

Decoding affects performance, but the impact is minimal for fields with no special encodings; there is no performance impacts unless large volumes of non-ASCII metadata is stored in a cluster that enabled full metadata searching.

How to Encode Non-ASCII Characters

Suppose a header string with a non-ASCII character is sent to Swarm, such as café. The non-ASCII characters must be escaped if not encoded in ISO-8859-1. One header value can combine partial- and whole-word encoding:

  • Partial-word encoding: caf=?UTF-8?Q?=C3=A9?=

  • Whole-word encoding: =?UTF-8?Q?caf=C3=A9?=

The string “café red white café brown orange” can be handled both ways:

X-Alt-Meta-Name: =?UTF-8?Q?caf=C3=A9?= red white =?UTF-8?Q?caf=C3=A9_brown_orange?=

Examples of Decoding

This is how Swarm decodes the following header values:

ASCII

"alpha beta gamma"

'alpha beta gamma'

UTF-8

"=?utf-8?q?caf=C3=A9?="

u'caf\xe9'

UTF-8 Base64

"=?utf-8?b?Y2Fmw6k=?="

u'caf\xe9'

Complex UTF-8

"=?utf-8?q?caf=c3=a9?= aaa =?utf-8?q?caf=c3=a9?= bbb =?utf-8?q?caf=c3=a9?="

u'caf\xe9 aaa caf\xe9 bbb caf\xe9'

ISO-8859-1

"=?iso-8859-1?q?caf=E9?="

u'caf\xe9’

Valid Header Values

"alpha beta gamma"

Pure ASCII

"=?utf-8?q?caf=C3=A9?="

UTF-8

"=?utf-8?b?Y2Fmw6k=?="

UTF-8 Base64 encoded

Valid but Malformed Header Values

While valid, incompletely formatted encodings are not decoded because Swarm does not recognize them as having been encoded. Swarm treats the content as valid content not encoded when the encoding format is malformed. 

"=?utf-8?q?caf=C3=A9"

UTF-8 without the expected suffix

"=?utf-8?q?caf=C3=A9="

UTF-8 with a partial suffix

"utf-8?q?caf=C3=A9?="

UTF-8 without the expected prefix

"?utf-8?q?caf=C3=A9?="

UTF-8 with a partial prefix

Invalid Header Values

"=?utf-8?j?caf=C3=A9?="

UTF-8 with an invalid coding indicator (not “Q” or “B”)

"=?utf-8?qcaf=C3=A9?="

UTF-8 missing an internal “?” separator character

"=?utf-9?q?caf=C3=A9?="

Invalid or unknown character encoding

"=?utf-8?q?caf=C3=FF?="

Invalid or unknown character

"=?utf-8?b?Y2-mw6k=?="

Base64 with invalid characters

"=?utf-8?b?Y2Fmw6k?="

Base64 with invalid padding

Decoding Limitations

Swarm does not perform the following:

Feature

Swarm Behavior

Workaround

Example

Feature

Swarm Behavior

Workaround

Example

Disable decoding

Swarm has no configuration parameter to disable decoding of headers. Decoding rules are applied to all header values.

Encode the content itself by encoding as ISO8859-1 with the ? and = replaced by octet values if a header value that looks like an encoded string but is meant to be taken literally is needed.

To have the header value passed as-is to Elasticsearch (instead of being decoded), replace the encoding like this:

"=?UTF-8?Q?caf=C3=A9?="

"=?ISO-8859-1?Q?=3D=3FUTF-8?Q=3Fcaf=3DC3=3DA9=3F=3D?="

Unicode normalizing

Swarm does not perform Unicode normalization.

Standardize how such words are encoded if various encodings of a word such as “café” need to match.

Valid variants for encoding diacritics in UTF-8:

=?UTF-8?Q?caf=C3=A9?=

=?UTF-8?Q?caf=65=CC=81?=

Unicode case folding

Swarm performs no Unicode case folding.

None. For case-insensitive operations, Swarm always converts uppercase to lowercase, including for non-ASCII characters.

In ASCII, uppercasing a character and then lowercasing it always results in the same character. That is not always the case for Unicode escapes.

Troubleshooting Decoding

Review these possible reasons why Swarm found the encoding incomplete or invalid if Swarm does not decode a header as expected:

Problems in Encoded Word Structure

These examples have validation issues in the structure of the encoded-word framework, such as:

  • an incorrect starting or ending sequence

  • an issue in the “?” separators between the character or Q/B encoding

  Swarm passes these types of strings as unencoded text.

Example

Error

Example

Error

'=?utf-8?Q?_brown=20=20and_blue?'

Missing the closing ‘=’. Per RFC 2047, encoded word must start with “=?” and end with “?=”. 

'=?utf-8Q?_brown=20=20and_blue?='

Missing the “?” between the “utf-8” and “Q” characters. 

'=?utf-8?J?_brown=20=20and_blue?='

J encoding is invalid.

'=?utf-8??Q?_brown=20=20and_blue?='

Extra “?” before “Q”.

Unknown or Unreadable Encodings

When Swarm encounters an encoded word with an unknown encoding or a valid encoding with any other problem, such as an invalid octet, it passes it through as-is:

Example

Error

Example

Error

'=?utf-9?Q?_brown=20=20and_blue?='

utf-9 is not a known encoding.

'=?utf-8?Q?_brown=20=FFand_blue?='

0xFF is not a valid octet in utf-8.

Problems with Base64 Encoding

Swarm passes through the original header as is if either validation fails in the Base64 encoding.

  • Characters: Base64-encoded words include only the characters A-F, a-f, +,  and /; all other characters are invalid. If any invalid characters are present, Swarm treats the entire encoded word as invalid.

  • Padding: Base64 encodings include groups of 4-character sequences. Base64 encodings have trailing padding (with “=”) to maintain the string as a multiple of four characters. Swarm treats any Base64 encodings that lack the trailing padding as invalid.

Problems with One of Several Encoded Words

HTTP header content can contain more than one encoded word, but Swarm does not partially decode headers. If any encoded word in a header is invalid, the entire header is passed through unencoded.

Swarm ignores the incomplete encoding (treating it like a valid non-encoded word) and decodes the complete word if a header includes both complete encoding and incomplete encoding (text that looks like an encoded word missing either the leading “=?” prefix or ending “?=” suffix).

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.