Using the Hadoop S3a connector with Gateway S3

The Apache Hadoop S3 connector "S3A" works with Content Gateway S3. Here is a small but complete example, using a single-node Hadoop system that can be easily run on any Docker server. It shows bucket listing, distcp, and a simple mapreduce job against a bucket in a Swarm domain.

Create a container using the docker image from https://github.com/sequenceiq/hadoop-docker.
```
$ docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash
```
At the container's shell prompt, add the path containing the "hadoop" binary:
```
PATH=/usr/local/hadoop/bin:$PATH
```

Copy the jar's needed for the S3A libraries:

cd /usr/local/hadoop-2.7.0/share/hadoop/tools/lib/ && cp -p hadoop-aws-2.7.0.jar aws-java-sdk-1.7.4.jar jackson-core-2.2.3.jar jackson-databind-2.2.3.jar jackson-annotations-2.2.3.jar /usr/local/hadoop-2.7.0/share/hadoop/hdfs/lib/

Make sure your domain (mydomain.example.com) and bucket (hadoop-test) have been created and that your /etc/hosts or DNS are configured to resolve http://mydomain.example.com/hadoop-test to your cloudgateway server's S3 port.

Create an S3 token:

curl -i -u USERNAME -X POST --data-binary '' -H 'X-User-Secret-Key-Meta: secret' -H 'X-User-Token-Expires-Meta: +90' http://mydomain.example.com/.TOKEN/

HTTP/1.1 201 Created
...
Token e181dcb1d01d5cf24f76dd276b95a638 issued for USERNAME in [root] with secret secret

List your bucket (it should be empty).
```
hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com -ls s3a://hadoop-test/
```
Note: You will get error "No such file or directory is expected" if the bucket is empty or the trailing slash is missing.

Copy the sample "input" files into your bucket:

hadoop distcp -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=jam.cloud.caringo.com input s3a://hadoop-test/input

Verify with "-ls" or in Content Gateway ui that the bucket now has ~31 objects.

hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=jam.cloud.caringo.com -ls s3a://hadoop-test/input/
Found 31 items
...

Run the sample mapreduce job that grep's the input files.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com s3a://hadoop-test/input s3a://hadoop-test/output 'dfs[a-z.]+'

Display the results in the "output" file.

hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com -cat s3a://hadoop-test/output/part-r-00000
6	dfs.audit.logger
4	dfs.class
3	dfs.server.namenode.
2	dfs.period
2	dfs.audit.log.maxfilesize
...

PS: https://wiki.apache.org/hadoop/AmazonS3 makes a good point:

S3 is not a filesystem. The Hadoop S3 filesystem bindings make it pretend to be a filesystem, but it is not. It can act as a source of data, and as a destination -though in the latter case, you must remember that the output may not be immediately visible.

Knowledge Base

Using the Hadoop S3a connector with Gateway S3

Analytics

Related articles

Related content