/
Using the Hadoop S3A connector with Content Gateway S3

Using the Hadoop S3A connector with Content Gateway S3

The Apache Hadoop S3 connector "S3A" works with Content Gateway S3. Here is a small but complete example, using a single-node hadoop system that can be easily run on any docker server. It shows bucket listing, distcp, and a simple mapreduce job against a bucket in a Swarm domain.

  • Create a container using the docker image from https://github.com/sequenceiq/hadoop-docker

    $ docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash

  • At the container's shell prompt, add the path containing the "hadoop" binary:

    PATH=/usr/local/hadoop/bin:$PATH

  • Copy the jar's needed for the S3A libraries:

    cd /usr/local/hadoop-2.7.0/share/hadoop/tools/lib/ && cp -p hadoop-aws-2.7.0.jar aws-java-sdk-1.7.4.jar jackson-core-2.2.3.jar jackson-databind-2.2.3.jar jackson-annotations-2.2.3.jar /usr/local/hadoop-2.7.0/share/hadoop/hdfs/lib/

  • Make sure your domain (mydomain.example.com) and bucket (hadoop-test) have been Created and that your /etc/hosts or DNS are configured to resolve http://mydomain.example.com/hadoop-test [https://jam.cloud.caringo.com/hadoop-test] to your cloudgateway server's S3 port.
  • Create an S3 token

    curl -i -u USERNAME -X POST --data-binary '' -H 'X-User-Secret-Key-Meta: secret' -H 'X-User-Token-Expires-Meta: +90' http://mydomain.example.com/.TOKEN/

    HTTP/1.1 201 Created

    ...

    Token e181dcb1d01d5cf24f76dd276b95a638 issued for USERNAME in [root] with secret secret

  • List your bucket (should be empty)

    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com -ls s3a://hadoop-test/

    Note: error ls: `s3a://hadoop-test': No such file or directory is expected if bucket is empty OR if you forget the trailing slash.

  • Copy the sample "input" files into your bucket:

    hadoop distcp -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=jam.cloud.caringo.com input s3a://hadoop-test/input

  • Verify with "-ls" or in Content Gateway ui that the bucket now has ~31 streams.

    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=jam.cloud.caringo.com -ls s3a://hadoop-test/input/

Found 31 items ...

  • Run the sample mapreduce job that grep's the input files

    hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com s3a://hadoop-test/input s3a://hadoop-test/output 'dfs[a-z.]+'

  • Display the results in the "output" file

    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com -cat s3a://hadoop-test/output/part-r-00000 6 dfs.audit.logger 4 dfs.class 3 dfs.server.namenode. 2 dfs.period 2 dfs.audit.log.maxfilesize ...

    PS: https://wiki.apache.org/hadoop/AmazonS3 [https://wiki.apache.org/hadoop/AmazonS3] makes a good point:

    "S3 is not a filesystem. The Hadoop S3 filesystem bindings make it pretend to be a filesystem, but it is not. It can act as a source of data, and as a destination -though in the latter case, you must remember that the output may not be immediately visible."

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.