/
Using the Hadoop S3a connector with Gateway S3

Using the Hadoop S3a connector with Gateway S3

The Apache Hadoop S3 connector "S3A" works with Content Gateway S3. Here is a small but complete example, using a single-node Hadoop system that can be easily run on any Docker server. It shows bucket listing, distcp, and a simple mapreduce job against a bucket in a Swarm domain.

  1. Create a container using the docker image from https://github.com/sequenceiq/hadoop-docker.

    $ docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash
  2. At the container's shell prompt, add the path containing the "hadoop" binary:

    PATH=/usr/local/hadoop/bin:$PATH
  3. Copy the jar's needed for the S3A libraries:

    cd /usr/local/hadoop-2.7.0/share/hadoop/tools/lib/ && cp -p hadoop-aws-2.7.0.jar aws-java-sdk-1.7.4.jar jackson-core-2.2.3.jar jackson-databind-2.2.3.jar jackson-annotations-2.2.3.jar /usr/local/hadoop-2.7.0/share/hadoop/hdfs/lib/
  4. Make sure your domain (mydomain.example.com) and bucket (hadoop-test) have been created and that your /etc/hosts or DNS are configured to resolve http://mydomain.example.com/hadoop-test to your cloudgateway server's S3 port.

  5. Create an S3 token:

    curl -i -u USERNAME -X POST --data-binary '' -H 'X-User-Secret-Key-Meta: secret' -H 'X-User-Token-Expires-Meta: +90' http://mydomain.example.com/.TOKEN/
    
    HTTP/1.1 201 Created
    ...
    Token e181dcb1d01d5cf24f76dd276b95a638 issued for USERNAME in [root] with secret secret
  6. List your bucket (it should be empty).

    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com -ls s3a://hadoop-test/

    Note: You will get error "No such file or directory is expected" if the bucket is empty or the trailing slash is missing.

  7. Copy the sample "input" files into your bucket: 

    hadoop distcp -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=jam.cloud.caringo.com input s3a://hadoop-test/input
  8. Verify with "-ls" or in Content Gateway ui that the bucket now has ~31 objects. 

    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=jam.cloud.caringo.com -ls s3a://hadoop-test/input/
    Found 31 items
    ...

  9. Run the sample mapreduce job that grep's the input files. 

    hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com s3a://hadoop-test/input s3a://hadoop-test/output 'dfs[a-z.]+'
  10. Display the results in the "output" file. 

    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com -cat s3a://hadoop-test/output/part-r-00000
    6	dfs.audit.logger
    4	dfs.class
    3	dfs.server.namenode.
    2	dfs.period
    2	dfs.audit.log.maxfilesize
    ...

PS: https://wiki.apache.org/hadoop/AmazonS3 makes a good point:

S3 is not a filesystem. The Hadoop S3 filesystem bindings make it pretend to be a filesystem, but it is not. It can act as a source of data, and as a destination -though in the latter case, you must remember that the output may not be immediately visible.


Related content

Sample Java Code for using Content Gateway's S3 Interface
Sample Java Code for using Content Gateway's S3 Interface
More like this
Getting started with Requests and SCSP Python3
Getting started with Requests and SCSP Python3
Read with this
Using the Hadoop S3A connector with Content Gateway S3
Using the Hadoop S3A connector with Content Gateway S3
More like this
Using boto3 to delete old object versions
Using boto3 to delete old object versions
Read with this
S3 Protocol Configuration
S3 Protocol Configuration
More like this
Example "curl" commands for uploading files to Swarm Content Gateway
Example "curl" commands for uploading files to Swarm Content Gateway
Read with this

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.