Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: rebranded Caringo to DataCore

The Apache Hadoop S3 connector "S3A" works with Content Gateway S3. Here is a small but complete example, using a single-node Hadoop system that can be easily run on any Docker server. It shows bucket listing, distcp, and a simple mapreduce job against a bucket in a Caringo Swarm domain.

  1. Create a container using the docker image from https://github.com/sequenceiq/hadoop-docker.

    Code Block
    $ docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash


  2. At the container's shell prompt, add the path containing the "hadoop" binary:

    Code Block
    PATH=/usr/local/hadoop/bin:$PATH


  3. Copy the jar's needed for the S3A libraries:

    Code Block
    cd /usr/local/hadoop-2.7.0/share/hadoop/tools/lib/ && cp -p hadoop-aws-2.7.0.jar aws-java-sdk-1.7.4.jar jackson-core-2.2.3.jar jackson-databind-2.2.3.jar jackson-annotations-2.2.3.jar /usr/local/hadoop-2.7.0/share/hadoop/hdfs/lib/


  4. Make sure your domain (mydomain.example.com) and bucket (hadoop-test) have been created and that your /etc/hosts or DNS are configured to resolve http://mydomain.example.com/hadoop-test to your cloudgateway server's S3 port.

  5. Create an S3 token:

    Code Block
    curl -i -u USERNAME -X POST --data-binary '' -H 'X-User-Secret-Key-Meta: secret' -H 'X-User-Token-Expires-Meta: +90' http://mydomain.example.com/.TOKEN/
    
    HTTP/1.1 201 Created
    ...
    Token e181dcb1d01d5cf24f76dd276b95a638 issued for USERNAME in [root] with secret secret


  6. List your bucket (it should be empty).

    Code Block
    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com -ls s3a://hadoop-test/

    Note: You will get error "No such file or directory is expected" if the bucket is empty or the trailing slash is missing.

  7. Copy the sample "input" files into your bucket: 

    Code Block
    hadoop distcp -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=jam.cloud.caringo.com input s3a://hadoop-test/input


  8. Verify with "-ls" or in Content Gateway ui that the bucket now has ~31 objects. 

    Code Block
    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=jam.cloud.caringo.com -ls s3a://hadoop-test/input/
    Found 31 items
    ...



  9. Run the sample mapreduce job that grep's the input files. 

    Code Block
    hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com s3a://hadoop-test/input s3a://hadoop-test/output 'dfs[a-z.]+'


  10. Display the results in the "output" file. 

    Code Block
    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com -cat s3a://hadoop-test/output/part-r-00000
    6	dfs.audit.logger
    4	dfs.class
    3	dfs.server.namenode.
    2	dfs.period
    2	dfs.audit.log.maxfilesize
    ...



...