Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: copyedit

The Apache Hadoop S3 connector "S3A" works with Content Gateway S3. Here is a small but complete example, using a single-node

...

Hadoop system that can be easily run on any

...

Docker server. It shows bucket listing, distcp, and a simple mapreduce job against a bucket in a Caringo domain.

  1. Create a container using the docker image from https://github.com/sequenceiq/hadoop-docker.

    Code Block
    $ docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash


  2. At the container's shell prompt, add the path containing the "hadoop" binary:

    Code Block
    PATH=/usr/local/hadoop/bin:$PATH


  3. Copy the jar's needed for the S3A libraries:

    Code Block
    cd /usr/local/hadoop-2.7.0/share/hadoop/tools/lib/ && cp -p hadoop-aws-2.7.0.jar aws-java-sdk-1.7.4.jar jackson-core-2.2.3.jar jackson-databind-2.2.3.jar jackson-annotations-2.2.3.jar /usr/local/hadoop-2.7.0/share/hadoop/hdfs/lib/


  4. Make sure your domain (mydomain.example.com) and bucket (hadoop-test) have been created and that your /etc/hosts or DNS are configured to resolve http://mydomain.example.com/hadoop-test to your cloudgateway server's S3 port.

  5. Create an S3 token:

    Code Block
    curl -i -u USERNAME -X POST --data-binary '' -H 'X-User-Secret-Key-Meta: secret' -H 'X-User-Token-Expires-Meta: +90'

...

  1.  http://mydomain.example.com/.TOKEN/

...

  1. 
    
    HTTP/1.1 201 Created

...

  1. 
    ...

...

  1. 
    Token e181dcb1d01d5cf24f76dd276b95a638 issued for USERNAME in [root] with secret secret


  2. List your bucket (it should be empty).

    Code Block
    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com -ls s3a://hadoop-test/

    Note:

...

  1. You will get error "No such file or directory is expected

...

  1. " if the bucket is empty or

...

  1.  the trailing slash is missing.

  2. Copy the sample "input" files into your bucket: 

    Code Block
    hadoop distcp -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=jam.cloud.caringo.com input s3a://hadoop-test/input

...

  1. Verify with "-ls" or in Content Gateway ui that the bucket now has ~31

...

  1. objects. 

    Code Block
    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=jam.cloud.caringo.com -ls s3a://hadoop-test/input/
    Found 31 items
    ...



  2. Run the sample mapreduce job that grep's the input files

    Code Block
    hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com s3a://hadoop-test/input s3a://hadoop-test/output 'dfs[a-z.]+'

...


  1. Display the results in the "output" file

    Code Block
    hadoop fs -Dfs.s3a.access.key=e181dcb1d01d5cf24f76dd276b95a638 -Dfs.s3a.secret.key=secret -Dfs.s3a.endpoint=mydomain.example.com -cat s3a://hadoop-test/output/part-r-00000
    6	dfs.audit.logger
    4	dfs.class
    3	dfs.server.namenode.
    2	dfs.period
    2	dfs.audit.log.maxfilesize
    ...



PS: https://wiki.apache.org/hadoop/AmazonS3 makes a good point:

S3 is not a filesystem. The Hadoop S3 filesystem bindings make it pretend to be a filesystem, but it is not. It can act as a source of data, and as a destination -though in the latter case, you must remember that the output may not be immediately visible.

Filter by label (Content by label)
showLabelsfalse
max5
spacescom.atlassian.confluence.content.render.xhtml.model.resource.identifiers.SpaceResourceIdentifier@957
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ( "hadoop" , "connector" ) and type = "page" and space = "KB"
labelshadoop connector

Page Properties
hiddentrue


Related issues