/
Apache Spark and Swarm S3 Installation

Apache Spark and Swarm S3 Installation

The following guide is for Ubuntu 16.04 LTS with Apache Spark 2.3.0.

Important

This outlines the steps for a standalone demo or test configuration. For production, set up at minimum a cluster of 3 nodes, one master and 2 slaves.
  1. First, update all of the OS packages. 

    apt-get update
    apt-get upgrade
  2. Download from Apache the latest version of spark with Hadoop precompiled.
    Download Spark-2.3.0-bin-hadoop2.7.tgz from http://spark.apache.org/downloads.html
  3. Java should already be installed on your Ubuntu OS, but you will need Scala:

    apt-get install scala
    
    tar xvzf spark-2.3.0-bin-hadoop2.7.tgz
    mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark
  4. Add to ~/.bashrc so that the binaries can be found in the PATH env.

    export SPARK_HOME=/usr/local/spark
    export PATH=$SPARK_HOME/bin:$PATH
  5. Source .bashrc to load it up, or just re-login.
  6. To start the Spark Scala shell with the hadoop-aws plugin, run the following:

    spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.5
  7. You will then see the following:

    Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_162)
    
    Type in expressions to have them evaluated.
    
    Type :help for more information.
    
    scala>
  8. Using the Content UI, create a domain such as "s3.demo.example.local" and a bucket called "sparktest".
  9. Open the sparktest bucket and upload a test text file called castor_license.txt.
  10. Create an S3 token for this domain and pass it to spark, as follows:

    scala> sc.hadoopConfiguration.set("fs.s3a.access.key", "2d980ac1fabda441d1b15472620cb9c0")
    scala> sc.hadoopConfiguration.set("fs.s3a.secret.key", "caringo")
    scala> sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false")
    scala> sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
    scala> sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.demo.example.local")
  11. To test, run a command to read a file from Swarm via S3 and count the words inside it:

    scala> sc.textFile("s3a://sparktest/castor_license.txt").count()
    
    res5: Long = 39

    Note

    s3a URI syntax is s3a://<BUCKET>/<FOLDER>/<FILE> or s3a://<BUCKET>/<FILE>

  12. These three commands apply the wordcount application against the text file in s3 and sends back the results of the reduce operation to s3. 

    scala> val sonnets = sc.textFile("s3a://sparktest/castor_license.txt")
    sonnets: org.apache.spark.rdd.RDD[String] = s3a://sparktest/castor_license.txt MapPartitionsRDD[3] at textFile at <console>:24
    
    scala> val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
    counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at <console>:25
    
    scala> counts.saveAsTextFile("s3a://sparktest/output.txt")
  13. Browse to see the Spark UI jobs and stages (see UI examples below): 

    http://<IP>:4040/jobs/
    http://<IP>:4040/stages/stage/?id=2&attempt=0
  14. It also works with SSL (even self-signed certificates): 

    scala> sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true");
    
    scala> sc.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.demo.sales.local")
    
    scala> sc.textFile("s3a://sparktest/castor_license.txt").count()
    res8: Long = 39

Spark Web UI Examples

Related content

Swarm Storage
Swarm Storage
Read with this
Using the Hadoop S3A connector with Content Gateway S3
Using the Hadoop S3A connector with Content Gateway S3
More like this
Using the Hadoop S3a connector with Gateway S3
Using the Hadoop S3a connector with Gateway S3
More like this
Sample Java Code for using Content Gateway's S3 Interface
Sample Java Code for using Content Gateway's S3 Interface
More like this
S3 Protocol Configuration
S3 Protocol Configuration
More like this
Hardware Requirements for Elasticsearch
Hardware Requirements for Elasticsearch
More like this

© DataCore Software Corporation. · https://www.datacore.com · All rights reserved.