Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. First, update all of the OS packages. 

    Code Block
    languagebash
    apt-get update
    apt-get upgrade


  2. Download from Apache the latest version of spark with Hadoop precompiled.
    Download Spark-2.3.0-bin-hadoop2.7.tgz from http://spark.apache.org/downloads.html
  3. Java should already be installed on your Ubuntu OS, but you will need Scala:

    Code Block
    languagebash
    apt-get install scala
    
    tar xvzf spark-2.3.0-bin-hadoop2.7.tgz
    mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark


  4. Add to ~/.bashrc so that the binaries can be found in the PATH env.

    Code Block
    languagebash
    export SPARK_HOME=/usr/local/spark
    export PATH=$SPARK_HOME/bin:$PATH


  5. Source .bashrc to load it up, or just re-login.
  6. To start the Spark Scala shell with the hadoop-aws plugin, run the following:

    Code Block
    languagebash
    spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.5


  7. You will then see the following:

    Code Block
    languagebash
    Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_162)
    
    Type in expressions to have them evaluated.
    
    Type :help for more information.
    
    scala>


  8. Using the Content UI, create a domain such as "s3.demo.example.local" and a bucket called "sparktest".
  9. Open the sparktest bucket and upload a test text file called castor_license.txt.
  10. Create an S3 token for this domain and pass it to spark, as follows:

    Code Block
    languagebash
    scala> sc.hadoopConfiguration.set("fs.s3a.access.key", "2d980ac1fabda441d1b15472620cb9c0")
    scala> sc.hadoopConfiguration.set("fs.s3a.secret.key", "caringo")
    scala> sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false")
    scala> sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
    scala> sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.demo.example.local")


  11. To test, run a command to read a file from Swarm via S3 and count the words inside it:

    Code Block
    languagebash
    scala> sc.textFile("s3a://sparktest/castor_license.txt").count()
    
    res5: Long = 39


    Info
    titleNote

    s3a URI syntax is s3a://<BUCKET>/<FOLDER>/<FILE> or s3a://<BUCKET>/<FILE>


  12. These three commands apply the wordcount application against the text file in s3 and sends back the results of the reduce operation to s3. 

    Code Block
    languagebash
    scala> val sonnets = sc.textFile("s3a://sparktest/castor_license.txt")
    sonnets: org.apache.spark.rdd.RDD[String] = s3a://sparktest/castor_license.txt MapPartitionsRDD[3] at textFile at <console>:24
    
    scala> val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
    counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at <console>:25
    
    scala> counts.saveAsTextFile("s3a://sparktest/output.txt")


  13. Browse to see the Spark UI jobs and stages (see UI examples below)

    Code Block
    languagebash
    http://<IP>:4040/jobs/
    http://<IP>:4040/stages/stage/?id=2&attempt=0


  14. It also works with SSL (even self-signed certificates): 

    Code Block
    languagebash
    scala> sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true");
    
    scala> sc.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.demo.sales.local")
    
    scala> sc.textFile("s3a://sparktest/castor_license.txt").count()
    res8: Long = 39


...