Page Comparison

...

First, update all of the OS packages.
Code Block
language bash
apt-get update apt-get upgrade
Download from Apache the latest version of spark with Hadoop precompiled.
Download Spark-2.3.0-bin-hadoop2.7.tgz from http://spark.apache.org/downloads.html

Java should already be installed on your Ubuntu OS, but you will need Scala:

Code Block

language	bash

apt-get install scala

tar xvzf spark-2.3.0-bin-hadoop2.7.tgz
mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark

Add to ~/.bashrc so that the binaries can be found in the PATH env.
Code Block
language bash
export SPARK_HOME=/usr/local/spark export PATH=$SPARK_HOME/bin:$PATH
Source .bashrc to load it up, or just re-login.
To start the Spark Scala shell with the hadoop-aws plugin, run the following:
Code Block
language bash
spark-shell --packages org.apache.hadoop:hadoop-aws:2.7.5

You will then see the following:

Code Block

language	bash

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_162)

Type in expressions to have them evaluated.

Type :help for more information.

scala>

Using the Content UI, create a domain such as "s3.demo.example.local" and a bucket called "sparktest".
Open the sparktest bucket and upload a test text file called castor_license.txt.

Create an S3 token for this domain and pass it to spark, as follows:

Code Block

language	bash

scala> sc.hadoopConfiguration.set("fs.s3a.access.key", "2d980ac1fabda441d1b15472620cb9c0")
scala> sc.hadoopConfiguration.set("fs.s3a.secret.key", "caringo")
scala> sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "false")
scala> sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
scala> sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.demo.example.local")

To test, run a command to read a file from Swarm via S3 and count the words inside it:
Code Block
language bash
scala> sc.textFile("s3a://sparktest/castor_license.txt").count() res5: Long = 39
Info
title Note
s3a URI syntax is s3a://<BUCKET>/<FOLDER>/<FILE> or s3a://<BUCKET>/<FILE>

These three commands apply the wordcount application against the text file in s3 and sends back the results of the reduce operation to s3.

Code Block

language	bash

scala> val sonnets = sc.textFile("s3a://sparktest/castor_license.txt")
sonnets: org.apache.spark.rdd.RDD[String] = s3a://sparktest/castor_license.txt MapPartitionsRDD[3] at textFile at <console>:24

scala> val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[6] at reduceByKey at <console>:25

scala> counts.saveAsTextFile("s3a://sparktest/output.txt")

Browse to see the Spark UI jobs and stages (see UI examples below):
Code Block
language bash
http://<IP>:4040/jobs/ http://<IP>:4040/stages/stage/?id=2&attempt=0

It also works with SSL (even self-signed certificates):

Code Block

language	bash

scala> sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled", "true");

scala> sc.hadoopConfiguration.set("fs.s3a.endpoint", "https://s3.demo.sales.local")

scala> sc.textFile("s3a://sparktest/castor_license.txt").count()
res8: Long = 39

...

Versions Compared

Old Version 2

New Version Current

Key