Introduction to Apache Spark

© 2015 IBM Corporation
Introduction to Apache Spark
Vincent Poncet
IBM Software Big Data Technical Sale
02/07/2015

2 © 2015 IBM Corporation
Credits
 This presentations draws upon previous work / slides by IBM
colleagues from WW Software Big Data Organization : Daniel
Kikuchi, Jacques Roy and Mokhtar Kandil
 I used several materials from DataBricks and Apache Spark
documentation

Introduction and background
Spark Core API
Spark Execution Model
Spark Shell & Application Deployment
Spark Extensions (SparkSQL, MLlib, Spark Streaming)
Spark Future
Agenda

Introduction and background

 Apache Spark is a fast, general purpose,
easy-to-use cluster computing system for
large-scale data processing
– Fast
• Leverages aggressively cached in-memory
distributed computing and dedicated Executor
processes even when no jobs are running
• Faster than MapReduce
– General purpose
• Covers a wide range of workloads
• Provides SQL, streaming and complex
analytics
– Flexible and easier to use than Map Reduce
• Spark is written in Scala, an object oriented,
functional programming language
• Scala, Python and Java APIs
• Scala and Python interactive shells
• Runs on Hadoop, Mesos, standalone or cloud
Logistic regression in Hadoop and Spark
Spark Stack
val wordCounts =
sc.textFile("README.md").flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
WordCount

Brief History of Spark
 2002 – MapReduce @ Google
 2004 – MapReduce paper
 2006 – Hadoop @ Yahoo
 2008 – Hadoop Summit
 2010 – Spark paper
 2013 – Spark 0.7 Apache Incubator
 2014 – Apache Spark top-level
 2014 – 1.2.0 release in December
 2015 – 1.3.0 release in March
 2015 – 1.4.0 release in June
 Spark is HOT!!!
 Most active project in Hadoop
ecosystem
 One of top 3 most active Apache
projects
 Databricks founded by the creators
of Spark from UC Berkeley’s
AMPLab
Activity for 6 months in 2014
(from Matei Zaharia – 2014 Spark Summit)
DataBricks
In June 2015, code base was about 400K lines

DataBricks / Spark Summit 2015

Large Scale Usage

Spark ecosystem
 Spark is quite versatile and flexible:
– Can run on YARN / HDFS but also standalone or on MESOS
– The general processing capabilities of the Spark engine can be exploited from
multiple “entry points”: SQL, Streaming, Machine Learning, Graph Processing

Spark in the Hadoop ecosystem
 Currently, Spark is a general purpose parallel processing engine
which integrates with YARN along the rest of the Hadoop frameworks
YARN
HDFS
Map/
Reduce 2
HivePig
Spark
HBase BigSQL Impala

Future of Spark’s role in Hadoop ?
 The Spark Core engine is a good performant replacement for Map
Reduce:
YARN
HDFS
Spark Core
BigSQL
Spark
SQL
Spark
MLlib
Spark
Streaming
Hive
Custom
code
HBase

Spark Core API

 An RDD is a distributed collection of Scala/Python/Java objects of
the same type:
– RDD of strings
– RDD of integers
– RDD of (key, value) pairs
– RDD of class Java/Python/Scala objects
 An RDD is physically distributed across the cluster, but manipulated
as one logical entity:
– Spark will “distribute” any required processing to all partitions where the RDD
exists and perform necessary redistributions and aggregations as well.
– Example: Consider a distributed RDD “Names” made of names
Resilient Distributed Dataset (RDD): definition
Mokhtar
Jacques
Dirk
Cindy
Dan
Susan
Dirk
Frank
Jacques
Partition 1 Partition 2 Partition 3
Names

 Suppose we want to know the number of names in the RDD “Names”
 User simply requests: Names.count()
– Spark will “distribute” count processing to all partitions so as to obtain:
• Partition 1: Mokhtar(1), Jacques (1), Dirk (1)  3
• Partition 2: Cindy (1), Dan (1), Susan (1)  3
• Partition 3: Dirk (1), Frank (1), Jacques (1)  3
– Local counts are subsequently aggregated: 3+3+3=9
 To lookup the first element in the RDD: Names.first()
 To display all elements of the RDD: Names.collect() (careful with this)
Resilient Distributed Dataset: definition
Mokhtar
Jacques
Dirk
Cindy
Dan
Susan
Dirk
Frank
Jacques
Partition 1 Partition 2 Partition 3
Names

Resilient Distributed Datasets: Creation and Manipulation
 Three methods for creation
– Distributing a collection of objects from the driver program (using the
parallelize method of the spark context)
val rddNumbers = sc.parallelize(1 to 10)
val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))
– Loading an external dataset (file)
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
– Transformation from another existing RDD
val rddNumbers2 = rddNumbers.map(x=> x+1)
 Dataset from any storage supported by Hadoop
– HDFS, Cassandra, HBase, Amazon S3
– Others
 File types supported
– Text files, SequenceFiles, Parquet, JSON
– Hadoop InputFormat

Resilient Distributed Datasets: Properties
 Immutable
 Two types of operations
– Transformations ~ DDL (Create View V2 as…)
• val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10
• val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11
• The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded
• It’s a Directed Acyclic Graph (DAG)
• No actual data processing does take place  Lazy evaluations
– Actions ~ DML (Select * From V2…)
• rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
• Performs transformations and action
• Returns a value (or write to a file)
 Fault tolerance
– If data in memory is lost it will be recreated from lineage
 Caching, persistence (memory, spilling, disk) and check-pointing

RDD Transformations
 Transformations are lazy evaluations
 Returns a pointer to the transformed RDD
 Pair RDD (K,V) functions for MapReduce style transformations
Transformation Meaning
map(func) Return a new dataset formed by passing each element of the source through a function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which func returns
true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. So func should
return a Seq rather than a single item
Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
join(otherDataset,
[numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all
pairs of elements for each key.
reduceByKey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K,V) pairs where the values for
each key are aggregated using the given reduce function func
sortByKey([ascendin
g],[numTasks])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K,V)
pairs sorted by keys in ascending or descending order.
combineByKey[C}(cr
eateCombiner,
mergeValue,
mergeCombiners))
Generic function to combine the elements for each key using a custom set of aggregation
functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C.
createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C)

RDD Actions
 Actions returns values or save a RDD to disk
Action Meaning
collect() Return all the elements of the dataset as an array of the driver program. This
is usually useful after a filter or another operation that returns a sufficiently
small subset of data.
count() Return the number of elements in a dataset.
first() Return the first element of the dataset
take(n) Return an array with the first n elements of the dataset.
foreach(func) Run a function func on each element of the dataset.
saveAsTextFile Save the RDD into a TextFile
Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package

RDD Persistence
 Each node stores any partitions of the cache that it computes in memory
 Reuses them in other actions on that dataset (or datasets derived from it)
– Future actions are much faster (often by more than 10x)
 Two methods for RDD persistence: persist() and cache()
Storage Level Meaning
MEMORY_ONLY Store as deserialized Java objects in the JVM. If the RDD does not fit in memory, part of
it will be cached. The other will be recomputed as needed. This is the default. The
cache() method uses this.
MEMORY_AND_DISK Same except also store on disk if it doesn’t fit in memory. Read from memory and disk
when needed.
MEMORY_ONLY_SER Store as serialized Java objects (one bye array per partition). Space efficient, but more
CPU intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_AND_DISK but stored as serialized objects.
DISK_ONLY Store only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc.
Same as above, but replicate each partition on two cluster nodes
OFF_HEAP (experimental) Store RDD in serialized format in Tachyon.

Scala
 Scala Crash Course
 Holden Karau, DataBricks
http://lintool.github.io/SparkTutorial/slides/day1_Scala_crash_course
.pdf

Code Execution (1)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt
 ‘spark-shell’ provides Spark context as ‘sc’

Code Execution (2)
// Create RDD
val quotes =
// Transformations
x(1))
// Action
DAN Spark is cool
BOB Spark is fun
File: sparkQuotes.txt RDD: quotes
DAN Spark is cool
BOB Spark is fun

Code Execution (3)
// Create RDD
val quotes =
// Transformations
x(1))
// Action
DAN Spark is cool
BOB Spark is fun
File: sparkQuotes.txt RDD: quotes RDD: danQuotes
DAN Spark is cool
BOB Spark is fun
DAN Spark is cool

Code Execution (4)
// Create RDD
val quotes =
// Transformations
x(1))
// Action
DAN Spark is cool
BOB Spark is fun
File: sparkQuotes.txt RDD: quotes RDD: danQuotes RDD: danSpark
DAN Spark is cool
BOB Spark is fun
DAN Spark is cool
Spark
Scala

Code Execution (5)
// Create RDD
val quotes =
// Transformations
x(1))
// Action
DAN Spark is cool
BOB Spark is fun
File: sparkQuotes.txt
HadoopRDD
DAN Spark is cool
BOB Spark is fun
RDD: quotes
DAN Spark is cool
RDD: danQuotes
Spark
Scala
RDD: danSpark
1

DataFrames
 A DataFrame is a distributed collection of data organized into named columns. It is
conceptually equivalent to a table in a relational database, an R dataframe or Python Pandas,
but in a distributed manner and with query optimizations and predicate pushdown to the
underlying storage.
 DataFrames can be constructed from a wide array of sources such as: structured data files,
tables in Hive, external databases, or existing RDDs.
 Released in Spark 1.3

DataFrames Examples
// Create the DataFrame
val df = sqlContext.read.parquet("examples/src/main/resources/people.parquet")
// Show the content of the DataFrame
df.show()
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the "name" column
df.select("name").show()
// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
// Select people older than 21
df.filter(df("age") > 21).show()
// Count people by age
df.groupBy("age").count().show()

Spark
Execution Model

sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark client
(app master) Spark worker
HDFS, HBase, …
Block
manager
Task
threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Cluster
manager
Components
DataBricks

rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
Scheduling Process
DataBricks

Pipelines narrow ops.
within a stage
Picks join algorithms
based on partitioning
(minimize shuffles)
Reuses previously
cached data
Scheduler Optimizations
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
DataBricks

Direct Acyclic Graph (DAG)
 View the lineage
 Could be issued in a continuous line
scala> danSpark.toDebugString
res1: String =
(2) MappedRDD[4] at map at <console>:16
| MappedRDD[3] at map at <console>:16
| FilteredRDD[2] at filter at <console>:14
| hdfs:/sparkdata/sparkQuotes.txt MappedRDD[1] at textFile at <console>:12
| hdfs:/sparkdata/sparkQuotes.txt HadoopRDD[0] at textFile at <console>:12
val danSpark = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt").
filter(_.startsWith("DAN")).
map(_.split(" ")).
map(x => x(1)).
.filter(_.contains("Spark"))
danSpark.count()

Showing Multiple Apps
SparkContext
Driver Program
Cluster Manager
Worker Node
Executor
Task Task
Cache
Worker Node
Executor
Task Task
Cache
App
 Each Spark application runs as a set of processes coordinated by the
Spark context object (driver program)
– Spark context connects to Cluster Manager (standalone, Mesos/Yarn)
– Spark context acquires executors (JVM instance)
on worker nodes
– Spark context sends tasks to the executors
DataBricks

Spark Terminology
 Context (Connection):
– Represents a connection to the Spark cluster. The Application which initiated
the context can submit one or several jobs, sequentially or in parallel, batch or
interactively, or long running server continuously serving requests.
 Driver (Coordinator agent)
– The program or process running the Spark context. Responsible for running
jobs over the cluster and converting the App into a set of tasks
 Job (Query / Query plan):
– A piece of logic (code) which will take some input from HDFS (or the local
filesystem), perform some computations (transformations and actions) and
write some output back.
 Stage (Subplan)
– Jobs are divided into stages
 Tasks (Sub section)
– Each stage is made up of tasks. One task per partition. One task is executed
on one partition (of data) by one executor
 Executor (Sub agent)
– The process responsible for executing a task on a worker node
 Resilient Distributed Dataset

Spark
Shell & Application Deployment

Spark’s Scala and Python Shell
 Spark comes with two shells
– Scala
– Python
 APIs available for Scala, Python and Java
 Appropriate versions for each Spark release
 Spark’s native language is Scala, more natural to write Spark
applications using Scala.
 This presentation will focus on code examples in Scala

Spark’s Scala and Python Shell
 Powerful tool to analyze data interactively
 The Scala shell runs on the Java VM
– Can leverage existing Java libraries
 Scala:
– To launch the Scala shell (from Spark home directory):
./bin/spark-shell
– To read in a text file:
scala> val textFile = sc.textFile("README.txt")
 Python:
– To launch the Python shell (from Spark home directory):
./bin/pyspark
– To read in a text file:
>>> textFile = sc.textFile("README.txt")

SparkContext in Applications
 The main entry point for Spark functionality
 Represents the connection to a Spark cluster
 Create RDDs, accumulators, and broadcast variables on that
cluster
 In the Spark shell, the SparkContext, sc, is automatically initialized
for you to use
 In a Spark program, import some classes and implicit conversions
into your program:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

A Spark Standalone Application in Scala
Import statements
SparkConf and
SparkContext
Transformations
and Actions

Running Standalone Applications
 Define the dependencies
– Scala  simple.sbt
 Create the typical directory structure with the files
 Create a JAR package containing the application’s code.
– Scala: sbt package
 Use spark-submit to run the program
Scala:
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

Spark Properties
 Set application properties via the SparkConf object
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
 Dynamically setting Spark properties
– SparkContext with an empty conf
val sc = new SparkContext(new SparkConf())
– Supply the configuration values during runtime
./bin/spark-submit --name "My app" --master local[4] --conf
spark.shuffle.spill=false --conf "spark.executor.extraJavaOptions=-
XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
– conf/spark-defaults.conf
 Application web UI
http://<driver>:4040

Spark Configuration
 Three locations for configuration:
– Spark properties
– Environment variables
conf/spark-env.sh
– Logging
log4j.properties
 Override default configuration directory (SPARK_HOME/conf)
– SPARK_CONF_DIR
• spark-defaults.conf
• spark-env.sh
• log4j.properties
• etc.

Spark Monitoring
 Three ways to monitor Spark applications
1. Web UI
• Default port 4040
• Available for the duration of the application
2. Metrics
• Based on the Coda Hale Metrics Library
• Report to a variety of sinks (HTTP, JMX, and CSV)
• /conf/metrics.properties
3. External instrumentations
• Ganglia
• OS profiling tools (dstat, iostat, iotop)
• JVM utilities (jstack, jmap, jstat, jconsole)

Running Spark Examples
 Spark samples available in the examples directory
 Run the examples (from Spark home directory):
./bin/run-example SparkPi
where SparkPi is the name of the sample application

Spark Extensions

Spark Extensions
 Extensions to the core Spark API
 Improvements made to the core are passed to these libraries
 Little overhead to use with the Spark core
from http://spark.apache.org

Spark SQL
 Process relational queries expressed in SQL (HiveQL)
 Seamlessly mix SQL queries with Spark programs
 In Spark since 1.0, refactored on top of DataFrames since 1.3
 Provide a single interface for efficiently working with structured
data including Apache Hive, Parquet and JSON files
 Leverages Hive frontend and metastore
– Compatibility with Hive data, queries
and UDFs
– HiveQL limitations may apply
– Not ANSI SQL compliant
– Little to no query rewrite optimization,
automatic memory management or
sophisticated workload management
 Graduated from alpha status with Spark 1.3
 Standard connectivity through JDBC/ODBC

Spark SQL - Getting Started
 SQLContext created from SparkContext
// An existing SparkContext, sc
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 HiveContext created from SparkContext
// An existing SparkContext, sc
val sqlContext = new
org.apache.spark.sql.hive.HiveContext(sc)
 Import a library to convert an RDD to a DataFrame
– Scala:
import sqlContext.implicits._
 DataFrame data sources
– Inferring the schema using reflection
– Programmatic interface

Spark SQL - Inferring the Schema Using Reflection
 The case class in Scala defines the schema of the table
case class Person(name: String, age: Int)
 The arguments of the case class becomes the names of the columns
 Create the RDD of the Person object and create a DataFrame
val people = sc.textFile("examples/src/main/resources/people.txt").
map(_.split(",")).
map(p => Person(p(0), p(1).trim.toInt)).toDF()
 Register the DataFrame as a table
people.registerTempTable("people")
 Run SQL statements using the sql method provided by the
SQLContext
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND
age <= 19")
 The results of the queries are DataFrames and support all the normal
RDD operations
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Spark SQL - Programmatic Interface
 Use when you cannot define the case classes ahead of time
 Three steps to create the Dataframe
1. Schema encoded as a String, import SparkSQL Struct types
val schemaString = “name age”
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
2. Create the schema represented by a StructType matching the structure of
the Rows in the RDD from step 1.
val schema = StructType( schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))
3. Apply the schema to the RDD of Rows using the createDataFrame method.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
 Then register the peopleSchemaRDD as a table
peopleDataFrame.registerTempTable("people")
 Run the sql statements using the sql method:
val results = sqlContext.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)

SparkSQL - DataSources
Before : Spark 1.2.x
 ParquetFile
– val parquetFile = sqlContext.parquetFile("people.parquet")
 JSON :
– val df =
sqlContext.jsonFile("examples/src/main/resources/people.json")
Spark 1.3.x
 Generic Load/Save
– val df = sqlContext.load(“<filename>", “<datasource
type>")
– df.save (“<filename>", “<datasource type>")
 ParquetFile
– val df = sqlContext.load("people.parquet") //
(parquet unless otherwise configured
by spark.sql.sources.default)
– df.select("name",
"age").save("namesAndAges.parquet")
 JSON
– val df = sqlContext.load("people.json", "json")
– df.select("name", "age").save("namesAndAges.json",
“json")
 CSV (external package)
– val df = sqlContext.load("com.databricks.spark.csv",
Map("path" -> "cars.csv", "header" -> "true"))
– df.select("year", "model").save("newcars.csv",
"com.databricks.spark.csv")
Spark 1.4.x
 Generic Load/Save
– val df = sqlContext.read.load(“<filename>", “<datasource type>")
– df.write.save (“<filename>", “<datasource type>")
 ParquetFile
– val df = sqlContext.read.load("people.parquet") // (parquet unless
otherwise configured by spark.sql.sources.default)
– df.select("name", "age").write.save("namesAndAges.parquet")
 JSON
– val df = sqlContext.read.load("people.json", "json")
– df.select("name", "age").write.save("namesAndAges.json", “json")
 CSV (external package)
– val df =
sqlContext.read.format("com.databricks.spark.csv").option("heade
r", "true").load("cars.csv")
– df.select("year",
"model").write.format("com.databricks.spark.csv").save("newcars.
csv")
DataSource APIs provides generic methods to
manage connectors to any datasource (file, jdbc,
cassandra, mongodb, etc…). From Spark 1.3
DataSource APIs provides predicate pushdown
capabilities to leverage the performance of the
backend. Most connectors are available at
http://spark-packages.org/

Spark Streaming
 Scalable, high-throughput, fault-tolerant stream processing of live
data streams
 Write Spark streaming applications like Spark applications
 Recovers lost work and operator state (sliding windows) out-of-the-
box
 Uses HDFS and Zookeeper for high availability
 Data sources also include TCP sockets, ZeroMQ or other customized
data sources

Spark Streaming - Internals
 The input stream goes into Spark Steaming
 Breaks up into batches of input data
 Feeds it into the Spark engine for processing
 Generate the final results in streams of batches
 DStream - Discretized Stream
– Represents a continuous stream of data created from the input streams
– Internally, represented as a sequence of RDDs

Spark Streaming - Getting Started
 Count the number of words coming in from the TCP socket
 Import the Spark Streaming classes
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
 Create the StreamingContext object
val conf =
new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
 Create a DStream
val lines = ssc.socketTextStream("localhost", 9999)
 Split the lines into words
val words = lines.flatMap(_.split(" "))
 Count the words
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
 Print to the console
wordCounts.print()

Spark Streaming - Continued
 No real processing happens until you tell it
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
 Code and application can be found in the NetworkWordCount
example
 To run the example:
– Invoke netcat to start the data stream
– In a different terminal, run the application
./bin/run-example streaming.NetworkWordCount localhost 9999

Spark MLlib
 Spark MLlib for machine learning
library
 Since Spark 0.8
 Provides common algorithms and
utilities
• Classification
• Regression
• Clustering
• Collaborative filtering
• Dimensionality reduction
 Leverages in-memory cache of Spark
to speed up iteration processing

Spark MLlib - Getting Started
 Use k-means clustering for set of latitudes and longitudes
 Import the Spark MLlib classes
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
 Create the SparkContext object
val conf = new SparkConf().setAppName("KMeans")
val sc = new SparkContext(conf)
 Create a data RDD
val taxifile = sc.textFile("user/spark/sparkdata/nyctaxisub/*")
 Create Vectors for input to algorithm
val taxi =
taxifile.map{line=>Vectors.dense(line.split(",").slice(3,5).map(_.toDouble))}
 Run the k-means algorithm with 3 clusters and 10 iterations
val model = Kmeans.train(taxi,3,10)
val clusterCenters = model.clusterCenters.map(_.toArray)
 Print to the console
clusterCenters.foreach(lines=>println(lines(0),lines(1)))

SparkML
 SparkML provides an API to build ML pipeline (since Spark 1.3)
 Similar to Python scikit-learn
 SparkML provides abstraction for all steps of an ML workflow
Generic ML Workflow Real Life ML Workflow
 Transformer: A Transformer is an algorithm which can transform
one DataFrame into another DataFrame. E.g., an ML model is a
Transformer which transforms an RDD with features into an
RDD with predictions.
 Estimator: An Estimator is an algorithm which can be fit on a
DataFrame to produce a Transformer. E.g., a learning algorithm
is an Estimator which trains on a dataset and produces a model.
 Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow.
 Param: All Transformers and Estimators now share a common
API for specifying parameters. Xebia HUG France 06/2015

Spark GraphX
 Flexible Graphing
–GraphX unifies ETL, exploratory analysis, and iterative graph
computation
–You can view the same data as both graphs and collections,
transform and join graphs with RDDs efficiently, and write custom
iterative graph algorithms with the API
 Speed
–Comparable performance to the fastest specialized graph
processing systems.
 Algorithms
–Choose from a growing library of graph algorithms
–In addition to a highly flexible API, GraphX comes
with a variety of graph algorithms

Spark R
 Spark R is an R package that provides a light-weight front-end to use
Apache Spark from R
 Spark R exposes the Spark API through the RDD class and allows
users to interactively run jobs from the R shell on a cluster.
 Goal
– Make Spark R production ready
– Integration with MLlib
– Consolidations to the DataFrames and RDD concepts
 First release in Spark 1.4.0 :
– Support of DataFrames
 Spark 1.5
– Support of MLlib

Spark internals refactoring : Project Tungsten
 Memory Management and Binary Processing:
leverage application semantics to manage memory
explicitly and eliminate the overhead of JVM object
model and garbage collection
 Cache-aware computation: algorithms and data
structures to exploit memory hierarchy
 Code generation: exploit modern compilers and
CPUs: allow efficient operation directly on binary data

Spark: Final Thoughts
 Spark is a good replacement for MapReduce
– Higher performance
– Framework is easier to use than MapReduce (M/R)
– Powerful RDD & DataFrames concepts
– Big higher level libraries : SparkSQL, MLlib/ML, Streaming, GraphX
– Big ecosystem adoption
 This is a very fast paced environment, so keep up !
– Lot of new features at each new release (major release each 3 months)
– Spark has the latest / best offer but things may change again

Resources
 The Learning Spark O’Reilly book
 Lab(s) this afternoon
 The following course on big data university

Introduction to Apache Spark

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to Introduction to Apache Spark (20)

Recently uploaded (20)

Introduction to Apache Spark