SlideShare a Scribd company logo
© 2015 IBM Corporation
Introduction to Apache Spark
Vincent Poncet
IBM Software Big Data Technical Sale
02/07/2015
2 © 2015 IBM Corporation
Credits
 This presentations draws upon previous work / slides by IBM
colleagues from WW Software Big Data Organization : Daniel
Kikuchi, Jacques Roy and Mokhtar Kandil
 I used several materials from DataBricks and Apache Spark
documentation
3 © 2015 IBM Corporation
Introduction and background
Spark Core API
Spark Execution Model
Spark Shell & Application Deployment
Spark Extensions (SparkSQL, MLlib, Spark Streaming)
Spark Future
Agenda
4 © 2015 IBM Corporation
Introduction and background
5 © 2015 IBM Corporation
 Apache Spark is a fast, general purpose,
easy-to-use cluster computing system for
large-scale data processing
– Fast
• Leverages aggressively cached in-memory
distributed computing and dedicated Executor
processes even when no jobs are running
• Faster than MapReduce
– General purpose
• Covers a wide range of workloads
• Provides SQL, streaming and complex
analytics
– Flexible and easier to use than Map Reduce
• Spark is written in Scala, an object oriented,
functional programming language
• Scala, Python and Java APIs
• Scala and Python interactive shells
• Runs on Hadoop, Mesos, standalone or cloud
Logistic regression in Hadoop and Spark
Spark Stack
val wordCounts =
sc.textFile("README.md").flatMap(line =>
line.split(" ")).map(word => (word,
1)).reduceByKey((a, b) => a + b)
WordCount
6 © 2015 IBM Corporation
Brief History of Spark
 2002 – MapReduce @ Google
 2004 – MapReduce paper
 2006 – Hadoop @ Yahoo
 2008 – Hadoop Summit
 2010 – Spark paper
 2013 – Spark 0.7 Apache Incubator
 2014 – Apache Spark top-level
 2014 – 1.2.0 release in December
 2015 – 1.3.0 release in March
 2015 – 1.4.0 release in June
 Spark is HOT!!!
 Most active project in Hadoop
ecosystem
 One of top 3 most active Apache
projects
 Databricks founded by the creators
of Spark from UC Berkeley’s
AMPLab
Activity for 6 months in 2014
(from Matei Zaharia – 2014 Spark Summit)
DataBricks
In June 2015, code base was about 400K lines
7 © 2015 IBM Corporation
DataBricks / Spark Summit 2015
8 © 2015 IBM Corporation
Large Scale Usage
DataBricks / Spark Summit 2015
9 © 2015 IBM Corporation
Spark ecosystem
 Spark is quite versatile and flexible:
– Can run on YARN / HDFS but also standalone or on MESOS
– The general processing capabilities of the Spark engine can be exploited from
multiple “entry points”: SQL, Streaming, Machine Learning, Graph Processing
10 © 2015 IBM Corporation
Spark in the Hadoop ecosystem
 Currently, Spark is a general purpose parallel processing engine
which integrates with YARN along the rest of the Hadoop frameworks
YARN
HDFS
Map/
Reduce 2
HivePig
Spark
HBase BigSQL Impala
11 © 2015 IBM Corporation
Future of Spark’s role in Hadoop ?
 The Spark Core engine is a good performant replacement for Map
Reduce:
YARN
HDFS
Spark Core
BigSQL
Spark
SQL
Spark
MLlib
Spark
Streaming
Hive
Custom
code
HBase
12 © 2015 IBM Corporation
Spark Core API
13 © 2015 IBM Corporation
 An RDD is a distributed collection of Scala/Python/Java objects of
the same type:
– RDD of strings
– RDD of integers
– RDD of (key, value) pairs
– RDD of class Java/Python/Scala objects
 An RDD is physically distributed across the cluster, but manipulated
as one logical entity:
– Spark will “distribute” any required processing to all partitions where the RDD
exists and perform necessary redistributions and aggregations as well.
– Example: Consider a distributed RDD “Names” made of names
Resilient Distributed Dataset (RDD): definition
Mokhtar
Jacques
Dirk
Cindy
Dan
Susan
Dirk
Frank
Jacques
Partition 1 Partition 2 Partition 3
Names
14 © 2015 IBM Corporation
 Suppose we want to know the number of names in the RDD “Names”
 User simply requests: Names.count()
– Spark will “distribute” count processing to all partitions so as to obtain:
• Partition 1: Mokhtar(1), Jacques (1), Dirk (1)  3
• Partition 2: Cindy (1), Dan (1), Susan (1)  3
• Partition 3: Dirk (1), Frank (1), Jacques (1)  3
– Local counts are subsequently aggregated: 3+3+3=9
 To lookup the first element in the RDD: Names.first()
 To display all elements of the RDD: Names.collect() (careful with this)
Resilient Distributed Dataset: definition
Mokhtar
Jacques
Dirk
Cindy
Dan
Susan
Dirk
Frank
Jacques
Partition 1 Partition 2 Partition 3
Names
15 © 2015 IBM Corporation
Resilient Distributed Datasets: Creation and Manipulation
 Three methods for creation
– Distributing a collection of objects from the driver program (using the
parallelize method of the spark context)
val rddNumbers = sc.parallelize(1 to 10)
val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”))
– Loading an external dataset (file)
val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
– Transformation from another existing RDD
val rddNumbers2 = rddNumbers.map(x=> x+1)
 Dataset from any storage supported by Hadoop
– HDFS, Cassandra, HBase, Amazon S3
– Others
 File types supported
– Text files, SequenceFiles, Parquet, JSON
– Hadoop InputFormat
16 © 2015 IBM Corporation
Resilient Distributed Datasets: Properties
 Immutable
 Two types of operations
– Transformations ~ DDL (Create View V2 as…)
• val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10
• val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11
• The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded
• It’s a Directed Acyclic Graph (DAG)
• No actual data processing does take place  Lazy evaluations
– Actions ~ DML (Select * From V2…)
• rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
• Performs transformations and action
• Returns a value (or write to a file)
 Fault tolerance
– If data in memory is lost it will be recreated from lineage
 Caching, persistence (memory, spilling, disk) and check-pointing
17 © 2015 IBM Corporation
RDD Transformations
 Transformations are lazy evaluations
 Returns a pointer to the transformed RDD
 Pair RDD (K,V) functions for MapReduce style transformations
Transformation Meaning
map(func) Return a new dataset formed by passing each element of the source through a function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which func returns
true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. So func should
return a Seq rather than a single item
Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
join(otherDataset,
[numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all
pairs of elements for each key.
reduceByKey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K,V) pairs where the values for
each key are aggregated using the given reduce function func
sortByKey([ascendin
g],[numTasks])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K,V)
pairs sorted by keys in ascending or descending order.
combineByKey[C}(cr
eateCombiner,
mergeValue,
mergeCombiners))
Generic function to combine the elements for each key using a custom set of aggregation
functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C.
createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C)
18 © 2015 IBM Corporation
RDD Actions
 Actions returns values or save a RDD to disk
Action Meaning
collect() Return all the elements of the dataset as an array of the driver program. This
is usually useful after a filter or another operation that returns a sufficiently
small subset of data.
count() Return the number of elements in a dataset.
first() Return the first element of the dataset
take(n) Return an array with the first n elements of the dataset.
foreach(func) Run a function func on each element of the dataset.
saveAsTextFile Save the RDD into a TextFile
Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
19 © 2015 IBM Corporation
RDD Persistence
 Each node stores any partitions of the cache that it computes in memory
 Reuses them in other actions on that dataset (or datasets derived from it)
– Future actions are much faster (often by more than 10x)
 Two methods for RDD persistence: persist() and cache()
Storage Level Meaning
MEMORY_ONLY Store as deserialized Java objects in the JVM. If the RDD does not fit in memory, part of
it will be cached. The other will be recomputed as needed. This is the default. The
cache() method uses this.
MEMORY_AND_DISK Same except also store on disk if it doesn’t fit in memory. Read from memory and disk
when needed.
MEMORY_ONLY_SER Store as serialized Java objects (one bye array per partition). Space efficient, but more
CPU intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_AND_DISK but stored as serialized objects.
DISK_ONLY Store only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc.
Same as above, but replicate each partition on two cluster nodes
OFF_HEAP (experimental) Store RDD in serialized format in Tachyon.
20 © 2015 IBM Corporation
Scala
 Scala Crash Course
 Holden Karau, DataBricks
http://lintool.github.io/SparkTutorial/slides/day1_Scala_crash_course
.pdf
21 © 2015 IBM Corporation
Code Execution (1)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt
 ‘spark-shell’ provides Spark context as ‘sc’
22 © 2015 IBM Corporation
Code Execution (2)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt RDD: quotes
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
23 © 2015 IBM Corporation
Code Execution (3)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt RDD: quotes RDD: danQuotes
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
DAN Spark is cool
DAN Scala is awesome
24 © 2015 IBM Corporation
Code Execution (4)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt RDD: quotes RDD: danQuotes RDD: danSpark
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
DAN Spark is cool
DAN Scala is awesome
Spark
Scala
25 © 2015 IBM Corporation
Code Execution (5)
// Create RDD
val quotes =
sc.textFile("hdfs:/sparkdata/sparkQuotes.txt")
// Transformations
val danQuotes = quotes.filter(_.startsWith("DAN"))
val danSpark = danQuotes.map(_.split(" ")).map(x =>
x(1))
// Action
danSpark.filter(_.contains("Spark")).count()
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
File: sparkQuotes.txt
HadoopRDD
DAN Spark is cool
BOB Spark is fun
BRIAN Spark is great
DAN Scala is awesome
BOB Scala is flexible
RDD: quotes
DAN Spark is cool
DAN Scala is awesome
RDD: danQuotes
Spark
Scala
RDD: danSpark
1
26 © 2015 IBM Corporation
DataFrames
 A DataFrame is a distributed collection of data organized into named columns. It is
conceptually equivalent to a table in a relational database, an R dataframe or Python Pandas,
but in a distributed manner and with query optimizations and predicate pushdown to the
underlying storage.
 DataFrames can be constructed from a wide array of sources such as: structured data files,
tables in Hive, external databases, or existing RDDs.
 Released in Spark 1.3
DataBricks / Spark Summit 2015
27 © 2015 IBM Corporation
DataFrames Examples
// Create the DataFrame
val df = sqlContext.read.parquet("examples/src/main/resources/people.parquet")
// Show the content of the DataFrame
df.show()
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the "name" column
df.select("name").show()
// Select everybody, but increment the age by 1
df.select(df("name"), df("age") + 1).show()
// Select people older than 21
df.filter(df("age") > 21).show()
// Count people by age
df.groupBy("age").count().show()
28 © 2015 IBM Corporation
Spark
Execution Model
29 © 2015 IBM Corporation
sc = new SparkContext
f = sc.textFile(“…”)
f.filter(…)
.count()
...
Your program
Spark client
(app master) Spark worker
HDFS, HBase, …
Block
manager
Task
threads
RDD graph
Scheduler
Block tracker
Shuffle tracker
Cluster
manager
Components
DataBricks
30 © 2015 IBM Corporation
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
agnostic to
operators!
doesn’t know
about stages
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
stage
failed
Scheduling Process
DataBricks
31 © 2015 IBM Corporation
Pipelines narrow ops.
within a stage
Picks join algorithms
based on partitioning
(minimize shuffles)
Reuses previously
cached data
Scheduler Optimizations
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
DataBricks
32 © 2015 IBM Corporation
Direct Acyclic Graph (DAG)
 View the lineage
 Could be issued in a continuous line
scala> danSpark.toDebugString
res1: String =
(2) MappedRDD[4] at map at <console>:16
| MappedRDD[3] at map at <console>:16
| FilteredRDD[2] at filter at <console>:14
| hdfs:/sparkdata/sparkQuotes.txt MappedRDD[1] at textFile at <console>:12
| hdfs:/sparkdata/sparkQuotes.txt HadoopRDD[0] at textFile at <console>:12
val danSpark = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt").
filter(_.startsWith("DAN")).
map(_.split(" ")).
map(x => x(1)).
.filter(_.contains("Spark"))
danSpark.count()
33 © 2015 IBM Corporation
Showing Multiple Apps
SparkContext
Driver Program
Cluster Manager
Worker Node
Executor
Task Task
Cache
Worker Node
Executor
Task Task
Cache
App
 Each Spark application runs as a set of processes coordinated by the
Spark context object (driver program)
– Spark context connects to Cluster Manager (standalone, Mesos/Yarn)
– Spark context acquires executors (JVM instance)
on worker nodes
– Spark context sends tasks to the executors
DataBricks
34 © 2015 IBM Corporation
Spark Terminology
 Context (Connection):
– Represents a connection to the Spark cluster. The Application which initiated
the context can submit one or several jobs, sequentially or in parallel, batch or
interactively, or long running server continuously serving requests.
 Driver (Coordinator agent)
– The program or process running the Spark context. Responsible for running
jobs over the cluster and converting the App into a set of tasks
 Job (Query / Query plan):
– A piece of logic (code) which will take some input from HDFS (or the local
filesystem), perform some computations (transformations and actions) and
write some output back.
 Stage (Subplan)
– Jobs are divided into stages
 Tasks (Sub section)
– Each stage is made up of tasks. One task per partition. One task is executed
on one partition (of data) by one executor
 Executor (Sub agent)
– The process responsible for executing a task on a worker node
 Resilient Distributed Dataset
35 © 2015 IBM Corporation
Spark
Shell & Application Deployment
36 © 2015 IBM Corporation
Spark’s Scala and Python Shell
 Spark comes with two shells
– Scala
– Python
 APIs available for Scala, Python and Java
 Appropriate versions for each Spark release
 Spark’s native language is Scala, more natural to write Spark
applications using Scala.
 This presentation will focus on code examples in Scala
37 © 2015 IBM Corporation
Spark’s Scala and Python Shell
 Powerful tool to analyze data interactively
 The Scala shell runs on the Java VM
– Can leverage existing Java libraries
 Scala:
– To launch the Scala shell (from Spark home directory):
./bin/spark-shell
– To read in a text file:
scala> val textFile = sc.textFile("README.txt")
 Python:
– To launch the Python shell (from Spark home directory):
./bin/pyspark
– To read in a text file:
>>> textFile = sc.textFile("README.txt")
38 © 2015 IBM Corporation
SparkContext in Applications
 The main entry point for Spark functionality
 Represents the connection to a Spark cluster
 Create RDDs, accumulators, and broadcast variables on that
cluster
 In the Spark shell, the SparkContext, sc, is automatically initialized
for you to use
 In a Spark program, import some classes and implicit conversions
into your program:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
39 © 2015 IBM Corporation
A Spark Standalone Application in Scala
Import statements
SparkConf and
SparkContext
Transformations
and Actions
40 © 2015 IBM Corporation
Running Standalone Applications
 Define the dependencies
– Scala  simple.sbt
 Create the typical directory structure with the files
 Create a JAR package containing the application’s code.
– Scala: sbt package
 Use spark-submit to run the program
Scala:
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
41 © 2015 IBM Corporation
Spark Properties
 Set application properties via the SparkConf object
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
 Dynamically setting Spark properties
– SparkContext with an empty conf
val sc = new SparkContext(new SparkConf())
– Supply the configuration values during runtime
./bin/spark-submit --name "My app" --master local[4] --conf
spark.shuffle.spill=false --conf "spark.executor.extraJavaOptions=-
XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
– conf/spark-defaults.conf
 Application web UI
http://<driver>:4040
42 © 2015 IBM Corporation
Spark Configuration
 Three locations for configuration:
– Spark properties
– Environment variables
conf/spark-env.sh
– Logging
log4j.properties
 Override default configuration directory (SPARK_HOME/conf)
– SPARK_CONF_DIR
• spark-defaults.conf
• spark-env.sh
• log4j.properties
• etc.
43 © 2015 IBM Corporation
Spark Monitoring
 Three ways to monitor Spark applications
1. Web UI
• Default port 4040
• Available for the duration of the application
2. Metrics
• Based on the Coda Hale Metrics Library
• Report to a variety of sinks (HTTP, JMX, and CSV)
• /conf/metrics.properties
3. External instrumentations
• Ganglia
• OS profiling tools (dstat, iostat, iotop)
• JVM utilities (jstack, jmap, jstat, jconsole)
44 © 2015 IBM Corporation
Running Spark Examples
 Spark samples available in the examples directory
 Run the examples (from Spark home directory):
./bin/run-example SparkPi
where SparkPi is the name of the sample application
45 © 2015 IBM Corporation
Spark Extensions
46 © 2015 IBM Corporation
Spark Extensions
 Extensions to the core Spark API
 Improvements made to the core are passed to these libraries
 Little overhead to use with the Spark core
from http://spark.apache.org
47 © 2015 IBM Corporation
Spark SQL
 Process relational queries expressed in SQL (HiveQL)
 Seamlessly mix SQL queries with Spark programs
 In Spark since 1.0, refactored on top of DataFrames since 1.3
 Provide a single interface for efficiently working with structured
data including Apache Hive, Parquet and JSON files
 Leverages Hive frontend and metastore
– Compatibility with Hive data, queries
and UDFs
– HiveQL limitations may apply
– Not ANSI SQL compliant
– Little to no query rewrite optimization,
automatic memory management or
sophisticated workload management
 Graduated from alpha status with Spark 1.3
 Standard connectivity through JDBC/ODBC
48 © 2015 IBM Corporation
Spark SQL - Getting Started
 SQLContext created from SparkContext
// An existing SparkContext, sc
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 HiveContext created from SparkContext
// An existing SparkContext, sc
val sqlContext = new
org.apache.spark.sql.hive.HiveContext(sc)
 Import a library to convert an RDD to a DataFrame
– Scala:
import sqlContext.implicits._
 DataFrame data sources
– Inferring the schema using reflection
– Programmatic interface
49 © 2015 IBM Corporation
Spark SQL - Inferring the Schema Using Reflection
 The case class in Scala defines the schema of the table
case class Person(name: String, age: Int)
 The arguments of the case class becomes the names of the columns
 Create the RDD of the Person object and create a DataFrame
val people = sc.textFile("examples/src/main/resources/people.txt").
map(_.split(",")).
map(p => Person(p(0), p(1).trim.toInt)).toDF()
 Register the DataFrame as a table
people.registerTempTable("people")
 Run SQL statements using the sql method provided by the
SQLContext
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND
age <= 19")
 The results of the queries are DataFrames and support all the normal
RDD operations
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
50 © 2015 IBM Corporation
Spark SQL - Programmatic Interface
 Use when you cannot define the case classes ahead of time
 Three steps to create the Dataframe
1. Schema encoded as a String, import SparkSQL Struct types
val schemaString = “name age”
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType};
2. Create the schema represented by a StructType matching the structure of
the Rows in the RDD from step 1.
val schema = StructType( schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)))
3. Apply the schema to the RDD of Rows using the createDataFrame method.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
 Then register the peopleSchemaRDD as a table
peopleDataFrame.registerTempTable("people")
 Run the sql statements using the sql method:
val results = sqlContext.sql("SELECT name FROM people")
results.map(t => "Name: " + t(0)).collect().foreach(println)
51 © 2015 IBM Corporation
SparkSQL - DataSources
Before : Spark 1.2.x
 ParquetFile
– val parquetFile = sqlContext.parquetFile("people.parquet")
 JSON :
– val df =
sqlContext.jsonFile("examples/src/main/resources/people.json")
Spark 1.3.x
 Generic Load/Save
– val df = sqlContext.load(“<filename>", “<datasource
type>")
– df.save (“<filename>", “<datasource type>")
 ParquetFile
– val df = sqlContext.load("people.parquet") //
(parquet unless otherwise configured
by spark.sql.sources.default)
– df.select("name",
"age").save("namesAndAges.parquet")
 JSON
– val df = sqlContext.load("people.json", "json")
– df.select("name", "age").save("namesAndAges.json",
“json")
 CSV (external package)
– val df = sqlContext.load("com.databricks.spark.csv",
Map("path" -> "cars.csv", "header" -> "true"))
– df.select("year", "model").save("newcars.csv",
"com.databricks.spark.csv")
Spark 1.4.x
 Generic Load/Save
– val df = sqlContext.read.load(“<filename>", “<datasource type>")
– df.write.save (“<filename>", “<datasource type>")
 ParquetFile
– val df = sqlContext.read.load("people.parquet") // (parquet unless
otherwise configured by spark.sql.sources.default)
– df.select("name", "age").write.save("namesAndAges.parquet")
 JSON
– val df = sqlContext.read.load("people.json", "json")
– df.select("name", "age").write.save("namesAndAges.json", “json")
 CSV (external package)
– val df =
sqlContext.read.format("com.databricks.spark.csv").option("heade
r", "true").load("cars.csv")
– df.select("year",
"model").write.format("com.databricks.spark.csv").save("newcars.
csv")
DataSource APIs provides generic methods to
manage connectors to any datasource (file, jdbc,
cassandra, mongodb, etc…). From Spark 1.3
DataSource APIs provides predicate pushdown
capabilities to leverage the performance of the
backend. Most connectors are available at
http://spark-packages.org/
52 © 2015 IBM Corporation
Spark Streaming
 Scalable, high-throughput, fault-tolerant stream processing of live
data streams
 Write Spark streaming applications like Spark applications
 Recovers lost work and operator state (sliding windows) out-of-the-
box
 Uses HDFS and Zookeeper for high availability
 Data sources also include TCP sockets, ZeroMQ or other customized
data sources
53 © 2015 IBM Corporation
Spark Streaming - Internals
 The input stream goes into Spark Steaming
 Breaks up into batches of input data
 Feeds it into the Spark engine for processing
 Generate the final results in streams of batches
 DStream - Discretized Stream
– Represents a continuous stream of data created from the input streams
– Internally, represented as a sequence of RDDs
54 © 2015 IBM Corporation
Spark Streaming - Getting Started
 Count the number of words coming in from the TCP socket
 Import the Spark Streaming classes
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
 Create the StreamingContext object
val conf =
new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
 Create a DStream
val lines = ssc.socketTextStream("localhost", 9999)
 Split the lines into words
val words = lines.flatMap(_.split(" "))
 Count the words
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
 Print to the console
wordCounts.print()
55 © 2015 IBM Corporation
Spark Streaming - Continued
 No real processing happens until you tell it
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
 Code and application can be found in the NetworkWordCount
example
 To run the example:
– Invoke netcat to start the data stream
– In a different terminal, run the application
./bin/run-example streaming.NetworkWordCount localhost 9999
56 © 2015 IBM Corporation
Spark MLlib
 Spark MLlib for machine learning
library
 Since Spark 0.8
 Provides common algorithms and
utilities
• Classification
• Regression
• Clustering
• Collaborative filtering
• Dimensionality reduction
 Leverages in-memory cache of Spark
to speed up iteration processing
57 © 2015 IBM Corporation
Spark MLlib - Getting Started
 Use k-means clustering for set of latitudes and longitudes
 Import the Spark MLlib classes
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
 Create the SparkContext object
val conf = new SparkConf().setAppName("KMeans")
val sc = new SparkContext(conf)
 Create a data RDD
val taxifile = sc.textFile("user/spark/sparkdata/nyctaxisub/*")
 Create Vectors for input to algorithm
val taxi =
taxifile.map{line=>Vectors.dense(line.split(",").slice(3,5).map(_.toDouble))}
 Run the k-means algorithm with 3 clusters and 10 iterations
val model = Kmeans.train(taxi,3,10)
val clusterCenters = model.clusterCenters.map(_.toArray)
 Print to the console
clusterCenters.foreach(lines=>println(lines(0),lines(1)))
58 © 2015 IBM Corporation
SparkML
 SparkML provides an API to build ML pipeline (since Spark 1.3)
 Similar to Python scikit-learn
 SparkML provides abstraction for all steps of an ML workflow
Generic ML Workflow Real Life ML Workflow
 Transformer: A Transformer is an algorithm which can transform
one DataFrame into another DataFrame. E.g., an ML model is a
Transformer which transforms an RDD with features into an
RDD with predictions.
 Estimator: An Estimator is an algorithm which can be fit on a
DataFrame to produce a Transformer. E.g., a learning algorithm
is an Estimator which trains on a dataset and produces a model.
 Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow.
 Param: All Transformers and Estimators now share a common
API for specifying parameters. Xebia HUG France 06/2015
59 © 2015 IBM Corporation
Spark GraphX
 Flexible Graphing
–GraphX unifies ETL, exploratory analysis, and iterative graph
computation
–You can view the same data as both graphs and collections,
transform and join graphs with RDDs efficiently, and write custom
iterative graph algorithms with the API
 Speed
–Comparable performance to the fastest specialized graph
processing systems.
 Algorithms
–Choose from a growing library of graph algorithms
–In addition to a highly flexible API, GraphX comes
with a variety of graph algorithms
60 © 2015 IBM Corporation
Spark R
 Spark R is an R package that provides a light-weight front-end to use
Apache Spark from R
 Spark R exposes the Spark API through the RDD class and allows
users to interactively run jobs from the R shell on a cluster.
 Goal
– Make Spark R production ready
– Integration with MLlib
– Consolidations to the DataFrames and RDD concepts
 First release in Spark 1.4.0 :
– Support of DataFrames
 Spark 1.5
– Support of MLlib
61 © 2015 IBM Corporation
Spark internals refactoring : Project Tungsten
 Memory Management and Binary Processing:
leverage application semantics to manage memory
explicitly and eliminate the overhead of JVM object
model and garbage collection
 Cache-aware computation: algorithms and data
structures to exploit memory hierarchy
 Code generation: exploit modern compilers and
CPUs: allow efficient operation directly on binary data
DataBricks / Spark Summit 2015
62 © 2015 IBM Corporation
Spark: Final Thoughts
 Spark is a good replacement for MapReduce
– Higher performance
– Framework is easier to use than MapReduce (M/R)
– Powerful RDD & DataFrames concepts
– Big higher level libraries : SparkSQL, MLlib/ML, Streaming, GraphX
– Big ecosystem adoption
 This is a very fast paced environment, so keep up !
– Lot of new features at each new release (major release each 3 months)
– Spark has the latest / best offer but things may change again
63 © 2015 IBM Corporation
Resources
 The Learning Spark O’Reilly book
 Lab(s) this afternoon
 The following course on big data university

More Related Content

What's hot (20)

PDF
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
PDF
Apache Spark Tutorial
Ahmet Bulut
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PDF
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
PPTX
Apache spark
TEJPAL GAUTAM
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Introduction to Apache Spark
Samy Dindane
 
PDF
Spark
Intellipaat
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PPTX
Big Data and Hadoop Guide
Simplilearn
 
PDF
Spark SQL | Apache Spark
Edureka!
 
PDF
Apache spark
Dona Mary Philip
 
PPTX
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
PPTX
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
PDF
Performance of Spark vs MapReduce
Edureka!
 
PPTX
Spark SQL
Caserta
 
Apache Spark 101
Abdullah Çetin ÇAVDAR
 
Apache Spark Tutorial
Ahmet Bulut
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Apache spark
TEJPAL GAUTAM
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Introduction to Apache Spark
Samy Dindane
 
Introduction to Apache Spark
Rahul Jain
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Big Data and Hadoop Guide
Simplilearn
 
Spark SQL | Apache Spark
Edureka!
 
Apache spark
Dona Mary Philip
 
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Performance of Spark vs MapReduce
Edureka!
 
Spark SQL
Caserta
 

Viewers also liked (13)

PDF
API Days Apidaze WebRTC Hype or Disruption 4 dec. 2013
Luis Borges Quina
 
PDF
Value Creation Strategies for APIs
Apigee | Google Cloud
 
PPTX
How to Talk about APIs (APIDays Paris 2016)
Andrew Seward
 
PPTX
APIdaze_Meetup require ('lx') _ TADHack 23 May 2015
Luis Borges Quina
 
PDF
Translation is UX manifesto
Antoine Lefeuvre
 
PDF
Networks, cloud & operator innovation- Mats Alendal
Ericsson
 
PDF
Incubateur HEC Presentation programme Oct 2016
Remi Rivas
 
PDF
Ottspott by Apidaze @API Days Paris 2015
Luis Borges Quina
 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
 
PDF
figo at API Days 2016 in Paris
Lars Markull
 
PDF
Api days 2014 from theatrophone to ap is_the 2020 telco challenge_
Luis Borges Quina
 
PDF
WebRTC Paris Meetup@ Google (10th Feb. 2014) : Apidaze Presentation
Luis Borges Quina
 
PDF
Manifeste 'Translation is UX'
Antoine Lefeuvre
 
API Days Apidaze WebRTC Hype or Disruption 4 dec. 2013
Luis Borges Quina
 
Value Creation Strategies for APIs
Apigee | Google Cloud
 
How to Talk about APIs (APIDays Paris 2016)
Andrew Seward
 
APIdaze_Meetup require ('lx') _ TADHack 23 May 2015
Luis Borges Quina
 
Translation is UX manifesto
Antoine Lefeuvre
 
Networks, cloud & operator innovation- Mats Alendal
Ericsson
 
Incubateur HEC Presentation programme Oct 2016
Remi Rivas
 
Ottspott by Apidaze @API Days Paris 2015
Luis Borges Quina
 
Apache Spark and DataStax Enablement
Vincent Poncet
 
figo at API Days 2016 in Paris
Lars Markull
 
Api days 2014 from theatrophone to ap is_the 2020 telco challenge_
Luis Borges Quina
 
WebRTC Paris Meetup@ Google (10th Feb. 2014) : Apidaze Presentation
Luis Borges Quina
 
Manifeste 'Translation is UX'
Antoine Lefeuvre
 
Ad

Similar to Introduction to Apache Spark (20)

PDF
Boston Spark Meetup event Slides Update
vithakur
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PPTX
Introduction to Apache Spark
Mohamed hedi Abidi
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PDF
Meetup ml spark_ppt
Snehal Nagmote
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PPTX
Dive into spark2
Gal Marder
 
PPT
Scala and spark
Fabio Fumarola
 
PPTX
Spark Study Notes
Richard Kuo
 
PPTX
Spark core
Prashant Gupta
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PDF
Scala+data
Samir Bessalah
 
PPTX
Spark from the Surface
Josi Aranda
 
PPTX
Spark real world use cases and optimizations
Gal Marder
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
PDF
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Boston Spark Meetup event Slides Update
vithakur
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Introduction to Apache Spark
Mohamed hedi Abidi
 
Tuning and Debugging in Apache Spark
Databricks
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Meetup ml spark_ppt
Snehal Nagmote
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Dive into spark2
Gal Marder
 
Scala and spark
Fabio Fumarola
 
Spark Study Notes
Richard Kuo
 
Spark core
Prashant Gupta
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Scala+data
Samir Bessalah
 
Spark from the Surface
Josi Aranda
 
Spark real world use cases and optimizations
Gal Marder
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Introduction to Spark - DataFactZ
DataFactZ
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Ad

Recently uploaded (20)

PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 

Introduction to Apache Spark

  • 1. © 2015 IBM Corporation Introduction to Apache Spark Vincent Poncet IBM Software Big Data Technical Sale 02/07/2015
  • 2. 2 © 2015 IBM Corporation Credits  This presentations draws upon previous work / slides by IBM colleagues from WW Software Big Data Organization : Daniel Kikuchi, Jacques Roy and Mokhtar Kandil  I used several materials from DataBricks and Apache Spark documentation
  • 3. 3 © 2015 IBM Corporation Introduction and background Spark Core API Spark Execution Model Spark Shell & Application Deployment Spark Extensions (SparkSQL, MLlib, Spark Streaming) Spark Future Agenda
  • 4. 4 © 2015 IBM Corporation Introduction and background
  • 5. 5 © 2015 IBM Corporation  Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-scale data processing – Fast • Leverages aggressively cached in-memory distributed computing and dedicated Executor processes even when no jobs are running • Faster than MapReduce – General purpose • Covers a wide range of workloads • Provides SQL, streaming and complex analytics – Flexible and easier to use than Map Reduce • Spark is written in Scala, an object oriented, functional programming language • Scala, Python and Java APIs • Scala and Python interactive shells • Runs on Hadoop, Mesos, standalone or cloud Logistic regression in Hadoop and Spark Spark Stack val wordCounts = sc.textFile("README.md").flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) WordCount
  • 6. 6 © 2015 IBM Corporation Brief History of Spark  2002 – MapReduce @ Google  2004 – MapReduce paper  2006 – Hadoop @ Yahoo  2008 – Hadoop Summit  2010 – Spark paper  2013 – Spark 0.7 Apache Incubator  2014 – Apache Spark top-level  2014 – 1.2.0 release in December  2015 – 1.3.0 release in March  2015 – 1.4.0 release in June  Spark is HOT!!!  Most active project in Hadoop ecosystem  One of top 3 most active Apache projects  Databricks founded by the creators of Spark from UC Berkeley’s AMPLab Activity for 6 months in 2014 (from Matei Zaharia – 2014 Spark Summit) DataBricks In June 2015, code base was about 400K lines
  • 7. 7 © 2015 IBM Corporation DataBricks / Spark Summit 2015
  • 8. 8 © 2015 IBM Corporation Large Scale Usage DataBricks / Spark Summit 2015
  • 9. 9 © 2015 IBM Corporation Spark ecosystem  Spark is quite versatile and flexible: – Can run on YARN / HDFS but also standalone or on MESOS – The general processing capabilities of the Spark engine can be exploited from multiple “entry points”: SQL, Streaming, Machine Learning, Graph Processing
  • 10. 10 © 2015 IBM Corporation Spark in the Hadoop ecosystem  Currently, Spark is a general purpose parallel processing engine which integrates with YARN along the rest of the Hadoop frameworks YARN HDFS Map/ Reduce 2 HivePig Spark HBase BigSQL Impala
  • 11. 11 © 2015 IBM Corporation Future of Spark’s role in Hadoop ?  The Spark Core engine is a good performant replacement for Map Reduce: YARN HDFS Spark Core BigSQL Spark SQL Spark MLlib Spark Streaming Hive Custom code HBase
  • 12. 12 © 2015 IBM Corporation Spark Core API
  • 13. 13 © 2015 IBM Corporation  An RDD is a distributed collection of Scala/Python/Java objects of the same type: – RDD of strings – RDD of integers – RDD of (key, value) pairs – RDD of class Java/Python/Scala objects  An RDD is physically distributed across the cluster, but manipulated as one logical entity: – Spark will “distribute” any required processing to all partitions where the RDD exists and perform necessary redistributions and aggregations as well. – Example: Consider a distributed RDD “Names” made of names Resilient Distributed Dataset (RDD): definition Mokhtar Jacques Dirk Cindy Dan Susan Dirk Frank Jacques Partition 1 Partition 2 Partition 3 Names
  • 14. 14 © 2015 IBM Corporation  Suppose we want to know the number of names in the RDD “Names”  User simply requests: Names.count() – Spark will “distribute” count processing to all partitions so as to obtain: • Partition 1: Mokhtar(1), Jacques (1), Dirk (1)  3 • Partition 2: Cindy (1), Dan (1), Susan (1)  3 • Partition 3: Dirk (1), Frank (1), Jacques (1)  3 – Local counts are subsequently aggregated: 3+3+3=9  To lookup the first element in the RDD: Names.first()  To display all elements of the RDD: Names.collect() (careful with this) Resilient Distributed Dataset: definition Mokhtar Jacques Dirk Cindy Dan Susan Dirk Frank Jacques Partition 1 Partition 2 Partition 3 Names
  • 15. 15 © 2015 IBM Corporation Resilient Distributed Datasets: Creation and Manipulation  Three methods for creation – Distributing a collection of objects from the driver program (using the parallelize method of the spark context) val rddNumbers = sc.parallelize(1 to 10) val rddLetters = sc.parallelize (List(“a”, “b”, “c”, “d”)) – Loading an external dataset (file) val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") – Transformation from another existing RDD val rddNumbers2 = rddNumbers.map(x=> x+1)  Dataset from any storage supported by Hadoop – HDFS, Cassandra, HBase, Amazon S3 – Others  File types supported – Text files, SequenceFiles, Parquet, JSON – Hadoop InputFormat
  • 16. 16 © 2015 IBM Corporation Resilient Distributed Datasets: Properties  Immutable  Two types of operations – Transformations ~ DDL (Create View V2 as…) • val rddNumbers = sc.parallelize(1 to 10): Numbers from 1 to 10 • val rddNumbers2 = rddNumbers.map (x => x+1): Numbers from 2 to 11 • The LINEAGE on how to obtain rddNumbers2 from rddNumber is recorded • It’s a Directed Acyclic Graph (DAG) • No actual data processing does take place  Lazy evaluations – Actions ~ DML (Select * From V2…) • rddNumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] • Performs transformations and action • Returns a value (or write to a file)  Fault tolerance – If data in memory is lost it will be recreated from lineage  Caching, persistence (memory, spilling, disk) and check-pointing
  • 17. 17 © 2015 IBM Corporation RDD Transformations  Transformations are lazy evaluations  Returns a pointer to the transformed RDD  Pair RDD (K,V) functions for MapReduce style transformations Transformation Meaning map(func) Return a new dataset formed by passing each element of the source through a function func. filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items. So func should return a Seq rather than a single item Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. reduceByKey(func) When called on a dataset of (K, V) pairs, returns a dataset of (K,V) pairs where the values for each key are aggregated using the given reduce function func sortByKey([ascendin g],[numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K,V) pairs sorted by keys in ascending or descending order. combineByKey[C}(cr eateCombiner, mergeValue, mergeCombiners)) Generic function to combine the elements for each key using a custom set of aggregation functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C. createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C)
  • 18. 18 © 2015 IBM Corporation RDD Actions  Actions returns values or save a RDD to disk Action Meaning collect() Return all the elements of the dataset as an array of the driver program. This is usually useful after a filter or another operation that returns a sufficiently small subset of data. count() Return the number of elements in a dataset. first() Return the first element of the dataset take(n) Return an array with the first n elements of the dataset. foreach(func) Run a function func on each element of the dataset. saveAsTextFile Save the RDD into a TextFile Full documentation at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
  • 19. 19 © 2015 IBM Corporation RDD Persistence  Each node stores any partitions of the cache that it computes in memory  Reuses them in other actions on that dataset (or datasets derived from it) – Future actions are much faster (often by more than 10x)  Two methods for RDD persistence: persist() and cache() Storage Level Meaning MEMORY_ONLY Store as deserialized Java objects in the JVM. If the RDD does not fit in memory, part of it will be cached. The other will be recomputed as needed. This is the default. The cache() method uses this. MEMORY_AND_DISK Same except also store on disk if it doesn’t fit in memory. Read from memory and disk when needed. MEMORY_ONLY_SER Store as serialized Java objects (one bye array per partition). Space efficient, but more CPU intensive to read. MEMORY_AND_DISK_SER Similar to MEMORY_AND_DISK but stored as serialized objects. DISK_ONLY Store only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as above, but replicate each partition on two cluster nodes OFF_HEAP (experimental) Store RDD in serialized format in Tachyon.
  • 20. 20 © 2015 IBM Corporation Scala  Scala Crash Course  Holden Karau, DataBricks http://lintool.github.io/SparkTutorial/slides/day1_Scala_crash_course .pdf
  • 21. 21 © 2015 IBM Corporation Code Execution (1) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible File: sparkQuotes.txt  ‘spark-shell’ provides Spark context as ‘sc’
  • 22. 22 © 2015 IBM Corporation Code Execution (2) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible File: sparkQuotes.txt RDD: quotes DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible
  • 23. 23 © 2015 IBM Corporation Code Execution (3) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible File: sparkQuotes.txt RDD: quotes RDD: danQuotes DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible DAN Spark is cool DAN Scala is awesome
  • 24. 24 © 2015 IBM Corporation Code Execution (4) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible File: sparkQuotes.txt RDD: quotes RDD: danQuotes RDD: danSpark DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible DAN Spark is cool DAN Scala is awesome Spark Scala
  • 25. 25 © 2015 IBM Corporation Code Execution (5) // Create RDD val quotes = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt") // Transformations val danQuotes = quotes.filter(_.startsWith("DAN")) val danSpark = danQuotes.map(_.split(" ")).map(x => x(1)) // Action danSpark.filter(_.contains("Spark")).count() DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible File: sparkQuotes.txt HadoopRDD DAN Spark is cool BOB Spark is fun BRIAN Spark is great DAN Scala is awesome BOB Scala is flexible RDD: quotes DAN Spark is cool DAN Scala is awesome RDD: danQuotes Spark Scala RDD: danSpark 1
  • 26. 26 © 2015 IBM Corporation DataFrames  A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database, an R dataframe or Python Pandas, but in a distributed manner and with query optimizations and predicate pushdown to the underlying storage.  DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.  Released in Spark 1.3 DataBricks / Spark Summit 2015
  • 27. 27 © 2015 IBM Corporation DataFrames Examples // Create the DataFrame val df = sqlContext.read.parquet("examples/src/main/resources/people.parquet") // Show the content of the DataFrame df.show() // Print the schema in a tree format df.printSchema() // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // Select only the "name" column df.select("name").show() // Select everybody, but increment the age by 1 df.select(df("name"), df("age") + 1).show() // Select people older than 21 df.filter(df("age") > 21).show() // Count people by age df.groupBy("age").count().show()
  • 28. 28 © 2015 IBM Corporation Spark Execution Model
  • 29. 29 © 2015 IBM Corporation sc = new SparkContext f = sc.textFile(“…”) f.filter(…) .count() ... Your program Spark client (app master) Spark worker HDFS, HBase, … Block manager Task threads RDD graph Scheduler Block tracker Shuffle tracker Cluster manager Components DataBricks
  • 30. 30 © 2015 IBM Corporation rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG agnostic to operators! doesn’t know about stages DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task stage failed Scheduling Process DataBricks
  • 31. 31 © 2015 IBM Corporation Pipelines narrow ops. within a stage Picks join algorithms based on partitioning (minimize shuffles) Reuses previously cached data Scheduler Optimizations join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = previously computed partition Task DataBricks
  • 32. 32 © 2015 IBM Corporation Direct Acyclic Graph (DAG)  View the lineage  Could be issued in a continuous line scala> danSpark.toDebugString res1: String = (2) MappedRDD[4] at map at <console>:16 | MappedRDD[3] at map at <console>:16 | FilteredRDD[2] at filter at <console>:14 | hdfs:/sparkdata/sparkQuotes.txt MappedRDD[1] at textFile at <console>:12 | hdfs:/sparkdata/sparkQuotes.txt HadoopRDD[0] at textFile at <console>:12 val danSpark = sc.textFile("hdfs:/sparkdata/sparkQuotes.txt"). filter(_.startsWith("DAN")). map(_.split(" ")). map(x => x(1)). .filter(_.contains("Spark")) danSpark.count()
  • 33. 33 © 2015 IBM Corporation Showing Multiple Apps SparkContext Driver Program Cluster Manager Worker Node Executor Task Task Cache Worker Node Executor Task Task Cache App  Each Spark application runs as a set of processes coordinated by the Spark context object (driver program) – Spark context connects to Cluster Manager (standalone, Mesos/Yarn) – Spark context acquires executors (JVM instance) on worker nodes – Spark context sends tasks to the executors DataBricks
  • 34. 34 © 2015 IBM Corporation Spark Terminology  Context (Connection): – Represents a connection to the Spark cluster. The Application which initiated the context can submit one or several jobs, sequentially or in parallel, batch or interactively, or long running server continuously serving requests.  Driver (Coordinator agent) – The program or process running the Spark context. Responsible for running jobs over the cluster and converting the App into a set of tasks  Job (Query / Query plan): – A piece of logic (code) which will take some input from HDFS (or the local filesystem), perform some computations (transformations and actions) and write some output back.  Stage (Subplan) – Jobs are divided into stages  Tasks (Sub section) – Each stage is made up of tasks. One task per partition. One task is executed on one partition (of data) by one executor  Executor (Sub agent) – The process responsible for executing a task on a worker node  Resilient Distributed Dataset
  • 35. 35 © 2015 IBM Corporation Spark Shell & Application Deployment
  • 36. 36 © 2015 IBM Corporation Spark’s Scala and Python Shell  Spark comes with two shells – Scala – Python  APIs available for Scala, Python and Java  Appropriate versions for each Spark release  Spark’s native language is Scala, more natural to write Spark applications using Scala.  This presentation will focus on code examples in Scala
  • 37. 37 © 2015 IBM Corporation Spark’s Scala and Python Shell  Powerful tool to analyze data interactively  The Scala shell runs on the Java VM – Can leverage existing Java libraries  Scala: – To launch the Scala shell (from Spark home directory): ./bin/spark-shell – To read in a text file: scala> val textFile = sc.textFile("README.txt")  Python: – To launch the Python shell (from Spark home directory): ./bin/pyspark – To read in a text file: >>> textFile = sc.textFile("README.txt")
  • 38. 38 © 2015 IBM Corporation SparkContext in Applications  The main entry point for Spark functionality  Represents the connection to a Spark cluster  Create RDDs, accumulators, and broadcast variables on that cluster  In the Spark shell, the SparkContext, sc, is automatically initialized for you to use  In a Spark program, import some classes and implicit conversions into your program: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf
  • 39. 39 © 2015 IBM Corporation A Spark Standalone Application in Scala Import statements SparkConf and SparkContext Transformations and Actions
  • 40. 40 © 2015 IBM Corporation Running Standalone Applications  Define the dependencies – Scala  simple.sbt  Create the typical directory structure with the files  Create a JAR package containing the application’s code. – Scala: sbt package  Use spark-submit to run the program Scala: ./simple.sbt ./src ./src/main ./src/main/scala ./src/main/scala/SimpleApp.scala
  • 41. 41 © 2015 IBM Corporation Spark Properties  Set application properties via the SparkConf object val conf = new SparkConf() .setMaster("local") .setAppName("CountingSheep") .set("spark.executor.memory", "1g") val sc = new SparkContext(conf)  Dynamically setting Spark properties – SparkContext with an empty conf val sc = new SparkContext(new SparkConf()) – Supply the configuration values during runtime ./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false --conf "spark.executor.extraJavaOptions=- XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar – conf/spark-defaults.conf  Application web UI http://<driver>:4040
  • 42. 42 © 2015 IBM Corporation Spark Configuration  Three locations for configuration: – Spark properties – Environment variables conf/spark-env.sh – Logging log4j.properties  Override default configuration directory (SPARK_HOME/conf) – SPARK_CONF_DIR • spark-defaults.conf • spark-env.sh • log4j.properties • etc.
  • 43. 43 © 2015 IBM Corporation Spark Monitoring  Three ways to monitor Spark applications 1. Web UI • Default port 4040 • Available for the duration of the application 2. Metrics • Based on the Coda Hale Metrics Library • Report to a variety of sinks (HTTP, JMX, and CSV) • /conf/metrics.properties 3. External instrumentations • Ganglia • OS profiling tools (dstat, iostat, iotop) • JVM utilities (jstack, jmap, jstat, jconsole)
  • 44. 44 © 2015 IBM Corporation Running Spark Examples  Spark samples available in the examples directory  Run the examples (from Spark home directory): ./bin/run-example SparkPi where SparkPi is the name of the sample application
  • 45. 45 © 2015 IBM Corporation Spark Extensions
  • 46. 46 © 2015 IBM Corporation Spark Extensions  Extensions to the core Spark API  Improvements made to the core are passed to these libraries  Little overhead to use with the Spark core from http://spark.apache.org
  • 47. 47 © 2015 IBM Corporation Spark SQL  Process relational queries expressed in SQL (HiveQL)  Seamlessly mix SQL queries with Spark programs  In Spark since 1.0, refactored on top of DataFrames since 1.3  Provide a single interface for efficiently working with structured data including Apache Hive, Parquet and JSON files  Leverages Hive frontend and metastore – Compatibility with Hive data, queries and UDFs – HiveQL limitations may apply – Not ANSI SQL compliant – Little to no query rewrite optimization, automatic memory management or sophisticated workload management  Graduated from alpha status with Spark 1.3  Standard connectivity through JDBC/ODBC
  • 48. 48 © 2015 IBM Corporation Spark SQL - Getting Started  SQLContext created from SparkContext // An existing SparkContext, sc val sqlContext = new org.apache.spark.sql.SQLContext(sc)  HiveContext created from SparkContext // An existing SparkContext, sc val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)  Import a library to convert an RDD to a DataFrame – Scala: import sqlContext.implicits._  DataFrame data sources – Inferring the schema using reflection – Programmatic interface
  • 49. 49 © 2015 IBM Corporation Spark SQL - Inferring the Schema Using Reflection  The case class in Scala defines the schema of the table case class Person(name: String, age: Int)  The arguments of the case class becomes the names of the columns  Create the RDD of the Person object and create a DataFrame val people = sc.textFile("examples/src/main/resources/people.txt"). map(_.split(",")). map(p => Person(p(0), p(1).trim.toInt)).toDF()  Register the DataFrame as a table people.registerTempTable("people")  Run SQL statements using the sql method provided by the SQLContext val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")  The results of the queries are DataFrames and support all the normal RDD operations teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
  • 50. 50 © 2015 IBM Corporation Spark SQL - Programmatic Interface  Use when you cannot define the case classes ahead of time  Three steps to create the Dataframe 1. Schema encoded as a String, import SparkSQL Struct types val schemaString = “name age” import org.apache.spark.sql.Row; import org.apache.spark.sql.types.{StructType,StructField,StringType}; 2. Create the schema represented by a StructType matching the structure of the Rows in the RDD from step 1. val schema = StructType( schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) 3. Apply the schema to the RDD of Rows using the createDataFrame method. val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim)) val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)  Then register the peopleSchemaRDD as a table peopleDataFrame.registerTempTable("people")  Run the sql statements using the sql method: val results = sqlContext.sql("SELECT name FROM people") results.map(t => "Name: " + t(0)).collect().foreach(println)
  • 51. 51 © 2015 IBM Corporation SparkSQL - DataSources Before : Spark 1.2.x  ParquetFile – val parquetFile = sqlContext.parquetFile("people.parquet")  JSON : – val df = sqlContext.jsonFile("examples/src/main/resources/people.json") Spark 1.3.x  Generic Load/Save – val df = sqlContext.load(“<filename>", “<datasource type>") – df.save (“<filename>", “<datasource type>")  ParquetFile – val df = sqlContext.load("people.parquet") // (parquet unless otherwise configured by spark.sql.sources.default) – df.select("name", "age").save("namesAndAges.parquet")  JSON – val df = sqlContext.load("people.json", "json") – df.select("name", "age").save("namesAndAges.json", “json")  CSV (external package) – val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true")) – df.select("year", "model").save("newcars.csv", "com.databricks.spark.csv") Spark 1.4.x  Generic Load/Save – val df = sqlContext.read.load(“<filename>", “<datasource type>") – df.write.save (“<filename>", “<datasource type>")  ParquetFile – val df = sqlContext.read.load("people.parquet") // (parquet unless otherwise configured by spark.sql.sources.default) – df.select("name", "age").write.save("namesAndAges.parquet")  JSON – val df = sqlContext.read.load("people.json", "json") – df.select("name", "age").write.save("namesAndAges.json", “json")  CSV (external package) – val df = sqlContext.read.format("com.databricks.spark.csv").option("heade r", "true").load("cars.csv") – df.select("year", "model").write.format("com.databricks.spark.csv").save("newcars. csv") DataSource APIs provides generic methods to manage connectors to any datasource (file, jdbc, cassandra, mongodb, etc…). From Spark 1.3 DataSource APIs provides predicate pushdown capabilities to leverage the performance of the backend. Most connectors are available at http://spark-packages.org/
  • 52. 52 © 2015 IBM Corporation Spark Streaming  Scalable, high-throughput, fault-tolerant stream processing of live data streams  Write Spark streaming applications like Spark applications  Recovers lost work and operator state (sliding windows) out-of-the- box  Uses HDFS and Zookeeper for high availability  Data sources also include TCP sockets, ZeroMQ or other customized data sources
  • 53. 53 © 2015 IBM Corporation Spark Streaming - Internals  The input stream goes into Spark Steaming  Breaks up into batches of input data  Feeds it into the Spark engine for processing  Generate the final results in streams of batches  DStream - Discretized Stream – Represents a continuous stream of data created from the input streams – Internally, represented as a sequence of RDDs
  • 54. 54 © 2015 IBM Corporation Spark Streaming - Getting Started  Count the number of words coming in from the TCP socket  Import the Spark Streaming classes import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._  Create the StreamingContext object val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1))  Create a DStream val lines = ssc.socketTextStream("localhost", 9999)  Split the lines into words val words = lines.flatMap(_.split(" "))  Count the words val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _)  Print to the console wordCounts.print()
  • 55. 55 © 2015 IBM Corporation Spark Streaming - Continued  No real processing happens until you tell it ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate  Code and application can be found in the NetworkWordCount example  To run the example: – Invoke netcat to start the data stream – In a different terminal, run the application ./bin/run-example streaming.NetworkWordCount localhost 9999
  • 56. 56 © 2015 IBM Corporation Spark MLlib  Spark MLlib for machine learning library  Since Spark 0.8  Provides common algorithms and utilities • Classification • Regression • Clustering • Collaborative filtering • Dimensionality reduction  Leverages in-memory cache of Spark to speed up iteration processing
  • 57. 57 © 2015 IBM Corporation Spark MLlib - Getting Started  Use k-means clustering for set of latitudes and longitudes  Import the Spark MLlib classes import org.apache.spark.mllib.clustering.KMeans import org.apache.spark.mllib.linalg.Vectors  Create the SparkContext object val conf = new SparkConf().setAppName("KMeans") val sc = new SparkContext(conf)  Create a data RDD val taxifile = sc.textFile("user/spark/sparkdata/nyctaxisub/*")  Create Vectors for input to algorithm val taxi = taxifile.map{line=>Vectors.dense(line.split(",").slice(3,5).map(_.toDouble))}  Run the k-means algorithm with 3 clusters and 10 iterations val model = Kmeans.train(taxi,3,10) val clusterCenters = model.clusterCenters.map(_.toArray)  Print to the console clusterCenters.foreach(lines=>println(lines(0),lines(1)))
  • 58. 58 © 2015 IBM Corporation SparkML  SparkML provides an API to build ML pipeline (since Spark 1.3)  Similar to Python scikit-learn  SparkML provides abstraction for all steps of an ML workflow Generic ML Workflow Real Life ML Workflow  Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms an RDD with features into an RDD with predictions.  Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a dataset and produces a model.  Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.  Param: All Transformers and Estimators now share a common API for specifying parameters. Xebia HUG France 06/2015
  • 59. 59 © 2015 IBM Corporation Spark GraphX  Flexible Graphing –GraphX unifies ETL, exploratory analysis, and iterative graph computation –You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms with the API  Speed –Comparable performance to the fastest specialized graph processing systems.  Algorithms –Choose from a growing library of graph algorithms –In addition to a highly flexible API, GraphX comes with a variety of graph algorithms
  • 60. 60 © 2015 IBM Corporation Spark R  Spark R is an R package that provides a light-weight front-end to use Apache Spark from R  Spark R exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.  Goal – Make Spark R production ready – Integration with MLlib – Consolidations to the DataFrames and RDD concepts  First release in Spark 1.4.0 : – Support of DataFrames  Spark 1.5 – Support of MLlib
  • 61. 61 © 2015 IBM Corporation Spark internals refactoring : Project Tungsten  Memory Management and Binary Processing: leverage application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection  Cache-aware computation: algorithms and data structures to exploit memory hierarchy  Code generation: exploit modern compilers and CPUs: allow efficient operation directly on binary data DataBricks / Spark Summit 2015
  • 62. 62 © 2015 IBM Corporation Spark: Final Thoughts  Spark is a good replacement for MapReduce – Higher performance – Framework is easier to use than MapReduce (M/R) – Powerful RDD & DataFrames concepts – Big higher level libraries : SparkSQL, MLlib/ML, Streaming, GraphX – Big ecosystem adoption  This is a very fast paced environment, so keep up ! – Lot of new features at each new release (major release each 3 months) – Spark has the latest / best offer but things may change again
  • 63. 63 © 2015 IBM Corporation Resources  The Learning Spark O’Reilly book  Lab(s) this afternoon  The following course on big data university