Introduction to Apache Spark

Introduction to Apache Spark
Hubert 范姜 @hubert
HadoopCon Taiwan
2015
Sep. 19 , 2015 Taipei
Photo from http://quotesgram.com/spark-quotes/

Who are we?
• 亦思科技
• 位於新竹科學園區
• 過去主要客戶為園區各大製造廠
• 2010.7 以研發雲端計算軟體工具之投資計畫獲准進駐新竹科學園區
• 2011 與清華大學資工系鍾葉青教授合作進行產學合作
• 少數獲邀參與國際雲端計算研討會 IEEE CloudCom的專業公司
• 少數已經有實際經驗協助客戶完成建置 Hadoop 系統的資訊廠商
• 2012.01 JackHare (ANSI SQL JDBC Driver)
• 2012.11 HareDB Hbase Client
• 2013.08 Hare ( High Speed Query in HBase)
• 2013.12 榮獲科學園區創新產品獎
• 2014.12 榮獲資訊月創新金質獎

Hadoop
HBase
Hive
Spark
HareDB Core
HBase Client HDFS Client
Solr Cloud
Security
KerberosSentry
Indexing
Restful Service JDBC/ODBC Cluster Monitor
HareDB Arch.

What is Apache Spark ?
• It is an open source cluster computing
framework
• In contrast to Hadoop's two-stage disk-
based MapReduce paradigm, Spark's
multi-stage in-memory primitives provides
performance up to 100 faster for certain
applications.

Databricks
• Founded in late 2013
• By the creators of Apache Spark
• Original team from UC Berkeley AMPLab
(Algorithms,Machines,People)
• Contributed more than 75% of the code
added to Spark in 2014

World Record
From Spark Summit 2015, Matei Zaharia

Spark is hot !
From Spark Summit 2015, Matei Zaharia

From Spark Summit 2015, Mike Olson (Cloudera),
http://www.slideshare.net/SparkSummit/spark-in-the-hadoop-ecosystem-mike-olson

From http://hortonworks.com/blog/apache-spark-yarn-ready-hortonworks-data-platform/
&& Spark Summit 2015, Arun C. Murthy

From Spark Summit 2015, Anil Gadre (MapR),
http://www.slideshare.net/SparkSummit/spark-in-the-hadoop-ecosystem-mike-olson

http://www.slideshare.net/SparkSummit/intro-to-spark-development

Spark Software Stack
Resource
Virtualization
Storage
Processing Engine
Access and
Interfaces
Mesos Hadoop Yarn
HDFS,S3
Spark Core
Spark
Streami
ng Spark
SQL
Spark R GraphX
MLlib
Splash
MLPipelinesBlinkDB

300 MB/s
600 MB/s
10GB/s
1Gb/s = 125MB/s
1Gb/s
125MB/s
Nodes in the
same rack
Nodes in
another
rack
0.1Gb/s
12.5MB/s
1.資料記憶體化
2.資料在地化
Physical Bottleneck

Spark 執行流程
1
2
2
3
3
http://www.codeproject.com/Articles/1023037/Introduction-to-Apache-
Spark

http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
“The main abstraction in Spark is that of a resilient dis-
tributed dataset (RDD), which represents a read-only
collection of objects partitioned across a set of machines
that can be rebuilt if a partition is lost.
Users can explicitly cache an RDD in memory across
machines and reuse it in multiple MapReduce-like parallel
operations.
RDDs achieve fault tolerance through a notion of lineage:
if a partition of an RDD is lost, the RDD has enough
information about how it was derived from other RDDs to
be able to rebuild just that partition.”
What is RDD ?

(Scala & Python only)
Interactive Shell

# Read a local txt file in Python
linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Scala
val linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Java
JavaRDD<String> lines = sc.textFile("/path/to/README.md");
Read From TextFile

item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
RDD
Ex
RDD
W
RDD
Ex
RDD
W
RDD
Ex
RDD
W
more partitions = more parallelism
Where is RDD ?

Error, ts, msg1
Warn, ts,
msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts,
msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
logLinesRDD
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3 Error, ts, msg4
Error, ts, msg1
errorsRDD
.filter( )
(input/base RDD)

errorsRDD
.coalesce( 2 )
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
Error, ts, msg1
Error, ts, msg1
Error, ts, msg1
.collect( )
Driver

.collect( )
Execute !
Driver

.collect( )
Driver
logLinesRD
D

.collect( )
logLinesRD
D
errorsRDD
cleanedRDD
.filter( )
.coalesce( 2 )
Driver
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1

.collect( )
Driver
logLinesRDD
errorsRDD
cleanedRDD
data
.filter( )
.coalesce( 2, shuffle= False)

Driver
logLinesRDD
errorsRDD
cleanedRDD

Driver
data

logLinesRD
D
errorsRDD
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts,
msg4
Error, ts, msg1 cleanedRDD
.filter( )
Error, ts, msg1
errorMsg1RD
D
.collect( )
.saveToCassandra( )
.count( )
5

logLinesRD
D
errorsRDD
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts,
msg4
Error, ts, msg1 cleanedRDD
.filter( )
Error, ts, msg1
errorMsg1RDD
.collect( )
.count( )
.saveToCassandra( )
5

Lifecycle of a Spark program
1. Create some input RDDs from external data or
parallelize a collection in your driver program.
2. Lazily transform them to define new RDDs
using transformations like filter() or map()
3. Ask Spark to cache() any intermediate RDDs
that will need to be reused.
4. Launch actions such as count() and collect()
to kick off a parallel computation, which is then
optimized and executed by Spark.

map() intersection() cartesion()
flatMap() distinct() pipe()
filter() groupByKey() coalesce()
mapPartitions() reduceByKey() repartition()
mapPartitionsWithIndex() sortByKey() partitionBy()
sample() join() ...
union() cogroup() ...
Transformations

reduce() takeOrdered()
collect() saveAsTextFile()
count() saveAsSequenceFile()
first() saveAsObjectFile()
take() countByKey()
takeSample() foreach()
... ...
Actions

sqlCtx = new HiveContext(sc)
results = sqlCtx.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)
What is Spark SQL

DataFrames
42
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015

DataFrames
43
http://www.slideshare.net/SparkSummit/reynold-xin

DataFrames
44

DataFrames
45

DataFrames
46

DataFrames
47

Spark R
48

Machine Learning Pipelines
51

External Data Sources
52

External Data Sources
53

Tungsten
54

Tungsten
55

Tungsten
56

Tungsten
57

All New Spark
58

Spark 1.5
• A large part of Spark 1.5, on the other hand, focuses
on under-the-hood changes to improve
Spark’s performance, usability, and operational
stability.
• Spark 1.5 delivers the first phase of Project Tungsten
Reference: https://databricks.com/blog/2015/08/18/spark-1-5-preview-now-available-in-databricks.html

Introduction to Apache Spark

More Related Content

What's hot(20)

Viewers also liked(20)

Similar to Introduction to Apache Spark (20)

Recently uploaded(20)

Introduction to Apache Spark

Editor's Notes