Introduction to Apache Spark
Hubert 范姜 @hubert
HadoopCon Taiwan
2015
Sep. 19 , 2015 Taipei
Photo from http://quotesgram.com/spark-quotes/
Who are we?
• 亦思科技
• 位於新竹科學園區
• 過去主要客戶為園區各大製造廠
• 2010.7 以研發雲端計算軟體工具之投資計畫獲准進駐新竹科學園區
• 2011 與清華大學資工系鍾葉青教授合作進行產學合作
• 少數獲邀參與國際雲端計算研討會 IEEE CloudCom的專業公司
• 少數已經有實際經驗協助客戶完成建置 Hadoop 系統的資訊廠商
• 2012.01 JackHare (ANSI SQL JDBC Driver)
• 2012.11 HareDB Hbase Client
• 2013.08 Hare ( High Speed Query in HBase)
• 2013.12 榮獲科學園區創新產品獎
• 2014.12 榮獲資訊月創新金質獎
Hadoop
HBase
Hive
Spark
HareDB Core
HBase Client HDFS Client
Solr Cloud
Security
KerberosSentry
Indexing
Restful Service JDBC/ODBC Cluster Monitor
HareDB Arch.
WHAT IS SPARK ?
What is Apache Spark ?
• It is an open source cluster computing
framework
• In contrast to Hadoop's two-stage disk-
based MapReduce paradigm, Spark's
multi-stage in-memory primitives provides
performance up to 100 faster for certain
applications.
Databricks
• Founded in late 2013
• By the creators of Apache Spark
• Original team from UC Berkeley AMPLab
(Algorithms,Machines,People)
• Contributed more than 75% of the code
added to Spark in 2014
World Record
From Spark Summit 2015, Matei Zaharia
Spark is hot !
From Spark Summit 2015, Matei Zaharia
SPARK 會取代 HADOOP ?
From Spark Summit 2015, Mike Olson (Cloudera),
http://www.slideshare.net/SparkSummit/spark-in-the-hadoop-ecosystem-mike-olson
From http://hortonworks.com/blog/apache-spark-yarn-ready-hortonworks-data-platform/
&& Spark Summit 2015, Arun C. Murthy
From Spark Summit 2015, Anil Gadre (MapR),
http://www.slideshare.net/SparkSummit/spark-in-the-hadoop-ecosystem-mike-olson
http://www.slideshare.net/SparkSummit/intro-to-spark-development
http://www.slideshare.net/SparkSummit/intro-to-spark-development
http://www.slideshare.net/SparkSummit/intro-to-spark-development
Spark Software Stack
Resource
Virtualization
Storage
Processing Engine
Access and
Interfaces
Mesos Hadoop Yarn
HDFS,S3
Spark Core
Spark
Streami
ng Spark
SQL
Spark R GraphX
MLlib
Splash
MLPipelinesBlinkDB
VS
10X ~100X
300 MB/s
600 MB/s
10GB/s
1Gb/s = 125MB/s
1Gb/s
125MB/s
Nodes in the
same rack
Nodes in
another
rack
0.1Gb/s
12.5MB/s
1.資料記憶體化
2.資料在地化
Physical Bottleneck
Spark 執行流程
1
2
2
3
3
http://www.codeproject.com/Articles/1023037/Introduction-to-Apache-
Spark
R D D
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
“The main abstraction in Spark is that of a resilient dis-
tributed dataset (RDD), which represents a read-only
collection of objects partitioned across a set of machines
that can be rebuilt if a partition is lost.
Users can explicitly cache an RDD in memory across
machines and reuse it in multiple MapReduce-like parallel
operations.
RDDs achieve fault tolerance through a notion of lineage:
if a partition of an RDD is lost, the RDD has enough
information about how it was derived from other RDDs to
be able to rebuild just that partition.”
What is RDD ?
(Scala & Python only)
Interactive Shell
# Read a local txt file in Python
linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Scala
val linesRDD = sc.textFile("/path/to/README.md")
// Read a local txt file in Java
JavaRDD<String> lines = sc.textFile("/path/to/README.md");
Read From TextFile
item-1
item-2
item-3
item-4
item-5
item-6
item-7
item-8
item-9
item-10
item-11
item-12
item-13
item-14
item-15
item-16
item-17
item-18
item-19
item-20
item-21
item-22
item-23
item-24
item-25
RDD
Ex
RDD
W
RDD
Ex
RDD
W
RDD
Ex
RDD
W
more partitions = more parallelism
Where is RDD ?
http://www.slideshare.net/SparkSummit/intro-to-spark-development
Error, ts, msg1
Warn, ts,
msg2
Error, ts, msg1
Info, ts, msg8
Warn, ts,
msg2
Info, ts, msg8
Error, ts, msg3
Info, ts, msg5
Info, ts, msg5
Error, ts, msg4
Warn, ts, msg9
Error, ts, msg1
logLinesRDD
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3 Error, ts, msg4
Error, ts, msg1
errorsRDD
.filter( )
(input/base RDD)
http://www.slideshare.net/SparkSummit/intro-to-spark-development
errorsRDD
.coalesce( 2 )
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
cleanedRDD
Error, ts, msg1
Error, ts, msg1
Error, ts, msg3 Error, ts, msg4
Error, ts, msg1
.collect( )
Driver
http://www.slideshare.net/SparkSummit/intro-to-spark-development
.collect( )
Execute !
Driver
http://www.slideshare.net/SparkSummit/intro-to-spark-development
.collect( )
Driver
logLinesRD
D
http://www.slideshare.net/SparkSummit/intro-to-spark-development
.collect( )
logLinesRD
D
errorsRDD
cleanedRDD
.filter( )
.coalesce( 2 )
Driver
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts, msg4
Error, ts, msg1
http://www.slideshare.net/SparkSummit/intro-to-spark-development
.collect( )
Driver
logLinesRDD
errorsRDD
cleanedRDD
data
.filter( )
.coalesce( 2, shuffle= False)
http://www.slideshare.net/SparkSummit/intro-to-spark-development
Driver
logLinesRDD
errorsRDD
cleanedRDD
http://www.slideshare.net/SparkSummit/intro-to-spark-development
Driver
data
http://www.slideshare.net/SparkSummit/intro-to-spark-development
logLinesRD
D
errorsRDD
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts,
msg4
Error, ts, msg1 cleanedRDD
.filter( )
Error, ts, msg1
Error, ts, msg1 Error, ts, msg1
errorMsg1RD
D
.collect( )
.saveToCassandra( )
.count( )
5
http://www.slideshare.net/SparkSummit/intro-to-spark-development
logLinesRD
D
errorsRDD
Error, ts, msg1
Error, ts, msg3
Error, ts, msg1
Error, ts,
msg4
Error, ts, msg1 cleanedRDD
.filter( )
Error, ts, msg1
Error, ts, msg1 Error, ts, msg1
errorMsg1RDD
.collect( )
.count( )
.saveToCassandra( )
5
http://www.slideshare.net/SparkSummit/intro-to-spark-development
Lifecycle of a Spark program
1. Create some input RDDs from external data or
parallelize a collection in your driver program.
2. Lazily transform them to define new RDDs
using transformations like filter() or map()
3. Ask Spark to cache() any intermediate RDDs
that will need to be reused.
4. Launch actions such as count() and collect()
to kick off a parallel computation, which is then
optimized and executed by Spark.
http://www.slideshare.net/SparkSummit/intro-to-spark-development
map() intersection() cartesion()
flatMap() distinct() pipe()
filter() groupByKey() coalesce()
mapPartitions() reduceByKey() repartition()
mapPartitionsWithIndex() sortByKey() partitionBy()
sample() join() ...
union() cogroup() ...
Transformations
http://www.slideshare.net/SparkSummit/intro-to-spark-development
reduce() takeOrdered()
collect() saveAsTextFile()
count() saveAsSequenceFile()
first() saveAsObjectFile()
take() countByKey()
takeSample() foreach()
... ...
Actions
http://www.slideshare.net/SparkSummit/intro-to-spark-development
SPARK SQL
sqlCtx = new HiveContext(sc)
results = sqlCtx.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)
What is Spark SQL
NEW FEATURES IN 1.4
AND 1.5
DataFrames
42
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
DataFrames
43
http://www.slideshare.net/SparkSummit/reynold-xin
DataFrames
44
http://www.slideshare.net/SparkSummit/reynold-xin
DataFrames
45
http://www.slideshare.net/SparkSummit/reynold-xin
DataFrames
46
http://www.slideshare.net/SparkSummit/reynold-xin
DataFrames
47
http://www.slideshare.net/SparkSummit/reynold-xin
Spark R
48
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
Spark R
49
Spark R
50
Machine Learning Pipelines
51
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
External Data Sources
52
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
External Data Sources
53
http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015
Tungsten
54
http://www.slideshare.net/SparkSummit/reynold-xin
Tungsten
55
http://www.slideshare.net/SparkSummit/reynold-xin
Tungsten
56
http://www.slideshare.net/SparkSummit/reynold-xin
Tungsten
57
http://www.slideshare.net/SparkSummit/reynold-xin
All New Spark
58
http://www.slideshare.net/SparkSummit/reynold-xin
Spark 1.5
• A large part of Spark 1.5, on the other hand, focuses
on under-the-hood changes to improve
Spark’s performance, usability, and operational
stability.
• Spark 1.5 delivers the first phase of Project Tungsten
Reference: https://databricks.com/blog/2015/08/18/spark-1-5-preview-now-available-in-databricks.html
Thank you

Introduction to Apache Spark

Editor's Notes

  • #11 Mike Olson : Chief Strategy Officer at Cloudera
  • #13 Senior Vice President, Product Management
  • #26 This RDD has 5 partitions. An RDD is simply a distributed collection of elements. You can think of the distributed collections like of like an array or list in your single machine program, except that it’s spread out across multiple nodes in the cluster. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them. So, Spark gives you APIs and functions that lets you do something on the whole collection in parallel using all the nodes.
  • #27 Introduce that Spark has Operations which can be transformations or actions. Those are 4 green unique blocks in a single HDFS file Here we are filtering out the warnings and info messages so we are left with just errors in the RDD. This doesn’t actually read the file from HDFS just yet… we’re just building out a lineage graph
  • #29 directed acyclic graph. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of edges that eventually loops back to v again. A collection of tasks that must be ordered into a sequence, subject to constraints that certain tasks must be performed earlier than others, may be represented as a DAG with a vertex for each task and an edge for each constraint https://en.wikipedia.org/wiki/Directed_acyclic_graph
  • #32  ----- Meeting Notes (6/15/15 16:02) ----- This is a stage (which we'll talk about later).
  • #33 Now the RDDs dissapear and get destroyed
  • #36 It’s okay if only part of the RDD actually fits in memory Talk about lineage: parent RDD and child RDD
  • #37  ----- Meeting Notes (6/15/15 16:08) ----- Also note that an application can have many such 1 through 4 procedures.
  • #39 Actions force the evaluation of the transformations required for the RDD they are called on, since they are required to actually produce output.