Apache Spark Crash Course

Robert Hryniewicz
Developer Advocate
T: @RobH8z
E: rhryniewicz@hortonworks.com
Apache Spark
Crash Course - DataWorks Summit – Sydney 2017

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Sources
Ã Internet of Things (IoT)
– Wind Turbines, Oil Rigs
– Beacons, Wearables
– Smart Cars
Ã User Generated Content (Social, Web & Mobile)
– Twitter, Facebook, Snapchat
– Clickstream
– Paypal, Venmo

0
10
20
30
40
50
60
70
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Data Growth in Zeta Bytes (ZB)
50+ ZB in 2021

Visualizing 50 ZB

The “Big Data” Problem
Ã A single machine cannot process or even store all the data!
Problem
Solution
Ã Distribute data over large clusters
Difficulty
Ã How to split work across machines?
Ã Moving data over network is expensive
Ã Must consider data & network locality
Ã How to deal with failures?
Ã How to deal with slow nodes?

Apache Spark Background

What Is Apache Spark?
Ã Apache open source project
originally developed at AMPLab
(University of California Berkeley)
Ã Unified, general data processing
engine that operates across varied
data workloads and platforms

Why Apache Spark?
Ã Elegant Developer APIs
– Single environment for data munging, data wrangling, and Machine Learning (ML)
Ã In-memory computation model – Fast!
– Effective for iterative computations and ML
Ã Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark MLlib)

Spark SQL
Structured Data
Spark Streaming
Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

Spark SQL

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

More Flexible Better Storage and Performance///

Spark SQL Overview
Ã Spark module for structured data processing (e.g. ORC, Parquet, Avro, MySQL)
Ã Two ways to manipulate data:
– DataFrame/Dataset API
– SQL query

SparkSession
Ã Main entry point for Spark functionality
Ã Allows programming with DataFrame and Dataset APIs
Ã Represented as spark and auto-initialized in a notebook type env. (Zeppelin or Jupyter)
What is it?

DataFrames
Ã Distributed collection of data organized into named
columns
Ã Conceptually equivalent to a table in relational DB or
a data frame in R/Python
Ã API available in Scala, Java, Python, and R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data is described as a DataFrame
with rows, columns, and a schema

Sources
CSVAvro
HIVE
Spark SQL
Col1 Col2 … … ColN
DataFrame
Column
Row
JSON

Create a DataFrame
val path = "examples/flights.json"
val flights = spark.read.json(path)
Example

Register a Temporary View (SQL API)
Example
flights.createOrReplaceTempView("flightsView")

Two API Examples: DataFrame and SQL APIs
flights.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
SELECT Origin, Dest, DepDelay
FROM flightsView
WHERE DepDelay > 15 LIMIT 5
SQL API
DataFrame API

Spark Streaming

Spark SQL
Structured Data
Spark Streaming
Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

What is Stream Processing?
Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics
Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action
Stream Processing + Batch Processing = All Data Analytics
real-time (now) historical (past)

Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
23
Modern Data Applications approach to Insights
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Analyze after landing… Analyze in motion…

Spark Streaming
Ã Extension of Spark Core API
Ã Stream processing of live data streams
– Scalable
– High-throughput
– Fault-tolerant
Overview
ZeroMQ
MQTT
No longer
supported
in
Spark 2.x

Spark Streaming

Spark Streaming
Discretized Streams (DStreams)
Ã High-level abstraction representing continuous stream of data
Ã Internally represented as a sequence of RDDs
Ã Operation applied on a DStream translates to operations on the underlying RDDs

Spark Streaming
Example: flatMap operation

Spark Streaming
Ã Apply transformations over a sliding window of data, e.g. rolling average
Window Operations

Challenges in Streaming Data
Ã Consistency
Ã Fault tolerance
Ã Out-of-order data

Structured Streaming
Ã High-Level APIs - DataFrames, Datasets and SQL. Same in streaming and in batch
Ã Event-time Processing - Native support for working w/ out -of-order and late data
Ã End-to-end Exactly Once - Transactional both in processing and output

Structured Streaming: Basics

Structured Streaming: Model

Handling late arriving data

Spark MLlib

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

Machine Learning use cases
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels

START
Regression
Classification Collaborative Filtering
Clustering
Dimensionality Reduction
• Logistic Regression
• Support Vector Machines (SVM)
• Random Forest (RF)
• Naïve Bayes
• Linear Regression
• Alternating Least Squares (ALS)
• K-Means, LDA
• Principal Component Analysis (PCA)

What is a ML Model?
Ã Mathematical formula with a number of parameters that need to be learned from the
data. And fitting a model to the data is a process known as model training
Ã E.g. linear regression
– Goal: fit a line y = mx + c to data points
– After model training: y = 2x + 5
Input OutputModel
1, 0, 7, 2, … 7, 5, 19, 9, …

Scatter 2D Data Visualized
scatterData
|label|features|
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...

Linear Regression Model Training (one feature)
Coefficients: 2.81 Intercept: 3.05
y = 2.81x + 3.05
Training
Result

Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563

Spark ML Pipeline
Ã fit() is for training
Ã transform() is for prediction
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline Model
fit()
transform()
Train
Predict

Spark ML Pipeline
Feature
transform
1
Feature
transform
2
Combine
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model

Sample Spark ML Pipeline
indexer = …
parser = …
hashingTF = …
vecAssembler = …
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model

Exporting ML Models - PMML
Ã Predictive Model Markup Language (PMML)
–> XML-based predictive model interchange format
Ã Supported models
–K-Means
–Linear Regression
–Ridge Regression
–Lasso
–SVM
–Binary

Spark GraphX

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

Ã Page Rank
Ã Topic Modeling (LDA)
Ã Community Detection
Source: ampcamp.berkeley.edu

GraphX Algorithms
Ã PageRank
Ã Connected components
Ã Label propagation
Ã SVD++
Ã Strongly connected components
Ã Triangle count

Sample GraphX Code in Scala
graph = Graph(vertices, edges)
messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
(id, vertex, msg) => ...
}

Apache Zeppelin Basics

What’s Apache Zeppelin?
Web-based notebook
that enables interactive
data analytics.
You can make beautiful
data-driven, interactive
and collaborative
documents with SQL,
Python, Scala and more

Apache Zeppelin with HDP 2.6
• Data exploration and discovery
• Visualization
• Interactive snippet-at-a-time
experience
• “Modern Data Science Studio”
Features
• Ad-hoc experimentation
• Deeply integrated with
Spark + Hadoop
• Supports multiple
language backends
• Incubating at Apache
Use Case
Web-based Notebook for interactive analytics

How does Zeppelin work?
Notebook
Author
Collaborators/
Report viewers
Zeppelin
Cluster
Spark | Hive | HBase
Any of 30+ back ends

Big Data Lifecycle
Collect
ETL /
Process
Analysis
Report
Data
Product
Business user
Customer
Data ScientistData Engineer
All in Zeppelin!

Multitenancy with Zeppelin

Livy
Ã Livy is the open source REST interface for interacting with Apache Spark from anywhere
Ã Installed as Spark Ambari Service
Livy Client
HTTP HTTP (RPC)
Spark Interactive Session
SparkContext
Spark Batch Session
SparkContext
Livy Server

Security Across Zeppelin-Livy-Spark
Shiro
Ispark Group Interpreter
SPNego: Kerberos Kerberos
Livy APIs
Spark on YARN
Zeppelin
Driver
LDAP
Livy Server

Reasons to Integrate with Livy
Ã Bring Sessions to Apache Zeppelin
– Isolation
– Session sharing
Ã Enable efficient cluster resource utilization
– Default Spark interpreter keeps YARN/Spark job running forever
– Livy interpreter recycled after 60 minutes of inactivity
(controlled by livy.server.session.timeout )
Ã To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN

Livy Server
SparkSession Sharing
Session-2
Session-1
SparkSession-1
SparkContext
SparkSession-2
SparkContext
Client 1
Client 2
Client 3
Session-1
Session-1
Session-2

Apache Zeppelin + Livy End-to-End Security
Ispark Group Interpreter
SPNego: Kerberos Kerberos/RPC
Livy APIs
Spark on YARN
Zeppelin
LDAP
Livy Server
Job runs as
Tommy Callahan
Tommy Callahan

Hortonworks Data Platform (HDP) Basics

Ã Zeppelin è Interactive notebook
Ã Spark
Ã YARN è Resource Management
Ã HDFS è Distributed Storage Layer
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS

Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications
Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.

Why Apache Spark on YARN?
Ã Resource management
Ã Utilizes existing HDP cluster
infrastructure
Ã Scheduling and queues
Spark Driver
Client
Spark
Application Master
YARN container
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task
Spark Executor
YARN container
Task Task

Why HDFS?
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomly across the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4

Hortonworks Data Platform

Hortonworks Data Cloud (HDCloud) Basics

Hortonworks Cloud Solutions
Microsoft AWS Google
Managed Azure HDInsight
Non-Managed /
Marketplace
Hortonworks Data
Cloud for AWS
Cloud IaaS
Hortonworks Data Platform
(via Ambari and via Cloudbreak)

Hortonworks Cloud Solutions: Flexibility and Choice
Hortonworks Data
Cloud for AWS
Cloudbreak
HDP on Cloud IaaS
More Prescriptive
More Ephemeral / Short Lived
More Options
More Long Running

Sample Architecture

Modern Data Apps
Ã HDP 2.6
– Batch Processing
Ã HDF 3.0
– Streaming Apps
DATA AT
REST
DATA IN
MOTION
ACTIONABLE
INTELLIGENCE
Modern Data Applications

Custom or Off the Shelf
Real-Time Cyber Security
protects systems with superior threat
detection
Smart Manufacturing
dramatically improves yields by managing
more variables in greater detail
Connected, Autonomous Cars
drive themselves and improve road safety
Future Farming
optimizing soil, seeds and equipment to
measured conditions on each square foot
Automatic Recommendation Engines
match products to preferences in milliseconds
DATA AT
REST
DATA IN
MOTION
ACTIONABLE
INTELLIGENCE
Hortonworks
DataFlow
Hortonworks
Data Platform

Managed Dataflow
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE

NiFi part of HDF

High-Level Overview
IoT Edge
(single node)
IoT Edge
(single node)
IoT Devices
IoT Devices
NiFi Hub Data Broker
Column
DB
Data
Store
Live Dashboard
Data Center
(on prem/cloud)
HDFS/S3 HBase/Cassandra

Robert Hryniewicz
T: @RobH8z
E: rhryniewicz@hortonworks.com
Thanks!

Apache Spark Crash Course

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Apache Spark Crash Course (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Apache Spark Crash Course