SlideShare a Scribd company logo
Apache Spark
Lightening Fast Cluster Computing
Eric Mizell – Director, Solution Engineering
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Apache Spark?
Apache Open Source Project
Distributed Compute Engine
for fast and expressive data processing
Designed for Iterative, In-Memory
computations and interactive data mining
Expressive Multi-Language APIs
for Java, Scala, Python, and R
Powerful Abstractions
Enable data workers to rapidly iterate over
data for:
• ETL, Machine Learning, SQL, Stream Processing,
and Graph Processing
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
Spark
Streaming
MLlib
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Why Spark?
Elegant Developer APIs
• Data Frames/SQL, Machine Learning, Graph algorithms and streaming
• Scala, Python, Java and R
• Single environment for pre-processing and Machine Learning
In-memory computation model
• Effective for iterative computations and machine learning
Machine Learning On Hadoop
• Implementation of distributed ML-algorithms
• Pipeline API (Spark ML)
Runs on Hadoop on YARN, Mesos, standalone
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Interactions with Spark
Command Line
• Scala shell – Scala/Java (./bin/spark-shell)
• Python - (./bin/pyspark)
Notebooks
• Apache Zeppelin Notebook
• Juptyer/IPython Notebook
• IRuby Notebook
ODBC/JDBC (Spark SQL only via Thrift)
• Simba driver
• DataDirect driver
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Introducing Apache Zeppelin Web-based Notebook for
interactive analytics
Features
Ad-hoc experimentation
Deeply integrated with Spark + Hadoop
Supports multiple language backends
Incubating at Apache
Use Case
Data exploration and discovery
Visualization
Interactive snippet-at-a-time experience
“Modern Data Science Studio”
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Fundamental Abstraction: Resilient Distributed Datasets
RDD
Work with distributed collections as
primitives
RDD Properties
• Immutable collections of objects spread across
a cluster
• Built through parallel transformations (map,
filter, etc.)
• Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
Multiple Languages
broad developer, partner and customer
engagement
RDD
Partition 1
RDD
Partition 2
RDD
Partition 3Worker Node
Worker Node
Worker Node
RDD
LogicalSpark
Driver
sc = new SparkContext
rDD
=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
…
Developer
Physical
Writes
RDD
RDDs are collections of objects distributed across a cluster,
cached in RAM or on disk. They are built through parallel
transformations, automatically rebuilt on failure and immutable
(each transformation creates a new RDD).
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What can developers do with RDDs?
RDD Operations
Transformations
• e.g. map, filter, groupBy, join
• Lazy operations to build RDDs from other
RDDs
Actions
• e.g. count, collect, save
• Return a result or write it to storage
Other primitives
• Accumulator
• Broadcast Variables
Developer
Writes
RDD
Operations
Writes
Accumulator
s
Actions
Broadcast
Variables
Transformations
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Example: Mining Console Logs
Load error messages from a log into memory, then
interactively search for patterns
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
RDD
Demo
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SQL
SQL Access and Data Frames
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
Streaming
MLlib
Spark
SQL
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
YARN
HDFS
Spark SQL
Table Structure
integrated to work with tables and rows
Hive Queries via Spark
by Spark SQL Context can connect to Hive and
query Hive
Bindings
to Python, Scala, Java, and R
Data Frames
new abstractions simplifies and speeds up SQL
processing
Spark Core Engine
Spark SQL
Data Frame DSL Spark SQL
Data Frame API
Data Source API
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storage
What are Data Frames?
Data Frames represent data in RDDs as a Table
RDD is a low level abstraction
–Think of RDD as bytecode and DataFrame as the
Java Program
Data Frame Properties
–Data Frames attach schema to RDDs
–Allows users to perform aggressive query
optimizations
–Brings the power of SQL to RDDs!
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Tuple
Relational
View
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Frames are intuitive
RDD Example
Equivalent Data Frame Example
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by department?
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
DataFrame
Demo
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
Streaming
MLlib
Spark
SQL
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
MLlib
Machine Learning Library
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
Spark
Streaming
MLlib
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Machine Learning?
Machine learning is the study of
algorithms that learn concepts from
data.
A key aspect of learning is
generalization: how well a learning
algorithm is able to predict on unseen
examples.
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Machine Learning Primitives
Unsupervised Learning
Clustering (K-means)
Recommendation
Collaborative Filtering
- alternating least squares
Dimensionality Reductions
- Principal component analysis (PCA) and singular
value decomposition (SVD)
Supervised Learning
Classification
- Naïve Bayes, Decision Tree, Random Forest,
Gradient Boosted Trees
Regression
- linear, logistic and Support Vector Machines
(SVMs)
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ML Workflows are complex
Q-Q
Q-A
similarit
y
Log
Parsing,
Cleanin
g
Ad
category
mapping
Query
category
mapping
Poly
Exp
(Q-A)
Feature
s
Model
Linear
Solver
train
test
Metrics
• Feature Extraction
Feature
Extraction
Ad Server
Sponsored Search Advertising Pipeline Challenges:
-> specify pipeline
-> inspect and debug
-> tune hyperparameters
-> productionize
HDFS
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ML Pipeline makes ML workflows easier
Transformer
Transforms one dataset into another
Estimator
Fits model to data
Pipeline
Sequence of stages, consisting of estimators
or transformers
Parameters
Trait for components that take parameters
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Streaming
Real Time Stream Processing
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
MLlib
Spark
Streaming
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark Streaming
• Spark Streaming is an extension of Spark-core API that supports scalable, high
throughput and fault-tolerant streaming applications.
• Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or
TCP sockets
• Data is processed using the now-familiar API: map, filter, reduce, join and window
• Processed data can be stored in databases, filesystems, or live dashboards
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
GraphX
Graph Processing
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark GraphX Graph API on Spark
Seamlessly work with graphs and collections
Growing library of graph algorithms
• SVD++, Connected Components, Triangle
Count, …
Iterative Graph Computations using
Pregel
Implements Valiant’s Bulk Synchronous
Parallel (BSP) model for distributing graph
algorithms.
Use Case
Social Media: Suggest new connections based
on existing relationships
Networking: Best routing through a given
network
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Part. 2
Part. 1
Vertex Table
(RDD)
B C
A D
F E
A D
Distributed Graphs as Tables (RDDs)
D
Property Graph
B C
D
E
AA
F
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing
Table (RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How to Get Started with Spark
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Try Spark Today
Download the Hortonworks Sandbox
http://hortonworks.com/products/hortonworks-sandbox/
Go to the Apache Spark Website
http://spark.apache.org/
Learn Spark
Build a Proof of Concept
Test New Functionality
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
© Hortonworks Inc. 2013
Thank You!
Eric Mizell - Director, Solutions Engineering
emizell@hortonworks.com

More Related Content

What's hot (20)

PDF
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
In-Memory Computing Summit
 
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PPTX
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Everywhere & Cloudbreak
Sean Roberts
 
PDF
OpenStack Scale-out Networking Architecture
Randy Bias
 
PDF
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
In-Memory Computing Summit
 
PDF
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
Spark Summit
 
PDF
Camel Riders in the Cloud
Red Hat Developers
 
PPTX
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
PDF
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
PDF
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
DataWorks Summit
 
PDF
Cooperative Data Exploration with iPython Notebook
DataWorks Summit/Hadoop Summit
 
PPT
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
marpierc
 
PPTX
OpenStack + Nano Server + Hyper-V + S2D
Alessandro Pilotti
 
PDF
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...
eNovance
 
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PDF
Streaming SQL
Jungtaek Lim
 
PDF
Introduction to Apache NiFi And Storm
Jungtaek Lim
 
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
In-Memory Computing Summit
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Performance Comparison of Streaming Big Data Platforms
DataWorks Summit/Hadoop Summit
 
Hadoop Everywhere & Cloudbreak
Sean Roberts
 
OpenStack Scale-out Networking Architecture
Randy Bias
 
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
In-Memory Computing Summit
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
Spark Summit
 
Camel Riders in the Cloud
Red Hat Developers
 
Effective Spark on Multi-Tenant Clusters
DataWorks Summit/Hadoop Summit
 
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
DataWorks Summit
 
Cooperative Data Exploration with iPython Notebook
DataWorks Summit/Hadoop Summit
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
marpierc
 
OpenStack + Nano Server + Hyper-V + S2D
Alessandro Pilotti
 
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...
eNovance
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Streaming SQL
Jungtaek Lim
 
Introduction to Apache NiFi And Storm
Jungtaek Lim
 

Viewers also liked (20)

PDF
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
All Things Open
 
PDF
Marketing is not all fluff; engineering is not all math
All Things Open
 
PDF
Trademarks and Your Free and Open Source Software Project
All Things Open
 
PDF
Women in Open Source
All Things Open
 
PPTX
Giving a URL to All Objects using Beacons²
All Things Open
 
PDF
Open Source Systems Administration
All Things Open
 
PPTX
Sustainable Open Data Markets
All Things Open
 
ODP
How Raleigh Became an Open Source City
All Things Open
 
PPTX
All Things Open Opening Keynote
All Things Open
 
PPT
Open Sourcing the Public Library
All Things Open
 
PDF
Software Development as a Civic Service
All Things Open
 
PDF
The Ember.js Framework - Everything You Need To Know
All Things Open
 
PPTX
Great Artists (Designers) Steal
All Things Open
 
PDF
What Academia Can Learn from Open Source
All Things Open
 
PPTX
JavaScript and Internet Controlled Hardware Prototyping
All Things Open
 
PPTX
Javascript - The Stack and Beyond
All Things Open
 
PDF
Open Source in Healthcare
All Things Open
 
PDF
Choosing a Javascript Framework
All Things Open
 
PDF
The Gurubox Project: Open Source Troubleshooting Tools
All Things Open
 
PPTX
Considerations for Operating an OpenStack Cloud
All Things Open
 
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
All Things Open
 
Marketing is not all fluff; engineering is not all math
All Things Open
 
Trademarks and Your Free and Open Source Software Project
All Things Open
 
Women in Open Source
All Things Open
 
Giving a URL to All Objects using Beacons²
All Things Open
 
Open Source Systems Administration
All Things Open
 
Sustainable Open Data Markets
All Things Open
 
How Raleigh Became an Open Source City
All Things Open
 
All Things Open Opening Keynote
All Things Open
 
Open Sourcing the Public Library
All Things Open
 
Software Development as a Civic Service
All Things Open
 
The Ember.js Framework - Everything You Need To Know
All Things Open
 
Great Artists (Designers) Steal
All Things Open
 
What Academia Can Learn from Open Source
All Things Open
 
JavaScript and Internet Controlled Hardware Prototyping
All Things Open
 
Javascript - The Stack and Beyond
All Things Open
 
Open Source in Healthcare
All Things Open
 
Choosing a Javascript Framework
All Things Open
 
The Gurubox Project: Open Source Troubleshooting Tools
All Things Open
 
Considerations for Operating an OpenStack Cloud
All Things Open
 
Ad

Similar to Apache Spark: Lightning Fast Cluster Computing (20)

PPTX
Intro to Spark with Zeppelin
Hortonworks
 
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
Spark from the Surface
Josi Aranda
 
PDF
Spark mhug2
Joseph Niemiec
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
PDF
Apache spark with java 8
Janu Jahnavi
 
PPTX
Apache spark with java 8
Janu Jahnavi
 
PDF
H2O Rains with Databricks Cloud - Parisoma SF
Sri Ambati
 
PPTX
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
PDF
Spark meets Spring
mark_fisher
 
PDF
H2O Rains with Databricks Cloud - NY 02.16.16
Sri Ambati
 
PPTX
Sparkflows.io
sparkflows
 
PDF
Reactive dashboard’s using apache spark
Rahul Kumar
 
PPT
Spark_Part 1
Shashi Prakash
 
Intro to Spark with Zeppelin
Hortonworks
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Spark from the Surface
Josi Aranda
 
Spark mhug2
Joseph Niemiec
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
Apache spark with java 8
Janu Jahnavi
 
Apache spark with java 8
Janu Jahnavi
 
H2O Rains with Databricks Cloud - Parisoma SF
Sri Ambati
 
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Spark meets Spring
mark_fisher
 
H2O Rains with Databricks Cloud - NY 02.16.16
Sri Ambati
 
Sparkflows.io
sparkflows
 
Reactive dashboard’s using apache spark
Rahul Kumar
 
Spark_Part 1
Shashi Prakash
 
Ad

More from All Things Open (20)

PDF
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
All Things Open
 
PPTX
Big Data on a Small Budget: Scalable Data Visualization for the Rest of Us - ...
All Things Open
 
PDF
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
PDF
Let's Create a GitHub Copilot Extension! - Nick Taylor, Pomerium
All Things Open
 
PDF
Leveraging Pre-Trained Transformer Models for Protein Function Prediction - T...
All Things Open
 
PDF
Gen AI: AI Agents - Making LLMs work together in an organized way - Brent Las...
All Things Open
 
PDF
You Don't Need an AI Strategy, But You Do Need to Be Strategic About AI - Jes...
All Things Open
 
PPTX
DON’T PANIC: AI IS COMING – The Hitchhiker’s Guide to AI - Mark Hinkle, Perip...
All Things Open
 
PDF
Fine-Tuning Large Language Models with Declarative ML Orchestration - Shivay ...
All Things Open
 
PDF
Leveraging Knowledge Graphs for RAG: A Smarter Approach to Contextual AI Appl...
All Things Open
 
PPTX
Artificial Intelligence Needs Community Intelligence - Sriram Raghavan, IBM R...
All Things Open
 
PDF
Don't just talk to AI, do more with AI: how to improve productivity with AI a...
All Things Open
 
PPTX
Open-Source GenAI vs. Enterprise GenAI: Navigating the Future of AI Innovatio...
All Things Open
 
PDF
The Death of the Browser - Rachel-Lee Nabors, AgentQL
All Things Open
 
PDF
Making Operating System updates fast, easy, and safe
All Things Open
 
PDF
Reshaping the landscape of belonging to transform community
All Things Open
 
PDF
The Unseen, Underappreciated Security Work Your Maintainers May (or may not) ...
All Things Open
 
PDF
Integrating Diversity, Equity, and Inclusion into Product Design
All Things Open
 
PDF
The Open Source Ecosystem for eBPF in Kubernetes
All Things Open
 
PDF
Open Source Privacy-Preserving Metrics - Sarah Gran & Brandon Pitman
All Things Open
 
Agentic AI for Developers and Data Scientists Build an AI Agent in 10 Lines o...
All Things Open
 
Big Data on a Small Budget: Scalable Data Visualization for the Rest of Us - ...
All Things Open
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
Let's Create a GitHub Copilot Extension! - Nick Taylor, Pomerium
All Things Open
 
Leveraging Pre-Trained Transformer Models for Protein Function Prediction - T...
All Things Open
 
Gen AI: AI Agents - Making LLMs work together in an organized way - Brent Las...
All Things Open
 
You Don't Need an AI Strategy, But You Do Need to Be Strategic About AI - Jes...
All Things Open
 
DON’T PANIC: AI IS COMING – The Hitchhiker’s Guide to AI - Mark Hinkle, Perip...
All Things Open
 
Fine-Tuning Large Language Models with Declarative ML Orchestration - Shivay ...
All Things Open
 
Leveraging Knowledge Graphs for RAG: A Smarter Approach to Contextual AI Appl...
All Things Open
 
Artificial Intelligence Needs Community Intelligence - Sriram Raghavan, IBM R...
All Things Open
 
Don't just talk to AI, do more with AI: how to improve productivity with AI a...
All Things Open
 
Open-Source GenAI vs. Enterprise GenAI: Navigating the Future of AI Innovatio...
All Things Open
 
The Death of the Browser - Rachel-Lee Nabors, AgentQL
All Things Open
 
Making Operating System updates fast, easy, and safe
All Things Open
 
Reshaping the landscape of belonging to transform community
All Things Open
 
The Unseen, Underappreciated Security Work Your Maintainers May (or may not) ...
All Things Open
 
Integrating Diversity, Equity, and Inclusion into Product Design
All Things Open
 
The Open Source Ecosystem for eBPF in Kubernetes
All Things Open
 
Open Source Privacy-Preserving Metrics - Sarah Gran & Brandon Pitman
All Things Open
 

Recently uploaded (20)

PPTX
Digital Circuits, important subject in CS
contactparinay1
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Digital Circuits, important subject in CS
contactparinay1
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 

Apache Spark: Lightning Fast Cluster Computing

  • 1. Apache Spark Lightening Fast Cluster Computing Eric Mizell – Director, Solution Engineering
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Apache Spark? Apache Open Source Project Distributed Compute Engine for fast and expressive data processing Designed for Iterative, In-Memory computations and interactive data mining Expressive Multi-Language APIs for Java, Scala, Python, and R Powerful Abstractions Enable data workers to rapidly iterate over data for: • ETL, Machine Learning, SQL, Stream Processing, and Graph Processing Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL Spark Streaming MLlib
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Why Spark? Elegant Developer APIs • Data Frames/SQL, Machine Learning, Graph algorithms and streaming • Scala, Python, Java and R • Single environment for pre-processing and Machine Learning In-memory computation model • Effective for iterative computations and machine learning Machine Learning On Hadoop • Implementation of distributed ML-algorithms • Pipeline API (Spark ML) Runs on Hadoop on YARN, Mesos, standalone
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Interactions with Spark Command Line • Scala shell – Scala/Java (./bin/spark-shell) • Python - (./bin/pyspark) Notebooks • Apache Zeppelin Notebook • Juptyer/IPython Notebook • IRuby Notebook ODBC/JDBC (Spark SQL only via Thrift) • Simba driver • DataDirect driver
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Introducing Apache Zeppelin Web-based Notebook for interactive analytics Features Ad-hoc experimentation Deeply integrated with Spark + Hadoop Supports multiple language backends Incubating at Apache Use Case Data exploration and discovery Visualization Interactive snippet-at-a-time experience “Modern Data Science Studio”
  • 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Fundamental Abstraction: Resilient Distributed Datasets RDD Work with distributed collections as primitives RDD Properties • Immutable collections of objects spread across a cluster • Built through parallel transformations (map, filter, etc.) • Automatically rebuilt on failure • Controllable persistence (e.g. caching in RAM) Multiple Languages broad developer, partner and customer engagement RDD Partition 1 RDD Partition 2 RDD Partition 3Worker Node Worker Node Worker Node RDD LogicalSpark Driver sc = new SparkContext rDD =sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map … Developer Physical Writes RDD RDDs are collections of objects distributed across a cluster, cached in RAM or on disk. They are built through parallel transformations, automatically rebuilt on failure and immutable (each transformation creates a new RDD).
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What can developers do with RDDs? RDD Operations Transformations • e.g. map, filter, groupBy, join • Lazy operations to build RDDs from other RDDs Actions • e.g. count, collect, save • Return a result or write it to storage Other primitives • Accumulator • Broadcast Variables Developer Writes RDD Operations Writes Accumulator s Actions Broadcast Variables Transformations
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Example: Mining Console Logs Load error messages from a log into memory, then interactively search for patterns
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved RDD Demo
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL SQL Access and Data Frames YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark Streaming MLlib Spark SQL
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved YARN HDFS Spark SQL Table Structure integrated to work with tables and rows Hive Queries via Spark by Spark SQL Context can connect to Hive and query Hive Bindings to Python, Scala, Java, and R Data Frames new abstractions simplifies and speeds up SQL processing Spark Core Engine Spark SQL Data Frame DSL Spark SQL Data Frame API Data Source API
  • 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storage What are Data Frames? Data Frames represent data in RDDs as a Table RDD is a low level abstraction –Think of RDD as bytecode and DataFrame as the Java Program Data Frame Properties –Data Frames attach schema to RDDs –Allows users to perform aggressive query optimizations –Brings the power of SQL to RDDs! dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Tuple Relational View Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog
  • 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Frames are intuitive RDD Example Equivalent Data Frame Example dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Find average age by department?
  • 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved DataFrame Demo YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark Streaming MLlib Spark SQL
  • 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved MLlib Machine Learning Library YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL Spark Streaming MLlib
  • 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Machine Learning? Machine learning is the study of algorithms that learn concepts from data. A key aspect of learning is generalization: how well a learning algorithm is able to predict on unseen examples.
  • 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Machine Learning Primitives Unsupervised Learning Clustering (K-means) Recommendation Collaborative Filtering - alternating least squares Dimensionality Reductions - Principal component analysis (PCA) and singular value decomposition (SVD) Supervised Learning Classification - Naïve Bayes, Decision Tree, Random Forest, Gradient Boosted Trees Regression - linear, logistic and Support Vector Machines (SVMs)
  • 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ML Workflows are complex Q-Q Q-A similarit y Log Parsing, Cleanin g Ad category mapping Query category mapping Poly Exp (Q-A) Feature s Model Linear Solver train test Metrics • Feature Extraction Feature Extraction Ad Server Sponsored Search Advertising Pipeline Challenges: -> specify pipeline -> inspect and debug -> tune hyperparameters -> productionize HDFS
  • 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ML Pipeline makes ML workflows easier Transformer Transforms one dataset into another Estimator Fits model to data Pipeline Sequence of stages, consisting of estimators or transformers Parameters Trait for components that take parameters
  • 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Streaming Real Time Stream Processing YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL MLlib Spark Streaming
  • 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Spark Streaming • Spark Streaming is an extension of Spark-core API that supports scalable, high throughput and fault-tolerant streaming applications. • Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or TCP sockets • Data is processed using the now-familiar API: map, filter, reduce, join and window • Processed data can be stored in databases, filesystems, or live dashboards
  • 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved GraphX Graph Processing YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine Spark SQL Spark Streaming MLlib GraphX
  • 23. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Spark GraphX Graph API on Spark Seamlessly work with graphs and collections Growing library of graph algorithms • SVD++, Connected Components, Triangle Count, … Iterative Graph Computations using Pregel Implements Valiant’s Bulk Synchronous Parallel (BSP) model for distributing graph algorithms. Use Case Social Media: Suggest new connections based on existing relationships Networking: Best routing through a given network
  • 24. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Part. 2 Part. 1 Vertex Table (RDD) B C A D F E A D Distributed Graphs as Tables (RDDs) D Property Graph B C D E AA F Edge Table (RDD) A B A C C D B C A E A F E F E D B C D E A F Routing Table (RDD) B C D E A F 1 2 1 2 1 2 1 2 2D Vertex Cut Heuristic
  • 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved How to Get Started with Spark
  • 26. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Try Spark Today Download the Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/ Go to the Apache Spark Website http://spark.apache.org/ Learn Spark Build a Proof of Concept Test New Functionality
  • 27. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2013 Thank You! Eric Mizell - Director, Solutions Engineering [email protected]

Editor's Notes

  • #3: NEED SPEAKER NOTES
  • #4: NEED SPEAKER NOTES
  • #5: NEED SPEAKER NOTES
  • #6: TALK TRACK Ad-hoc experimentation Spark, Hive, Shell, Flink, Tajo, Ignite, Lens, etc Deeply integrated with Spark + Hadoop Can be managed via Ambari Stacks Supports multiple language backends Pluggable “Interpreters” Incubating at Apache 100% open source and open community [NEXT SLIDE]
  • #7: TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pd
  • #8: TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pd
  • #9: Key idea: add “variables” to the “functions” in functional programming
  • #10: NEED SPEAKER NOTES
  • #11: NEED SPEAKER NOTES
  • #12: NEED SPEAKER NOTES
  • #14: Spark DataFrames represent tabular Data
  • #15: NEED SPEAKER NOTES
  • #16: NEED SPEAKER NOTES
  • #17: NEED SPEAKER NOTES
  • #18: TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE]
  • #20: TALK TRACK [NEXT SLIDE]
  • #21: NEED SPEAKER NOTES
  • #23: NEED SPEAKER NOTES
  • #24: TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] [RESOURCES] A vertex is an entity that can bring a bag of data (generally small) An edge connects vertices and can also own a bag of data https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf
  • #28: Takeaways Change order of interoperability slide Flush out no lock-in slide to talk about “proprietary open source”