SlideShare a Scribd company logo
Spark and Friends: 
Spark Streaming, 
Machine Learning, Graph Processing, 
Lambda Architecture, Approximations 
Chris Fregly 
East Bay Java User Group 
Oct 2014 
Kinesis 
Streaming
Who am I? 
Former Netflix’er 
(netflix.github.io) 
Spark Contributor 
(github.com/apache/spark) 
Founder 
(fluxcapacitor.com) 
Author 
(effectivespark.com, 
sparkinaction.com)
Spark Streaming-Kinesis Jira
Not-so-Quick Poll 
• Spark, Spark Streaming? 
• Hadoop, Hive, Pig? 
• Parquet, Avro, RCFile, ORCFile? 
• EMR, DynamoDB, Redshift? 
• Flume, Kafka, Kinesis, Storm? 
• Lambda Architecture? 
• PageRank, Collaborative Filtering, Recommendations? 
• Probabilistic Data Structs, Bloom Filters, HyperLogLog?
Quick Poll 
Raiders or 49’ers?
“Streaming” 
Kinesis 
Streaming 
Video 
Streaming 
Piping 
Big Data 
Streaming
Agenda 
• Spark, Spark Streaming 
• Use Cases 
• API and Libraries 
• Machine Learning 
• Graph Processing 
• Execution Model 
• Fault Tolerance 
• Cluster Deployment 
• Monitoring 
• Scaling and Tuning 
• Lambda Architecture 
• Probabilistic Data Structures 
• Approximations
Timeline of Spark Evolution
Spark and Berkeley AMP Lab 
Berkeley Data Analytics Stack (BDAS)
Spark Overview 
• Based on 2007 Microsoft Dryad paper 
• Written in Scala 
• Supports Java, Python, SQL, and R 
• Data fits in memory when possible, but not 
required 
• Improved efficiency over MapReduce 
– 100x in-memory, 2-10x on-disk 
• Compatible with Hadoop 
– File formats, SerDes, and UDFs
Spark Use Cases 
• Ad hoc, exploratory, interactive analytics 
• Real-time + Batch Analytics 
– Lambda Architecture 
• Real-time Machine Learning 
• Real-time Graph Processing 
• Approximate, Time-bound Queries
Explosion of Specialized Systems
Unified High-level Spark Libraries 
• Spark SQL (Data Processing) 
• Spark Streaming (Streaming) 
• MLlib (Machine Learning) 
• GraphX (Graph Processing) 
• BlinkDB (Approximate Queries) 
• Statistics (Correlations, Sampling, etc) 
• Others 
– Shark (Hive on Spark) 
– Spork (Pig on Spark)
Benefits of Unified Libraries 
• Advancements in higher-level libraries are 
pushed down into core and vice-versa 
• Examples 
– Spark Streaming 
• GC and memory management improvements 
– Spark GraphX 
• IndexedRDD for random, hashed-based access within 
a partition versus scanning entire partition 
– Spark Core 
• Sort-based Shuffle
Resilient Distributed Dataset 
(RDD)
RDD Overview 
• Core Spark abstraction 
• Represents partitions 
across the cluster nodes 
• Enables parallel processing 
on data sets 
• Partitions can be in-memory 
or on-disk 
• Immutable, recomputable, 
fault tolerant 
• Contains transformation history (“lineage”) for 
whole data set
Types of RDDs 
• Spark Out-of-the-Box 
– HadoopRDD 
– FilteredRDD 
– MappedRDD 
– PairRDD 
– ShuffledRDD 
– UnionRDD 
– DoubleRDD 
– JdbcRDD 
– JsonRDD 
– SchemaRDD 
– VertexRDD 
– EdgeRDD 
• External 
– CassandraRDD (DataStax) 
– GeoRDD (Esri) 
– EsSpark (ElasticSearch)
RDD Partitions
RDD Lineage Examples
Demo! 
• Load user data 
• Load gender data 
• Join user data 
with gender data 
• Analyze lineage
Join Optimizations 
• When joining large dataset with small dataset (reference 
data) 
• Broadcast small dataset to each node/partition of large 
dataset (one broadcast per node)
Spark API
Spark API Overview 
• Richer, more expressive than MapReduce 
• Native support for Java, Scala, Python, 
SQL, and R (mostly) 
• Unified API across all libraries 
• Operations 
– Transformations (lazy evaluation) 
– Actions (execute transformations)
Transformations
Actions
Spark Execution Model
Job Scheduling 
• Job 
– Contains many stages 
• Contains many tasks 
• FIFO 
– Long-running jobs may starve resources for other 
jobs 
• Fair 
– Round-robin to prevent resource starvation 
• Fair Scheduler Pools 
– High-priority pool for important jobs 
– Separate user pools 
– FIFO within the pool 
– Modeled after Hadoop Fair Scheduler
Spark Execution Model Overview 
• Parallel, distributed 
• DAG-based 
• Lazy evaluation 
• Allows optimizations 
– Reduce disk I/O 
– Reduce shuffle I/O 
– Single pass through dataset 
– Parallel execution 
– Task pipelining 
• Data locality and rack awareness 
• Worker node fault tolerance using RDD 
lineage graphs per partition
Execution Optimizations
Task Pipelining
Spark Cluster Deployment
Types of Cluster Deployments 
• Spark Standalone (default) 
• YARN 
– Allows hierarchies of resources 
– Kerberos integration 
– Multiple workloads from disparate execution frameworks 
• Hive, Pig, Spark, MapReduce, Cascading, etc 
• Mesos 
– Coarse-grained 
• Single, long-running Mesos tasks runs Spark mini tasks 
– Fine-grained 
• New Mesos task for each Spark task 
• Higher overhead 
• Not good for long-running Spark jobs like Spark Streaming
Spark Standalone Deployment
Spark YARN Deployment
Master High Availability 
• Multiple Master Nodes 
• ZooKeeper maintains current Master 
• Existing applications and workers will be 
notified of new Master election 
• New applications and workers need to 
explicitly specify current Master 
• Alternatives (Not recommended) 
– Local filesystem 
– NFS Mount
Spark After Dark 
(sparkafterdark.com)
Spark After Dark ~ Tinder
Mutual Match!
Goal: Increase Matches (1/2) 
• Top Influencers 
– PageRank 
– Most desirable people 
• People You May Know 
– Shortest Path 
– Facebook-enabled 
• Recommendations 
– Alternating Least 
Squares (ALS)
Goal: Increase Matches (2/2) 
• Cluster on Interests, 
Height, Religion, etc 
– K-Means 
– Nearest Neighbor 
• Textual Analysis 
of Profile Text 
– TF/IDF 
– N-gram 
– LDA Topic Extraction
Demo!
Spark Streaming
Spark Streaming Overview 
• Low latency, high throughput, fault-tolerance 
(mostly) 
• Long-running Spark application 
• Supports Flume, Kafka, Twitter, Kinesis, 
Socket, File, etc. 
• Graceful shutdown, in-flight message draining 
• Uses Spark Core, DAG Execution Model, and 
Fault Tolerance 
• Submits micro-batch jobs to the cluster
Spark Streaming Overview
Spark Streaming Use Cases 
• ETL and enrichment 
of streaming data 
on injest 
• Operational 
dashboards 
• Lambda 
Architecture
Discretized Stream (DStream) 
• Core Spark Streaming abstraction 
• Micro-batches of RDDs 
• Operations similar to RDD – operates on underlying RDDs
Spark Streaming API
Spark Streaming API Overview 
• Rich, expressive API similar to core 
• Operations 
– Transformations (lazy) 
– Actions (execute transformations) 
• Window and State Operations 
• Requires checkpointing to snip long-running 
DStream lineage 
• Register DStream as a Spark SQL table 
for querying?! Wow.
DStream Transformations
DStream Actions
Window and State DStream Operations
DStream Window and State Example
Spark Streaming Cluster 
Deployment
Spark Streaming Cluster Deployment
Adding Receivers
Adding Workers
Spark Streaming 
+ 
Kinesis
Spark Streaming + Kinesis Architecture
Throughput and Pricing
Demo! 
Kinesis 
Streaming 
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/… 
Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala 
Java: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
Spark Streaming 
Fault Tolerance
Characteristics of Sources 
• Buffered 
– Flume, Kafka, Kinesis 
– Allows replay and back pressure 
• Batched 
– Flume, Kafka, Kinesis 
– Improves throughput at expense of duplication on 
failure 
• Checkpointed 
– Kafka, Kinesis 
– Allows replay from specific checkpoint
Message Delivery Guarantees 
• Exactly once [1] 
– No loss 
– No redeliver 
– Perfect delivery 
– Incurs higher latency for transactional semantics 
– Spark default per batch using DStream lineage 
– Degrades to less guarantees depending on source 
• At least once [1..n] 
– No loss 
– Possible redeliver 
• At most once [0,1] 
– Possible loss 
– No redeliver 
– *Best configuration if some data loss is acceptable 
• Ordered 
– Per partition: Kafka, Kinesis 
– Global across all partitions: Hard to scale
Types of Checkpoints 
Spark 
1. Spark checkpointing of StreamingContext 
DStreams and metadata 
2. Lineage of state and window DStream 
operations 
Kinesis 
3. Kinesis Client Library (KCL) checkpoints 
current position within shard 
– Checkpoint info is stored in DynamoDB per 
Kinesis application keyed by shard
Fault Tolerance 
• Points of Failure 
– Receiver 
– Driver 
– Worker/Processor 
• Possible Solutions 
– Use HDFS File Source for durability 
– Data Replication 
– Secondary/Backup Nodes 
– Checkpoints 
• Stream, Window, and State info
Streaming Receiver Failure 
• Use a backup receiver 
• Use multiple receivers pulling from multiple 
shards 
– Use checkpoint-enabled, sharded streaming 
source (ie. Kafka and Kinesis) 
• Data is replicated to 2 nodes immediately 
upon ingestion 
– Will spill to disk if doesn’t fit in memory 
• Possible loss of most-recent batch 
• Possible at-least once delivery of batch 
• Use buffered sources for replay 
– Kafka and Kinesis
Streaming Driver Failure 
• Use a backup Driver 
– Use DStream metadata checkpoint info to 
recover 
• Single point of failure – interrupts stream 
processing 
• Streaming Driver is a long-running Spark 
application 
– Schedules long-running stream receivers 
• State and Window RDD checkpoints to 
HDFS to help avoid data loss (mostly)
Stream Worker/Processor Failure 
• No problem! 
• DStream RDD partitions will be recalculated from lineage 
• Causes blip in processing during node failover
Spark Streaming 
Monitoring and Tuning
Monitoring 
• Monitor driver, receiver, worker nodes, and 
streams 
• Alert upon failure or unusually high latency 
• Spark Web UI 
– Streaming tab 
• Ganglia, CloudWatch 
• StreamingListener callback
Spark Web UI
Tuning 
• Batch interval 
– High: reduce overhead of submitting new tasks for each batch 
– Low: keeps latencies low 
– Sweet spot: DStream job time (scheduling + processing) is 
steady and less than batch interval 
• Checkpoint interval 
– High: reduce load on checkpoint overhead 
– Low: reduce amount of data loss on failure 
– Recommendation: 5-10x sliding window interval 
• Use DStream.repartition() to increase parallelism of processing 
DStream jobs across cluster 
• Use spark.streaming.unpersist=true to let the Streaming Framework 
figure out when to unpersist 
• Use CMS GC for consistent processing times
Lambda Architecture
Lambda Architecture Overview 
• Batch Layer 
– Immutable, 
Batch read, 
Append-only write 
– Source of truth 
– ie. HDFS 
• Speed Layer 
– Mutable, 
Random read/write 
– Most complex 
– Recent data only 
– ie. Cassandra 
• Serving Layer 
– Immutable, 
Random read, 
Batch write 
– ie. ElephantDB
Spark + AWS + Lambda
Spark + Lambda + GraphX + MLlib
Demo! 
• Load JSON 
• Convert to Parquet file 
• Save Parquet file to disk 
• Query Parquet file directly
Approximations
Approximation Overview 
• Required for scaling 
• Speed up analysis of large datasets 
• Reduce size of working dataset 
• Data is messy 
• Collection of data is messy 
• Exact isn’t always necessary 
• “Approximate is the new Exact”
Some Approximation Methods 
• Approximate time-bound queries 
– BlinkDB 
• Bernouilli and Poisson Sampling 
– RDD: sample(), takeSample() 
• HyperLogLog 
PairRDD: countApproxDistinctByKey() 
• Count-min Sketch 
• Spark Streaming and Twitter Algebird 
• Bloom Filters 
– Everywhere!
Approximations In Action 
Figure: Memory Savings with Approximation Techniques 
(http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
Spark Statistics Library 
• Correlations 
– Dependence between 2 random variables 
– Pearson, Spearman 
• Hypothesis Testing 
– Measure of statistical significance 
– Chi-squared test 
• Stratified Sampling 
– Sample separately from different sub-populations 
– Bernoulli and Poisson sampling 
– With and without replacement 
• Random data generator 
– Uniform, standard normal, and Poisson distribution
Summary 
• Spark, Spark Streaming Overview 
• Use Cases 
• API and Libraries 
• Machine Learning 
• Graph Processing 
• Execution Model 
• Fault Tolerance 
• Cluster Deployment 
• Monitoring 
• Scaling and Tuning 
• Lambda Architecture 
• Probabilistic Data Structures 
• Approximations 
http://effectivespark.com http://sparkinaction.com 
Thanks!! 
Chris Fregly 
@cfregly

More Related Content

PDF
Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
PDF
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
PPTX
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Chris Fregly
 
PDF
What no one tells you about writing a streaming app
hadooparchbook
 
PDF
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
PDF
Architectural Patterns for Streaming Applications
hadooparchbook
 
Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Spark Summit 2014: Spark Job Server Talk
Evan Chan
 
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Chris Fregly
 
What no one tells you about writing a streaming app
hadooparchbook
 
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
Architectural Patterns for Streaming Applications
hadooparchbook
 

What's hot (20)

PDF
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
PPTX
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
PDF
Solr + Hadoop = Big Data Search
Mark Miller
 
PDF
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
PDF
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
PDF
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
PDF
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
PDF
Streaming architecture patterns
hadooparchbook
 
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
PDF
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
PPTX
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
PDF
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
PPTX
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Shivji Kumar Jha
 
PDF
Data Science meets Software Development
Alexis Seigneurin
 
PDF
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Lightbend
 
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
Solr + Hadoop = Big Data Search
Mark Miller
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Streaming architecture patterns
hadooparchbook
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Shivji Kumar Jha
 
Data Science meets Software Development
Alexis Seigneurin
 
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Lightbend
 
Ad

Viewers also liked (9)

PPTX
Summarization and opinion detection in product reviews
papanaboinasuman
 
PPTX
Hadoop bootcamp getting started
JWORKS powered by Ordina
 
PDF
On Centralizing Logs
Sematext Group, Inc.
 
PDF
Introduction to spark
Duyhai Doan
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PPTX
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
PDF
Apache Spark RDDs
Dean Chen
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Yongho Ha
 
Summarization and opinion detection in product reviews
papanaboinasuman
 
Hadoop bootcamp getting started
JWORKS powered by Ordina
 
On Centralizing Logs
Sematext Group, Inc.
 
Introduction to spark
Duyhai Doan
 
Tuning and Debugging in Apache Spark
Databricks
 
Resilient Distributed DataSets - Apache SPARK
Taposh Roy
 
Apache Spark RDDs
Dean Chen
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Yongho Ha
 
Ad

Similar to East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning (20)

PPTX
Apache Spark Components
Girish Khanzode
 
PPTX
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Chris Fregly
 
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
PDF
Deep dive into spark streaming
Tao Li
 
PDF
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
PDF
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
PDF
Spark streaming state of the union
Databricks
 
PDF
Apache Spark Streaming
Bartosz Jankiewicz
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
PDF
Building end to end streaming application on Spark
datamantra
 
PDF
Sparklife - Life In The Trenches With Spark
Ian Pointer
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Austin Data Meetup 092014 - Spark
Steve Blackmon
 
PPTX
Apache Spark
masifqadri
 
Apache Spark Components
Girish Khanzode
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Chris Fregly
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Deep dive into spark streaming
Tao Li
 
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Spark streaming State of the Union - Strata San Jose 2015
Databricks
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Michael Spector
 
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Santosh Sahoo
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Spark streaming state of the union
Databricks
 
Apache Spark Streaming
Bartosz Jankiewicz
 
Introduction to Spark Streaming
datamantra
 
Apache Spark - A High Level overview
Karan Alang
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Building end to end streaming application on Spark
datamantra
 
Sparklife - Life In The Trenches With Spark
Ian Pointer
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Austin Data Meetup 092014 - Spark
Steve Blackmon
 
Apache Spark
masifqadri
 

More from Chris Fregly (20)

PDF
AWS reInvent 2022 reCap AI/ML and Data
Chris Fregly
 
PDF
Pandas on AWS - Let me count the ways.pdf
Chris Fregly
 
PDF
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Chris Fregly
 
PDF
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Chris Fregly
 
PDF
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly
 
PDF
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Chris Fregly
 
PDF
Quantum Computing with Amazon Braket
Chris Fregly
 
PDF
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
Chris Fregly
 
PDF
AWS Re:Invent 2019 Re:Cap
Chris Fregly
 
PDF
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Chris Fregly
 
PDF
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly
 
PDF
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
 
PDF
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Chris Fregly
 
PDF
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
Chris Fregly
 
PDF
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
Chris Fregly
 
PDF
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly
 
PDF
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly
 
PDF
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Chris Fregly
 
PDF
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
PDF
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Chris Fregly
 
AWS reInvent 2022 reCap AI/ML and Data
Chris Fregly
 
Pandas on AWS - Let me count the ways.pdf
Chris Fregly
 
Ray AI Runtime (AIR) on AWS - Data Science On AWS Meetup
Chris Fregly
 
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds Updated
Chris Fregly
 
Amazon reInvent 2020 Recap: AI and Machine Learning
Chris Fregly
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Chris Fregly
 
Quantum Computing with Amazon Braket
Chris Fregly
 
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-Person
Chris Fregly
 
AWS Re:Invent 2019 Re:Cap
Chris Fregly
 
KubeFlow + GPU + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTo...
Chris Fregly
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Chris Fregly
 
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + ...
Chris Fregly
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Chris Fregly
 
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...
Chris Fregly
 
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...
Chris Fregly
 
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...
Chris Fregly
 
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...
Chris Fregly
 
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...
Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
PipelineAI + TensorFlow AI + Spark ML + Kuberenetes + Istio + AWS SageMaker +...
Chris Fregly
 

East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning

  • 1. Spark and Friends: Spark Streaming, Machine Learning, Graph Processing, Lambda Architecture, Approximations Chris Fregly East Bay Java User Group Oct 2014 Kinesis Streaming
  • 2. Who am I? Former Netflix’er (netflix.github.io) Spark Contributor (github.com/apache/spark) Founder (fluxcapacitor.com) Author (effectivespark.com, sparkinaction.com)
  • 4. Not-so-Quick Poll • Spark, Spark Streaming? • Hadoop, Hive, Pig? • Parquet, Avro, RCFile, ORCFile? • EMR, DynamoDB, Redshift? • Flume, Kafka, Kinesis, Storm? • Lambda Architecture? • PageRank, Collaborative Filtering, Recommendations? • Probabilistic Data Structs, Bloom Filters, HyperLogLog?
  • 5. Quick Poll Raiders or 49’ers?
  • 6. “Streaming” Kinesis Streaming Video Streaming Piping Big Data Streaming
  • 7. Agenda • Spark, Spark Streaming • Use Cases • API and Libraries • Machine Learning • Graph Processing • Execution Model • Fault Tolerance • Cluster Deployment • Monitoring • Scaling and Tuning • Lambda Architecture • Probabilistic Data Structures • Approximations
  • 8. Timeline of Spark Evolution
  • 9. Spark and Berkeley AMP Lab Berkeley Data Analytics Stack (BDAS)
  • 10. Spark Overview • Based on 2007 Microsoft Dryad paper • Written in Scala • Supports Java, Python, SQL, and R • Data fits in memory when possible, but not required • Improved efficiency over MapReduce – 100x in-memory, 2-10x on-disk • Compatible with Hadoop – File formats, SerDes, and UDFs
  • 11. Spark Use Cases • Ad hoc, exploratory, interactive analytics • Real-time + Batch Analytics – Lambda Architecture • Real-time Machine Learning • Real-time Graph Processing • Approximate, Time-bound Queries
  • 13. Unified High-level Spark Libraries • Spark SQL (Data Processing) • Spark Streaming (Streaming) • MLlib (Machine Learning) • GraphX (Graph Processing) • BlinkDB (Approximate Queries) • Statistics (Correlations, Sampling, etc) • Others – Shark (Hive on Spark) – Spork (Pig on Spark)
  • 14. Benefits of Unified Libraries • Advancements in higher-level libraries are pushed down into core and vice-versa • Examples – Spark Streaming • GC and memory management improvements – Spark GraphX • IndexedRDD for random, hashed-based access within a partition versus scanning entire partition – Spark Core • Sort-based Shuffle
  • 16. RDD Overview • Core Spark abstraction • Represents partitions across the cluster nodes • Enables parallel processing on data sets • Partitions can be in-memory or on-disk • Immutable, recomputable, fault tolerant • Contains transformation history (“lineage”) for whole data set
  • 17. Types of RDDs • Spark Out-of-the-Box – HadoopRDD – FilteredRDD – MappedRDD – PairRDD – ShuffledRDD – UnionRDD – DoubleRDD – JdbcRDD – JsonRDD – SchemaRDD – VertexRDD – EdgeRDD • External – CassandraRDD (DataStax) – GeoRDD (Esri) – EsSpark (ElasticSearch)
  • 20. Demo! • Load user data • Load gender data • Join user data with gender data • Analyze lineage
  • 21. Join Optimizations • When joining large dataset with small dataset (reference data) • Broadcast small dataset to each node/partition of large dataset (one broadcast per node)
  • 23. Spark API Overview • Richer, more expressive than MapReduce • Native support for Java, Scala, Python, SQL, and R (mostly) • Unified API across all libraries • Operations – Transformations (lazy evaluation) – Actions (execute transformations)
  • 27. Job Scheduling • Job – Contains many stages • Contains many tasks • FIFO – Long-running jobs may starve resources for other jobs • Fair – Round-robin to prevent resource starvation • Fair Scheduler Pools – High-priority pool for important jobs – Separate user pools – FIFO within the pool – Modeled after Hadoop Fair Scheduler
  • 28. Spark Execution Model Overview • Parallel, distributed • DAG-based • Lazy evaluation • Allows optimizations – Reduce disk I/O – Reduce shuffle I/O – Single pass through dataset – Parallel execution – Task pipelining • Data locality and rack awareness • Worker node fault tolerance using RDD lineage graphs per partition
  • 32. Types of Cluster Deployments • Spark Standalone (default) • YARN – Allows hierarchies of resources – Kerberos integration – Multiple workloads from disparate execution frameworks • Hive, Pig, Spark, MapReduce, Cascading, etc • Mesos – Coarse-grained • Single, long-running Mesos tasks runs Spark mini tasks – Fine-grained • New Mesos task for each Spark task • Higher overhead • Not good for long-running Spark jobs like Spark Streaming
  • 35. Master High Availability • Multiple Master Nodes • ZooKeeper maintains current Master • Existing applications and workers will be notified of new Master election • New applications and workers need to explicitly specify current Master • Alternatives (Not recommended) – Local filesystem – NFS Mount
  • 36. Spark After Dark (sparkafterdark.com)
  • 37. Spark After Dark ~ Tinder
  • 39. Goal: Increase Matches (1/2) • Top Influencers – PageRank – Most desirable people • People You May Know – Shortest Path – Facebook-enabled • Recommendations – Alternating Least Squares (ALS)
  • 40. Goal: Increase Matches (2/2) • Cluster on Interests, Height, Religion, etc – K-Means – Nearest Neighbor • Textual Analysis of Profile Text – TF/IDF – N-gram – LDA Topic Extraction
  • 41. Demo!
  • 43. Spark Streaming Overview • Low latency, high throughput, fault-tolerance (mostly) • Long-running Spark application • Supports Flume, Kafka, Twitter, Kinesis, Socket, File, etc. • Graceful shutdown, in-flight message draining • Uses Spark Core, DAG Execution Model, and Fault Tolerance • Submits micro-batch jobs to the cluster
  • 45. Spark Streaming Use Cases • ETL and enrichment of streaming data on injest • Operational dashboards • Lambda Architecture
  • 46. Discretized Stream (DStream) • Core Spark Streaming abstraction • Micro-batches of RDDs • Operations similar to RDD – operates on underlying RDDs
  • 48. Spark Streaming API Overview • Rich, expressive API similar to core • Operations – Transformations (lazy) – Actions (execute transformations) • Window and State Operations • Requires checkpointing to snip long-running DStream lineage • Register DStream as a Spark SQL table for querying?! Wow.
  • 51. Window and State DStream Operations
  • 52. DStream Window and State Example
  • 57. Spark Streaming + Kinesis
  • 58. Spark Streaming + Kinesis Architecture
  • 60. Demo! Kinesis Streaming https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/… Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala Java: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java
  • 62. Characteristics of Sources • Buffered – Flume, Kafka, Kinesis – Allows replay and back pressure • Batched – Flume, Kafka, Kinesis – Improves throughput at expense of duplication on failure • Checkpointed – Kafka, Kinesis – Allows replay from specific checkpoint
  • 63. Message Delivery Guarantees • Exactly once [1] – No loss – No redeliver – Perfect delivery – Incurs higher latency for transactional semantics – Spark default per batch using DStream lineage – Degrades to less guarantees depending on source • At least once [1..n] – No loss – Possible redeliver • At most once [0,1] – Possible loss – No redeliver – *Best configuration if some data loss is acceptable • Ordered – Per partition: Kafka, Kinesis – Global across all partitions: Hard to scale
  • 64. Types of Checkpoints Spark 1. Spark checkpointing of StreamingContext DStreams and metadata 2. Lineage of state and window DStream operations Kinesis 3. Kinesis Client Library (KCL) checkpoints current position within shard – Checkpoint info is stored in DynamoDB per Kinesis application keyed by shard
  • 65. Fault Tolerance • Points of Failure – Receiver – Driver – Worker/Processor • Possible Solutions – Use HDFS File Source for durability – Data Replication – Secondary/Backup Nodes – Checkpoints • Stream, Window, and State info
  • 66. Streaming Receiver Failure • Use a backup receiver • Use multiple receivers pulling from multiple shards – Use checkpoint-enabled, sharded streaming source (ie. Kafka and Kinesis) • Data is replicated to 2 nodes immediately upon ingestion – Will spill to disk if doesn’t fit in memory • Possible loss of most-recent batch • Possible at-least once delivery of batch • Use buffered sources for replay – Kafka and Kinesis
  • 67. Streaming Driver Failure • Use a backup Driver – Use DStream metadata checkpoint info to recover • Single point of failure – interrupts stream processing • Streaming Driver is a long-running Spark application – Schedules long-running stream receivers • State and Window RDD checkpoints to HDFS to help avoid data loss (mostly)
  • 68. Stream Worker/Processor Failure • No problem! • DStream RDD partitions will be recalculated from lineage • Causes blip in processing during node failover
  • 70. Monitoring • Monitor driver, receiver, worker nodes, and streams • Alert upon failure or unusually high latency • Spark Web UI – Streaming tab • Ganglia, CloudWatch • StreamingListener callback
  • 72. Tuning • Batch interval – High: reduce overhead of submitting new tasks for each batch – Low: keeps latencies low – Sweet spot: DStream job time (scheduling + processing) is steady and less than batch interval • Checkpoint interval – High: reduce load on checkpoint overhead – Low: reduce amount of data loss on failure – Recommendation: 5-10x sliding window interval • Use DStream.repartition() to increase parallelism of processing DStream jobs across cluster • Use spark.streaming.unpersist=true to let the Streaming Framework figure out when to unpersist • Use CMS GC for consistent processing times
  • 74. Lambda Architecture Overview • Batch Layer – Immutable, Batch read, Append-only write – Source of truth – ie. HDFS • Speed Layer – Mutable, Random read/write – Most complex – Recent data only – ie. Cassandra • Serving Layer – Immutable, Random read, Batch write – ie. ElephantDB
  • 75. Spark + AWS + Lambda
  • 76. Spark + Lambda + GraphX + MLlib
  • 77. Demo! • Load JSON • Convert to Parquet file • Save Parquet file to disk • Query Parquet file directly
  • 79. Approximation Overview • Required for scaling • Speed up analysis of large datasets • Reduce size of working dataset • Data is messy • Collection of data is messy • Exact isn’t always necessary • “Approximate is the new Exact”
  • 80. Some Approximation Methods • Approximate time-bound queries – BlinkDB • Bernouilli and Poisson Sampling – RDD: sample(), takeSample() • HyperLogLog PairRDD: countApproxDistinctByKey() • Count-min Sketch • Spark Streaming and Twitter Algebird • Bloom Filters – Everywhere!
  • 81. Approximations In Action Figure: Memory Savings with Approximation Techniques (http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/)
  • 82. Spark Statistics Library • Correlations – Dependence between 2 random variables – Pearson, Spearman • Hypothesis Testing – Measure of statistical significance – Chi-squared test • Stratified Sampling – Sample separately from different sub-populations – Bernoulli and Poisson sampling – With and without replacement • Random data generator – Uniform, standard normal, and Poisson distribution
  • 83. Summary • Spark, Spark Streaming Overview • Use Cases • API and Libraries • Machine Learning • Graph Processing • Execution Model • Fault Tolerance • Cluster Deployment • Monitoring • Scaling and Tuning • Lambda Architecture • Probabilistic Data Structures • Approximations http://effectivespark.com http://sparkinaction.com Thanks!! Chris Fregly @cfregly