SlideShare a Scribd company logo
|8/21/20
15
Jack Gudenkauf
VP Big Data
scala> sc.parallelize(List("Kafka Spark Vertica"), 3).mapPartitions(iter => { iter.toList.map(x=>print(x)) }.iterator).collect; println()
https://twitter.com/_JG
2
PLAYTIKA
 Founded in 2010
 Social Casino global category leader
 10 games
 13 platforms
 1000+ employees
3© Cloudera, Inc. All rights reserved.
Hari Shreedharan, Software Engineer @ Cloudera
Committer/PMC Member, Apache Flume
Committer, Apache Sqoop
Contributor, Apache Spark
Author, Using Flume (O’Reilly)
Spark + Kafka:
Future of Streaming Processing
4© Cloudera, Inc. All rights reserved.
Motivation for Real-Time Stream Processing
Data is being created at unprecedented rates
• Exponential data growth from mobile, web, social
• Connected devices: 9B in 2012 to 50B by 2020
• Over 1 trillion sensors by 2020
• Datacenter IP traffic growing at CAGR of 25%
How can we harness it data in real-time?
• Value can quickly degrade → capture value immediately
• From reactive analysis to direct operational impact
• Unlocks new competitive advantages
• Requires a completely new approach...
5© Cloudera, Inc. All rights reserved.
From Volume and Variety to Velocity
Present
Batch + Stream Processing
Time to Insight of Seconds
Big-Data = Volume + Variety
Big-Data = Volume + Variety + Velocity
Past
Present
Hadoop Ecosystem evolves as well…
Past
Big Data has evolved
Batch Processing
Time to insight of Hours
6© Cloudera, Inc. All rights reserved.
Key Components of Streaming Architectures
Data Ingestion
& Transportation
Service
Real-Time Stream
Processing Engine
Kafka Flume
System Management
Security
Data Management & Integration
Real-Time
Data Serving
7© Cloudera, Inc. All rights reserved.
Canonical Stream Processing Architecture
Kafka
Data Ingest
App 1
App 2
.
.
.
Kafka Flume
HDFS
HBase
Data
Sources
8© Cloudera, Inc. All rights reserved.
Spark: Easy and Fast Big Data
•Easy to Develop
•Rich APIs in Java, Scala,
Python
•Interactive shell
•Fast to Run
•General execution graphs
•In-memory storage
2-5× less code
Up to 10× faster on disk,
100× in memory
9© Cloudera, Inc. All rights reserved.
Spark Architecture
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM
10© Cloudera, Inc. All rights reserved.
RDDs
RDD = Resilient Distributed Datasets
• Immutable representation of data
• Operations on one RDD creates a new one
• Memory caching layer that stores data in a distributed, fault-tolerant cache
• Created by parallel transformations on data in stable storage
• Lazy materialization
Two observations:
a. Can fall back to disk when data-set does not fit in memory
b. Provides fault-tolerance through concept of lineage
11© Cloudera, Inc. All rights reserved.
Spark Streaming
Extension of Apache Spark’s Core API, for Stream Processing.
The Framework Provides
Fault Tolerance
Scalability
High-Throughput
12© Cloudera, Inc. All rights reserved.
Spark Streaming
• Incoming data represented as Discretized Streams (DStreams)
• Stream is broken down into micro-batches
• Each micro-batch is an RDD – can share code between batch and streaming
13© Cloudera, Inc. All rights reserved.
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2tweets DStream
hashTags DStream
Stream composed of
small (1-10s) batch
computations
“Micro-batch” Architecture
14© Cloudera, Inc. All rights reserved.
Use DStreams for Windowing Functions
15© Cloudera, Inc. All rights reserved.
Spark Streaming
• Runs as a Spark job
• YARN or standalone for scheduling
• YARN has KDC integration
• Use the same code for real-time Spark Streaming and for batch Spark jobs.
• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….
• Easy to write “Receivers” for custom messaging systems.
16© Cloudera, Inc. All rights reserved.
Sharing Code between Batch and Streaming
def filterErrors (rdd: RDD[String]): RDD[String] = {
rdd.filter(s => s.contains(“ERROR”))
}
Library that filters “ERRORS”
• Streaming generates RDDs periodically
• Any code that operates on RDDs can therefore be used in streaming as
well
17© Cloudera, Inc. All rights reserved.
Sharing Code between Batch and Streaming
val lines = sc.textFile(…)
val filtered = filterErrors(lines)
filtered.saveAsTextFile(...)
Spark:
val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)
val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {
filterErrors(rdd)
}))
filtered.saveAsTextFiles(…)
Spark Streaming:
18© Cloudera, Inc. All rights reserved.
Reliability
• Received data automatically persisted to HDFS Write Ahead Log to prevent data
loss
• set spark.streaming.receiver.writeAheadLog.enable=true in spark conf
• When AM dies, the application is restarted by YARN
• Received, ack-ed and unprocessed data replayed from WAL (data that made it
into blocks)
• Reliable Receivers can replay data from the original source, if required
• Un-acked data replayed from source.
• Kafka, Flume receivers bundled with Spark are examples
• Reliable Receivers + WAL = No data loss on driver or receiver failure!
19© Cloudera, Inc. All rights reserved.
Reliable Kafka DStream
• Stores received data to Write Ahead Log on HDFS for replay – no data loss!
• Stable and supported!
• Uses a reliable receiver to pull data from Kafka
• Application-controlled parallelism
• Create as many receivers as you want to parallelize
• Remember – each receiver is a task and holds one executor hostage, no
processing happens on that executor.
• Tricky to do this efficiently, so is controlling ordering (everything needs to be
done explicitly
20© Cloudera, Inc. All rights reserved.
Reliable Kafka Dstream - Issues
• Kafka can replay messages if processing failed for some reason
• So WAL is overkill – causes unnecessary performance hit
• In addition, the Reliable Stream causes a lot of network traffic due
to unneeded HDFS writes etc.
• Receivers hold executors hostage – which could otherwise be
used for processing
• How can we solve these issues?
21© Cloudera, Inc. All rights reserved.
Direct Kafka DStream
• No long-running receiver = no executor hogging!
• Communicates with Kafka via the “low-level API”
• 1 Spark partition Kafka partition
• At the end of every batch:
• The first message after the last batch to the current latest message in partition
• If max rate is configured, then rate x batch interval is downloaded & processed
• Checkpoint contains the starting and ending offset in the current RDD
• Recovering from checkpoint is simple – last offset + 1 is least offset of next
batch
22© Cloudera, Inc. All rights reserved.
Direct Kafka DStream
• (Almost) Exactly once processing
• At the end of each interval, the RDD can provide information about the starting
and ending offset
• These offsets can be persisted, so even on failure – recover from there
• Edge cases are possible and can cause duplicates
• Failure in the middle of HDFS writes -> duplicates!
• Failure after processing but before offsets getting persisted -> duplicates!
• More likely!
• Writes to Kafka also can cause duplicates, so do reads from Kafka
• Fix: You app should really be resilient to duplicates
23© Cloudera, Inc. All rights reserved.
Spark Streaming Use-Cases
• Real-time dashboards
• Show approximate results in real-time
• Reconcile periodically with source-of-truth using Spark
• Joins of multiple streams
• Time-based or count-based “windows”
• Combine multiple sources of input to produce composite data
• Re-use RDDs created by Streaming in other Spark jobs.
24© Cloudera, Inc. All rights reserved.
What is coming?
• Better Monitoring and alerting
• Batch-level and task-level monitoring
• SQL on Streaming
• Run SQL-like queries on top of Streaming (medium – long term)
• Python!
• Limited support already available, but more detailed support coming
• ML
• More real-time ML algorithms
25© Cloudera, Inc. All rights reserved.
Current Spark project status
• 400+ contributors and 50+ companies contributing
• Includes: Databricks, Cloudera, Intel, Huawei, Yahoo! etc
• Dozens of production deployments
• Spark Streaming Survived Netflix Chaos Monkey – production ready!
• Included in CDH!
26© Cloudera, Inc. All rights reserved.
More Info..
• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-
docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html
• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/
• Apache Spark homepage: http://spark.apache.org/
• Github: https://github.com/apache/spark
27© Cloudera, Inc. All rights reserved.
Thank you
hshreedharan@cloudera.com
@harisr1234

More Related Content

What's hot (20)

PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
PPTX
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
DataStax
 
PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
PDF
Apache kafka-a distributed streaming platform
confluent
 
PPTX
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
PPTX
Apache kafka
Daan Gerits
 
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
PPTX
In Flux Limiting for a multi-tenant logging service
DataWorks Summit/Hadoop Summit
 
PDF
Spark Summit EU talk by Mike Percy
Spark Summit
 
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
PPTX
Automatic Scaling Iterative Computations
Guozhang Wang
 
PPTX
Ai big dataconference_jeffrey ricker_kappa_architecture
Olga Zinkevych
 
PDF
HBase at Mendeley
Dan Harvey
 
PDF
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Codemotion Dubai
 
PDF
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
PDF
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
confluent
 
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
DataStax
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
Apache kafka-a distributed streaming platform
confluent
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
Apache kafka
Daan Gerits
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
In Flux Limiting for a multi-tenant logging service
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Mike Percy
Spark Summit
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
Automatic Scaling Iterative Computations
Guozhang Wang
 
Ai big dataconference_jeffrey ricker_kappa_architecture
Olga Zinkevych
 
HBase at Mendeley
Dan Harvey
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Codemotion Dubai
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
confluent
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 

Viewers also liked (20)

PPTX
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
Cloudera, Inc.
 
PDF
How to Avoid Pitfalls in Big Data Analytics Webinar
Datameer
 
PDF
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Data Con LA
 
PPTX
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Data Con LA
 
PPT
101129 tokyopref bochibochi
redgang
 
PDF
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Data Con LA
 
PPT
Dot pab forum september 2011
The Social Executive
 
PPTX
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Data Con LA
 
PDF
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Data Con LA
 
PDF
Avvo fkafka
Nitin Kumar
 
PPTX
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Data Con LA
 
PDF
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
 
PDF
Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...
Data Con LA
 
PDF
Do you know how the ultra affluent use social media? Find out.
The Social Executive
 
PDF
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit
 
PDF
Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
PDF
6 damaging myths about social media and the truths behind them
The Social Executive
 
PPTX
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
Cloudera, Inc.
 
How to Avoid Pitfalls in Big Data Analytics Webinar
Datameer
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Data Con LA
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Data Con LA
 
101129 tokyopref bochibochi
redgang
 
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Data Con LA
 
Dot pab forum september 2011
The Social Executive
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Data Con LA
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Data Con LA
 
Avvo fkafka
Nitin Kumar
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Data Con LA
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
 
Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...
Data Con LA
 
Do you know how the ultra affluent use social media? Find out.
The Social Executive
 
Spark Summit EU talk by Brij Bhushan Ravat
Spark Summit
 
Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
6 damaging myths about social media and the truths behind them
The Social Executive
 
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA
 
Ad

Similar to Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of Clouder­a (20)

PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PPTX
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
PDF
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Eugene
 
PDF
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
PPTX
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
PPTX
unit5_Big Data Framework and security.pptx
argadesudarshan2004
 
PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
PPTX
Apache kafka
Kumar Shivam
 
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
PDF
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
PDF
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
HostedbyConfluent
 
PPTX
IoT Austin CUG talk
Felicia Haggarty
 
PDF
Fraud Detection using Hadoop
hadooparchbook
 
PPTX
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
PPTX
Fraud Detection Architecture
Gwen (Chen) Shapira
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Eugene
 
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
unit5_Big Data Framework and security.pptx
argadesudarshan2004
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache Spark Components
Girish Khanzode
 
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Apache kafka
Kumar Shivam
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
HostedbyConfluent
 
IoT Austin CUG talk
Felicia Haggarty
 
Fraud Detection using Hadoop
hadooparchbook
 
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
Fraud Detection Architecture
Gwen (Chen) Shapira
 
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PDF
Data Con LA 2022 Keynote
Data Con LA
 
PPTX
Data Con LA 2022 - Startup Showcase
Data Con LA
 
PPTX
Data Con LA 2022 Keynote
Data Con LA
 
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
PPTX
Data Con LA 2022 - AI Ethics
Data Con LA
 
PDF
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
PDF
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
PDF
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 

Recently uploaded (20)

PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Wojciech Ciemski for Top Cyber News MAGAZINE. June 2025
Dr. Ludmila Morozova-Buss
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Persuasive AI: risks and opportunities in the age of digital debate
Speck&Tech
 

Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of Clouder­a

  • 1. |8/21/20 15 Jack Gudenkauf VP Big Data scala> sc.parallelize(List("Kafka Spark Vertica"), 3).mapPartitions(iter => { iter.toList.map(x=>print(x)) }.iterator).collect; println() https://twitter.com/_JG
  • 2. 2 PLAYTIKA  Founded in 2010  Social Casino global category leader  10 games  13 platforms  1000+ employees
  • 3. 3© Cloudera, Inc. All rights reserved. Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O’Reilly) Spark + Kafka: Future of Streaming Processing
  • 4. 4© Cloudera, Inc. All rights reserved. Motivation for Real-Time Stream Processing Data is being created at unprecedented rates • Exponential data growth from mobile, web, social • Connected devices: 9B in 2012 to 50B by 2020 • Over 1 trillion sensors by 2020 • Datacenter IP traffic growing at CAGR of 25% How can we harness it data in real-time? • Value can quickly degrade → capture value immediately • From reactive analysis to direct operational impact • Unlocks new competitive advantages • Requires a completely new approach...
  • 5. 5© Cloudera, Inc. All rights reserved. From Volume and Variety to Velocity Present Batch + Stream Processing Time to Insight of Seconds Big-Data = Volume + Variety Big-Data = Volume + Variety + Velocity Past Present Hadoop Ecosystem evolves as well… Past Big Data has evolved Batch Processing Time to insight of Hours
  • 6. 6© Cloudera, Inc. All rights reserved. Key Components of Streaming Architectures Data Ingestion & Transportation Service Real-Time Stream Processing Engine Kafka Flume System Management Security Data Management & Integration Real-Time Data Serving
  • 7. 7© Cloudera, Inc. All rights reserved. Canonical Stream Processing Architecture Kafka Data Ingest App 1 App 2 . . . Kafka Flume HDFS HBase Data Sources
  • 8. 8© Cloudera, Inc. All rights reserved. Spark: Easy and Fast Big Data •Easy to Develop •Rich APIs in Java, Scala, Python •Interactive shell •Fast to Run •General execution graphs •In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 9. 9© Cloudera, Inc. All rights reserved. Spark Architecture Driver Worker Worker Worker Data RAM Data RAM Data RAM
  • 10. 10© Cloudera, Inc. All rights reserved. RDDs RDD = Resilient Distributed Datasets • Immutable representation of data • Operations on one RDD creates a new one • Memory caching layer that stores data in a distributed, fault-tolerant cache • Created by parallel transformations on data in stable storage • Lazy materialization Two observations: a. Can fall back to disk when data-set does not fit in memory b. Provides fault-tolerance through concept of lineage
  • 11. 11© Cloudera, Inc. All rights reserved. Spark Streaming Extension of Apache Spark’s Core API, for Stream Processing. The Framework Provides Fault Tolerance Scalability High-Throughput
  • 12. 12© Cloudera, Inc. All rights reserved. Spark Streaming • Incoming data represented as Discretized Streams (DStreams) • Stream is broken down into micro-batches • Each micro-batch is an RDD – can share code between batch and streaming
  • 13. 13© Cloudera, Inc. All rights reserved. val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2tweets DStream hashTags DStream Stream composed of small (1-10s) batch computations “Micro-batch” Architecture
  • 14. 14© Cloudera, Inc. All rights reserved. Use DStreams for Windowing Functions
  • 15. 15© Cloudera, Inc. All rights reserved. Spark Streaming • Runs as a Spark job • YARN or standalone for scheduling • YARN has KDC integration • Use the same code for real-time Spark Streaming and for batch Spark jobs. • Integrates natively with messaging systems such as Flume, Kafka, Zero MQ…. • Easy to write “Receivers” for custom messaging systems.
  • 16. 16© Cloudera, Inc. All rights reserved. Sharing Code between Batch and Streaming def filterErrors (rdd: RDD[String]): RDD[String] = { rdd.filter(s => s.contains(“ERROR”)) } Library that filters “ERRORS” • Streaming generates RDDs periodically • Any code that operates on RDDs can therefore be used in streaming as well
  • 17. 17© Cloudera, Inc. All rights reserved. Sharing Code between Batch and Streaming val lines = sc.textFile(…) val filtered = filterErrors(lines) filtered.saveAsTextFile(...) Spark: val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435) val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => { filterErrors(rdd) })) filtered.saveAsTextFiles(…) Spark Streaming:
  • 18. 18© Cloudera, Inc. All rights reserved. Reliability • Received data automatically persisted to HDFS Write Ahead Log to prevent data loss • set spark.streaming.receiver.writeAheadLog.enable=true in spark conf • When AM dies, the application is restarted by YARN • Received, ack-ed and unprocessed data replayed from WAL (data that made it into blocks) • Reliable Receivers can replay data from the original source, if required • Un-acked data replayed from source. • Kafka, Flume receivers bundled with Spark are examples • Reliable Receivers + WAL = No data loss on driver or receiver failure!
  • 19. 19© Cloudera, Inc. All rights reserved. Reliable Kafka DStream • Stores received data to Write Ahead Log on HDFS for replay – no data loss! • Stable and supported! • Uses a reliable receiver to pull data from Kafka • Application-controlled parallelism • Create as many receivers as you want to parallelize • Remember – each receiver is a task and holds one executor hostage, no processing happens on that executor. • Tricky to do this efficiently, so is controlling ordering (everything needs to be done explicitly
  • 20. 20© Cloudera, Inc. All rights reserved. Reliable Kafka Dstream - Issues • Kafka can replay messages if processing failed for some reason • So WAL is overkill – causes unnecessary performance hit • In addition, the Reliable Stream causes a lot of network traffic due to unneeded HDFS writes etc. • Receivers hold executors hostage – which could otherwise be used for processing • How can we solve these issues?
  • 21. 21© Cloudera, Inc. All rights reserved. Direct Kafka DStream • No long-running receiver = no executor hogging! • Communicates with Kafka via the “low-level API” • 1 Spark partition Kafka partition • At the end of every batch: • The first message after the last batch to the current latest message in partition • If max rate is configured, then rate x batch interval is downloaded & processed • Checkpoint contains the starting and ending offset in the current RDD • Recovering from checkpoint is simple – last offset + 1 is least offset of next batch
  • 22. 22© Cloudera, Inc. All rights reserved. Direct Kafka DStream • (Almost) Exactly once processing • At the end of each interval, the RDD can provide information about the starting and ending offset • These offsets can be persisted, so even on failure – recover from there • Edge cases are possible and can cause duplicates • Failure in the middle of HDFS writes -> duplicates! • Failure after processing but before offsets getting persisted -> duplicates! • More likely! • Writes to Kafka also can cause duplicates, so do reads from Kafka • Fix: You app should really be resilient to duplicates
  • 23. 23© Cloudera, Inc. All rights reserved. Spark Streaming Use-Cases • Real-time dashboards • Show approximate results in real-time • Reconcile periodically with source-of-truth using Spark • Joins of multiple streams • Time-based or count-based “windows” • Combine multiple sources of input to produce composite data • Re-use RDDs created by Streaming in other Spark jobs.
  • 24. 24© Cloudera, Inc. All rights reserved. What is coming? • Better Monitoring and alerting • Batch-level and task-level monitoring • SQL on Streaming • Run SQL-like queries on top of Streaming (medium – long term) • Python! • Limited support already available, but more detailed support coming • ML • More real-time ML algorithms
  • 25. 25© Cloudera, Inc. All rights reserved. Current Spark project status • 400+ contributors and 50+ companies contributing • Includes: Databricks, Cloudera, Intel, Huawei, Yahoo! etc • Dozens of production deployments • Spark Streaming Survived Netflix Chaos Monkey – production ready! • Included in CDH!
  • 26. 26© Cloudera, Inc. All rights reserved. More Info.. • CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera- docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html • Cloudera Blog: http://blog.cloudera.com/blog/category/spark/ • Apache Spark homepage: http://spark.apache.org/ • Github: https://github.com/apache/spark
  • 27. 27© Cloudera, Inc. All rights reserved. Thank you [email protected] @harisr1234