TechLabs by
A la découverte de Machine Learning, de Redis et de Spark
TechLabs by
2
Maturin BADO
@mccstanmbg
github.com/mccstan
SPARK
Spark : Introduction
Outline
❏ Data processing today
❏ Spark, hadoop, MapReduce
❏ Spark ecosystem
❏ Spark basics
Data processing today
Data intensive application
Definition :
“We call an application data-intensive if data is its primary challenge—the
quantity of data, the complexity of data, or the speed at which it is changing—as
opposed to compute-intensive, where CPU cycles are the bottleneck.”
Martin Klepmann
Data processing today
Today apps needs :
❏ Store data (databases)
❏ Caches
❏ search data (search index)
❏ Asynchronously message handling (stream processing)
❏ batch processing
Spark, hadoop, MapReduce
Spark, hadoop, MapReduce
Spark : main differences with Map Reduce
❏ Spark load most of the dataset in memory
❏ Implement cache mechanisms which reduce read from disk
❏ Is much faster than MapReduce : Job scheduling
❏ Does not implement any data distribution technology but
can run on top of hadoop clusters (HDFS )
Spark ecosystem : open source
Spark ecosystem : features
Spark ecosystem : deployment
Spark basics : RDD
RDD : Resilient Distributed data
❏ Primary spark abstraction
❏ Fault tolerant collection of elements
❏ Partitioned and Immutable
❏ Two types operations
❏ Lazy Transformation
Spark basics : An execution flow
Spark Streaming
Outline
❏ Why In-stream processing ?
❏ Runtime and Programming Model
❏ Spark Streaming : Overview
❏ Benefits of Discretized Stream Processing
❏ Processing flow
❏ Transform operations
❏ Window operations
Why In-stream processing ?
Why In-stream processing ?
Runtime and Programming Model
Native Streaming
Runtime and Programming Model
Micro-batch Streaming
Spark Streaming : Overview
Benefits of Discretized Stream Processing
Dynamic load balancing
Benefits of Discretized Stream Processing
Fast failure and straggler recovery
Benefits of Discretized Stream Processing
❏ Unification of batch, streaming and interactive analytics
❏ Advanced analytics like machine learning and interactive SQL
❏ Streaming + SQL and DataFrames
❏ Streaming + MLlib
Spark Streaming : Processing flow
Spark Streaming : DStreams
Discretized Streams (DStreams) :
❏ The basic spark streaming abstraction
❏ A continuous series of RDDs
Spark Streaming : Transformations
Transform Operations : Any operation applied on a DStream translates
to operations on the underlying RDDs
Spark Streaming : Transformations
Window Operations :
Spark Streaming : Time abstractions
Batch interval
Sliding interval
Window size
Spark Streaming : Time abstractions
Batch interval
Window size
Sliding interval
Spark Streaming : Some examples
❏ Wordcount
❏ stateless operation, counting words for every batch
❏ Basic Error count
❏ stateless operation, using a filter : contains(“ERROR”)
❏ Cumulative Error count
❏ Stateful operation, errors from the beginning of the processing
❏ Windowed Errors counts
❏ Stateful operation, errors from the sliding window of time
The git repo
https://github.com/SoatGroup/spark-streaming-java-examples
https://github.com/SoatGroup/spark-streaming-python

More Related Content

PDF
Discover some "Big Data" architectural concepts with Redis
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
PDF
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
PDF
Workshop - How to benchmark your database
PPTX
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PDF
IEEE International Conference on Data Engineering 2015
PPTX
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Discover some "Big Data" architectural concepts with Redis
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Workshop - How to benchmark your database
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
IEEE International Conference on Data Engineering 2015
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla

What's hot (20)

PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PDF
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
PDF
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
PDF
How to Build a Scylla Database Cluster that Fits Your Needs
PDF
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
PDF
Stsg17 speaker yousunjeong
PDF
Cisco: Cassandra adoption on Cisco UCS & OpenStack
PDF
Data Pipelines with Spark & DataStax Enterprise
PDF
Azure + DataStax Enterprise Powers Office 365 Per User Store
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
PPTX
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
PDF
Spark Summit EU talk by Mike Percy
PDF
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
PPTX
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
PPTX
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
PPTX
Cassandra vs. ScyllaDB: Evolutionary Differences
PDF
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
PPTX
Scylla Summit 2018: Keynote - 4 Years of Scylla
PPTX
Overcoming Barriers of Scaling Your Database
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
How to Build a Scylla Database Cluster that Fits Your Needs
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Stsg17 speaker yousunjeong
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Data Pipelines with Spark & DataStax Enterprise
Azure + DataStax Enterprise Powers Office 365 Per User Store
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Spark Summit EU talk by Mike Percy
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Cassandra vs. ScyllaDB: Evolutionary Differences
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Scylla Summit 2018: Keynote - 4 Years of Scylla
Overcoming Barriers of Scaling Your Database
Ad

Similar to DIscover Spark and Spark streaming (20)

PPTX
Real time streaming analytics
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
PDF
Introduction to Spark Streaming
PDF
Apache Spark Crash Course
PDF
Apache Spark Crash Course
PPTX
Apache Spark Components
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
PDF
Headaches and Breakthroughs in Building Continuous Applications
PDF
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
PDF
Apache Spark - A High Level overview
PDF
Deep dive into spark streaming
PDF
Strata NYC 2015: What's new in Spark Streaming
PPTX
Apache Spark
PPTX
Big data processing with Apache Spark and Oracle Database
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PPTX
Apache Spark Crash Course
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Real time streaming analytics
Spark Streaming and MLlib - Hyderabad Spark Group
Introduction to Spark Streaming
Apache Spark Crash Course
Apache Spark Crash Course
Apache Spark Components
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Headaches and Breakthroughs in Building Continuous Applications
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Apache Spark - A High Level overview
Deep dive into spark streaming
Strata NYC 2015: What's new in Spark Streaming
Apache Spark
Big data processing with Apache Spark and Oracle Database
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark Crash Course
Build a Time Series Application with Apache Spark and Apache HBase
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Ad

Recently uploaded (20)

PDF
Introduction to the R Programming Language
PPTX
Business_Capability_Map_Collection__pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Introduction to Inferential Statistics.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
chrmotography.pptx food anaylysis techni
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PDF
Transcultural that can help you someday.
PPTX
Leprosy and NLEP programme community medicine
PPTX
New ISO 27001_2022 standard and the changes
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PDF
Introduction to Data Science and Data Analysis
PPTX
SET 1 Compulsory MNH machine learning intro
DOCX
Factor Analysis Word Document Presentation
PPTX
IMPACT OF LANDSLIDE.....................
Introduction to the R Programming Language
Business_Capability_Map_Collection__pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Introduction to Inferential Statistics.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
SAP 2 completion done . PRESENTATION.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
STERILIZATION AND DISINFECTION-1.ppthhhbx
CYBER SECURITY the Next Warefare Tactics
chrmotography.pptx food anaylysis techni
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
Transcultural that can help you someday.
Leprosy and NLEP programme community medicine
New ISO 27001_2022 standard and the changes
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
Introduction to Data Science and Data Analysis
SET 1 Compulsory MNH machine learning intro
Factor Analysis Word Document Presentation
IMPACT OF LANDSLIDE.....................

DIscover Spark and Spark streaming