© 2015 MapR Technologies ‹#›© 2016 MapR Technologies
Lambda Architecture: The Best Way to Build
Scalable and Reliable Applications!
© 2016 MapR Technologies ‹#›@tgrall
{“about” : “me”}
Tugdual “Tug” Grall
• MapR
• Technical Evangelist
• MongoDB
• Technical Evangelist
• Couchbase
• Technical Evangelist
• eXo
• CTO
• Oracle
• Developer/Product Manager
• Mainly Java/SOA
• Developer in consulting firms
• Web
• @tgrall
• http://tgrall.github.io
• tgrall
• NantesJUG co-founder
• Pet Project :
• http://www.resultri.com
• tug@mapr.com
• tugdual@gmail.com
© 2016 MapR Technologies@tgrall 3
Big Data & Hadoop
In Production
© 2016 MapR Technologies 4
Data Warehouse Optimization
© 2016 MapR Technologies 5
Data Hub
Choose the best “connector”:
• File
• Sqoop
• ETL
• …
Use the aggregated data
• In your applications
• To update other systems
• as an Open Data API
• …
Customer DB
Customer DB
Logs
…
Hadoop
NoSQL
© 2016 MapR Technologies 6
Financial Services
Fraud detection
Personalized
offers
Fraud
investigation tool
Fraud investigator
Fraud model
Recommendations
table
Clickstream
analysis
Online
transactions
MapR Distribution for Hadoop
Analytics
Real-time Operational Applications
Interactive marketer
© 2016 MapR Technologies@tgrall 7
Fault Tolerance
© 2016 MapR Technologies 8
Fault Tolerance
hardware
software
developer
?
© 2016 MapR Technologies 9
Human fault tolerance
© 2014 MapR Technologies 10
© 2014 MapR Technologies 11
© 2014 MapR Technologies 12
© 2016 MapR Technologies@tgrall 13
Lambda Architecture
To the rescue
λ
© 2016 MapR Technologies 14
A little bit of history….
• Defined by Nathan Marz
• ex BackType, Twitter
• in a new Startup
• Creator of …
– Storm
– Cascalog
– ElephantDB
© 2016 MapR Technologies 15
Lambda Architecture Requirements
• Fault-tolerant against both hardware failures & human errors
• Support variety of use cases that include low latency querying
as well as updates
• Linear scale-out capabilities
• Extensible, so that the system is manageable and can
accommodate newer features easily
© 2016 MapR Technologies 16
© 2016 MapR Technologies 17
Lambda Architecture
NEW DATA
STREAM QUERY
BATCH VIEWS
√View 1 View 2 View N
REAL-TIME VIEWS
BATCH LAYER
SERVINGLAYER
SPEED LAYER
MERGE
IMMUTABLE
MASTER DATA
PRECOMPUTE
VIEWSBATCH
RECOMPUTE
PROCESS
STREAM
INCREMENT
VIEWS
View 1 View 2 View N
© 2016 MapR Technologies 18
Data Ingestion
All data entering the system are dispatched to both
• the batch layer
• the speed layer
NEW DATA
STREAM
BATCH LAYER
SPEED LAYER
© 2016 MapR Technologies
Batch Layer
• managing the master dataset, an immutable, append-only set of raw data
• pre-computing arbitrary query functions, called batch views.
BATCH VIEWS
BATCH LAYER
IMMUTABLE
MASTER DATA
PRECOMPUTE
VIEWSBATCH
RECOMPUTE
View 1 View 2 View N
© 2016 MapR Technologies 20
Speed Layer
√View 1 View 2 View N
REAL-TIME VIEWS
SPEED LAYER
PROCESS
STREAM
INCREMENT
VIEWS
• Speed layer accommodates low latency requests that are subject to
low latency requirements.
• Using fast and incremental algorithms, deals with recent data
only
© 2016 MapR Technologies 21
Serving Layer
QUERY
BATCH VIEWS
√View 1 View 2 View N
REAL-TIME VIEWS
SERVINGLAYER
MERGE
View 1 View 2 View N
• Serving layer indexes batch views so that they can be
queried in ad hoc with low latency
© 2014 MapR Technologies 22
Lambda Architecture—Compensate Batch
time
not absorbed
now
© 2016 MapR Technologies 23
Lambda Architecture—Immutable Data + Views
http://openflights.org
© 2016 MapR Technologies 24
Lambda Architecture—Immutable Data + Views
timestamp airport flight action
2016-02-04T10:00:00 MUC EY123 take-off
2016-02-04T10:05:00 BRU SAS45 take-off
2016-02-04T10:07:00 AMS BA99 take-off
2016-02-04T10:09:00 LHR LH17 landing
2016-02-04T10:10:00 CDG AF03 landing
2016-02-04T10:10:00 FCO AZ501 take-off
immutable master dataset
© 2016 MapR Technologies 25
Lambda Architecture—Immutable Data + Views
timestamp airport flight action
2016-02-04T10:00:00 MUC EY123 take-off
2016-02-04T10:05:00 BRU SAS45 take-off
2016-02-04T10:07:00 AMS BA99 take-off
2016-02-04T10:09:00 LHR LH17 landing
2016-02-04T10:10:00 CDG AF03 landing
2016-02-04T10:10:00 FCO AZ501 take-off
air-borne: 2307
airline planes
AF 59
AZ 23
BA 167
EY 19
LH 201
SAS 28
air-borne per airline:
airport planes
AMS 69
CDG 44
BRU 31
FCO 10
HEL 17
LHR 101
airport load:
© 2016 MapR Technologies@tgrall 26
Implementation
© 2016 MapR Technologies 27
Lambda Architecture
NEW DATA
STREAM QUERY
BATCH VIEWS
√View 1 View 2 View N
REAL-TIME VIEWS
BATCH LAYER
SERVINGLAYER
SPEED LAYER
MERGE
IMMUTABLE
MASTER DATA
PRECOMPUTE
VIEWSBATCH
RECOMPUTE
PROCESS
STREAM
INCREMENT
VIEWS
View 1 View 2 View N
© 2016 MapR Technologies 28
Batch Layer: View Generation
Master
Data
View 1
View 2
Master
Data
Master
Data
Master
Data
Events “Raw” Storage Processing Aggregated Data
© 2016 MapR Technologies 29
© 2016 MapR Technologies 30
• Cluster Computing Platform
• Extends “MapReduce” with
extensions
– Streaming
– Interactive Analytics
• Run in Memory
© 2015 MapR Technologies ‹#›@tgrall
Spark components
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine Learning)
Spark Core (General execution engine)
GraphX
(Graph Computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN
© 2016 MapR Technologies 32
Spark Jobs
Driver Program
(application)
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.map
Cluster Manager
Worker
Executor
Task Task
Worker
Executor
Task Task
© 2016 MapR Technologies 33
Spark Resilient Distributed Datasets “RDD”
Sensor RDD
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
sc.textFile P1
8213034705,
95, 2.927373,
jake7870, 0……
P2
8213034705,
115, 2.943484,
Davidbresler2,
1….
P3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
P4
8213034705,
117, 2.998947,
daysrus, 95….
© 2016 MapR Technologies 34
Spark Resilient Distributed Datasets
Transformation
Filter()
Action
Count()
RDD
newRDD
Value
© 2015 MapR Technologies@tgrall
Transformations
• Process an RDD, returns an RDD
• Examples :
• map() : one value => another value
• mapToPair() : one value => a tuple
• filter() : filters values/tuples on a given condition
• groupByKey() : groups values by key
• reduceByKey() : aggregates values by key
• join(), cogroup(), … : joins RDDs
© 2015 MapR Technologies@tgrall
Actions
• Process an RDD, returns a value
• Examples :
• count() : counts number of items in dataset
• first() : returns first entry
• take(n) : returns array of the n first elements
• foreach() : applies a function on each element
• collect() : returns all elements
• saveAsTextFile() : saves in files each element
© 2016 MapR Technologies 37
Speed Layer
Real Time View1
Real Time View 2
Events Processing NoSQL
© 2016 MapR Technologies 38
Serving Layer: Aggregated Data
• Views are stored in a Read/Write database
• Apache HBase
• MapR DB Binary & JSON
• Cassandra
• MongoDB
• Elasticsearch
• …
© 2016 MapR Technologies 39
Serving Layer
Real Time View
Events Processing Aggregated
Batch View
Query-SQL
Dataviz
Query/Visualisation
SQL
© 2016 MapR Technologies
// Join MapR-DB Table, Parquet and MongoDB collection
> SELECT u.name, b.category, count(1) nb_review
FROM mongo.yelp.`user` u , dfs.yelp.`review.parquet` r, (select business_id,
flatten(categories) category from maprdb.`business` ) b
WHERE u.user_id = r.user_id
AND b.business_id = r.business_id
GROUP BY u.user_id, u.name, b.category
ORDER BY nb_review DESC
LIMIT 10;
+-----------+--------------+------------+
| name | category | nb_review |
+-----------+--------------+------------+
| Rand | Restaurants | 1086 |
| J | Restaurants | 661 |
| Aileen | Restaurants | 499 |
| Michael | Restaurants | 496 |
+-----------+--------------+------------+
40
© 2016 MapR Technologies@tgrall 41
Events Capture?
© 2016 MapR Technologies 42
Events Capture
Customer DB
API
Logs
…
Streaming Streams
Files
© 2016 MapR Technologies 43
What is Spark Streaming?
• Enables scalable, high-throughput, fault-tolerant stream
processing of live data
• Extension of the core Spark
Data Sources Data Sinks
© 2016 MapR Technologies 44
Spark Streaming Architecture
• Divide data stream into batches of X seconds (micro batching)
• Called DStream = sequence of RDDs
Spark
Streaming
input data
stream
DStream RDD batches
Batch
interval
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1
© 2016 MapR Technologies 45
What are Apache Kafka & MapR Streams?
• Publish Subscribe Messaging
• Fast
• Scalable
• Durable
• Distributed
© 2016 MapR Technologies@tgrall 46
Summary
© 2016 MapR Technologies 47
Lambda Architecture
NEW DATA
STREAM QUERY
BATCH VIEWS
√View 1 View 2 View N
REAL-TIME VIEWS
BATCH LAYER
SERVINGLAYER
SPEED LAYER
MERGE
IMMUTABLE
MASTER DATA
PRECOMPUTE
VIEWSBATCH
RECOMPUTE
PROCESS
STREAM
INCREMENT
VIEWS
View 1 View 2 View N
NoSQL
Distributed
File System
NoSQL
Streams
© 2016 MapR Technologies 48
Lambda Architecture in Action
Batch processing
(MapReduce)
Tax reduction
reporting
Shortest path graph
algorithm
(Titan on MapR-DB)
Route
optimization
.
.
.
Geolocation
Geolocation
Geolocation
Geolocation
Online alerts
Real-time stream
© 2016 MapR Technologies 49
Lambda Architecture
• Fault-tolerant
• Use batch layer to pre compute complex/large data set queries
• Use speed layer to deal with “near real time” use cases
• Linear scale-out capabilities
• Error Prone:
• Recompute data from master data set when needed
© 2016 MapR Technologies 50
© 2016 MapR Technologies 51
Q&A
@tgrall maprtech
tug@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

PDF
Lambda architecture
PPTX
Lambda architecture with Spark
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
PDF
Using the SDACK Architecture to Build a Big Data Product
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PDF
Cassandra & Spark for IoT
Lambda architecture
Lambda architecture with Spark
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Implementing the Lambda Architecture efficiently with Apache Spark
Sa introduction to big data pipelining with cassandra & spark west mins...
Using the SDACK Architecture to Build a Big Data Product
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Cassandra & Spark for IoT

What's hot (20)

PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
PDF
Rethinking Streaming Analytics For Scale
PDF
Proud to be Polyglot - Riviera Dev 2015
ODP
Lambda Architecture with Spark
PDF
Lambda architecture
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PDF
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
PDF
Lambda architecture @ Indix
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
PPTX
Kafka Lambda architecture with mirroring
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PDF
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
PDF
Streaming Big Data & Analytics For Scale
ODP
Kick-Start with SMACK Stack
PDF
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
PDF
Family data sheet HP Virtual Connect(May 2013)
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Apache Spark 1.6 with Zeppelin - Transformations and Actions on RDDs
Rethinking Streaming Analytics For Scale
Proud to be Polyglot - Riviera Dev 2015
Lambda Architecture with Spark
Lambda architecture
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Kafka spark cassandra webinar feb 16 2016
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Real time data viz with Spark Streaming, Kafka and D3.js
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Lambda architecture @ Indix
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Kafka Lambda architecture with mirroring
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Streaming Big Data & Analytics For Scale
Kick-Start with SMACK Stack
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Family data sheet HP Virtual Connect(May 2013)
Ad

Similar to Lambda Architecture: The Best Way to Build Scalable and Reliable Applications! (20)

PPTX
How Spark is Enabling the New Wave of Converged Applications
PPTX
How Spark is Enabling the New Wave of Converged Cloud Applications
PDF
Hadoop and NoSQL joining forces by Dale Kim of MapR
PDF
Meruvian - Introduction to MapR
PPTX
Enabling Real-Time Business with Change Data Capture
PDF
Apache Spark streaming and HBase
PPTX
Real time-hadoop
PDF
Real World Use Cases: Hadoop and NoSQL in Production
PDF
Spark and MapR Streams: A Motivating Example
PPTX
Evolving Beyond the Data Lake: A Story of Wind and Rain
PDF
Spark Streaming Data Pipelines
PDF
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
PDF
Is Spark Replacing Hadoop
PPTX
Cleveland Hadoop Users Group - Spark
PPTX
Predictive Analytics San Diego
PPTX
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
PPTX
Free Code Friday - Spark Streaming with HBase
PPTX
Trivento summercamp masterclass 9/9/2016
PDF
How Apache Spark fits in the Big Data landscape
PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Cloud Applications
Hadoop and NoSQL joining forces by Dale Kim of MapR
Meruvian - Introduction to MapR
Enabling Real-Time Business with Change Data Capture
Apache Spark streaming and HBase
Real time-hadoop
Real World Use Cases: Hadoop and NoSQL in Production
Spark and MapR Streams: A Motivating Example
Evolving Beyond the Data Lake: A Story of Wind and Rain
Spark Streaming Data Pipelines
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Is Spark Replacing Hadoop
Cleveland Hadoop Users Group - Spark
Predictive Analytics San Diego
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Free Code Friday - Spark Streaming with HBase
Trivento summercamp masterclass 9/9/2016
How Apache Spark fits in the Big Data landscape
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
Ad

More from Tugdual Grall (20)

PDF
Introduction to Streaming with Apache Flink
PDF
Introduction to Streaming with Apache Flink
PDF
Fast Cars, Big Data - How Streaming Can Help Formula 1
PDF
Big Data Journey
PDF
Introduction to NoSQL with MongoDB - SQLi Workshop
PDF
Enabling Telco to Build and Run Modern Applications
PPTX
MongoDB and Hadoop
PDF
Proud to be polyglot
PDF
Drop your table ! MongoDB Schema Design
PDF
Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6
PDF
Some cool features of MongoDB
PDF
Building Your First MongoDB Application
PDF
Opensourceday 2014-iot
PDF
Neotys conference
PDF
Softshake 2013: Introduction to NoSQL with Couchbase
PDF
Introduction to NoSQL with Couchbase
PDF
Why and How to integrate Hadoop and NoSQL?
PDF
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
PPT
Big Data Paris : Hadoop and NoSQL
PDF
Big Data Israel Meetup : Couchbase and Big Data
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache Flink
Fast Cars, Big Data - How Streaming Can Help Formula 1
Big Data Journey
Introduction to NoSQL with MongoDB - SQLi Workshop
Enabling Telco to Build and Run Modern Applications
MongoDB and Hadoop
Proud to be polyglot
Drop your table ! MongoDB Schema Design
Devoxx 2014 : Atelier MongoDB - Decouverte de MongoDB 2.6
Some cool features of MongoDB
Building Your First MongoDB Application
Opensourceday 2014-iot
Neotys conference
Softshake 2013: Introduction to NoSQL with Couchbase
Introduction to NoSQL with Couchbase
Why and How to integrate Hadoop and NoSQL?
NoSQL Matters 2013 - Introduction to Map Reduce with Couchbase 2.0
Big Data Paris : Hadoop and NoSQL
Big Data Israel Meetup : Couchbase and Big Data

Recently uploaded (20)

PPTX
CRM(Customer Relationship Managmnet) Presentation
PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
PDF
Rooftops detection with YOLOv8 from aerial imagery and a brief review on roof...
PPT
Overviiew on Intellectual property right
PDF
Optimizing bioinformatics applications: a novel approach with human protein d...
PDF
Examining Bias in AI Generated News Content.pdf
PDF
ment.tech-How to Develop an AI Agent Healthcare App like Sully AI (1).pdf
PDF
Altius execution marketplace concept.pdf
PDF
Ebook - The Future of AI A Comprehensive Guide.pdf
PDF
Be ready for tomorrow’s needs with a longer-lasting, higher-performing PC
PDF
TicketRoot: Event Tech Solutions Deck 2025
PDF
Revolutionizing recommendations a survey: a comprehensive exploration of mode...
PDF
Decision Optimization - From Theory to Practice
PPTX
Introduction-to-Artificial-Intelligence (1).pptx
PDF
Applying Agentic AI in Enterprise Automation
PDF
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
PDF
EGCB_Solar_Project_Presentation_and Finalcial Analysis.pdf
PDF
The Basics of Artificial Intelligence - Understanding the Key Concepts and Te...
PDF
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
PDF
FASHION-DRIVEN TEXTILES AS A CRYSTAL OF A NEW STREAM FOR STAKEHOLDER CAPITALI...
CRM(Customer Relationship Managmnet) Presentation
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
Rooftops detection with YOLOv8 from aerial imagery and a brief review on roof...
Overviiew on Intellectual property right
Optimizing bioinformatics applications: a novel approach with human protein d...
Examining Bias in AI Generated News Content.pdf
ment.tech-How to Develop an AI Agent Healthcare App like Sully AI (1).pdf
Altius execution marketplace concept.pdf
Ebook - The Future of AI A Comprehensive Guide.pdf
Be ready for tomorrow’s needs with a longer-lasting, higher-performing PC
TicketRoot: Event Tech Solutions Deck 2025
Revolutionizing recommendations a survey: a comprehensive exploration of mode...
Decision Optimization - From Theory to Practice
Introduction-to-Artificial-Intelligence (1).pptx
Applying Agentic AI in Enterprise Automation
GDG Cloud Southlake #45: Patrick Debois: The Impact of GenAI on Development a...
EGCB_Solar_Project_Presentation_and Finalcial Analysis.pdf
The Basics of Artificial Intelligence - Understanding the Key Concepts and Te...
TrustArc Webinar - Data Minimization in Practice_ Reducing Risk, Enhancing Co...
FASHION-DRIVEN TEXTILES AS A CRYSTAL OF A NEW STREAM FOR STAKEHOLDER CAPITALI...

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

  • 1. © 2015 MapR Technologies ‹#›© 2016 MapR Technologies Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
  • 2. © 2016 MapR Technologies ‹#›@tgrall {“about” : “me”} Tugdual “Tug” Grall • MapR • Technical Evangelist • MongoDB • Technical Evangelist • Couchbase • Technical Evangelist • eXo • CTO • Oracle • Developer/Product Manager • Mainly Java/SOA • Developer in consulting firms • Web • @tgrall • http://tgrall.github.io • tgrall • NantesJUG co-founder • Pet Project : • http://www.resultri.com • [email protected][email protected]
  • 3. © 2016 MapR Technologies@tgrall 3 Big Data & Hadoop In Production
  • 4. © 2016 MapR Technologies 4 Data Warehouse Optimization
  • 5. © 2016 MapR Technologies 5 Data Hub Choose the best “connector”: • File • Sqoop • ETL • … Use the aggregated data • In your applications • To update other systems • as an Open Data API • … Customer DB Customer DB Logs … Hadoop NoSQL
  • 6. © 2016 MapR Technologies 6 Financial Services Fraud detection Personalized offers Fraud investigation tool Fraud investigator Fraud model Recommendations table Clickstream analysis Online transactions MapR Distribution for Hadoop Analytics Real-time Operational Applications Interactive marketer
  • 7. © 2016 MapR Technologies@tgrall 7 Fault Tolerance
  • 8. © 2016 MapR Technologies 8 Fault Tolerance hardware software developer ?
  • 9. © 2016 MapR Technologies 9 Human fault tolerance
  • 10. © 2014 MapR Technologies 10
  • 11. © 2014 MapR Technologies 11
  • 12. © 2014 MapR Technologies 12
  • 13. © 2016 MapR Technologies@tgrall 13 Lambda Architecture To the rescue λ
  • 14. © 2016 MapR Technologies 14 A little bit of history…. • Defined by Nathan Marz • ex BackType, Twitter • in a new Startup • Creator of … – Storm – Cascalog – ElephantDB
  • 15. © 2016 MapR Technologies 15 Lambda Architecture Requirements • Fault-tolerant against both hardware failures & human errors • Support variety of use cases that include low latency querying as well as updates • Linear scale-out capabilities • Extensible, so that the system is manageable and can accommodate newer features easily
  • 16. © 2016 MapR Technologies 16
  • 17. © 2016 MapR Technologies 17 Lambda Architecture NEW DATA STREAM QUERY BATCH VIEWS √View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWSBATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N
  • 18. © 2016 MapR Technologies 18 Data Ingestion All data entering the system are dispatched to both • the batch layer • the speed layer NEW DATA STREAM BATCH LAYER SPEED LAYER
  • 19. © 2016 MapR Technologies Batch Layer • managing the master dataset, an immutable, append-only set of raw data • pre-computing arbitrary query functions, called batch views. BATCH VIEWS BATCH LAYER IMMUTABLE MASTER DATA PRECOMPUTE VIEWSBATCH RECOMPUTE View 1 View 2 View N
  • 20. © 2016 MapR Technologies 20 Speed Layer √View 1 View 2 View N REAL-TIME VIEWS SPEED LAYER PROCESS STREAM INCREMENT VIEWS • Speed layer accommodates low latency requests that are subject to low latency requirements. • Using fast and incremental algorithms, deals with recent data only
  • 21. © 2016 MapR Technologies 21 Serving Layer QUERY BATCH VIEWS √View 1 View 2 View N REAL-TIME VIEWS SERVINGLAYER MERGE View 1 View 2 View N • Serving layer indexes batch views so that they can be queried in ad hoc with low latency
  • 22. © 2014 MapR Technologies 22 Lambda Architecture—Compensate Batch time not absorbed now
  • 23. © 2016 MapR Technologies 23 Lambda Architecture—Immutable Data + Views http://openflights.org
  • 24. © 2016 MapR Technologies 24 Lambda Architecture—Immutable Data + Views timestamp airport flight action 2016-02-04T10:00:00 MUC EY123 take-off 2016-02-04T10:05:00 BRU SAS45 take-off 2016-02-04T10:07:00 AMS BA99 take-off 2016-02-04T10:09:00 LHR LH17 landing 2016-02-04T10:10:00 CDG AF03 landing 2016-02-04T10:10:00 FCO AZ501 take-off immutable master dataset
  • 25. © 2016 MapR Technologies 25 Lambda Architecture—Immutable Data + Views timestamp airport flight action 2016-02-04T10:00:00 MUC EY123 take-off 2016-02-04T10:05:00 BRU SAS45 take-off 2016-02-04T10:07:00 AMS BA99 take-off 2016-02-04T10:09:00 LHR LH17 landing 2016-02-04T10:10:00 CDG AF03 landing 2016-02-04T10:10:00 FCO AZ501 take-off air-borne: 2307 airline planes AF 59 AZ 23 BA 167 EY 19 LH 201 SAS 28 air-borne per airline: airport planes AMS 69 CDG 44 BRU 31 FCO 10 HEL 17 LHR 101 airport load:
  • 26. © 2016 MapR Technologies@tgrall 26 Implementation
  • 27. © 2016 MapR Technologies 27 Lambda Architecture NEW DATA STREAM QUERY BATCH VIEWS √View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWSBATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N
  • 28. © 2016 MapR Technologies 28 Batch Layer: View Generation Master Data View 1 View 2 Master Data Master Data Master Data Events “Raw” Storage Processing Aggregated Data
  • 29. © 2016 MapR Technologies 29
  • 30. © 2016 MapR Technologies 30 • Cluster Computing Platform • Extends “MapReduce” with extensions – Streaming – Interactive Analytics • Run in Memory
  • 31. © 2015 MapR Technologies ‹#›@tgrall Spark components Spark SQL Spark Streaming (Streaming) MLlib (Machine Learning) Spark Core (General execution engine) GraphX (Graph Computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN
  • 32. © 2016 MapR Technologies 32 Spark Jobs Driver Program (application) sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Cluster Manager Worker Executor Task Task Worker Executor Task Task
  • 33. © 2016 MapR Technologies 33 Spark Resilient Distributed Datasets “RDD” Sensor RDD W Executor P4 W Executor P1 P3 W Executor P2 sc.textFile P1 8213034705, 95, 2.927373, jake7870, 0…… P2 8213034705, 115, 2.943484, Davidbresler2, 1…. P3 8213034705, 100, 2.951285, gladimacowgirl, 58… P4 8213034705, 117, 2.998947, daysrus, 95….
  • 34. © 2016 MapR Technologies 34 Spark Resilient Distributed Datasets Transformation Filter() Action Count() RDD newRDD Value
  • 35. © 2015 MapR Technologies@tgrall Transformations • Process an RDD, returns an RDD • Examples : • map() : one value => another value • mapToPair() : one value => a tuple • filter() : filters values/tuples on a given condition • groupByKey() : groups values by key • reduceByKey() : aggregates values by key • join(), cogroup(), … : joins RDDs
  • 36. © 2015 MapR Technologies@tgrall Actions • Process an RDD, returns a value • Examples : • count() : counts number of items in dataset • first() : returns first entry • take(n) : returns array of the n first elements • foreach() : applies a function on each element • collect() : returns all elements • saveAsTextFile() : saves in files each element
  • 37. © 2016 MapR Technologies 37 Speed Layer Real Time View1 Real Time View 2 Events Processing NoSQL
  • 38. © 2016 MapR Technologies 38 Serving Layer: Aggregated Data • Views are stored in a Read/Write database • Apache HBase • MapR DB Binary & JSON • Cassandra • MongoDB • Elasticsearch • …
  • 39. © 2016 MapR Technologies 39 Serving Layer Real Time View Events Processing Aggregated Batch View Query-SQL Dataviz Query/Visualisation SQL
  • 40. © 2016 MapR Technologies // Join MapR-DB Table, Parquet and MongoDB collection > SELECT u.name, b.category, count(1) nb_review FROM mongo.yelp.`user` u , dfs.yelp.`review.parquet` r, (select business_id, flatten(categories) category from maprdb.`business` ) b WHERE u.user_id = r.user_id AND b.business_id = r.business_id GROUP BY u.user_id, u.name, b.category ORDER BY nb_review DESC LIMIT 10; +-----------+--------------+------------+ | name | category | nb_review | +-----------+--------------+------------+ | Rand | Restaurants | 1086 | | J | Restaurants | 661 | | Aileen | Restaurants | 499 | | Michael | Restaurants | 496 | +-----------+--------------+------------+ 40
  • 41. © 2016 MapR Technologies@tgrall 41 Events Capture?
  • 42. © 2016 MapR Technologies 42 Events Capture Customer DB API Logs … Streaming Streams Files
  • 43. © 2016 MapR Technologies 43 What is Spark Streaming? • Enables scalable, high-throughput, fault-tolerant stream processing of live data • Extension of the core Spark Data Sources Data Sinks
  • 44. © 2016 MapR Technologies 44 Spark Streaming Architecture • Divide data stream into batches of X seconds (micro batching) • Called DStream = sequence of RDDs Spark Streaming input data stream DStream RDD batches Batch interval data from time 0 to 1 data from time 1 to 2 RDD @ time 2 data from time 2 to 3 RDD @ time 3RDD @ time 1
  • 45. © 2016 MapR Technologies 45 What are Apache Kafka & MapR Streams? • Publish Subscribe Messaging • Fast • Scalable • Durable • Distributed
  • 46. © 2016 MapR Technologies@tgrall 46 Summary
  • 47. © 2016 MapR Technologies 47 Lambda Architecture NEW DATA STREAM QUERY BATCH VIEWS √View 1 View 2 View N REAL-TIME VIEWS BATCH LAYER SERVINGLAYER SPEED LAYER MERGE IMMUTABLE MASTER DATA PRECOMPUTE VIEWSBATCH RECOMPUTE PROCESS STREAM INCREMENT VIEWS View 1 View 2 View N NoSQL Distributed File System NoSQL Streams
  • 48. © 2016 MapR Technologies 48 Lambda Architecture in Action Batch processing (MapReduce) Tax reduction reporting Shortest path graph algorithm (Titan on MapR-DB) Route optimization . . . Geolocation Geolocation Geolocation Geolocation Online alerts Real-time stream
  • 49. © 2016 MapR Technologies 49 Lambda Architecture • Fault-tolerant • Use batch layer to pre compute complex/large data set queries • Use speed layer to deal with “near real time” use cases • Linear scale-out capabilities • Error Prone: • Recompute data from master data set when needed
  • 50. © 2016 MapR Technologies 50
  • 51. © 2016 MapR Technologies 51 Q&A @tgrall maprtech [email protected] Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  • #4: First of all since we will be talking about Big Data Applications… let’s see some use case that are very common… High level
  • #8: New apps, big data or not, must be “fault tolerant”… and the lambda arch has been build for that.. and at which level…
  • #9: Hardware, Commodity hardware : we know that it willl fail So we compensate for that using software : HDFS/MapR-FS, Hbase/MaprDB, Zookeeper, .. you have infrastructure to support failure What about the developer? human being becoming the weakest link
  • #10: So infrastrcuture using Hadoop/MapR/Distributed software is Fault Tolerant but we still need to deal with HUMAN ERROR…. since some of us are making mistake the goal is to “recover from it” WE ALL DO MISTAKE… look at these big names
  • #11: Facebook apologises after crash: Social network site went down for the third time in a month due to a 'configuration issue'
  • #15: Storm is realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Cascalog Fully-featured data processing and querying library for Clojure or Java. ElephantDB Distributed database specialized in exporting key/value data from Hadoop
  • #17: As you can guess, in application development when we talk about architecture it is all about LAYERS
  • #19: So we can see that in this case the application generate “EVENTS” everything we do generate events: Credit Card Payiment, Commit toGit, WebPage Click, twet, …. The event are used to manipulate the “data”, but we can use the events as the main data
  • #20: The event you generate are immutable they have “happened”, and they are time based
  • #30: Data Locality
  • #34: Resilient distributed datasets, or RDD, are the primary abstraction in Spark. They are a collection of objects that is distributed across nodes in a cluster, and data operations are performed on RDD. Once created, RDD are immutable. You can also persist, or cache, RDDs in memory or on disk. Spark RDDs are fault-tolerant. If a given node or task fails, the RDD can be reconstructed automatically on the remaining nodes and the job will complete.
  • #35: There are two types of data operations you can perform on an RDD, transformations and actions.   A transformation will return an RDD. Since RDD are immutable, the transformation will return a new RDD.   An action will return a value.
  • #44: ● Socket ● Kafka ● Flume ● HDFS ● MQ (ZeroMQ...) ● Twitter ● ... ● Or a custom implementation of Receiver
  • #50: Store all events as raw data Create Intermediate Views Errors are fixed using re-computation Based on Scalable and Reliable Storage Distributed File System Optimized formats (Parquet, Avro, Protobuff, …) NoSQL Engines HBase, MapR-DB, Elasticsearch, Cassandra, MongoDB, … Distributed Processing Spark Drill (SQL)