Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

© 2015 MapR Technologies ‹#›© 2016 MapR Technologies
Lambda Architecture: The Best Way to Build
Scalable and Reliable Applications!

© 2016 MapR Technologies ‹#›@tgrall
{“about” : “me”}
Tugdual “Tug” Grall
• MapR
• Technical Evangelist
• MongoDB
• Couchbase
• eXo
• CTO
• Oracle
• Developer/Product Manager
• Mainly Java/SOA
• Developer in consulting firms
• Web
• @tgrall
• http://tgrall.github.io
• tgrall
• NantesJUG co-founder
• Pet Project :
• http://www.resultri.com
• tug@mapr.com
• tugdual@gmail.com

© 2016 MapR Technologies@tgrall 3
Big Data & Hadoop
In Production

© 2016 MapR Technologies 4
Data Warehouse Optimization

Data Hub
Choose the best “connector”:
• File
• Sqoop
• ETL
• …
Use the aggregated data
• In your applications
• To update other systems
• as an Open Data API
• …
Customer DB
Customer DB
Logs
…
Hadoop
NoSQL

Financial Services
Fraud detection
Personalized
offers
Fraud
investigation tool
Fraud investigator
Fraud model
Recommendations
table
Clickstream
analysis
Online
transactions
MapR Distribution for Hadoop
Analytics
Real-time Operational Applications
Interactive marketer

Fault Tolerance

Fault Tolerance
hardware
software
developer
?

Human fault tolerance

Lambda Architecture
To the rescue
λ

A little bit of history….
• Defined by Nathan Marz
• ex BackType, Twitter
• in a new Startup
• Creator of …
– Storm
– Cascalog
– ElephantDB

Lambda Architecture Requirements
• Fault-tolerant against both hardware failures & human errors
• Support variety of use cases that include low latency querying
as well as updates
• Linear scale-out capabilities
• Extensible, so that the system is manageable and can
accommodate newer features easily

Lambda Architecture
NEW DATA
STREAM QUERY
BATCH VIEWS
√View 1 View 2 View N
REAL-TIME VIEWS
BATCH LAYER
SERVINGLAYER
SPEED LAYER
MERGE
IMMUTABLE
MASTER DATA
PRECOMPUTE
VIEWSBATCH
RECOMPUTE
PROCESS
STREAM
INCREMENT
VIEWS
View 1 View 2 View N

Data Ingestion
All data entering the system are dispatched to both
• the batch layer
• the speed layer
NEW DATA
STREAM
BATCH LAYER
SPEED LAYER

© 2016 MapR Technologies
Batch Layer
• managing the master dataset, an immutable, append-only set of raw data
• pre-computing arbitrary query functions, called batch views.
BATCH VIEWS
BATCH LAYER
IMMUTABLE
MASTER DATA
PRECOMPUTE
VIEWSBATCH
RECOMPUTE

Speed Layer
REAL-TIME VIEWS
SPEED LAYER
PROCESS
STREAM
INCREMENT
VIEWS
• Speed layer accommodates low latency requests that are subject to
low latency requirements.
• Using fast and incremental algorithms, deals with recent data
only

Serving Layer
QUERY
BATCH VIEWS
REAL-TIME VIEWS
SERVINGLAYER
MERGE
• Serving layer indexes batch views so that they can be
queried in ad hoc with low latency

Lambda Architecture—Compensate Batch
time
not absorbed
now

Lambda Architecture—Immutable Data + Views
http://openflights.org

timestamp airport flight action
2016-02-04T10:00:00 MUC EY123 take-off
2016-02-04T10:05:00 BRU SAS45 take-off
2016-02-04T10:07:00 AMS BA99 take-off
2016-02-04T10:09:00 LHR LH17 landing
2016-02-04T10:10:00 CDG AF03 landing
2016-02-04T10:10:00 FCO AZ501 take-off
immutable master dataset

timestamp airport flight action
2016-02-04T10:00:00 MUC EY123 take-off
2016-02-04T10:05:00 BRU SAS45 take-off
2016-02-04T10:07:00 AMS BA99 take-off
2016-02-04T10:09:00 LHR LH17 landing
2016-02-04T10:10:00 CDG AF03 landing
2016-02-04T10:10:00 FCO AZ501 take-off
air-borne: 2307
airline planes
AF 59
AZ 23
BA 167
EY 19
LH 201
SAS 28
air-borne per airline:
airport planes
AMS 69
CDG 44
BRU 31
FCO 10
HEL 17
LHR 101
airport load:

Implementation

Lambda Architecture
NEW DATA
STREAM QUERY
BATCH VIEWS
REAL-TIME VIEWS
BATCH LAYER
SERVINGLAYER
SPEED LAYER
MERGE
IMMUTABLE
MASTER DATA
PRECOMPUTE
VIEWSBATCH
RECOMPUTE
PROCESS
STREAM
INCREMENT
VIEWS

Batch Layer: View Generation
Master
Data
View 1
View 2
Master
Data
Master
Data
Master
Data
Events “Raw” Storage Processing Aggregated Data

• Cluster Computing Platform
• Extends “MapReduce” with
extensions
– Streaming
– Interactive Analytics
• Run in Memory

© 2015 MapR Technologies ‹#›@tgrall
Spark components
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine Learning)
Spark Core (General execution engine)
GraphX
(Graph Computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN

Spark Jobs
Driver Program
(application)
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.map
Cluster Manager
Worker
Executor
Task Task
Worker
Executor
Task Task

Spark Resilient Distributed Datasets “RDD”
Sensor RDD
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
sc.textFile P1
8213034705,
95, 2.927373,
jake7870, 0……
P2
8213034705,
115, 2.943484,
Davidbresler2,
1….
P3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
P4
8213034705,
117, 2.998947,
daysrus, 95….

Spark Resilient Distributed Datasets
Transformation
Filter()
Action
Count()
RDD
newRDD
Value

© 2015 MapR Technologies@tgrall
Transformations
• Process an RDD, returns an RDD
• Examples :
• map() : one value => another value
• mapToPair() : one value => a tuple
• filter() : filters values/tuples on a given condition
• groupByKey() : groups values by key
• reduceByKey() : aggregates values by key
• join(), cogroup(), … : joins RDDs

© 2015 MapR Technologies@tgrall
Actions
• Process an RDD, returns a value
• Examples :
• count() : counts number of items in dataset
• first() : returns first entry
• take(n) : returns array of the n first elements
• foreach() : applies a function on each element
• collect() : returns all elements
• saveAsTextFile() : saves in files each element

Speed Layer
Real Time View1
Real Time View 2
Events Processing NoSQL

Serving Layer: Aggregated Data
• Views are stored in a Read/Write database
• Apache HBase
• MapR DB Binary & JSON
• Cassandra
• MongoDB
• Elasticsearch
• …

Serving Layer
Real Time View
Events Processing Aggregated
Batch View
Query-SQL
Dataviz
Query/Visualisation
SQL

Events Capture?

Events Capture
Customer DB
API
Logs
…
Streaming Streams
Files

What is Spark Streaming?
• Enables scalable, high-throughput, fault-tolerant stream
processing of live data
• Extension of the core Spark
Data Sources Data Sinks

Spark Streaming Architecture
• Divide data stream into batches of X seconds (micro batching)
• Called DStream = sequence of RDDs
Spark
Streaming
input data
stream
DStream RDD batches
Batch
interval
data from
time 0 to 1
data from
time 1 to 2
RDD @ time 2
data from
time 2 to 3
RDD @ time 3RDD @ time 1

What are Apache Kafka & MapR Streams?
• Publish Subscribe Messaging
• Fast
• Scalable
• Durable
• Distributed

Summary

Lambda Architecture
NEW DATA
STREAM QUERY
BATCH VIEWS
REAL-TIME VIEWS
BATCH LAYER
SERVINGLAYER
SPEED LAYER
MERGE
IMMUTABLE
MASTER DATA
PRECOMPUTE
VIEWSBATCH
RECOMPUTE
PROCESS
STREAM
INCREMENT
VIEWS
NoSQL
Distributed
File System
NoSQL
Streams

Lambda Architecture in Action
Batch processing
(MapReduce)
Tax reduction
reporting
Shortest path graph
algorithm
(Titan on MapR-DB)
Route
optimization
.
.
.
Geolocation
Geolocation
Geolocation
Geolocation
Online alerts
Real-time stream

Lambda Architecture
• Fault-tolerant
• Use batch layer to pre compute complex/large data set queries
• Use speed layer to deal with “near real time” use cases
• Linear scale-out capabilities
• Error Prone:
• Recompute data from master data set when needed

Q&A
@tgrall maprtech
tug@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

More Related Content

What's hot (20)

Similar to Lambda Architecture: The Best Way to Build Scalable and Reliable Applications! (20)

More from Tugdual Grall (20)

Recently uploaded (20)

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!

Editor's Notes