SlideShare a Scribd company logo
Lambda Architecture 
Analyzing large scale, unstructured, 
dynamic data 
Rajesh Muppalla (@codingnirvana) 
rajesh@indix.com
Indix - Quick Overview 
Am I priced higher or lower w.r.t 
my competitor on Nikon D700? 
Which product has the UPC - 
8745354434? 
What are all the variants of 
Apple Macbook Air 13”? What is the average price change of all Nike Shoes 
in Walmart in the last 3 months?
Data Pipeline @ Indix 
C 
Crawling Parsing 
ML 
Model 
ML 
Model 
Classification 
C1 C1 C1 C1 
C2 C2 C2 
C2 C2 
Matching 
Product & Price 
Catalog
Data Pipeline @ Indix 
Analytics 
(Precomputes, 
Insights) 
Search Index 
Product & Price 
Catalog 
Experiences 
We released the v1.0 of our API today - developer.indix.com
Data is Dynamic 
C C1 C1 C1 C1 
C2 C2 C2 
C2 C2 
ML 
Model 
ML 
Model 
(new) 
Crawling Parsing Classification Matching
Data Scale 
400 M 
Product 
URLs 4 TB 
HTML Data 
Crawled 
Daily 
100 TB 
Data 
Processed 
Daily 
3000 
Categories 
10 B 
Price 
Points 
2000 
Sites
Data Pipeline v1.0
Batch using HBase & MapReduce
Problem 1 
Mutable State 
Data Systems should be Human Fault Tolerant
Problem 2 
Compactions 
Random Write databases are hard to manage at large scale
Problem 3 
16 hours 
16 hours latency is a lot. We wanted it to be couple of hours
Three Problems 
● No Human Fault Tolerance 
○ Mutable State 
● Operational Complexity 
○ Random Writes (Compactions) 
● High Latency 
○ Batch system architectural tradeoff
Rethink our data systems
Lambda Architecture
Lambda Architecture 
● An approach to build big data systems 
○ Architectural Components & Principles 
○ Ties Batch & Real Time Systems 
○ General Purpose - Domain Agnostic 
● Coined by Nathan Marz 
○ Ex-Twitter Engineer 
○ Creator of Storm
Data System - Traditional Approach 
HBase 
Application 
Source of Truth
Data System - New Approach 
Immutable 
Raw 
Data 
Application 
Processed 
View(s) 
Source of Truth
Let’s take an example 
Find the count of unique products in any 
given category for the entire time range
Two Requirements 
● Recomputations 
● Large Scale
Batch Layer Implementation 
HDFS (Vertical Partitioning) HBase 
C1 5 
C2 7 
C3 4 
C4 7 
C5 1 
Products Master Data 
9 am 
10 am 
11 am 
12 pm 
1 pm 
2 pm 
Query 
Intermediate view 
C1 
C2 
C3 
C4 
C5 
MR Job 1 
Batch View 
New Data MR Job 2
Handling Recomputations 
HDFS (Vertical Partitioning) HBase 
C1 5 
C2 7 
C3 4 
C4 7 
C5 1 
Products Master Data 
9 am 
10 am 
11 am 
12 pm 
1 pm 
2 pm 
Query 
Intermediate view 
C1 
C2 
C3 
C4 
C5 
MR Job 1 
Batch View 
New Data MR Job 2
Handling Scale 
● Hadoop HDFS, MapReduce, HBase 
● Proven Linear Scalability
Three Problems (Recap) 
● No Human Fault Tolerance 
○ Mutable State 
● Operational Complexity 
○ Random Writes (Compactions) 
● High Latency 
○ Batch system architectural tradeoff
Human Fault Tolerance 
● Bugs in the batch jobs 
○ Discard views & Recompute 
● Bugs in the master data jobs 
○ Re-process the master data to hide the old data 
● Bugs in the query 
○ Re-deploy the query layer 
● Traceability as a side effect
Operational Complexity 
● No random writes in the batch layer 
○ Bulk Updates to build the batch view
Great… What about Latency?
Speed Layer 
Queue 
(Kafka) 
Recent Data 
Real Time Processing 
(Storm) 
HHyyppeHerylroplogeglrolloogg gS lSeoetgst s Query 
Random 
Writes 
(Updates) 
Read-Write Data Store 
(Riak, HBase, 
Cassandra)
Speed Layer has mutation... But 
● Speed layer deals with much smaller data 
○ Batch Layer - Months/years of data 
○ Speed Layer - Few hours or 1 day of data 
● Easy to manage operationally 
Complexity Isolation
Final Step - Merging Results 
Batch Layer 
Speed Layer 
Data 
Query 
Merged Results 
C1 - 50000 
C1 - 499 
(Approximate with 
error 0.02%) 
C1 - 50499
What about Accuracy? 
Batch Layer 
Speed Layer 
Data 
Query 
Merged Results 
C1 - 499 
(Approximate with 
error 0.02%) 
C1’ - 50500 
Batch Layer 
CC11’ -- 5500050000 
Eventually Accurate
Lambda Architecture
Lambda Architecture @ INDIX
Lambda Architecture @ Indix
Batch Layer @ Indix 
● Pail 
○ Vertical partitioning 
○ Consolidation of small files 
● Scalding 
● Thrift for enforcing schemas 
● HBase/Solr for views 
○ Bulk updates to create views
Speed Layer @ Indix 
● Still WIP 
● To reduce latency 
○ Micro batches for Speed layer 
○ Use the last batch run + bulk update views
Open Challenges 
● Managing both Batch & Real Time still painful 
● Two broad directions 
○ Abstractions 
■ SummingBird (Twitter) 
○ Unified Stack 
■ Spark 
■ Kafka + Samza/Storm (LinkedIn) 
■ Cloud Data Flow (Google)
In Conclusion... 
● Lambda Architecture 
○ A different approach to build data systems 
○ Solid principles 
○ Domain Agnostic 
○ Tools not yet mature
Resources 
● Indix Engineering Blog - http://engineering.indix.com 
● Runaway Complexity in Big Data Systems 
● Lambda Architecture 
● Big Data Book - Manning 
● Scalding 
● Spark 
● Pail 
● Summingbird
Key Takeaways 
- Human Fault Tolerance 
- Complexity Isolation 
- Higher Level Abstractions
Thank You
Batch vs Real Time Choices
Tying it all together - Go-CD
Extras 
● Monoids 
● LA is not new 
○ Search Engines (fast, slow crawl) 
○ Event Sourcing (immutable events to maintain 
state) 
○ Patch, Audit, Bootstrap
Problem Statement - Optimization

More Related Content

What's hot (20)

PDF
Big data real time architectures
Daniel Marcous
 
PDF
Extracting Insights from Data at Twitter
Prasad Wagle
 
PDF
Big Telco - Yousun Jeong
Spark Summit
 
PDF
Modern ETL Pipelines with Change Data Capture
Databricks
 
PDF
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Khai Tran
 
PDF
Lambda architecture for real time big data
Trieu Nguyen
 
PDF
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Databricks
 
PDF
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
PDF
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
PPTX
Spark - Migration Story
Roman Chukh
 
PPTX
Spark Streaming the Industrial IoT
Jim Haughwout
 
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
PPTX
Quark Virtualization Engine for Analytics
DataWorks Summit/Hadoop Summit
 
PDF
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
PDF
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
PDF
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
PDF
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Databricks
 
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
Big data real time architectures
Daniel Marcous
 
Extracting Insights from Data at Twitter
Prasad Wagle
 
Big Telco - Yousun Jeong
Spark Summit
 
Modern ETL Pipelines with Change Data Capture
Databricks
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Khai Tran
 
Lambda architecture for real time big data
Trieu Nguyen
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Databricks
 
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
Spark - Migration Story
Roman Chukh
 
Spark Streaming the Industrial IoT
Jim Haughwout
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Quark Virtualization Engine for Analytics
DataWorks Summit/Hadoop Summit
 
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Databricks
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 

Similar to Lambda architecture @ Indix (20)

PPTX
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
PPTX
2014 09-12 lambda-architecture-at-indix
Yu Ishikawa
 
PPTX
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
PPTX
Big Data_Architecture.pptx
betalab
 
PDF
JOSA TechTalk - Lambda architecture and real-time processing
Mahmoud Jalajel
 
PDF
Big Data Computing Architecture
Gang Tao
 
KEY
The Secrets of Building Realtime Big Data Systems
nathanmarz
 
PDF
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Flip Kromer
 
PDF
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Cloudera, Inc.
 
PDF
Big Data and Fast Data combined – is it possible?
Swiss Data Forum Swiss Data Forum
 
ODP
Lambda Architecture with Spark
Knoldus Inc.
 
PDF
Building Scalable Big Data Pipelines
Christian Gügi
 
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Mitul Tiwari
 
PDF
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Osama Khan
 
PDF
Cloud Lambda Architecture Patterns
Asis Mohanty
 
PPT
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
KEY
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
larsgeorge
 
PPTX
Software architecture for data applications
Ding Li
 
PPTX
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 
Moving Beyond Lambda Architectures with Apache Kudu
Cloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Cloudera, Inc.
 
2014 09-12 lambda-architecture-at-indix
Yu Ishikawa
 
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
 
Big Data_Architecture.pptx
betalab
 
JOSA TechTalk - Lambda architecture and real-time processing
Mahmoud Jalajel
 
Big Data Computing Architecture
Gang Tao
 
The Secrets of Building Realtime Big Data Systems
nathanmarz
 
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
Flip Kromer
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Cloudera, Inc.
 
Big Data and Fast Data combined – is it possible?
Swiss Data Forum Swiss Data Forum
 
Lambda Architecture with Spark
Knoldus Inc.
 
Building Scalable Big Data Pipelines
Christian Gügi
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Mitul Tiwari
 
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and Pancakes
Osama Khan
 
Cloud Lambda Architecture Patterns
Asis Mohanty
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
 
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
larsgeorge
 
Software architecture for data applications
Ding Li
 
Big dataarchitecturesandecosystem+nosql
Khanderao Kand
 
Ad

Recently uploaded (20)

PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PDF
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PPTX
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PDF
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PPTX
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
PPTX
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
PPTX
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PDF
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
澳洲电子毕业证澳大利亚圣母大学水印成绩单UNDA学生证网上可查学历
Taqyea
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Reasons for the succes of MENARD PRESSUREMETER.pdf
majdiamz
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
MODULE 05 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
Ad

Lambda architecture @ Indix

  • 1. Lambda Architecture Analyzing large scale, unstructured, dynamic data Rajesh Muppalla (@codingnirvana) [email protected]
  • 2. Indix - Quick Overview Am I priced higher or lower w.r.t my competitor on Nikon D700? Which product has the UPC - 8745354434? What are all the variants of Apple Macbook Air 13”? What is the average price change of all Nike Shoes in Walmart in the last 3 months?
  • 3. Data Pipeline @ Indix C Crawling Parsing ML Model ML Model Classification C1 C1 C1 C1 C2 C2 C2 C2 C2 Matching Product & Price Catalog
  • 4. Data Pipeline @ Indix Analytics (Precomputes, Insights) Search Index Product & Price Catalog Experiences We released the v1.0 of our API today - developer.indix.com
  • 5. Data is Dynamic C C1 C1 C1 C1 C2 C2 C2 C2 C2 ML Model ML Model (new) Crawling Parsing Classification Matching
  • 6. Data Scale 400 M Product URLs 4 TB HTML Data Crawled Daily 100 TB Data Processed Daily 3000 Categories 10 B Price Points 2000 Sites
  • 8. Batch using HBase & MapReduce
  • 9. Problem 1 Mutable State Data Systems should be Human Fault Tolerant
  • 10. Problem 2 Compactions Random Write databases are hard to manage at large scale
  • 11. Problem 3 16 hours 16 hours latency is a lot. We wanted it to be couple of hours
  • 12. Three Problems ● No Human Fault Tolerance ○ Mutable State ● Operational Complexity ○ Random Writes (Compactions) ● High Latency ○ Batch system architectural tradeoff
  • 13. Rethink our data systems
  • 15. Lambda Architecture ● An approach to build big data systems ○ Architectural Components & Principles ○ Ties Batch & Real Time Systems ○ General Purpose - Domain Agnostic ● Coined by Nathan Marz ○ Ex-Twitter Engineer ○ Creator of Storm
  • 16. Data System - Traditional Approach HBase Application Source of Truth
  • 17. Data System - New Approach Immutable Raw Data Application Processed View(s) Source of Truth
  • 18. Let’s take an example Find the count of unique products in any given category for the entire time range
  • 19. Two Requirements ● Recomputations ● Large Scale
  • 20. Batch Layer Implementation HDFS (Vertical Partitioning) HBase C1 5 C2 7 C3 4 C4 7 C5 1 Products Master Data 9 am 10 am 11 am 12 pm 1 pm 2 pm Query Intermediate view C1 C2 C3 C4 C5 MR Job 1 Batch View New Data MR Job 2
  • 21. Handling Recomputations HDFS (Vertical Partitioning) HBase C1 5 C2 7 C3 4 C4 7 C5 1 Products Master Data 9 am 10 am 11 am 12 pm 1 pm 2 pm Query Intermediate view C1 C2 C3 C4 C5 MR Job 1 Batch View New Data MR Job 2
  • 22. Handling Scale ● Hadoop HDFS, MapReduce, HBase ● Proven Linear Scalability
  • 23. Three Problems (Recap) ● No Human Fault Tolerance ○ Mutable State ● Operational Complexity ○ Random Writes (Compactions) ● High Latency ○ Batch system architectural tradeoff
  • 24. Human Fault Tolerance ● Bugs in the batch jobs ○ Discard views & Recompute ● Bugs in the master data jobs ○ Re-process the master data to hide the old data ● Bugs in the query ○ Re-deploy the query layer ● Traceability as a side effect
  • 25. Operational Complexity ● No random writes in the batch layer ○ Bulk Updates to build the batch view
  • 27. Speed Layer Queue (Kafka) Recent Data Real Time Processing (Storm) HHyyppeHerylroplogeglrolloogg gS lSeoetgst s Query Random Writes (Updates) Read-Write Data Store (Riak, HBase, Cassandra)
  • 28. Speed Layer has mutation... But ● Speed layer deals with much smaller data ○ Batch Layer - Months/years of data ○ Speed Layer - Few hours or 1 day of data ● Easy to manage operationally Complexity Isolation
  • 29. Final Step - Merging Results Batch Layer Speed Layer Data Query Merged Results C1 - 50000 C1 - 499 (Approximate with error 0.02%) C1 - 50499
  • 30. What about Accuracy? Batch Layer Speed Layer Data Query Merged Results C1 - 499 (Approximate with error 0.02%) C1’ - 50500 Batch Layer CC11’ -- 5500050000 Eventually Accurate
  • 34. Batch Layer @ Indix ● Pail ○ Vertical partitioning ○ Consolidation of small files ● Scalding ● Thrift for enforcing schemas ● HBase/Solr for views ○ Bulk updates to create views
  • 35. Speed Layer @ Indix ● Still WIP ● To reduce latency ○ Micro batches for Speed layer ○ Use the last batch run + bulk update views
  • 36. Open Challenges ● Managing both Batch & Real Time still painful ● Two broad directions ○ Abstractions ■ SummingBird (Twitter) ○ Unified Stack ■ Spark ■ Kafka + Samza/Storm (LinkedIn) ■ Cloud Data Flow (Google)
  • 37. In Conclusion... ● Lambda Architecture ○ A different approach to build data systems ○ Solid principles ○ Domain Agnostic ○ Tools not yet mature
  • 38. Resources ● Indix Engineering Blog - http://engineering.indix.com ● Runaway Complexity in Big Data Systems ● Lambda Architecture ● Big Data Book - Manning ● Scalding ● Spark ● Pail ● Summingbird
  • 39. Key Takeaways - Human Fault Tolerance - Complexity Isolation - Higher Level Abstractions
  • 41. Batch vs Real Time Choices
  • 42. Tying it all together - Go-CD
  • 43. Extras ● Monoids ● LA is not new ○ Search Engines (fast, slow crawl) ○ Event Sourcing (immutable events to maintain state) ○ Patch, Audit, Bootstrap
  • 44. Problem Statement - Optimization