SlideShare a Scribd company logo
Scaling ETL on Hadoop: Bridging OLTP with OLAP
Agenda
 Data Ecosystem @ LinkedIn
 Problem : Bridging OLTP with OLAP
 Solution
 Details
 Conclusion and Future Work
2
Data Ecosystem @ LinkedIn
3
Data Ecosystem - Overview
4
Serving App
Online Stores
Espresso
Oracle
MySQL
Logs
Analytics Infra
Business
Engines
Serving
OLAP
Data Ecosystem – Data
5
 Tracking Data
 Tracks user activity at web site
 Append only
 Example: Page View
 Database Data
 Member provided data in online-stores
 Inserts, Updates and Deletes
 Example: Member Profiles, Likes, Comments
Problem
Scaling ETL on Hadoop
6
Bridging OLTP to OLAP
7
OLTP OLAP
 Integrating site-serving data stores with Hadoop
at scale with low latency.
 Critical to LinkedIn’s
 Member engagement
 Business decision making
Kafka
Engines
Serving
OLAP
Databases
Tracking Data
Espresso
Oracle
MySQL
Challenge - Scalable ETL
8
 600+ Tracking topics
 500+ Database tables
 XXX TB of Data at rest
 X TB of new data generated per day
 5000 Nodes, Several Hadoop clusters
Kafka
Engines
Serving
OLAP
Databases
Tracking Data
Espresso
Oracle
MySQL
OLTP OLAP
Challenge – Consistent Snapshot with SLA
9
 Apply updates, deletes
 Copy full tables
 But, resource overheads
 Small fraction of data changes
Kafka
Engines
Serving
OLAP
Databases
Tracking Data
Espresso
Oracle
MySQL
OLTP OLAP
Engines
Requirements
10
OLTP
Oracle Espresso
OLAP
 Refresh data on HDFS frequently
 Seamless handling of schema evolution
 Optimal resource usage
 Handle multi data centers
 Efficient change capture on source
 Ensure Last-Update semantics
 Handle deletes
Serving
OLAP
Database Data
Tracking Data
Solution
11
Lumos
12
Data Capture
 Can use commit logs
 Delta processing
 Latencies in minutes
 Schema agnostic framework
Databus
Others
Hadoop : Data Center
DB
Extract
Files
Data Center
Colo-1
Databases
Colo-2
Databases
Lumos
databases
(HDFS)
dbchanges
(HDFS)
Lumos – Multi-Datacenter
13
Data Capture
 Handle multi-datacenter stores
 Resolve updates via commit order
Databus
Others
Hadoop : Data Center
DB
Extract
Files
Data Center
Colo-1
Databases
Colo-2
Databases
Lumos
databases
(HDFS)
dbchanges
(HDFS)
Lumos – Data Organization
14
-
Virtual Snapshot
HDFS Layout
InputFormat
Pig&Hive
Loaders
 Database Snapshot
- Entire database on HDFS
- With added latency
 Database Virtual Snapshot
- Previous Snapshot + Delta
- Enables faster refresh
/db/table/snapshot-0
_delta
dir-1
dir-2
dir-3
Lumos - High Level Architecture
15

Virtual
Snapshot
Builder
ETL Hadoop Cluster
Staging
(internal)
Lazy
Snapshot
Builder
User
Jobs
HDFS
Published
Virtual
Snapshot
MR/Pig/Hiv
e
Loaders
Compactor
Change
Captur
e Increments
Pre-
Process
Full Drops
Alternative Approaches
 Sqoop
 Hbase
 Hive Streaming
16
Details
17
Change Capture – File Based
18
 File Format
 Compressed CSV
 Metadata
 Full Drop
 Via Fast Reader (Oracle, MySQL)
 Via MySQL backups (Espresso)
 Runs for hours with Dirty reads
 Increments
 Via SQL
 Transactional
Full Drop
1am 4am
Inc
h-1
Inc
h-2
Inc
h-3
2am 3am
Prev.
HW
New
High-water mark
DB
Files
Web
Service
HDFS
HTTPS
Pulls
Inc
H-4
Change Capture – Databus Based
19
Databus
Relay
Mapper
Databus
Consumer
dbchanges
(HDFS)
Reducer
Database
Mapper
Databus
Consumer
Reducer
 Reads Database commit logs
 Multi datacenter via Databus Relay
 Runs as MR Job
 Output : date-time partitioned with multiple versions
 True change capture (including hard deletes)
Databus
RelayDatabase
Hadoop
Pre-Processing
20
 Data format conversion
 Field level transformations
 Privacy
 Cleansing – Eg. Remove recursive schema
 Metadata annotation
 Add row counts for data validation
 Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture Increments
Pre-
Process
Full Drops
Snapshotting – Lazy Materializer
21
 One MR job per table, consumes full drops
 Supports dirty reads.
 Hash Partition on primary key
 Number of partitions based on data size
 Sorts on primary key
 Results published into staging directory
 Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture Increments
Pre-
Process
Full Drops
Snapshotting – Virtual Snapshot Builder
22
 One MR Job for all tables
 Identifies all existing snapshots, both published and staged
 Creates appropriate delta partitions for every snapshot
 Delta partition count equals Snapshot partition count
 Club multiple partition in one file
 Outputs latest row using delta column
 Publishes staged snapshots with new deltas
 Previously published snapshots updated with new deltas
 Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture Increments
Pre-
Process
Full Drops
Snapshotting – Virtual Snapshot Builder
23
/db/table/snapshot-0
(10 partitions, 10 Avro files)
_delta
inc-1
(10 partitions, 2 Avro file)
Part-0 . .
.Part-9
Index files
Inc-2
(10 partitions, 2 Avro file)
Part-0
Part-5
Part-0
 Incremental data is small
 Rolls increments
 Avoid creating small files
 Equi-partitions INC as Snapshot
 Seek and Read a partition
Partition-0
Part-0.avro File
Partition-4
Partition-5
Partition-9
Index file
Index files
Part-5
Index file
Part-5.avro File
Snapshotting – Loaders
24
 Custom InputFormat (MR)
 Uses the Index file to create Splits
 RecordReader merges partition-0 of Snapshot and
Delta
 Returns latest row from Delta if present
 Masks row if deleted
 Otherwise returns row from snapshot
 Pig Loader enables reading virtual snapshot via Pig
 Storage handler enables reading virtual snapshot via Hive
Snapshotting – Loaders (2)
25
/db/table/snapshot-0
(10 partitions, 10 Avro files)
_delta
Part-0
Part-9
Delta-1
(10 partitions, 2 Avro file)
Part-5
Part-0
Custom
InputFormat
Index files
Part-1
Part-2 . .
.
Mapper-0
Custom
InputFormat
Mapper-9
 Delta-1.Part-0 contains partitions 0 to 4
 Delta-2.Part-5 contains partitions 5 to 9
 Snapshot-0.Part-0 contains partition 0
 Both sorted on primary key
Snapshotting – Compactor
26
 Required when partition size exceeds threshold
 Materializes Virtual Snapshot to Snapshot
 With more partitions
 MR job with Reducer
 Virtual
Snapshot
Builder
(HDFS)
Internal
Staging
Lazy
Snapshot
Builder
User Jobs
(HDFS)
Published
Virtual
Snapshot
MR/Pig/Hive
Loaders
Compactor
Change
Capture
Increments
Pre-
Process
Full Drops
Operating billions of rows per day
 Dude, where’s my row?
– Automatic Data validation
 When data misses the bus
– Handling late data
– Look back window
 Cluster downtime
– Restart-ability
– Active-active
– Idempotent processing
27
Conclusion and Future Work
 Conclusion
 Lumos : Scalable ETL framework
 Battle tested in production
 Future Work
 Unify Internal and External data
 Open source
28
Q & A
29
Questions?
Appendix
30

More Related Content

PPTX
Self-Service ETL: The PowerBI Data Flows
Data Con LA
 
KEY
Large scale ETL with Hadoop
OReillyStrata
 
PDF
Splice machine-bloor-webinar-data-lakes
Edgar Alejandro Villegas
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PDF
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
PPTX
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
PPTX
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
PPT
Boston Hadoop Meetup, April 26 2012
Daniel Abadi
 
Self-Service ETL: The PowerBI Data Flows
Data Con LA
 
Large scale ETL with Hadoop
OReillyStrata
 
Splice machine-bloor-webinar-data-lakes
Edgar Alejandro Villegas
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
Boston Hadoop Meetup, April 26 2012
Daniel Abadi
 

What's hot (20)

PPTX
Rds data lake @ Robinhood
BalajiVaradarajan13
 
PDF
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
BigDataCloud
 
PPTX
SQL-on-Hadoop Tutorial
Daniel Abadi
 
PPTX
Data warehousing with Hadoop
hadooparchbook
 
PPTX
Hadoop introduction
musrath mohammad
 
PPTX
Jstorm introduction-0.9.6
longda feng
 
PDF
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
PDF
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
PDF
DoneDeal - AWS Data Analytics Platform
martinbpeters
 
PDF
SQL on Hadoop
nvvrajesh
 
PDF
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Asis Mohanty
 
PPTX
SQL on Hadoop
Bigdatapump
 
PPTX
Mutable Data in Hive's Immutable World
DataWorks Summit
 
PDF
The Heterogeneous Data lake
DataWorks Summit/Hadoop Summit
 
KEY
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Bill Graham
 
PPTX
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
PPTX
NoSQL Needs SomeSQL
DataWorks Summit
 
PDF
Introduction to Hadoop and MapReduce
eakasit_dpu
 
PDF
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 
Rds data lake @ Robinhood
BalajiVaradarajan13
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
BigDataCloud
 
SQL-on-Hadoop Tutorial
Daniel Abadi
 
Data warehousing with Hadoop
hadooparchbook
 
Hadoop introduction
musrath mohammad
 
Jstorm introduction-0.9.6
longda feng
 
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
DoneDeal - AWS Data Analytics Platform
martinbpeters
 
SQL on Hadoop
nvvrajesh
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Asis Mohanty
 
SQL on Hadoop
Bigdatapump
 
Mutable Data in Hive's Immutable World
DataWorks Summit
 
The Heterogeneous Data lake
DataWorks Summit/Hadoop Summit
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Bill Graham
 
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
NoSQL Needs SomeSQL
DataWorks Summit
 
Introduction to Hadoop and MapReduce
eakasit_dpu
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 
Ad

Viewers also liked (20)

PPTX
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
 
PPTX
Databus - LinkedIn's Change Data Capture Pipeline
Sunil Nagaraj
 
PDF
Data Infrastructure at LinkedIn
Amy W. Tang
 
PPT
Bridging the gap: e-learning research
grainne
 
PDF
Log ingestion kafka -- impala using apex
Apache Apex
 
PPT
Cs intro-ca
aniketbijwe143
 
PDF
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
In-Memory Computing Summit
 
PDF
Aesop change data propagation
Regunath B
 
PPTX
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
PPTX
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
PPTX
Architecture of a Kafka camus infrastructure
mattlieber
 
PPTX
Big Data Ecosystem
Ivo Vachkov
 
PPT
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
PDF
The Ecosystem is too damn big
DataWorks Summit/Hadoop Summit
 
PPTX
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
PPTX
5 Reasons Why Healthcare Data is Unique and Difficult to Measure
Health Catalyst
 
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
PDF
Big data landscape v 3.0 - Matt Turck (FirstMark)
Matt Turck
 
PDF
Introduction to Databus
Amy W. Tang
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
 
Databus - LinkedIn's Change Data Capture Pipeline
Sunil Nagaraj
 
Data Infrastructure at LinkedIn
Amy W. Tang
 
Bridging the gap: e-learning research
grainne
 
Log ingestion kafka -- impala using apex
Apache Apex
 
Cs intro-ca
aniketbijwe143
 
IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...
In-Memory Computing Summit
 
Aesop change data propagation
Regunath B
 
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
Architecture of a Kafka camus infrastructure
mattlieber
 
Big Data Ecosystem
Ivo Vachkov
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Yahoo Developer Network
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
The Ecosystem is too damn big
DataWorks Summit/Hadoop Summit
 
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
5 Reasons Why Healthcare Data is Unique and Difficult to Measure
Health Catalyst
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
Big data landscape v 3.0 - Matt Turck (FirstMark)
Matt Turck
 
Introduction to Databus
Amy W. Tang
 
Ad

Similar to Bringing OLTP woth OLAP: Lumos on Hadoop (20)

PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
PDF
Hoodie - DataEngConf 2017
Vinoth Chandar
 
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
PPT
Hw09 Production Deep Dive With High Availability
Cloudera, Inc.
 
PDF
Apache Hudi: The Path Forward
Alluxio, Inc.
 
PPT
2011 06-30-hadoop-summit v5
Samuel Rash
 
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
PPT
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
PPTX
Hadoop File system (HDFS)
Prashant Gupta
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PPTX
HANA SITSP 2011
Henrique Pinto
 
PPTX
The Big Data Analytics Ecosystem at LinkedIn
rajappaiyer
 
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
PPT
Hive @ Hadoop day seattle_2010
nzhang
 
PPTX
Hive acid and_2.x new_features
Alberto Romero
 
PPT
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PDF
SharePoint 2010 Boost your farm performance!
Brian Culver
 
PPTX
Building data pipelines
Jonathan Holloway
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Hw09 Production Deep Dive With High Availability
Cloudera, Inc.
 
Apache Hudi: The Path Forward
Alluxio, Inc.
 
2011 06-30-hadoop-summit v5
Samuel Rash
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Hadoop File system (HDFS)
Prashant Gupta
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
HANA SITSP 2011
Henrique Pinto
 
The Big Data Analytics Ecosystem at LinkedIn
rajappaiyer
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Hive @ Hadoop day seattle_2010
nzhang
 
Hive acid and_2.x new_features
Alberto Romero
 
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
SharePoint 2010 Boost your farm performance!
Brian Culver
 
Building data pipelines
Jonathan Holloway
 

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Doc9.....................................
SofiaCollazos
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 

Bringing OLTP woth OLAP: Lumos on Hadoop

  • 1. Scaling ETL on Hadoop: Bridging OLTP with OLAP
  • 2. Agenda  Data Ecosystem @ LinkedIn  Problem : Bridging OLTP with OLAP  Solution  Details  Conclusion and Future Work 2
  • 3. Data Ecosystem @ LinkedIn 3
  • 4. Data Ecosystem - Overview 4 Serving App Online Stores Espresso Oracle MySQL Logs Analytics Infra Business Engines Serving OLAP
  • 5. Data Ecosystem – Data 5  Tracking Data  Tracks user activity at web site  Append only  Example: Page View  Database Data  Member provided data in online-stores  Inserts, Updates and Deletes  Example: Member Profiles, Likes, Comments
  • 7. Bridging OLTP to OLAP 7 OLTP OLAP  Integrating site-serving data stores with Hadoop at scale with low latency.  Critical to LinkedIn’s  Member engagement  Business decision making Kafka Engines Serving OLAP Databases Tracking Data Espresso Oracle MySQL
  • 8. Challenge - Scalable ETL 8  600+ Tracking topics  500+ Database tables  XXX TB of Data at rest  X TB of new data generated per day  5000 Nodes, Several Hadoop clusters Kafka Engines Serving OLAP Databases Tracking Data Espresso Oracle MySQL OLTP OLAP
  • 9. Challenge – Consistent Snapshot with SLA 9  Apply updates, deletes  Copy full tables  But, resource overheads  Small fraction of data changes Kafka Engines Serving OLAP Databases Tracking Data Espresso Oracle MySQL OLTP OLAP
  • 10. Engines Requirements 10 OLTP Oracle Espresso OLAP  Refresh data on HDFS frequently  Seamless handling of schema evolution  Optimal resource usage  Handle multi data centers  Efficient change capture on source  Ensure Last-Update semantics  Handle deletes Serving OLAP Database Data Tracking Data
  • 12. Lumos 12 Data Capture  Can use commit logs  Delta processing  Latencies in minutes  Schema agnostic framework Databus Others Hadoop : Data Center DB Extract Files Data Center Colo-1 Databases Colo-2 Databases Lumos databases (HDFS) dbchanges (HDFS)
  • 13. Lumos – Multi-Datacenter 13 Data Capture  Handle multi-datacenter stores  Resolve updates via commit order Databus Others Hadoop : Data Center DB Extract Files Data Center Colo-1 Databases Colo-2 Databases Lumos databases (HDFS) dbchanges (HDFS)
  • 14. Lumos – Data Organization 14 - Virtual Snapshot HDFS Layout InputFormat Pig&Hive Loaders  Database Snapshot - Entire database on HDFS - With added latency  Database Virtual Snapshot - Previous Snapshot + Delta - Enables faster refresh /db/table/snapshot-0 _delta dir-1 dir-2 dir-3
  • 15. Lumos - High Level Architecture 15 Virtual Snapshot Builder ETL Hadoop Cluster Staging (internal) Lazy Snapshot Builder User Jobs HDFS Published Virtual Snapshot MR/Pig/Hiv e Loaders Compactor Change Captur e Increments Pre- Process Full Drops
  • 16. Alternative Approaches  Sqoop  Hbase  Hive Streaming 16
  • 18. Change Capture – File Based 18  File Format  Compressed CSV  Metadata  Full Drop  Via Fast Reader (Oracle, MySQL)  Via MySQL backups (Espresso)  Runs for hours with Dirty reads  Increments  Via SQL  Transactional Full Drop 1am 4am Inc h-1 Inc h-2 Inc h-3 2am 3am Prev. HW New High-water mark DB Files Web Service HDFS HTTPS Pulls Inc H-4
  • 19. Change Capture – Databus Based 19 Databus Relay Mapper Databus Consumer dbchanges (HDFS) Reducer Database Mapper Databus Consumer Reducer  Reads Database commit logs  Multi datacenter via Databus Relay  Runs as MR Job  Output : date-time partitioned with multiple versions  True change capture (including hard deletes) Databus RelayDatabase Hadoop
  • 20. Pre-Processing 20  Data format conversion  Field level transformations  Privacy  Cleansing – Eg. Remove recursive schema  Metadata annotation  Add row counts for data validation Virtual Snapshot Builder (HDFS) Internal Staging Lazy Snapshot Builder User Jobs (HDFS) Published Virtual Snapshot MR/Pig/Hive Loaders Compactor Change Capture Increments Pre- Process Full Drops
  • 21. Snapshotting – Lazy Materializer 21  One MR job per table, consumes full drops  Supports dirty reads.  Hash Partition on primary key  Number of partitions based on data size  Sorts on primary key  Results published into staging directory Virtual Snapshot Builder (HDFS) Internal Staging Lazy Snapshot Builder User Jobs (HDFS) Published Virtual Snapshot MR/Pig/Hive Loaders Compactor Change Capture Increments Pre- Process Full Drops
  • 22. Snapshotting – Virtual Snapshot Builder 22  One MR Job for all tables  Identifies all existing snapshots, both published and staged  Creates appropriate delta partitions for every snapshot  Delta partition count equals Snapshot partition count  Club multiple partition in one file  Outputs latest row using delta column  Publishes staged snapshots with new deltas  Previously published snapshots updated with new deltas Virtual Snapshot Builder (HDFS) Internal Staging Lazy Snapshot Builder User Jobs (HDFS) Published Virtual Snapshot MR/Pig/Hive Loaders Compactor Change Capture Increments Pre- Process Full Drops
  • 23. Snapshotting – Virtual Snapshot Builder 23 /db/table/snapshot-0 (10 partitions, 10 Avro files) _delta inc-1 (10 partitions, 2 Avro file) Part-0 . . .Part-9 Index files Inc-2 (10 partitions, 2 Avro file) Part-0 Part-5 Part-0  Incremental data is small  Rolls increments  Avoid creating small files  Equi-partitions INC as Snapshot  Seek and Read a partition Partition-0 Part-0.avro File Partition-4 Partition-5 Partition-9 Index file Index files Part-5 Index file Part-5.avro File
  • 24. Snapshotting – Loaders 24  Custom InputFormat (MR)  Uses the Index file to create Splits  RecordReader merges partition-0 of Snapshot and Delta  Returns latest row from Delta if present  Masks row if deleted  Otherwise returns row from snapshot  Pig Loader enables reading virtual snapshot via Pig  Storage handler enables reading virtual snapshot via Hive
  • 25. Snapshotting – Loaders (2) 25 /db/table/snapshot-0 (10 partitions, 10 Avro files) _delta Part-0 Part-9 Delta-1 (10 partitions, 2 Avro file) Part-5 Part-0 Custom InputFormat Index files Part-1 Part-2 . . . Mapper-0 Custom InputFormat Mapper-9  Delta-1.Part-0 contains partitions 0 to 4  Delta-2.Part-5 contains partitions 5 to 9  Snapshot-0.Part-0 contains partition 0  Both sorted on primary key
  • 26. Snapshotting – Compactor 26  Required when partition size exceeds threshold  Materializes Virtual Snapshot to Snapshot  With more partitions  MR job with Reducer Virtual Snapshot Builder (HDFS) Internal Staging Lazy Snapshot Builder User Jobs (HDFS) Published Virtual Snapshot MR/Pig/Hive Loaders Compactor Change Capture Increments Pre- Process Full Drops
  • 27. Operating billions of rows per day  Dude, where’s my row? – Automatic Data validation  When data misses the bus – Handling late data – Look back window  Cluster downtime – Restart-ability – Active-active – Idempotent processing 27
  • 28. Conclusion and Future Work  Conclusion  Lumos : Scalable ETL framework  Battle tested in production  Future Work  Unify Internal and External data  Open source 28

Editor's Notes

  • #2: Today, Talk about Scaling ETL in order to consolidate and democratize data and analytics on Hadoop at LinkedIn.
  • #3: Let’s start with the overall Data Ecosystem Then focus on the specific problem of integrating online data-stores with Hadoop and go over the solution
  • #5: Members interact with the site apps And they generate actions and data mutations Which gets persisted in LOGS store and ONLINE data stores Espresso, MySQL and Oracle are primary online data stores. Espresso is a document oriented partitioned data store with transactional support. It is home grown. Kafka is used as the LOG store. Online Data sources are periodically replicated to hadoop for creating cubes & enrichments. Cubes are used externally on the site as well as internally on the reports/insights for analysts. (Eg: “Who viewed your profile”, “Campaign performance reports”, Member sign-up reports) Cubes are delivered via Cube serving Engines. There are primarily 3 cube serving stack. Voldemort is a key-value store : used to deliver static reports with pre-computed metrics. Pinot : search technology : used for delivering some what dynamic reports with pre-compute metrics (drill) Finally, the traditional BI stack comprised of TD + Tableau + MSTR: deliver insights to business users.
  • #6: Explain interactively what action generated what data  real use case. Tracking: User activity at the site turns into tracking data Example -> Tracking -> PageView, AdClick Append -> each user activity generates new data Immutable -> Once generated, does not change but grows over time Usually organized by time and accessed over time range Database: is user provided data stored in online stores. This data is mutable over time Example -> Member Profile, Education Organized as full table as of some time and accessed in full
  • #8: The problem is simply replicating the data from ONLINE to HADOOP But, LNKD has 300m members and generates lots of data => humongous amount of data Fresh data directly impacts the member engagement and business decision making
  • #9: PROD data center that is accessible from outside HADOOP is CORP data center
  • #10: Deletes for compliance Move the data entirely, but it puts load on the source system, network and hadoop resources
  • #11: Commit time or Since tracking data generates is append only, it is easier to handler and arrange them in time window. DB data can have updates or deletes, and reflecting that on HDFS in low latency and with optimal resouce usage is a challenge
  • #13: TALK about schema evaluation
  • #14: TALK about schema evaluation
  • #15: This is not HDFS snaphsot not HBASE snapshot
  • #17: Schema changes + rewrite the complete data Sqoop: Cross-colo database connections are not allowed Sqoop: May put load on the production databases Hbase Write the change logs and periodically do a snapshot and replicate not all companies run Hbase as part of the standard deployment not clear if this will meet the low-latency requirement Hive Streaming looks similar to what we do caveat: it only supports ORCA
  • #18: Change to Data Extract
  • #19: Bottom right
  • #20: TODO: cluster of databases and Relay Reading off of databus With a picture Checkpoint  Scn to time mapping Backup slides towards the end
  • #21: Db Dump format to Avro Oracle data types Map-Only Job Field Level transformation Eliminate recursive schema Avro Schema Attribute JSON Meta info Key and delta column begin_date, end_date, drop_date, full_drop date Row counts
  • #22: Db Dump format to Avro Oracle data types Map-Only Job Field Level transformation Eliminate recursive schema Avro Schema Attribute JSON Meta info Key and delta column begin_date, end_date, drop_date, full_drop date Row counts
  • #23: Db Dump format to Avro Oracle data types Map-Only Job Field Level transformation Eliminate recursive schema Avro Schema Attribute JSON Meta info Key and delta column begin_date, end_date, drop_date, full_drop date Row counts
  • #27: Db Dump format to Avro Oracle data types Map-Only Job Field Level transformation Eliminate recursive schema Avro Schema Attribute JSON Meta info Key and delta column begin_date, end_date, drop_date, full_drop date Row counts
  • #31: Change to Data Extract