SlideShare a Scribd company logo
Apache HBase at Airbnb
JINGWEI LU, LIYIN TANG, AND JASON ZHANG
1
Data Infrastructure at Airbnb
Event
Logs
MySQL
Dumps
Gold Cluster
HDFS
Hive
Kafk
a
Sqoo
p
Silver Cluster Spark Cluster
Spark
ReAi
r
Airflow Scheduling
S3
Presto Cluster
AirPal
Caravel
Tableau
Batch Infrastructure
Yarn HDFS
Hive
Yarn
Jingwei Lu, Liyin Tang, Jason Zhang
3
Streaming at Airbnb
Event
Logging
MySQL
BINLOG
Cluster
HDFS
Hive
Spinal tap
Presto Cluster
Yarn
Kafk
a
HBase
Spark Streaming
Datadog
Druid
Kafka
Jingwei Lu, Liyin Tang, Jason Zhang
4
Growing Pain
Stateless
Jingwei Lu, Liyin Tang, Jason Zhang
Computation SinkSource
DStream DF DF
Stateful
Jingwei Lu, Liyin Tang, Jason Zhang
Computatio
n
Source
DStream DF DF
Sink1
Sink2
Sink N
State Storage
RDD
Multiple Streams
Jingwei Lu, Liyin Tang, Jason Zhang
DataFrame
Sink1
Process
A
Sink2
Sink3
SinkN
…
DataFrame
Sink1
Process
N
Sink2
Sink3
SinkN
…
Source
DStream
Align by Time
DataFram
e
DataFram
e
State
Storage
Source
DStream
…
Streaming + Batch
Jingwei Lu, Liyin Tang, Jason Zhang
DataFrame
Sink1
Process
A
Sink2
Sink3
SinkN
…
DataFram
e
State
Storage
Sourc
e
DStream
Sourc
e
…
Align by Time
…
DataFrame
Sink1
Process
A
Sink2
Sink3
SinkN
…
Simplify and Unify
AirStream Architecture
Jingwei Lu, Liyin Tang, Jason Zhang
Sources
Stream #1 Stream #N
Hive Tables
HBase
Tables
Virtual Table Views for Computation
Sinks
…
Customized ComputationSpark SQL
Simple Config
HBase Services
Streaming
Sources
Druid
AirStream Architecture
Jingwei Lu, Liyin Tang, Jason Zhang
Sources
Stream #1 Stream #N
Hive Tables
HBase
Tables
Virtual Table Views for Computation
Sinks
…
Customized ComputationSpark SQL
HBase Services
Streaming
Sources
Druid
Same Computation
for Batch
processing
Stateful
Jingwei Lu, Liyin Tang, Jason Zhang
State Store
• Merge changes
• Provide fast lookup
• Fast persistent storage across
streaming and batch jobs
14
Why HBase
Jingwei Lu, Liyin Tang, Jason Zhang
Rich Functionalities
Rich Integration with Hadoop EcoSystem
Easy Management
Strong Community
Reliable and Scalable
HBase State Store
Operators in Airstream
Jingwei Lu, Liyin Tang, Jason Zhang
16
Full Table Scan
Simple Aggregation
Bulk Upload
Key/Prefix Lookup
Update
Jingwei Lu, Liyin Tang, Jason Zhang
Computation DAG
17
Input Data
Left Outer Join Result
Key Lookup
Jingwei Lu, Liyin Tang, Jason Zhang
Key Space Design
• Hash partition key space
for load balance
• Composite key for K -> V
• Support full key lookup
• Prefix lookup supported for
all keys used in hash
function
Hash key1 key2 key3
Hash based on key prefix
Hash key1 key2
Lookup based on key prefix
key1 = ‘value1’ and key2 = ‘value2’
18
• Partition based on key before write
• Use bulk upload for large volume update
Write Performance
Jingwei Lu, Liyin Tang, Jason Zhang
19
Case Study
Jingwei Lu, Liyin Tang, Jason Zhang
Experiment realtime feedback
Update
Experiment Assignment
Event
Lookup
HBase
with TTL
Booking Event
Druid
Datado
g
20
one airstream job
Realtime Data Ingestion
Realtime Ingestion on HBase
Data Infrastructure
MySQL
Analytica
l Events
Kafka
Spark
Streamin
g
HBase
HDFS
Presto/Hive/Spar
k
Source
Inges
t
Realtime
Query
Snapsh
ot
Batch
Query
Jingwei Lu, Liyin Tang, Jason Zhang
22
Access Data in HBase
Jingwei Lu, Liyin Tang, Jason Zhang
HBase
Hive Presto
Spark
SQL
Spark
Streaming
Batch Jobs Interactive Query Streaming
HDFS
Snapshot
Table Mapping/Unifed View on realtime data
23
Snapshot & Reseed
Jingwei Lu, Liyin Tang, Jason Zhang
HBase HDFS
Snapshot(HFile Links)
Bulk Upload
24
Case Study 1: Events Ingestion
Jingwei Lu, Liyin Tang, Jason Zhang
Kafka
topic
…
topic
topic
Spark
Executor
1
…
Executor
2
Executor
K
HBase
DeDup
HDFS
Region1…Region2R
egion M
Daily
Snapshot
Realtime
Query
Hive
Presto
Events
Partition
25
Case Study 2: Streaming DB Export
KafkaRDS
Table1
…
Spinalta
p.
Table1
…
Table2
TableN
Spinaltap.
Table2
Spinaltap.
TableN
Spark
Executor
1
…
Executor2
Executor K
HBase
Region1
…
Region2
Region M
HDFS
Region1…Region2R
egion M
Daily Snapshot
Realtime Query
Jingwei Lu, Liyin Tang, Jason Zhang
26
Case Study: Streaming DB Export
Rows CF: Colums Version Value
<ShardKey><DB_TABLE_#1><PK_a=A> id Fri May 19 00:33:19 2016 101
<ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 19 00:33:19 2016 San Francisco
<ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 10 00:34:19 2016 New York
<ShardKey><DB_TABLE_#2><PK_a=A’> id Fri May 19 00:33:19 2016 1
Jingwei Lu, Liyin Tang, Jason Zhang
27
Case Study: Streaming DB Export
TXN 1
Commit_TS:
101
…
TXN 2
Commit_TS:
102
TXN 3
Commit_TS:
103
TXN N
Commit_TS:
N’
Binlog Order
Jingwei Lu, Liyin Tang, Jason Zhang
28
Case Study: Streaming DB Export
TXN 1
Commit_TS:
101
…
TXN 2
Commit_TS:
103
TXN 3
Commit_TS:
102
TXN N
Commit_TS:
N’
NTP
Binlog Order
Jingwei Lu, Liyin Tang, Jason Zhang
29
Case Study: Streaming DB Export
TXN 1
Commit_TS:
101
…
Binlog Order
TXN 2
Commit_TS:
103
TXN 3
Commit_TS:
102
TXN N
Commit_TS:
N’
Point-in-Time Restore on TS 102
Jingwei Lu, Liyin Tang, Jason Zhang
30
Case Study: Streaming DB Export
Rows CF: Colums Version Value
<ShardKey><DB_TABLE_#1><PK_a=A> id bin100 101
<ShardKey><DB_TABLE_#1><PK_a=A> city bin101 San Francisco
<ShardKey><DB_TABLE_#1><PK_a=A> city bin102 New York
<ShardKey><DB_TABLE_#2><PK_a=A’> id bin100 1
Jingwei Lu, Liyin Tang, Jason Zhang
31
Case Study: Streaming DB Export
Rows Version (Logical Offset) Value
<ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100
<ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101
<ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103
<ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102
Jingwei Lu, Liyin Tang, Jason Zhang
32
Case Study: Streaming DB Export
Rows Version (Logical Offset) Value
<ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100
<ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101
<ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103
<ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102
Jingwei Lu, Liyin Tang, Jason Zhang
33
Summary
Jingwei Lu, Liyin Tang, Jason Zhang
Scalable and Reliable
Rich Stateful Computation
Rich Integration with Hadoop EcoSystem
Easy Operation
35

More Related Content

PPTX
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
PPTX
[211] HBase 기반 검색 데이터 저장소 (공개용)
NAVER D2
 
PPTX
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon
 
PPTX
HBaseCon 2013: Apache HBase Table Snapshots
Cloudera, Inc.
 
PDF
Building large scale transactional data lake using apache hudi
Bill Liu
 
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
PPTX
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
NAVER D2
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon
 
HBaseCon 2013: Apache HBase Table Snapshots
Cloudera, Inc.
 
Building large scale transactional data lake using apache hudi
Bill Liu
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 

What's hot (20)

PDF
Kafka 101 and Developer Best Practices
confluent
 
PDF
Hands on MapR -- Viadea
viadea
 
PPTX
Introduction to NoSQL Databases
Derek Stainer
 
PDF
Embulk, an open-source plugin-based parallel bulk data loader
Sadayuki Furuhashi
 
PDF
Facebook Messages & HBase
强 王
 
PDF
NiFi 시작하기
Byunghwa Yoon
 
PPTX
Introduction à Hadoop
Mathieu Dumoulin
 
PDF
Big Data Analytics with Spark
Mohammed Guller
 
PDF
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
PDF
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio, Inc.
 
PPTX
Apache Ranger Hive Metastore Security
DataWorks Summit/Hadoop Summit
 
PDF
Apache Hudi: The Path Forward
Alluxio, Inc.
 
PPTX
Apache kafka
Viswanath J
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PPTX
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PPTX
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
PPTX
Apache HBase Performance Tuning
Lars Hofhansl
 
PDF
MLflow Model Serving
Databricks
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Kafka 101 and Developer Best Practices
confluent
 
Hands on MapR -- Viadea
viadea
 
Introduction to NoSQL Databases
Derek Stainer
 
Embulk, an open-source plugin-based parallel bulk data loader
Sadayuki Furuhashi
 
Facebook Messages & HBase
强 王
 
NiFi 시작하기
Byunghwa Yoon
 
Introduction à Hadoop
Mathieu Dumoulin
 
Big Data Analytics with Spark
Mohammed Guller
 
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio, Inc.
 
Apache Ranger Hive Metastore Security
DataWorks Summit/Hadoop Summit
 
Apache Hudi: The Path Forward
Alluxio, Inc.
 
Apache kafka
Viswanath J
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Kafka Tutorial - basics of the Kafka streaming platform
Jean-Paul Azar
 
Apache HBase Performance Tuning
Lars Hofhansl
 
MLflow Model Serving
Databricks
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Ad

Similar to Apache HBase at Airbnb (20)

PDF
Airstream: Spark Streaming At Airbnb
Jen Aman
 
PDF
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
PDF
HBaseCon2017 Data Product at AirBnB
HBaseCon
 
PPT
Riding the Elephant - Hadoop 2.0
Simon Elliston Ball
 
PPTX
Building Stream Processing as a Service
Steven Wu
 
PDF
2014 sept 26_thug_lambda_part1
Adam Muise
 
PDF
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Lightbend
 
PDF
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward
 
PDF
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
PDF
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
PDF
[WSO2Con USA 2018] The Rise of Streaming SQL
WSO2
 
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
PDF
Yarn by default (Spark on YARN)
Ferran Galí Reniu
 
PDF
Data platform evolution
Lev Brailovskiy
 
PDF
Data pipeline with kafka
Mole Wong
 
PDF
Down the event-driven road: Experiences of integrating streaming into analyti...
inovex GmbH
 
PDF
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
PDF
SQL and Machine Learning on Hadoop using HAWQ
pivotalny
 
PDF
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Data Con LA
 
PDF
Extending Analytic Reach
Agilisium Consulting
 
Airstream: Spark Streaming At Airbnb
Jen Aman
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Databricks
 
HBaseCon2017 Data Product at AirBnB
HBaseCon
 
Riding the Elephant - Hadoop 2.0
Simon Elliston Ball
 
Building Stream Processing as a Service
Steven Wu
 
2014 sept 26_thug_lambda_part1
Adam Muise
 
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Lightbend
 
Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...
Flink Forward
 
Introduction to apache kafka, confluent and why they matter
Paolo Castagna
 
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
[WSO2Con USA 2018] The Rise of Streaming SQL
WSO2
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
confluent
 
Yarn by default (Spark on YARN)
Ferran Galí Reniu
 
Data platform evolution
Lev Brailovskiy
 
Data pipeline with kafka
Mole Wong
 
Down the event-driven road: Experiences of integrating streaming into analyti...
inovex GmbH
 
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Flink Forward
 
SQL and Machine Learning on Hadoop using HAWQ
pivotalny
 
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Data Con LA
 
Extending Analytic Reach
Agilisium Consulting
 
Ad

More from HBaseCon (20)

PDF
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon
 
PDF
hbaseconasia2017: HBase on Beam
HBaseCon
 
PDF
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
HBaseCon
 
PDF
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon
 
PDF
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon
 
PDF
hbaseconasia2017: Apache HBase at Netease
HBaseCon
 
PDF
hbaseconasia2017: HBase在Hulu的使用和实践
HBaseCon
 
PDF
hbaseconasia2017: 基于HBase的企业级大数据平台
HBaseCon
 
PDF
hbaseconasia2017: HBase at JD.com
HBaseCon
 
PDF
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon
 
PDF
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
HBaseCon
 
PDF
hbaseconasia2017: HBase Practice At XiaoMi
HBaseCon
 
PDF
hbaseconasia2017: hbase-2.0.0
HBaseCon
 
PDF
HBaseCon2017 Democratizing HBase
HBaseCon
 
PDF
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon
 
PDF
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon
 
PDF
HBaseCon2017 Transactions in HBase
HBaseCon
 
PDF
HBaseCon2017 Highly-Available HBase
HBaseCon
 
PDF
HBaseCon2017 Apache HBase at Didi
HBaseCon
 
PDF
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon
 
hbaseconasia2017: HBase on Beam
HBaseCon
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
HBaseCon
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
HBaseCon
 
hbaseconasia2017: Apache HBase at Netease
HBaseCon
 
hbaseconasia2017: HBase在Hulu的使用和实践
HBaseCon
 
hbaseconasia2017: 基于HBase的企业级大数据平台
HBaseCon
 
hbaseconasia2017: HBase at JD.com
HBaseCon
 
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
HBaseCon
 
hbaseconasia2017: HBase Practice At XiaoMi
HBaseCon
 
hbaseconasia2017: hbase-2.0.0
HBaseCon
 
HBaseCon2017 Democratizing HBase
HBaseCon
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon
 
HBaseCon2017 Transactions in HBase
HBaseCon
 
HBaseCon2017 Highly-Available HBase
HBaseCon
 
HBaseCon2017 Apache HBase at Didi
HBaseCon
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon
 

Recently uploaded (20)

PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
PDF
Bandai Playdia The Book - David Glotz
BluePanther6
 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
PDF
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PDF
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
PDF
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
PPTX
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PPTX
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
PDF
Immersive experiences: what Pharo users do!
ESUG
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PPTX
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
Explanation about Structures in C language.pptx
Veeral Rathod
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
Activate_Methodology_Summary presentatio
annapureddyn
 
Bandai Playdia The Book - David Glotz
BluePanther6
 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
MiniTool Power Data Recovery Crack New Pre Activated Version Latest 2025
imang66g
 
Maximizing Revenue with Marketo Measure: A Deep Dive into Multi-Touch Attribu...
bbedford2
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
Web Testing.pptx528278vshbuqffqhhqiwnwuq
studylike474
 
Immersive experiences: what Pharo users do!
ESUG
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
Presentation about variables and constant.pptx
kr2589474
 
Explanation about Structures in C language.pptx
Veeral Rathod
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 

Apache HBase at Airbnb

  • 1. Apache HBase at Airbnb JINGWEI LU, LIYIN TANG, AND JASON ZHANG 1
  • 3. Event Logs MySQL Dumps Gold Cluster HDFS Hive Kafk a Sqoo p Silver Cluster Spark Cluster Spark ReAi r Airflow Scheduling S3 Presto Cluster AirPal Caravel Tableau Batch Infrastructure Yarn HDFS Hive Yarn Jingwei Lu, Liyin Tang, Jason Zhang 3
  • 4. Streaming at Airbnb Event Logging MySQL BINLOG Cluster HDFS Hive Spinal tap Presto Cluster Yarn Kafk a HBase Spark Streaming Datadog Druid Kafka Jingwei Lu, Liyin Tang, Jason Zhang 4
  • 6. Stateless Jingwei Lu, Liyin Tang, Jason Zhang Computation SinkSource DStream DF DF
  • 7. Stateful Jingwei Lu, Liyin Tang, Jason Zhang Computatio n Source DStream DF DF Sink1 Sink2 Sink N State Storage RDD
  • 8. Multiple Streams Jingwei Lu, Liyin Tang, Jason Zhang DataFrame Sink1 Process A Sink2 Sink3 SinkN … DataFrame Sink1 Process N Sink2 Sink3 SinkN … Source DStream Align by Time DataFram e DataFram e State Storage Source DStream …
  • 9. Streaming + Batch Jingwei Lu, Liyin Tang, Jason Zhang DataFrame Sink1 Process A Sink2 Sink3 SinkN … DataFram e State Storage Sourc e DStream Sourc e … Align by Time … DataFrame Sink1 Process A Sink2 Sink3 SinkN …
  • 11. AirStream Architecture Jingwei Lu, Liyin Tang, Jason Zhang Sources Stream #1 Stream #N Hive Tables HBase Tables Virtual Table Views for Computation Sinks … Customized ComputationSpark SQL Simple Config HBase Services Streaming Sources Druid
  • 12. AirStream Architecture Jingwei Lu, Liyin Tang, Jason Zhang Sources Stream #1 Stream #N Hive Tables HBase Tables Virtual Table Views for Computation Sinks … Customized ComputationSpark SQL HBase Services Streaming Sources Druid Same Computation for Batch processing
  • 14. Jingwei Lu, Liyin Tang, Jason Zhang State Store • Merge changes • Provide fast lookup • Fast persistent storage across streaming and batch jobs 14
  • 15. Why HBase Jingwei Lu, Liyin Tang, Jason Zhang Rich Functionalities Rich Integration with Hadoop EcoSystem Easy Management Strong Community Reliable and Scalable
  • 16. HBase State Store Operators in Airstream Jingwei Lu, Liyin Tang, Jason Zhang 16 Full Table Scan Simple Aggregation Bulk Upload Key/Prefix Lookup Update
  • 17. Jingwei Lu, Liyin Tang, Jason Zhang Computation DAG 17 Input Data Left Outer Join Result Key Lookup
  • 18. Jingwei Lu, Liyin Tang, Jason Zhang Key Space Design • Hash partition key space for load balance • Composite key for K -> V • Support full key lookup • Prefix lookup supported for all keys used in hash function Hash key1 key2 key3 Hash based on key prefix Hash key1 key2 Lookup based on key prefix key1 = ‘value1’ and key2 = ‘value2’ 18
  • 19. • Partition based on key before write • Use bulk upload for large volume update Write Performance Jingwei Lu, Liyin Tang, Jason Zhang 19
  • 20. Case Study Jingwei Lu, Liyin Tang, Jason Zhang Experiment realtime feedback Update Experiment Assignment Event Lookup HBase with TTL Booking Event Druid Datado g 20 one airstream job
  • 22. Realtime Ingestion on HBase Data Infrastructure MySQL Analytica l Events Kafka Spark Streamin g HBase HDFS Presto/Hive/Spar k Source Inges t Realtime Query Snapsh ot Batch Query Jingwei Lu, Liyin Tang, Jason Zhang 22
  • 23. Access Data in HBase Jingwei Lu, Liyin Tang, Jason Zhang HBase Hive Presto Spark SQL Spark Streaming Batch Jobs Interactive Query Streaming HDFS Snapshot Table Mapping/Unifed View on realtime data 23
  • 24. Snapshot & Reseed Jingwei Lu, Liyin Tang, Jason Zhang HBase HDFS Snapshot(HFile Links) Bulk Upload 24
  • 25. Case Study 1: Events Ingestion Jingwei Lu, Liyin Tang, Jason Zhang Kafka topic … topic topic Spark Executor 1 … Executor 2 Executor K HBase DeDup HDFS Region1…Region2R egion M Daily Snapshot Realtime Query Hive Presto Events Partition 25
  • 26. Case Study 2: Streaming DB Export KafkaRDS Table1 … Spinalta p. Table1 … Table2 TableN Spinaltap. Table2 Spinaltap. TableN Spark Executor 1 … Executor2 Executor K HBase Region1 … Region2 Region M HDFS Region1…Region2R egion M Daily Snapshot Realtime Query Jingwei Lu, Liyin Tang, Jason Zhang 26
  • 27. Case Study: Streaming DB Export Rows CF: Colums Version Value <ShardKey><DB_TABLE_#1><PK_a=A> id Fri May 19 00:33:19 2016 101 <ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 19 00:33:19 2016 San Francisco <ShardKey><DB_TABLE_#1><PK_a=A> city Fri May 10 00:34:19 2016 New York <ShardKey><DB_TABLE_#2><PK_a=A’> id Fri May 19 00:33:19 2016 1 Jingwei Lu, Liyin Tang, Jason Zhang 27
  • 28. Case Study: Streaming DB Export TXN 1 Commit_TS: 101 … TXN 2 Commit_TS: 102 TXN 3 Commit_TS: 103 TXN N Commit_TS: N’ Binlog Order Jingwei Lu, Liyin Tang, Jason Zhang 28
  • 29. Case Study: Streaming DB Export TXN 1 Commit_TS: 101 … TXN 2 Commit_TS: 103 TXN 3 Commit_TS: 102 TXN N Commit_TS: N’ NTP Binlog Order Jingwei Lu, Liyin Tang, Jason Zhang 29
  • 30. Case Study: Streaming DB Export TXN 1 Commit_TS: 101 … Binlog Order TXN 2 Commit_TS: 103 TXN 3 Commit_TS: 102 TXN N Commit_TS: N’ Point-in-Time Restore on TS 102 Jingwei Lu, Liyin Tang, Jason Zhang 30
  • 31. Case Study: Streaming DB Export Rows CF: Colums Version Value <ShardKey><DB_TABLE_#1><PK_a=A> id bin100 101 <ShardKey><DB_TABLE_#1><PK_a=A> city bin101 San Francisco <ShardKey><DB_TABLE_#1><PK_a=A> city bin102 New York <ShardKey><DB_TABLE_#2><PK_a=A’> id bin100 1 Jingwei Lu, Liyin Tang, Jason Zhang 31
  • 32. Case Study: Streaming DB Export Rows Version (Logical Offset) Value <ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100 <ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101 <ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103 <ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102 Jingwei Lu, Liyin Tang, Jason Zhang 32
  • 33. Case Study: Streaming DB Export Rows Version (Logical Offset) Value <ShardKey><DB_TABLE_#1><2016-05-23 23><100> 100 mysql-bin.00000:100 <ShardKey><DB_TABLE_#1><2016-05-23 23><101> 101 mysql-bin.00000:101 <ShardKey><DB_TABLE_#1><2016-05-23 23><103> 103 mysql-bin.00000:103 <ShardKey><DB_TABLE_#1><2016-05-24 00><102> 102 mysql-bin.00000:102 Jingwei Lu, Liyin Tang, Jason Zhang 33
  • 34. Summary Jingwei Lu, Liyin Tang, Jason Zhang Scalable and Reliable Rich Stateful Computation Rich Integration with Hadoop EcoSystem Easy Operation
  • 35. 35

Editor's Notes

  • #4: *Disaster recovery *High Slow SLA job isolation
  • #15: Slide why Stateful process vs stateless
  • #17: Use diagram to show operators
  • #22: Realtime ingestion provides fast feedback loop. Advanced monitoring infrastructure Tracking changes instead of full snapshot for RDS dump
  • #23: What is the goal of realtime ingestion: *fast feedback loop for experiment to reduce testing cycle *provide realtime view of production database for many offline workload(for example, machine learning)
  • #24: Table mapping provide a unified view to access realtime ingested data.
  • #25: For snapshot using scan it takes 10-30 minutes per table. This does not scale. Take 10 minutes to do the link and restore. All tables can be accessed afterward.
  • #27: Backup based db export restore takes 9 - 12 hours and it is subject to AWS network situation. Long latency and fragile. We just need to track changes and apply to snapshot. Provide near realtime snapshot of db. Unify across mysql and dynamodb