SlideShare a Scribd company logo
Edward Zhang,
Software Engineer Manager, Data Service & Solution (eBay)
ADBMS to Apache Spark
Auto Migration Framework
#SAISDD7
Who We Are
• Data Service & Solution team in eBay
• Responsible for big data processing and data
application development
• Focus on batch auto migration and Spark core
optimization
2#SAISDD7
Why Migrate to Spark
• More complex big data processing needs
• Streaming, Graph computation, Machine
Learning use cases
• Extreme performance optimization need
3#SAISDD7
What We Do
• ~90% batch workload auto migration
• Tool sets to enable manual migration
4#SAISDD7
Agenda
5#SAISDD7
ØAuto Migration Scope
ØAuto Migration Strategy
ØAuto Migration Components
ØKey Components
ØTool Sets
ØMajor Challenges
ØBe part of community
Auto Migration Scope
6#SAISDD7
• ~5K Target tables
• ~20K intermediate/working tables
• ~22PB target tables
• ~40PB relational data processing every day
• ~ 1 year timeline
Auto Migration Strategy
7#SAISDD7
Auto Migration Framework
8#SAISDD7
Migration Planner Metadata
Migration Engine
Controller
Process Manager
Task Invoker
Task Monitor
DDL Generator SQL Convertor Job Optimizer Pipeline Generator
Release Assistant Data Mover
Data Validator
Auto Migration Components
9#SAISDD7
Migration Planner
• Analyze and identify auto migration candidates
• Determine the order of table migration
Metadata
• Define and collect metadata to enable the auto migration
engine
• Include table profile, data linage, job linage, SQL file profile,
pipeline profile
Auto Migration Components
10#SAISDD7
Controller
• Manage the end to end migration process
• Include sub components like process manager, task invoker,
task monitor
DDL Generator
• A data modeler to generate DDL on Spark for target table,
working tables and views
• Also include setting the table format, bucket and partition
Auto Migration Components
11#SAISDD7
SQL Convertor
• Split original SQL files into table transform + merge
steps
• Parsing original ADBMS SQL into abstract syntax
tree and assemble into Spark SQL
• Special rules to deal with SQL dialect and UDFs
Auto Migration Components
12#SAISDD7
Job Optimizer
• Pre generate Spark job execution configurations
based on table size and Spark cluster scale (typically
spark.sql.shuffle.partions)
• Leverage Spark Adaptive Execution to optimize the
execution plan online
Auto Migration Components
13#SAISDD7
Pipeline Generator
– Generate workflow to set spark sql files execution steps and schedule
Release Assistant
- Push code to production environment and github repo, and table creation ..
Data Mover
- Move data across platforms, for snapshot data preparation on DEV and historical data
initialize on PROD
Data Validator
- Cross platform data checksum on both DEV and PROD
Key Components
14#SAISDD7
• Metadata
• SQL Converter
Metadata - Overview
15#SAISDD7
Neo4jMySQL
Table Profile SQL File Profile Pipeline Profile Data Linage Job Linage
Metadata – Data Linage
16#SAISDD7
SQL Converter - Overview
17#SAISDD7
SQL Converter – Conversion Rules
18#SAISDD7
• Split original SQL files into table transformation and final table merge
• Identify ACID steps (merge update/delete/insert into one insert-
overwrite step)
• Multiple update/delete cases – store middle step result into temp view
and do final single merge
• Special handling for cases like case sensitive, date/timestamp
calculations, column name alias …
• Adaptive for Spark known issues
• Internal function & UDF translation
SQL Convertor – Sample
19#SAISDD7
Tool Sets
20#SAISDD7
• DDL Generator
• SQL Converter
• SQL Optimizer
• Pipeline Generator
• Release Assistant
• Data Mover
• Data Validator
• + Dev Suite
Major Challenges
21#SAISDD7
• Metadata Definition & Collection
- You do not know what you do not know
• Data Validation
- Upstream data quality issues
- SQL behavior or data format difference on Spark
• No SQL Jobs
- Cannot cover logic in shell scripts or command lines in pipeline
Be part of community
22#SAISDD7
~ 50 issues reported to community during migration
Case-insensitive field resolution
• SPARK-25132 Case-insensitive field resolution when reading from Parquet
• SPARK-25175 Field resolution should fail if there's ambiguity for ORC native reader
• SPARK-25207 Case-insensitive field resolution for filter pushdown when reading Parquet
Parquet filter pushdown
• SPARK-23727 Support DATE predict push down in parquet
• SPARK-24716 Refactor ParquetFilters
• SPARK-24706 Support ByteType and ShortType pushdown to parquet
• SPARK-24549 Support DecimalType push down to the parquet data sources
• SPARK-24718 Timestamp support pushdown to parquet data source
• SPARK-24638 StringStartsWith support push down
• SPARK-17091 Convert IN predicate to equivalent Parquet filter
UDF Improvement
• SPARK-23900 format_number udf should take user specifed format as argument
• SPARK-23903Add support for date extract
• SPARK-23905 Add UDF weekday
Bugs
• SPARK-24076 very bad performance when shuffle.partition = 8192
• SPARK-24556 ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning
• SPARK-25084 "distribute by" on multiple columns may lead to codegen issue
• SPARK-25368 Incorrect constraint inference returns wrong result
Q & A
23#SAISDD7
Thank You!

More Related Content

What's hot (20)

PDF
The delta architecture
Prakash Chockalingam
 
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
PDF
Big Telco - Yousun Jeong
Spark Summit
 
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Databricks
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PDF
Reactive dashboard’s using apache spark
Rahul Kumar
 
PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
PPTX
How ReversingLabs Serves File Reputation Service for 10B Files
ScyllaDB
 
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
PDF
Acid ORC, Iceberg and Delta Lake
Michal Gancarski
 
PDF
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks
 
PPTX
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
SingleStore
 
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
PDF
Spark with Delta Lake
Knoldus Inc.
 
PDF
Capital One: Using Cassandra In Building A Reporting Platform
DataStax Academy
 
PDF
Managing Thousands of Spark Workers in Cloud Environment with Yuhao Zheng and...
Databricks
 
PDF
Extracting Insights from Data at Twitter
Prasad Wagle
 
PPTX
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
PDF
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Databricks
 
The delta architecture
Prakash Chockalingam
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
Big Telco - Yousun Jeong
Spark Summit
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Databricks
 
Making Apache Spark Better with Delta Lake
Databricks
 
Reactive dashboard’s using apache spark
Rahul Kumar
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
How ReversingLabs Serves File Reputation Service for 10B Files
ScyllaDB
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Acid ORC, Iceberg and Delta Lake
Michal Gancarski
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
SingleStore
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
Spark with Delta Lake
Knoldus Inc.
 
Capital One: Using Cassandra In Building A Reporting Platform
DataStax Academy
 
Managing Thousands of Spark Workers in Cloud Environment with Yuhao Zheng and...
Databricks
 
Extracting Insights from Data at Twitter
Prasad Wagle
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Databricks
 

Similar to Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang and Lipeng Zhu (20)

PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
PDF
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
PDF
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Databricks
 
PDF
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
Databricks
 
PDF
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
PDF
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
NRB
 
PDF
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
NRB
 
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
PPTX
Transform Your Data Integration Platform From Informatica To ODI
Jade Global
 
PDF
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Business objects data services advanced
saddagiri
 
PPTX
Spark SQL
Caserta
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
Big Data Transformation Powered By Apache Spark.pptx
Knoldus Inc.
 
PPTX
Big Data Transformations Powered By Spark
Knoldus Inc.
 
PPTX
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
PDF
Learning Spark Lightningfast Data Analytics 2nd Edition Jules S Damji
snaggbarumx3
 
PDF
Sap business objects data services toc
saddagiri
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Databricks
 
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
Databricks
 
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
NRB
 
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
NRB
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
Transform Your Data Integration Platform From Informatica To ODI
Jade Global
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Business objects data services advanced
saddagiri
 
Spark SQL
Caserta
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Big Data Transformation Powered By Apache Spark.pptx
Knoldus Inc.
 
Big Data Transformations Powered By Spark
Knoldus Inc.
 
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Learning Spark Lightningfast Data Analytics 2nd Edition Jules S Damji
snaggbarumx3
 
Sap business objects data services toc
saddagiri
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
What Is Data Integration and Transformation?
subhashenia
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 

Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang and Lipeng Zhu

  • 1. Edward Zhang, Software Engineer Manager, Data Service & Solution (eBay) ADBMS to Apache Spark Auto Migration Framework #SAISDD7
  • 2. Who We Are • Data Service & Solution team in eBay • Responsible for big data processing and data application development • Focus on batch auto migration and Spark core optimization 2#SAISDD7
  • 3. Why Migrate to Spark • More complex big data processing needs • Streaming, Graph computation, Machine Learning use cases • Extreme performance optimization need 3#SAISDD7
  • 4. What We Do • ~90% batch workload auto migration • Tool sets to enable manual migration 4#SAISDD7
  • 5. Agenda 5#SAISDD7 ØAuto Migration Scope ØAuto Migration Strategy ØAuto Migration Components ØKey Components ØTool Sets ØMajor Challenges ØBe part of community
  • 6. Auto Migration Scope 6#SAISDD7 • ~5K Target tables • ~20K intermediate/working tables • ~22PB target tables • ~40PB relational data processing every day • ~ 1 year timeline
  • 8. Auto Migration Framework 8#SAISDD7 Migration Planner Metadata Migration Engine Controller Process Manager Task Invoker Task Monitor DDL Generator SQL Convertor Job Optimizer Pipeline Generator Release Assistant Data Mover Data Validator
  • 9. Auto Migration Components 9#SAISDD7 Migration Planner • Analyze and identify auto migration candidates • Determine the order of table migration Metadata • Define and collect metadata to enable the auto migration engine • Include table profile, data linage, job linage, SQL file profile, pipeline profile
  • 10. Auto Migration Components 10#SAISDD7 Controller • Manage the end to end migration process • Include sub components like process manager, task invoker, task monitor DDL Generator • A data modeler to generate DDL on Spark for target table, working tables and views • Also include setting the table format, bucket and partition
  • 11. Auto Migration Components 11#SAISDD7 SQL Convertor • Split original SQL files into table transform + merge steps • Parsing original ADBMS SQL into abstract syntax tree and assemble into Spark SQL • Special rules to deal with SQL dialect and UDFs
  • 12. Auto Migration Components 12#SAISDD7 Job Optimizer • Pre generate Spark job execution configurations based on table size and Spark cluster scale (typically spark.sql.shuffle.partions) • Leverage Spark Adaptive Execution to optimize the execution plan online
  • 13. Auto Migration Components 13#SAISDD7 Pipeline Generator – Generate workflow to set spark sql files execution steps and schedule Release Assistant - Push code to production environment and github repo, and table creation .. Data Mover - Move data across platforms, for snapshot data preparation on DEV and historical data initialize on PROD Data Validator - Cross platform data checksum on both DEV and PROD
  • 15. Metadata - Overview 15#SAISDD7 Neo4jMySQL Table Profile SQL File Profile Pipeline Profile Data Linage Job Linage
  • 16. Metadata – Data Linage 16#SAISDD7
  • 17. SQL Converter - Overview 17#SAISDD7
  • 18. SQL Converter – Conversion Rules 18#SAISDD7 • Split original SQL files into table transformation and final table merge • Identify ACID steps (merge update/delete/insert into one insert- overwrite step) • Multiple update/delete cases – store middle step result into temp view and do final single merge • Special handling for cases like case sensitive, date/timestamp calculations, column name alias … • Adaptive for Spark known issues • Internal function & UDF translation
  • 19. SQL Convertor – Sample 19#SAISDD7
  • 20. Tool Sets 20#SAISDD7 • DDL Generator • SQL Converter • SQL Optimizer • Pipeline Generator • Release Assistant • Data Mover • Data Validator • + Dev Suite
  • 21. Major Challenges 21#SAISDD7 • Metadata Definition & Collection - You do not know what you do not know • Data Validation - Upstream data quality issues - SQL behavior or data format difference on Spark • No SQL Jobs - Cannot cover logic in shell scripts or command lines in pipeline
  • 22. Be part of community 22#SAISDD7 ~ 50 issues reported to community during migration Case-insensitive field resolution • SPARK-25132 Case-insensitive field resolution when reading from Parquet • SPARK-25175 Field resolution should fail if there's ambiguity for ORC native reader • SPARK-25207 Case-insensitive field resolution for filter pushdown when reading Parquet Parquet filter pushdown • SPARK-23727 Support DATE predict push down in parquet • SPARK-24716 Refactor ParquetFilters • SPARK-24706 Support ByteType and ShortType pushdown to parquet • SPARK-24549 Support DecimalType push down to the parquet data sources • SPARK-24718 Timestamp support pushdown to parquet data source • SPARK-24638 StringStartsWith support push down • SPARK-17091 Convert IN predicate to equivalent Parquet filter UDF Improvement • SPARK-23900 format_number udf should take user specifed format as argument • SPARK-23903Add support for date extract • SPARK-23905 Add UDF weekday Bugs • SPARK-24076 very bad performance when shuffle.partition = 8192 • SPARK-24556 ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning • SPARK-25084 "distribute by" on multiple columns may lead to codegen issue • SPARK-25368 Incorrect constraint inference returns wrong result