SlideShare a Scribd company logo
SQOOP on SPARK
for Data Ingestion
Veena Basavaraj & Vinoth Chandar
@Uber
Currently @Uber on streaming systems.
@Cloudera on Ingestion for Hadoop.
@Linkedin on front-end service infra.
2
Currently @ Uber focussed on building
a real time pipeline for ingestion to
Hadoop. @linkedin lead on Voldemort.
In the past, worked on log based
replication, HPC and stream
processing.
Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
3
Sqoop Before
4
SQL HADOOP
Data Ingestion
• Data Ingestion needs evolved
– Non SQL like data sources
– Messaging Systems as data sources
– Multi-stage pipeline
5
Sqoop Now
• Generic data
Transfer Service
–FROM ANY
source
–TO ANY
target
6
MYSQL KAFKA
HDFS MONGO
FTP HDFS
FROM TO
KAFKA MEMSQL
Sqoop How?
• Connectors represent
Pluggable Data Sources
• Connectors are
configurable
•LINK configs
•JOB configs
7
Sqoop Connector API
8
Source Targetpartition()
**No Transform (T) stage yet!
extract() load()
Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
9
It turns out…
• MapReduce is slow!
• We need Connector APIs to support (T)
transformations, not just EL
• Good news! - Execution Engine is also
pluggable
10
Why Apache Spark ?
• Why not ? ETL expressed as Spark jobs
• Faster than MapReduce
• Growing Community embracing Apache
Spark
11
Why Not Use Spark Data Sources?
12
Sure we can ! but …
Why Not Spark DataSources ?
• Recent addition for data sources!
• Run MR Sqoop jobs on Spark with
simple config change
• Leverage incremental EL & job
management within Sqoop
Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
14
Sqoop on Spark
•Creating a Job
•Job Submission
•Job Execution
15
Sqoop Job API
• Create Sqoop Job
–Create FROM and TO job configs
–Create JOB associating FROM and TO configs
• SparkContext holds Sqoop Jobs
• Invoke SqoopSparkJob.execute(conf, context)
16
Spark Job Submission
• We explored a few options.!
– Invoke Spark in process within the Sqoop Server to
execute the job
– Use Remote Spark Context used by Hive on Spark to
submit
– Sqoop Job as a driver for the Spark submit command
17
Spark Job Submission
• Build a “uber.jar” with the driver and all the sqoop
dependencies
• Programmatically using Spark Yarn Client ( non public) or
directly via command line submit the driver program to
yarn client/
• bin/spark-submit —class org.apache.sqoop.spark.SqoopJDBCHDFSJobDriver
--master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/ —
jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir hdfs://
path/to/output —numE 4 —numL 4
18
Spark Job Execution
19
MySQL
partitionRDD extractRDD
.map() .map()
loadRDD.collect()
Spark Job Execution
SqoopSparkJob.execute(…)
List<Partition> sp = getPartitions(request,numMappers);
JavaRDD<Partition> partitionRDD = sc.parallelize(sp, sp.size());
20
1
2
3
JavaRDD<List<IntermediateDataFormat<?>>> extractRDD =
partitionRDD.map(new SqoopExtractFunction(request)); 

extractRDD.map(new SqoopLoadFunction(request)).collect();
Spark Job Execution
21
MySQL
Compute Partitions Extract from
MySQL
.map() .mapPartition()
Load
into HDFS
.repartition()
Repartition
to
limit
files on HDFS
Agenda
• Sqoop for Data Ingestion
• Why Sqoop on Spark?
• Sqoop Jobs on Spark
• Insights & Next Steps
22
Micro Benchmark: MySQL to HDFS
23
Table w/ 300K records, numExtractors = numLoaders
Table w/ 2.8M records, numExtractors = numLoaders
good partitioning!!
Micro Benchmark: MySQL to HDFS
24
What was Easy?
• NO changes to the Connector API
required.
• Inbuilt support for Standalone and Yarn
Cluster mode for quick end-end testing
and faster iteration
• Scheduling Spark sqoop jobs via Oozie
25
What was not Easy?
• No clean public Spark Job Submit API.
Using Yarn UI for Job status and health.
• Bunch of Sqoop core classes such as IDF
had to be made serializable
• Managing Hadoop and spark dependencies
together in Sqoop caused some pain
26
Next Steps!
• Explore alternative ways for Spark Sqoop
Job Submission with Spark 1.4 additions
• Connector Filter API (filter, data masking)
• SQOOP-1532
– https://github.com/vybs/sqoop-on-spark
27
Sqoop Connector ETL
28
Source Targetpartition()
**With Transform (T) stage!
extract() Transform()
load()
Questions!
• Thanks to the Folks @Cloudera
and @Uber !
• You can reach us @vybs,
@byte_array
29

More Related Content

What's hot (20)

PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Shirshanka Das
 
PDF
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
PPTX
Hive 3 - a new horizon
Thejas Nair
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PDF
Cookpad TechConf 2016 - DWHに必要なこと
Minero Aoki
 
PDF
Big Data Processing With Spark
Edureka!
 
PPTX
Spark
Koushik Mondal
 
PPTX
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
PDF
Apache spark
shima jafari
 
PDF
Apache Atlasの現状とデータガバナンス事例 #hadoopreading
Yahoo!デベロッパーネットワーク
 
PDF
Spark Summit EU talk by Dean Wampler
Spark Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PDF
運用が楽になる分散データベース Riak
Takahiko Sato
 
PPTX
Avro introduction
Nanda8904648951
 
PPTX
Hive on Spark の設計指針を読んでみた
Recruit Technologies
 
PDF
Working with deeply nested documents in Apache Solr
Anshum Gupta
 
PDF
Physical Plans in Spark SQL
Databricks
 
PPTX
Event driven architecture with Kafka
Florence Next
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Apache Spark Architecture
Alexey Grishchenko
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Shirshanka Das
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
 
Hive 3 - a new horizon
Thejas Nair
 
Iceberg: a fast table format for S3
DataWorks Summit
 
Cookpad TechConf 2016 - DWHに必要なこと
Minero Aoki
 
Big Data Processing With Spark
Edureka!
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
Apache spark
shima jafari
 
Apache Atlasの現状とデータガバナンス事例 #hadoopreading
Yahoo!デベロッパーネットワーク
 
Spark Summit EU talk by Dean Wampler
Spark Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
運用が楽になる分散データベース Riak
Takahiko Sato
 
Avro introduction
Nanda8904648951
 
Hive on Spark の設計指針を読んでみた
Recruit Technologies
 
Working with deeply nested documents in Apache Solr
Anshum Gupta
 
Physical Plans in Spark SQL
Databricks
 
Event driven architecture with Kafka
Florence Next
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 

Similar to Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber) (20)

PDF
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
PPTX
Spark SQL
Caserta
 
PPT
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
AjajKhan23
 
PDF
20170126 big data processing
Vienna Data Science Group
 
PDF
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
PDF
Hadoop to spark_v2
elephantscale
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PDF
Spark For The Business Analyst
Gustaf Cavanaugh
 
PDF
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Spark Summit
 
PDF
spark_v1_2
Frank Schroeter
 
PPTX
From oracle to hadoop with Sqoop and other tools
Guy Harrison
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
The state of Spark in the cloud
Nicolas Poggi
 
PDF
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
PDF
Spark SQL & Machine Learning - A Practical Demonstration
Craig Warman
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
DataStax Academy
 
PPTX
Insight on "From Hadoop to Spark" by Mark Kerzner
Synerzip
 
PPTX
Spark from the Surface
Josi Aranda
 
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
Spark SQL
Caserta
 
SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt
AjajKhan23
 
20170126 big data processing
Vienna Data Science Group
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Hadoop to spark_v2
elephantscale
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Spark For The Business Analyst
Gustaf Cavanaugh
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Spark Summit
 
spark_v1_2
Frank Schroeter
 
From oracle to hadoop with Sqoop and other tools
Guy Harrison
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
The state of Spark in the cloud
Nicolas Poggi
 
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
Spark SQL & Machine Learning - A Practical Demonstration
Craig Warman
 
Unified Big Data Processing with Apache Spark
C4Media
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
DataStax Academy
 
Insight on "From Hadoop to Spark" by Mark Kerzner
Synerzip
 
Spark from the Surface
Josi Aranda
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Ad

Recently uploaded (20)

PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
Research Methodology Overview Introduction
ayeshagul29594
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
big data eco system fundamentals of data science
arivukarasi
 

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

  • 1. SQOOP on SPARK for Data Ingestion Veena Basavaraj & Vinoth Chandar @Uber
  • 2. Currently @Uber on streaming systems. @Cloudera on Ingestion for Hadoop. @Linkedin on front-end service infra. 2 Currently @ Uber focussed on building a real time pipeline for ingestion to Hadoop. @linkedin lead on Voldemort. In the past, worked on log based replication, HPC and stream processing.
  • 3. Agenda • Sqoop for Data Ingestion • Why Sqoop on Spark? • Sqoop Jobs on Spark • Insights & Next Steps 3
  • 5. Data Ingestion • Data Ingestion needs evolved – Non SQL like data sources – Messaging Systems as data sources – Multi-stage pipeline 5
  • 6. Sqoop Now • Generic data Transfer Service –FROM ANY source –TO ANY target 6 MYSQL KAFKA HDFS MONGO FTP HDFS FROM TO KAFKA MEMSQL
  • 7. Sqoop How? • Connectors represent Pluggable Data Sources • Connectors are configurable •LINK configs •JOB configs 7
  • 8. Sqoop Connector API 8 Source Targetpartition() **No Transform (T) stage yet! extract() load()
  • 9. Agenda • Sqoop for Data Ingestion • Why Sqoop on Spark? • Sqoop Jobs on Spark • Insights & Next Steps 9
  • 10. It turns out… • MapReduce is slow! • We need Connector APIs to support (T) transformations, not just EL • Good news! - Execution Engine is also pluggable 10
  • 11. Why Apache Spark ? • Why not ? ETL expressed as Spark jobs • Faster than MapReduce • Growing Community embracing Apache Spark 11
  • 12. Why Not Use Spark Data Sources? 12 Sure we can ! but …
  • 13. Why Not Spark DataSources ? • Recent addition for data sources! • Run MR Sqoop jobs on Spark with simple config change • Leverage incremental EL & job management within Sqoop
  • 14. Agenda • Sqoop for Data Ingestion • Why Sqoop on Spark? • Sqoop Jobs on Spark • Insights & Next Steps 14
  • 15. Sqoop on Spark •Creating a Job •Job Submission •Job Execution 15
  • 16. Sqoop Job API • Create Sqoop Job –Create FROM and TO job configs –Create JOB associating FROM and TO configs • SparkContext holds Sqoop Jobs • Invoke SqoopSparkJob.execute(conf, context) 16
  • 17. Spark Job Submission • We explored a few options.! – Invoke Spark in process within the Sqoop Server to execute the job – Use Remote Spark Context used by Hive on Spark to submit – Sqoop Job as a driver for the Spark submit command 17
  • 18. Spark Job Submission • Build a “uber.jar” with the driver and all the sqoop dependencies • Programmatically using Spark Yarn Client ( non public) or directly via command line submit the driver program to yarn client/ • bin/spark-submit —class org.apache.sqoop.spark.SqoopJDBCHDFSJobDriver --master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/ — jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir hdfs:// path/to/output —numE 4 —numL 4 18
  • 19. Spark Job Execution 19 MySQL partitionRDD extractRDD .map() .map() loadRDD.collect()
  • 20. Spark Job Execution SqoopSparkJob.execute(…) List<Partition> sp = getPartitions(request,numMappers); JavaRDD<Partition> partitionRDD = sc.parallelize(sp, sp.size()); 20 1 2 3 JavaRDD<List<IntermediateDataFormat<?>>> extractRDD = partitionRDD.map(new SqoopExtractFunction(request)); 
 extractRDD.map(new SqoopLoadFunction(request)).collect();
  • 21. Spark Job Execution 21 MySQL Compute Partitions Extract from MySQL .map() .mapPartition() Load into HDFS .repartition() Repartition to limit files on HDFS
  • 22. Agenda • Sqoop for Data Ingestion • Why Sqoop on Spark? • Sqoop Jobs on Spark • Insights & Next Steps 22
  • 23. Micro Benchmark: MySQL to HDFS 23 Table w/ 300K records, numExtractors = numLoaders
  • 24. Table w/ 2.8M records, numExtractors = numLoaders good partitioning!! Micro Benchmark: MySQL to HDFS 24
  • 25. What was Easy? • NO changes to the Connector API required. • Inbuilt support for Standalone and Yarn Cluster mode for quick end-end testing and faster iteration • Scheduling Spark sqoop jobs via Oozie 25
  • 26. What was not Easy? • No clean public Spark Job Submit API. Using Yarn UI for Job status and health. • Bunch of Sqoop core classes such as IDF had to be made serializable • Managing Hadoop and spark dependencies together in Sqoop caused some pain 26
  • 27. Next Steps! • Explore alternative ways for Spark Sqoop Job Submission with Spark 1.4 additions • Connector Filter API (filter, data masking) • SQOOP-1532 – https://github.com/vybs/sqoop-on-spark 27
  • 28. Sqoop Connector ETL 28 Source Targetpartition() **With Transform (T) stage! extract() Transform() load()
  • 29. Questions! • Thanks to the Folks @Cloudera and @Uber ! • You can reach us @vybs, @byte_array 29