SlideShare a Scribd company logo
From Oracle to Hadoop: 
Unlocking Hadoop for Your RDBMS with 
Apache Sqoop and Other Tools 
Guy Harrison, David Robson, Kate Ting 
{guy.harrison, david.robson}@software.dell.com, 
kate@cloudera.com 
October 16, 2014
About Guy, David, & Kate 
Guy Harrison @guyharrison 
- Executive Director of R&D @ Dell 
- Author of Oracle Performance Survival Guide & MySQL Stored Procedure Programming 
David Robson @DavidR021 
- Principal Technologist @ Dell 
- Sqoop Committer, Lead on Toad for Hadoop & OraOop 
Kate Ting @kate_ting 
- Technical Account Mgr @ Cloudera 
- Sqoop Committer/PMC, Co-author of Apache Sqoop Cookbook
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
RDBMS and Hadoop 
 The relational database reigned 
supreme for more than two decades 
 Hadoop and other non-relational 
tools have overthrown that 
hegemony 
 We are unlikely to return to a “one 
size fits all” model based on Hadoop 
- Though some will try  
 For the foreseeable future, enterprise 
information architectures will include 
relational and non-relational stores
Scenarios 
1. We need to access RDBMS 
to make sense of Hadoop 
data 
Analytic output 
YARN/ 
MR1 
HDFS 
Weblogs 
Products 
RDBMS 
Flume SQOOP
Scenarios 
1. Reference data is in the 
RDBMS 
2. We want to run analysis 
outside of the RDBMS 
Analytic output 
HDFS 
Products 
RDBMS 
SQOOP 
YARN/ 
MR1 
Sales 
SQOOP
Scenarios 
1. Reference data is in the 
RDBMS 
2. We want to run analysis 
outside of the RDBMS 
3. Feeding YARN/MR output 
into RDBMS 
Analytic output 
HDFS 
Weblogs 
Weblog 
Summary 
RDBMS 
Flume 
SQOOP 
YARN/ 
MR1
Scenarios 
1. We need to access RDBMS 
to make sense of Hadoop 
data 
2. We want to use Hadoop to 
analyse RDBMS data 
3. Hadoop output belongs in 
RDBMS Data warehouse 
4. We archive old RDBMS 
data to Hadoop 
HDFS 
BI platform 
Sales 
RDBMS 
SQOOP 
HQL 
Old Sales 
SQL
SQOOP 
 SQOOP was created in 2009 
by Aaron Kimball as a means 
of moving data between SQL 
databases and Hadoop 
 It provided a generic 
implementation for moving 
data 
 It also provided a framework 
for implementing database 
specific optimized 
connectors
How SQOOP works (import) 
Hive Table 
HDFS 
Table 
Metadata 
Table 
Data 
RDBMS 
Hive DDL 
Table.java SQOOP 
Map Task 
FileOutputFormat 
DataDrivenDBInputFormat 
Map Task 
DataDrivenDBInputFormat 
FileOutputFormat 
HDFS files
SQOOP & Oracle
SQOOP issues with Oracle 
 SQOOP uses primary key 
ranges to divide up data 
between mappers 
 However, the deletes hit older 
key values harder, making key 
ranges unbalanced. 
 Data is almost never arranged 
on disk in key order so index 
scans collide on disk 
 Load is unbalanced, and IO 
block requests >> blocks in the 
table. 
ORACLE TABLE on DISK 
ID > 0 and ID < 
MAX/2 
MAPPER 
ORACLE SESSION 
RANGE SCAN 
Index block Index block 
ID > MAX/2 
MAPPER 
ORACLE SESSION 
RANGE SCAN 
Index block Index block 
Index block Index block
Other problems 
 Oracle might run each mapper using a 
full scan – clobbering the database 
 Oracle might run each mapper in 
parallel – clobbering the database 
 Sqoop may clobber the database 
cache 
1800 
1600 
1400 
1200 
1000 
800 
600 
400 
200 
0 
0 2 4 6 8 10 12 14 16 18 
Elasped time (s) 
7000 
6000 
5000 
4000 
3000 
2000 
1000 
Database load 
0 Number of mappers 
0 4 8 12 16 20 24 
Database Time (s) 
Number of mappers
High speed connector design 
 Partition data based on physical 
storage 
 By-pass Oracle buffering 
 By-pass Oracle parallelism 
 Do not require or use indexes 
 Never read the same data block more 
than once 
 Support Oracle datatypes 
ORACLE 
TABLE 
HDFS 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION
Imports (Oracle->Hadoop) 
 Uses Oracle block/extent map to equally 
divide IO 
 Uses Oracle direct path (non-buffered) 
IO for all reads 
 Round-robin, sequential or random 
allocation 
 All mappers get an equal number of 
blocks & no block is read twice 
 If table is partitioned, each mapper can 
work on a separate partition – results in 
partitioned output 
ORACLE 
TABLE 
HDFS 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION
Exports (Hadoop-> Oracle) 
 Optionally leverages Oracle 
partitions and temporary tables for 
parallel writes 
 Performs MERGE into Oracle table 
(Updates existing rows, inserts new 
rows) 
 Optionally use oracle NOLOGGING 
(faster but unrecoverable) 
ORACLE 
TABLE 
HDFS 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION
Import – Oracle to Hadoop 
 When data is unclustered 
(randomly distributed by PK), old 
SQOOP scales poorly 
 Clustered data shows better 
scalability but is still much slower 
than the direct approach. 
 New SQOOP outperforms 5-20 
times typically 
 We’ve seen limiting factor as: 
- Data IO bandwidth, or 
- Network out of DB, or 
- Hadoop CPU 
1600 
1400 
1200 
1000 
800 
600 
400 
200 
0 
0 5 10 15 20 25 30 35 
Elapsed time (s) 
Number of mappers 
direct=false - unclustered Data direct=false clustered data direct=true
Import - Database overhead 
 As you increase mappers in old sqoop, 
database load increases rapidly 
- (sometimes non-linear) 
 In new Sqoop, queuing occurs only after 
IO bandwidth is exceeded 
3000 
2500 
2000 
1500 
1000 
500 
0 
0 4 8 12 16 20 24 
DB time (minutes) 
Number of mappers 
Sqoop 
Direct
Export – Oracle to Hadoop 
 On Export, old SQOOP would hit 
database writer bottleneck early on 
and fail to parallelize. 
 New SQOOP uses partitioning and 
direct path inserts. 
 Typically bottlenecks on write IO on 
Oracle side 
120 
100 
80 
60 
40 
20 
0 
0 4 8 12 16 20 24 
Elapsed time (minutes) 
Number of mappers 
Sqoop 
Direct
Reduction in database load 
 45% reduction in DB CPU 
 83% reduction in elapsed time 
 90% reduction in total database 
time 
 99.9% reduction in database IO 
8 node Hadoop cluster, 1B rows, 310GB 
55.31 
83.45 
90.59 
99.98 
99.28 
0 20 40 60 80 100 
IO time 
IO requests 
DB time 
Elapsed time 
CPU time 
% reduction
Replication 
 No matter how fast we make SQOOP, 
it’s a drag to have to run a SQOOP job 
before every Hadoop job. 
 Replicating data into Hadoop cuts 
down on SQOOP overhead on both 
sides and avoids stale data. 
Shareplex® for Oracle and Hadoop
Sqoop 1.4.5 Summary 
Sqoop 1.4.5 without –direct Sqoop 1.4.5 with --direct 
Minimal privileges required Access to DBA views required 
Works on most object types: e.g. IOT 5x-20x faster performance on tables 
Favors Sqoop terminology Favors Oracle terminology 
Database load increases non-linearly Up to 99% reduction in database IO
Future of SQOOP
Sqoop 1 Import Architecture 
sqoop import  
--connect jdbc:mysql://mysql.example.com/sqoop  
--username sqoop --password sqoop  
--table cities
Sqoop 1 Export Architecture 
sqoop export  
--connect jdbc:mysql://mysql.example.com/sqoop  
--username sqoop --password sqoop  
--table cities  
--export-dir /temp/cities
Sqoop 1 Challenges 
 Concerns with usability 
- Cryptic, contextual command line 
arguments 
 Concerns with security 
- Client access to Hadoop bin/config, DB 
 Concerns with extensibility 
- Connectors tightly coupled with data 
format
Sqoop 2 Design Goals 
 Ease of use 
- REST API and Java API 
 Ease of security 
- Separation of responsibilities 
 Ease of extensibility 
- Connector SDK, focus on pluggability
Ease of Use 
Sqoop 1 Sqoop 2 
sqoop import  
- 
Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura 
ndom“ 
-Ddfs.replication=1  
-Dmapred.map.tasks.speculative.execution=false  
--num-mappers 4  
--hive-import --hive-table CUSTOMERS --create-hive-table  
--connect jdbc:oracle:thin:@//localhost:1521/g12c  
--username OPSG --password opsg --table 
OPSG.CUSTOMERS  
--target-dir CUSTOMERS.CUSTOMERS
Ease of Security 
Sqoop 1 Sqoop 2 
sqoop import  
- 
Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura 
ndom“ 
-Ddfs.replication=1  
-Dmapred.map.tasks.speculative.execution=false  
--num-mappers 4  
--hive-import --hive-table CUSTOMERS --create-hive-table  
--connect jdbc:oracle:thin:@//localhost:1521/g12c  
--username OPSG --password opsg --table 
OPSG.CUSTOMERS  
--target-dir CUSTOMERS.CUSTOMERS 
• Role-based access to connection objects 
• Prevents misuse and abuse 
• Administrators create, edit, delete 
• Operators use
Ease of Extensibility 
Sqoop 1 Sqoop 2 
Tight Coupling 
• Connectors fetch and store 
data from db 
• Framework handles 
serialization, format 
conversion, integration
Takeaway 
 Apache Sqoop 
- Bulk data transfer tool between external structured datastores and Hadoop 
 Sqoop 1.4.5 now with a --direct parameter option for Oracle 
- 5x-20x performance improvement on Oracle table imports 
 Sqoop 2 
- Ease of use, security, extensibility
Questions? 
Guy Harrison @guyharrison 
David Robson @DavidR021 
Kate Ting @kate_ting 
Visit Dell at Booth #102 
Visit Cloudera at Booth #305 
Book Signing: Today @ 3:15pm 
Office Hours: Tomorrow @ 11am

More Related Content

What's hot (20)

PDF
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
SeaweedFS introduction
chrislusf
 
PPTX
PostgreSQL and CockroachDB SQL
CockroachDB
 
PPT
Oracle Architecture
Neeraj Singh
 
PDF
Best Practices for Becoming an Exceptional Postgres DBA
EDB
 
PPTX
BigQuery walk through.pptx
VikRam S
 
PPTX
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
PPTX
Introduction to Redis
TO THE NEW | Technology
 
PDF
Bigquery 101
Cesar Orozco Manotas
 
PDF
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Tanel Poder
 
PDF
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
SANG WON PARK
 
PDF
In-memory OLTP storage with persistence and transaction support
Alexander Korotkov
 
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
PPTX
Cross Data Center Replication with Redis using Redis Enterprise
Cihan Biyikoglu
 
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Altinity Ltd
 
PPTX
Apache HBase at Airbnb
HBaseCon
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
SeaweedFS introduction
chrislusf
 
PostgreSQL and CockroachDB SQL
CockroachDB
 
Oracle Architecture
Neeraj Singh
 
Best Practices for Becoming an Exceptional Postgres DBA
EDB
 
BigQuery walk through.pptx
VikRam S
 
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Introduction to Redis
TO THE NEW | Technology
 
Bigquery 101
Cesar Orozco Manotas
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Tanel Poder
 
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
SANG WON PARK
 
In-memory OLTP storage with persistence and transaction support
Alexander Korotkov
 
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
Cross Data Center Replication with Redis using Redis Enterprise
Cihan Biyikoglu
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Altinity Ltd
 
Apache HBase at Airbnb
HBaseCon
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 

Viewers also liked (20)

PDF
Apache Sqoop: A Data Transfer Tool for Hadoop
Cloudera, Inc.
 
PDF
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
PPTX
Advanced Sqoop
Yogesh Kulkarni
 
PDF
Introduction to Apache Sqoop
Avkash Chauhan
 
PPTX
Hadoop and rdbms with sqoop
Guy Harrison
 
PDF
Connecting Hadoop and Oracle
Tanel Poder
 
PDF
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
aaamase
 
PDF
Apache Sqoop: Unlocking Hadoop for Your Relational Database
huguk
 
PDF
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
PDF
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
PPTX
Five database trends - updated April 2015
Guy Harrison
 
PDF
Habits of Effective Sqoop Users
Kathleen Ting
 
PPTX
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
PDF
Big data: Loading your data with flume and sqoop
Christophe Marchal
 
PPTX
Top 10 tips for Oracle performance (Updated April 2015)
Guy Harrison
 
PPTX
Replication in Distributed Real Time Database
Ghanshyam Yadav
 
PDF
Oracle in Database Hadoop
DataWorks Summit
 
PDF
Highlights Of Sqoop2
Alexander Alten
 
PPTX
Hadoop Essential for Oracle Professionals
Chien Chung Shen
 
PDF
Sqooping 50 Million Rows a Day from MySQL
Kathleen Ting
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Cloudera, Inc.
 
New Data Transfer Tools for Hadoop: Sqoop 2
DataWorks Summit
 
Advanced Sqoop
Yogesh Kulkarni
 
Introduction to Apache Sqoop
Avkash Chauhan
 
Hadoop and rdbms with sqoop
Guy Harrison
 
Connecting Hadoop and Oracle
Tanel Poder
 
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
aaamase
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
huguk
 
SQOOP - RDBMS to Hadoop
Sofian Hadiwijaya
 
Sqoop on Spark for Data Ingestion
DataWorks Summit
 
Five database trends - updated April 2015
Guy Harrison
 
Habits of Effective Sqoop Users
Kathleen Ting
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Cloudera, Inc.
 
Big data: Loading your data with flume and sqoop
Christophe Marchal
 
Top 10 tips for Oracle performance (Updated April 2015)
Guy Harrison
 
Replication in Distributed Real Time Database
Ghanshyam Yadav
 
Oracle in Database Hadoop
DataWorks Summit
 
Highlights Of Sqoop2
Alexander Alten
 
Hadoop Essential for Oracle Professionals
Chien Chung Shen
 
Sqooping 50 Million Rows a Day from MySQL
Kathleen Ting
 
Ad

Similar to From oracle to hadoop with Sqoop and other tools (20)

PDF
SQOOP PPT
Dushhyant Kumar
 
PPTX
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
PPTX
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PPTX
Spark from the Surface
Josi Aranda
 
PDF
SQL on Hadoop
nvvrajesh
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PDF
spark_v1_2
Frank Schroeter
 
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
PDF
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
 
PDF
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
Andrey Kudryavtsev
 
PPT
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
PDF
Oracle hadoop let them talk together !
Laurent Leturgez
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
SQL on Hadoop
Doron Vainrub
 
PPTX
Hadoop and Big Data: Revealed
Sachin Holla
 
SQOOP PPT
Dushhyant Kumar
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Spark from the Surface
Josi Aranda
 
SQL on Hadoop
nvvrajesh
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
APACHE SPARK.pptx
DeepaThirumurugan
 
spark_v1_2
Frank Schroeter
 
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Hadoop in Practice (SDN Conference, Dec 2014)
Marcel Krcah
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
Andrey Kudryavtsev
 
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Oracle hadoop let them talk together !
Laurent Leturgez
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
SQL on Hadoop
Doron Vainrub
 
Hadoop and Big Data: Revealed
Sachin Holla
 
Ad

More from Guy Harrison (19)

PPTX
Thriving and surviving the Big Data revolution
Guy Harrison
 
PPTX
Mega trends in information management
Guy Harrison
 
PPTX
Big datacamp2013 share
Guy Harrison
 
PPTX
Hadoop, Oracle and the big data revolution collaborate 2013
Guy Harrison
 
PPTX
Hadoop, oracle and the industrial revolution of data
Guy Harrison
 
PPTX
Making the most of ssd in oracle11g
Guy Harrison
 
PPTX
Oracle sql high performance tuning
Guy Harrison
 
PPTX
Next generation databases july2010
Guy Harrison
 
PPTX
Optimize oracle on VMware (April 2011)
Guy Harrison
 
PPTX
Optimizing Oracle databases with SSD - April 2014
Guy Harrison
 
PPTX
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Guy Harrison
 
PPTX
High Performance Plsql
Guy Harrison
 
PPTX
Performance By Design
Guy Harrison
 
PPTX
Optimize Oracle On VMware (Sep 2011)
Guy Harrison
 
PPTX
Thanks for the Memory
Guy Harrison
 
PPTX
Top 10 tips for Oracle performance
Guy Harrison
 
PPTX
How I learned to stop worrying and love Oracle
Guy Harrison
 
PPTX
Performance By Design
Guy Harrison
 
PPTX
High Performance Plsql
Guy Harrison
 
Thriving and surviving the Big Data revolution
Guy Harrison
 
Mega trends in information management
Guy Harrison
 
Big datacamp2013 share
Guy Harrison
 
Hadoop, Oracle and the big data revolution collaborate 2013
Guy Harrison
 
Hadoop, oracle and the industrial revolution of data
Guy Harrison
 
Making the most of ssd in oracle11g
Guy Harrison
 
Oracle sql high performance tuning
Guy Harrison
 
Next generation databases july2010
Guy Harrison
 
Optimize oracle on VMware (April 2011)
Guy Harrison
 
Optimizing Oracle databases with SSD - April 2014
Guy Harrison
 
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Guy Harrison
 
High Performance Plsql
Guy Harrison
 
Performance By Design
Guy Harrison
 
Optimize Oracle On VMware (Sep 2011)
Guy Harrison
 
Thanks for the Memory
Guy Harrison
 
Top 10 tips for Oracle performance
Guy Harrison
 
How I learned to stop worrying and love Oracle
Guy Harrison
 
Performance By Design
Guy Harrison
 
High Performance Plsql
Guy Harrison
 

Recently uploaded (20)

PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 

From oracle to hadoop with Sqoop and other tools

  • 1. From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com, [email protected] October 16, 2014
  • 2. About Guy, David, & Kate Guy Harrison @guyharrison - Executive Director of R&D @ Dell - Author of Oracle Performance Survival Guide & MySQL Stored Procedure Programming David Robson @DavidR021 - Principal Technologist @ Dell - Sqoop Committer, Lead on Toad for Hadoop & OraOop Kate Ting @kate_ting - Technical Account Mgr @ Cloudera - Sqoop Committer/PMC, Co-author of Apache Sqoop Cookbook
  • 6. RDBMS and Hadoop  The relational database reigned supreme for more than two decades  Hadoop and other non-relational tools have overthrown that hegemony  We are unlikely to return to a “one size fits all” model based on Hadoop - Though some will try   For the foreseeable future, enterprise information architectures will include relational and non-relational stores
  • 7. Scenarios 1. We need to access RDBMS to make sense of Hadoop data Analytic output YARN/ MR1 HDFS Weblogs Products RDBMS Flume SQOOP
  • 8. Scenarios 1. Reference data is in the RDBMS 2. We want to run analysis outside of the RDBMS Analytic output HDFS Products RDBMS SQOOP YARN/ MR1 Sales SQOOP
  • 9. Scenarios 1. Reference data is in the RDBMS 2. We want to run analysis outside of the RDBMS 3. Feeding YARN/MR output into RDBMS Analytic output HDFS Weblogs Weblog Summary RDBMS Flume SQOOP YARN/ MR1
  • 10. Scenarios 1. We need to access RDBMS to make sense of Hadoop data 2. We want to use Hadoop to analyse RDBMS data 3. Hadoop output belongs in RDBMS Data warehouse 4. We archive old RDBMS data to Hadoop HDFS BI platform Sales RDBMS SQOOP HQL Old Sales SQL
  • 11. SQOOP  SQOOP was created in 2009 by Aaron Kimball as a means of moving data between SQL databases and Hadoop  It provided a generic implementation for moving data  It also provided a framework for implementing database specific optimized connectors
  • 12. How SQOOP works (import) Hive Table HDFS Table Metadata Table Data RDBMS Hive DDL Table.java SQOOP Map Task FileOutputFormat DataDrivenDBInputFormat Map Task DataDrivenDBInputFormat FileOutputFormat HDFS files
  • 14. SQOOP issues with Oracle  SQOOP uses primary key ranges to divide up data between mappers  However, the deletes hit older key values harder, making key ranges unbalanced.  Data is almost never arranged on disk in key order so index scans collide on disk  Load is unbalanced, and IO block requests >> blocks in the table. ORACLE TABLE on DISK ID > 0 and ID < MAX/2 MAPPER ORACLE SESSION RANGE SCAN Index block Index block ID > MAX/2 MAPPER ORACLE SESSION RANGE SCAN Index block Index block Index block Index block
  • 15. Other problems  Oracle might run each mapper using a full scan – clobbering the database  Oracle might run each mapper in parallel – clobbering the database  Sqoop may clobber the database cache 1800 1600 1400 1200 1000 800 600 400 200 0 0 2 4 6 8 10 12 14 16 18 Elasped time (s) 7000 6000 5000 4000 3000 2000 1000 Database load 0 Number of mappers 0 4 8 12 16 20 24 Database Time (s) Number of mappers
  • 16. High speed connector design  Partition data based on physical storage  By-pass Oracle buffering  By-pass Oracle parallelism  Do not require or use indexes  Never read the same data block more than once  Support Oracle datatypes ORACLE TABLE HDFS HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION
  • 17. Imports (Oracle->Hadoop)  Uses Oracle block/extent map to equally divide IO  Uses Oracle direct path (non-buffered) IO for all reads  Round-robin, sequential or random allocation  All mappers get an equal number of blocks & no block is read twice  If table is partitioned, each mapper can work on a separate partition – results in partitioned output ORACLE TABLE HDFS HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION
  • 18. Exports (Hadoop-> Oracle)  Optionally leverages Oracle partitions and temporary tables for parallel writes  Performs MERGE into Oracle table (Updates existing rows, inserts new rows)  Optionally use oracle NOLOGGING (faster but unrecoverable) ORACLE TABLE HDFS HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION
  • 19. Import – Oracle to Hadoop  When data is unclustered (randomly distributed by PK), old SQOOP scales poorly  Clustered data shows better scalability but is still much slower than the direct approach.  New SQOOP outperforms 5-20 times typically  We’ve seen limiting factor as: - Data IO bandwidth, or - Network out of DB, or - Hadoop CPU 1600 1400 1200 1000 800 600 400 200 0 0 5 10 15 20 25 30 35 Elapsed time (s) Number of mappers direct=false - unclustered Data direct=false clustered data direct=true
  • 20. Import - Database overhead  As you increase mappers in old sqoop, database load increases rapidly - (sometimes non-linear)  In new Sqoop, queuing occurs only after IO bandwidth is exceeded 3000 2500 2000 1500 1000 500 0 0 4 8 12 16 20 24 DB time (minutes) Number of mappers Sqoop Direct
  • 21. Export – Oracle to Hadoop  On Export, old SQOOP would hit database writer bottleneck early on and fail to parallelize.  New SQOOP uses partitioning and direct path inserts.  Typically bottlenecks on write IO on Oracle side 120 100 80 60 40 20 0 0 4 8 12 16 20 24 Elapsed time (minutes) Number of mappers Sqoop Direct
  • 22. Reduction in database load  45% reduction in DB CPU  83% reduction in elapsed time  90% reduction in total database time  99.9% reduction in database IO 8 node Hadoop cluster, 1B rows, 310GB 55.31 83.45 90.59 99.98 99.28 0 20 40 60 80 100 IO time IO requests DB time Elapsed time CPU time % reduction
  • 23. Replication  No matter how fast we make SQOOP, it’s a drag to have to run a SQOOP job before every Hadoop job.  Replicating data into Hadoop cuts down on SQOOP overhead on both sides and avoids stale data. Shareplex® for Oracle and Hadoop
  • 24. Sqoop 1.4.5 Summary Sqoop 1.4.5 without –direct Sqoop 1.4.5 with --direct Minimal privileges required Access to DBA views required Works on most object types: e.g. IOT 5x-20x faster performance on tables Favors Sqoop terminology Favors Oracle terminology Database load increases non-linearly Up to 99% reduction in database IO
  • 26. Sqoop 1 Import Architecture sqoop import --connect jdbc:mysql://mysql.example.com/sqoop --username sqoop --password sqoop --table cities
  • 27. Sqoop 1 Export Architecture sqoop export --connect jdbc:mysql://mysql.example.com/sqoop --username sqoop --password sqoop --table cities --export-dir /temp/cities
  • 28. Sqoop 1 Challenges  Concerns with usability - Cryptic, contextual command line arguments  Concerns with security - Client access to Hadoop bin/config, DB  Concerns with extensibility - Connectors tightly coupled with data format
  • 29. Sqoop 2 Design Goals  Ease of use - REST API and Java API  Ease of security - Separation of responsibilities  Ease of extensibility - Connector SDK, focus on pluggability
  • 30. Ease of Use Sqoop 1 Sqoop 2 sqoop import - Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura ndom“ -Ddfs.replication=1 -Dmapred.map.tasks.speculative.execution=false --num-mappers 4 --hive-import --hive-table CUSTOMERS --create-hive-table --connect jdbc:oracle:thin:@//localhost:1521/g12c --username OPSG --password opsg --table OPSG.CUSTOMERS --target-dir CUSTOMERS.CUSTOMERS
  • 31. Ease of Security Sqoop 1 Sqoop 2 sqoop import - Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura ndom“ -Ddfs.replication=1 -Dmapred.map.tasks.speculative.execution=false --num-mappers 4 --hive-import --hive-table CUSTOMERS --create-hive-table --connect jdbc:oracle:thin:@//localhost:1521/g12c --username OPSG --password opsg --table OPSG.CUSTOMERS --target-dir CUSTOMERS.CUSTOMERS • Role-based access to connection objects • Prevents misuse and abuse • Administrators create, edit, delete • Operators use
  • 32. Ease of Extensibility Sqoop 1 Sqoop 2 Tight Coupling • Connectors fetch and store data from db • Framework handles serialization, format conversion, integration
  • 33. Takeaway  Apache Sqoop - Bulk data transfer tool between external structured datastores and Hadoop  Sqoop 1.4.5 now with a --direct parameter option for Oracle - 5x-20x performance improvement on Oracle table imports  Sqoop 2 - Ease of use, security, extensibility
  • 34. Questions? Guy Harrison @guyharrison David Robson @DavidR021 Kate Ting @kate_ting Visit Dell at Booth #102 Visit Cloudera at Booth #305 Book Signing: Today @ 3:15pm Office Hours: Tomorrow @ 11am

Editor's Notes

  • #4: When you think about Dell you probably think about laptops
  • #5: Or servers that might run databases or a Hadoop cluster, but you probably don't think of Dell as having expertise in either Oracle or Hadoop
  • #6: But actually Dell now has a billion-dollar software arm which includes the world's number one independent database tool – toad – used by millions of users and supporting almost every data platform
  • #7: Guy to improve diagram