From oracle to hadoop with Sqoop and other tools

From Oracle to Hadoop:
Unlocking Hadoop for Your RDBMS with
Apache Sqoop and Other Tools
Guy Harrison, David Robson, Kate Ting
{guy.harrison, david.robson}@software.dell.com,
kate@cloudera.com
October 16, 2014

About Guy, David, & Kate
Guy Harrison @guyharrison
- Executive Director of R&D @ Dell
- Author of Oracle Performance Survival Guide & MySQL Stored Procedure Programming
David Robson @DavidR021
- Principal Technologist @ Dell
- Sqoop Committer, Lead on Toad for Hadoop & OraOop
Kate Ting @kate_ting
- Technical Account Mgr @ Cloudera
- Sqoop Committer/PMC, Co-author of Apache Sqoop Cookbook

RDBMS and Hadoop
 The relational database reigned
supreme for more than two decades
 Hadoop and other non-relational
tools have overthrown that
hegemony
 We are unlikely to return to a “one
size fits all” model based on Hadoop
- Though some will try 
 For the foreseeable future, enterprise
information architectures will include
relational and non-relational stores

Scenarios
1. We need to access RDBMS
to make sense of Hadoop
data
Analytic output
YARN/
MR1
HDFS
Weblogs
Products
RDBMS
Flume SQOOP

Scenarios
1. Reference data is in the
RDBMS
2. We want to run analysis
outside of the RDBMS
Analytic output
HDFS
Products
RDBMS
SQOOP
YARN/
MR1
Sales
SQOOP

Scenarios
1. Reference data is in the
RDBMS
2. We want to run analysis
outside of the RDBMS
3. Feeding YARN/MR output
into RDBMS
Analytic output
HDFS
Weblogs
Weblog
Summary
RDBMS
Flume
SQOOP
YARN/
MR1

Scenarios
1. We need to access RDBMS
to make sense of Hadoop
data
2. We want to use Hadoop to
analyse RDBMS data
3. Hadoop output belongs in
RDBMS Data warehouse
4. We archive old RDBMS
data to Hadoop
HDFS
BI platform
Sales
RDBMS
SQOOP
HQL
Old Sales
SQL

SQOOP
 SQOOP was created in 2009
by Aaron Kimball as a means
of moving data between SQL
databases and Hadoop
 It provided a generic
implementation for moving
data
 It also provided a framework
for implementing database
specific optimized
connectors

How SQOOP works (import)
Hive Table
HDFS
Table
Metadata
Table
Data
RDBMS
Hive DDL
Table.java SQOOP
Map Task
FileOutputFormat
DataDrivenDBInputFormat
Map Task
DataDrivenDBInputFormat
FileOutputFormat
HDFS files

SQOOP issues with Oracle
 SQOOP uses primary key
ranges to divide up data
between mappers
 However, the deletes hit older
key values harder, making key
ranges unbalanced.
 Data is almost never arranged
on disk in key order so index
scans collide on disk
 Load is unbalanced, and IO
block requests >> blocks in the
table.
ORACLE TABLE on DISK
ID > 0 and ID <
MAX/2
MAPPER
ORACLE SESSION
RANGE SCAN
Index block Index block
ID > MAX/2
MAPPER
ORACLE SESSION
RANGE SCAN

Other problems
 Oracle might run each mapper using a
full scan – clobbering the database
 Oracle might run each mapper in
parallel – clobbering the database
 Sqoop may clobber the database
cache
1800
1600
1400
1200
1000
800
600
400
200
0
0 2 4 6 8 10 12 14 16 18
Elasped time (s)
7000
6000
5000
4000
3000
2000
1000
Database load
0 Number of mappers
0 4 8 12 16 20 24
Database Time (s)
Number of mappers

High speed connector design
 Partition data based on physical
storage
 By-pass Oracle buffering
 By-pass Oracle parallelism
 Do not require or use indexes
 Never read the same data block more
than once
 Support Oracle datatypes
ORACLE
TABLE
HDFS
HADOOP
MAPPER
ORACLE
SESSION
HADOOP
MAPPER
ORACLE
SESSION
HADOOP
MAPPER
ORACLE
SESSION

Imports (Oracle->Hadoop)
 Uses Oracle block/extent map to equally
divide IO
 Uses Oracle direct path (non-buffered)
IO for all reads
 Round-robin, sequential or random
allocation
 All mappers get an equal number of
blocks & no block is read twice
 If table is partitioned, each mapper can
work on a separate partition – results in
partitioned output
ORACLE
TABLE
HDFS
HADOOP
MAPPER
ORACLE
SESSION
HADOOP
MAPPER
ORACLE
SESSION
HADOOP
MAPPER
ORACLE
SESSION

Exports (Hadoop-> Oracle)
 Optionally leverages Oracle
partitions and temporary tables for
parallel writes
 Performs MERGE into Oracle table
(Updates existing rows, inserts new
rows)
 Optionally use oracle NOLOGGING
(faster but unrecoverable)
ORACLE
TABLE
HDFS
HADOOP
MAPPER
ORACLE
SESSION
HADOOP
MAPPER
ORACLE
SESSION
HADOOP
MAPPER
ORACLE
SESSION

Import – Oracle to Hadoop
 When data is unclustered
(randomly distributed by PK), old
SQOOP scales poorly
 Clustered data shows better
scalability but is still much slower
than the direct approach.
 New SQOOP outperforms 5-20
times typically
 We’ve seen limiting factor as:
- Data IO bandwidth, or
- Network out of DB, or
- Hadoop CPU
1600
1400
1200
1000
800
600
400
200
0
0 5 10 15 20 25 30 35
Elapsed time (s)
Number of mappers
direct=false - unclustered Data direct=false clustered data direct=true

Import - Database overhead
 As you increase mappers in old sqoop,
database load increases rapidly
- (sometimes non-linear)
 In new Sqoop, queuing occurs only after
IO bandwidth is exceeded
3000
2500
2000
1500
1000
500
0
0 4 8 12 16 20 24
DB time (minutes)
Number of mappers
Sqoop
Direct

Export – Oracle to Hadoop
 On Export, old SQOOP would hit
database writer bottleneck early on
and fail to parallelize.
 New SQOOP uses partitioning and
direct path inserts.
 Typically bottlenecks on write IO on
Oracle side
120
100
80
60
40
20
0
0 4 8 12 16 20 24
Elapsed time (minutes)
Number of mappers
Sqoop
Direct

Reduction in database load
 45% reduction in DB CPU
 83% reduction in elapsed time
 90% reduction in total database
time
 99.9% reduction in database IO
8 node Hadoop cluster, 1B rows, 310GB
55.31
83.45
90.59
99.98
99.28
0 20 40 60 80 100
IO time
IO requests
DB time
Elapsed time
CPU time
% reduction

Replication
 No matter how fast we make SQOOP,
it’s a drag to have to run a SQOOP job
before every Hadoop job.
 Replicating data into Hadoop cuts
down on SQOOP overhead on both
sides and avoids stale data.
Shareplex® for Oracle and Hadoop

Sqoop 1.4.5 Summary
Sqoop 1.4.5 without –direct Sqoop 1.4.5 with --direct
Minimal privileges required Access to DBA views required
Works on most object types: e.g. IOT 5x-20x faster performance on tables
Favors Sqoop terminology Favors Oracle terminology
Database load increases non-linearly Up to 99% reduction in database IO

Sqoop 1 Import Architecture
sqoop import
--connect jdbc:mysql://mysql.example.com/sqoop
--username sqoop --password sqoop
--table cities

Sqoop 1 Export Architecture
sqoop export
--connect jdbc:mysql://mysql.example.com/sqoop
--username sqoop --password sqoop
--table cities
--export-dir /temp/cities

Sqoop 1 Challenges
 Concerns with usability
- Cryptic, contextual command line
arguments
 Concerns with security
- Client access to Hadoop bin/config, DB
 Concerns with extensibility
- Connectors tightly coupled with data
format

Sqoop 2 Design Goals
 Ease of use
- REST API and Java API
 Ease of security
- Separation of responsibilities
 Ease of extensibility
- Connector SDK, focus on pluggability

Ease of Use
Sqoop 1 Sqoop 2
sqoop import
-
Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura
ndom“
-Ddfs.replication=1
-Dmapred.map.tasks.speculative.execution=false
--num-mappers 4
--hive-import --hive-table CUSTOMERS --create-hive-table
--connect jdbc:oracle:thin:@//localhost:1521/g12c
--username OPSG --password opsg --table
OPSG.CUSTOMERS
--target-dir CUSTOMERS.CUSTOMERS

Ease of Security
Sqoop 1 Sqoop 2
sqoop import
-
Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura
ndom“
-Ddfs.replication=1
-Dmapred.map.tasks.speculative.execution=false
--num-mappers 4
--hive-import --hive-table CUSTOMERS --create-hive-table
--connect jdbc:oracle:thin:@//localhost:1521/g12c
--username OPSG --password opsg --table
OPSG.CUSTOMERS
--target-dir CUSTOMERS.CUSTOMERS
• Role-based access to connection objects
• Prevents misuse and abuse
• Administrators create, edit, delete
• Operators use

Ease of Extensibility
Sqoop 1 Sqoop 2
Tight Coupling
• Connectors fetch and store
data from db
• Framework handles
serialization, format
conversion, integration

Takeaway
 Apache Sqoop
- Bulk data transfer tool between external structured datastores and Hadoop
 Sqoop 1.4.5 now with a --direct parameter option for Oracle
- 5x-20x performance improvement on Oracle table imports
 Sqoop 2
- Ease of use, security, extensibility

Questions?
Guy Harrison @guyharrison
David Robson @DavidR021
Kate Ting @kate_ting
Visit Dell at Booth #102
Visit Cloudera at Booth #305
Book Signing: Today @ 3:15pm
Office Hours: Tomorrow @ 11am

From oracle to hadoop with Sqoop and other tools

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to From oracle to hadoop with Sqoop and other tools (20)

More from Guy Harrison (19)

Recently uploaded (20)

From oracle to hadoop with Sqoop and other tools

Editor's Notes