SlideShare a Scribd company logo
ŠContinuent 2014
Real-Time Loading from
MySQL to Hadoop
Featuring Continuent Tungsten
MC Brown, Senior Information Architect
ŠContinuent 2014 2
Introducing Continuent
ŠContinuent 2014
Introducing Continuent
3
• The leading provider of clustering and
replication for open source DBMS
• Our Product: Continuent Tungsten
• Clustering - Commercial-grade HA, performance
scaling and data management for MySQL
• Replication - Flexible, high-performance data
movement
ŠContinuent 2014
Quick Continuent Facts
• Largest Tungsten installation processes over
700 million transactions daily on 225
terabytes of data
• Tungsten Replicator was application of the
year at the 2011 MySQL User Conference
• Wide variety of topologies including MySQL,
Oracle, Vertica, and MongoDB are in
production now
• MySQL to Hadoop deployments are now in
progress with multiple customers
4
ŠContinuent 2014ŠContinuent 2014
Continuent Tungsten Customers
5
1
ŠContinuent 2014 6
Five Minute Hadoop
Introduction
ŠContinuent 2014
What Is Hadoop, Exactly?
7
a.A distributed file system
b.A method of processing massive quantities
of data in parallel
c.The Cutting family’s stuffed elephant
d.All of the above
ŠContinuent 2014
Hadoop Distributed File System
8
Java	

Client
NameNode	

(directory)
DataNodes (replicated data)
Hive
Pig
hadoop	

command
Find 	

le
Read	

block(s)
ŠContinuent 2014
Map/Reduce
9
Acme,2013,4.75!
Spitze,2013,25.00!
Acme,2013,55.25!
Excelsior,2013,1.00!
Spitze,2013,5.00
Spitze,2014,60.00!
Spitze,2014,9.50!
Acme,2014,1.00!
Acme,2014,4.00!
Excelsior,2014,1.00!
Excelsior,2014,9.00
Acme,60.00!
Excelsior,1.00!
Spitze,30.00
Acme,5.00!
Excelsior,10.00!
Spitze,69.50
MAP
MAP
REDUCE
Acme,65.00!
Excelsior,11.00!
Spitze,99.50
ŠContinuent 2014
Typical MySQL to Hadoop Use Case
10
Hive	

(Analytics)
Hadoop
Cluster
Transaction
Processing
Initial Load?
Latency?
App changes?
Materialized 	

views?
Changes?
App load?
ŠContinuent 2014
Options for Loading Data
11
CSV	

Files
Sqoop
Manual	

Loading
Sqoop
Tungsten	

Replicator
ŠContinuent 2014
Comparing Methods in Detail
12
Manual via
CSV
Sqoop
Tungsten
Replicator
Process
Manual/
Scripted
Manual/
Scripted
Fully
automated
Incremental
Loading
Possible with
DDL changes
Requires DDL
changes
Fully
supported
Latency Full-load Intermittent Real-time
Extraction
Requirements
Full table scan
Full and partial
table scans
Low-impact
binlog scan
ŠContinuent 2014 13
Replicating MySQL Data
to Hadoop using
Tungsten Replicator
ŠContinuent 2014
What is Tungsten Replicator?
14
A real-time,
high-performance,
open source database
replication engine
!
GPLV2 license - 100% open source	

Download from https://code.google.com/p/tungsten-replicator/	

Annual support subscription available from Continuent
“GoldenGate without the Price Tag”®
ŠContinuent 2014
Tungsten Replicator Overview
15
Master
(Transactions + Metadata)
Slave
THL
DBMS	

Logs
Replicator
(Transactions + Metadata)
THLReplicator
Extract
transactions
from log
Apply
ŠContinuent 2014
Tungsten Replicator 3.0 & Hadoop
16
• Extract from MySQL or Oracle
• Base Hadoop support
• Platforms: Cloudera, HortonWorks, MapR,
Amazon EMR, IBM InfoSphere BigInsights
• Provision using Sqoop or parallel extraction
• Automatic replication of incremental changes
• Transformation to preferred HDFS formats
• Schema generation for Hive
• Tools for generating materialized views
ŠContinuent 2014
Hadoop Support
17
Hadoop Hadoop-BaseFS
Apache Hadoop Yes Yes
Cloudera Yes (Certied) Yes (Certied)
MapR Yes
HortonWorks
Yes (Awaiting
Certication)
IBM InfoSphere
BigInsights
Yes
Amazon EMR Yes
ŠContinuent 2014
Basic MySQL to Hadoop Replication
18
MySQL Tungsten Master
Replicator
hadoop
Master-Side Filtering	

* pkey - Fill in pkey info	

* colnames - Fill in names	

* cdc - Add update type and
schema/table info	

* source - Add source DBMS	

* replicate - Subset tables to
be replicated
binlog_format=row
Tungsten Slave
Replicator
hadoop
MySQL	

Binlog
CSV	

Files
CSV	

Files
CSV	

Files
CSV	

Files
CSV	

Files
Hadoop	

Cluster
Extract from
MySQL binlog
Load raw CSV to HDFS
(e.g., via LOAD DATA to
Hive)
Access via Hive
ŠContinuent 2014
Hadoop Data Loading - Gory Details
19
Replicator
hadoop
Transactions
from master
CSV	

Files
CSV	

Files
CSV	

Files
Staging	

Tables
Staging	

Tables
Staging
“Tables”
Base TablesBase TablesMaterializedViews
Javascript load
script	

e.g. hadoop.js
Write data
to CSV
(Run Map/
Reduce)
(Generate
Table
Denitions)
(Generate
Table
Denitions)
Load using
hadoop
command
ŠContinuent 2014 20
Demo #1
!
Replicating sysbench data
ŠContinuent 2014 21
Viewing MySQL Data
in Hadoop
ŠContinuent 2014
Generating Staging Table Schema
22
$ ddlscan -template ddl-mysql-hive-0.10-staging.vm !
-user tungsten -pass secret !
-url jdbc:mysql:thin://logos1:3306/db01 -db db01!
...!
DROP TABLE IF EXISTS db01.stage_xxx_sbtest;!
!
CREATE EXTERNAL TABLE db01.stage_xxx_sbtest!
(!
tungsten_opcode STRING ,!
tungsten_seqno INT ,!
tungsten_row_id INT ,!
id INT ,!
k INT ,!
c STRING ,!
pad STRING)!
ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' ESCAPED BY ''!
LINES TERMINATED BY 'n'!
STORED AS TEXTFILE LOCATION '/user/tungsten/staging/db01/sbtest';
ŠContinuent 2014
Generating Base Table Schema
$ ddlscan -template ddl-mysql-hive-0.10.vm -user tungsten !
-pass secret -url jdbc:mysql:thin://logos1:3306/db01 -db db01!
...!
DROP TABLE IF EXISTS db01.sbtest;!
!
CREATE TABLE db01.sbtest!
(!
id INT ,!
k INT ,!
c STRING ,!
pad STRING )!
;!
23
ŠContinuent 2014
Creating a Materialized View in Theory
24
Log #1 Log #2 Log #N...
MAP	

Sort by key(s), transaction order
REDUCE	

Emit last row per key if not a delete
ŠContinuent 2014
Creating a Materialized View in Hive
$ hive!
...!
hive> ADD FILE /home/rhodges/github/continuent-tools-hadoop/bin/
tungsten-reduce;!
hive> FROM ( !
SELECT sbx.*!
FROM db01.stage_xxx_sbtest sbx!
DISTRIBUTE BY id !
SORT BY id,tungsten_seqno,tungsten_row_id!
) map1!
INSERT OVERWRITE TABLE db01.sbtest!
SELECT TRANSFORM(!
tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad)!
USING 'perl tungsten-reduce -k id -c
tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad'!
AS id INT,k INT,c STRING,pad STRING;!
...
25
MAP
REDUCE
ŠContinuent 2014
Comparing MySQL and Hadoop Data
$ export TUNGSTEN_EXT_LIBS=/usr/lib/hive/lib!
...!
$ /opt/continuent/tungsten/bristlecone/bin/dc !
-url1 jdbc:mysql:thin://logos1:3306/db01 !
-user1 tungsten -password1 secret !
-url2 jdbc:hive2://localhost:10000 !
-user2 'tungsten' -password2 'secret' -schema db01 !
-table sbtest -verbose -keys id !
-driver org.apache.hive.jdbc.HiveDriver!
22:33:08,093 INFO DC - Data comparison utility!
...!
22:33:24,526 INFO Tables compare OK!
26
ŠContinuent 2014
Doing it all at once
$ git clone !
https://github.com/continuent/continuent-tools-
hadoop.git!
!
$ cd continuent-tools-hadoop!
!
$ bin/load-reduce-check !
-U jdbc:mysql:thin://logos1:3306/db01 !
-s db01 --verbose
27
ŠContinuent 2014 28
Demo #2
!
Constructing and Checking a
Materialized View
ŠContinuent 2014 29
Scaling It Up!
ŠContinuent 2014
MySQL to Hadoop Fan-In Architecture
30
Replicator
m1 (slave)
m2 (slave)
m3 (slave)
Replicator
m1 (master)
m2 (master)
m3 (master)
Replicator
Replicator
RBR
RBR
Slaves
Hadoop	

Cluster	

(many nodes)
Masters
RBR
ŠContinuent 2014
Integration with Provisioning
31
MySQL
Tungsten Master
hadoop
binlog_format=row
Tungsten Slave
hadoop
MySQL	

Binlog
CSV	

Files
CSV	

Files
CSV	

Files
CSV	

Files
CSV	

Files
Hadoop	

Cluster
Access via Hive
Sqoop/ETL
(Initial provisioning run)
ŠContinuent 2014
On-Demand Provisioning via Parallel
Extract
32
MySQL Tungsten Master
Replicator
hadoop
Master-Side Filtering	

* pkey - Fill in pkey info	

* colnames - Fill in names	

* cdc - Add update type and
schema/table info	

* source - Add source DBMS	

* replicate - Subset tables to
be replicated	

(other lters as needed)	

binlog_format=row
Tungsten Slave
Replicator
hadoop
MySQL	

Binlog
CSV	

Files
CSV	

Files
CSV	

Files
CSV	

Files
CSV	

Files
Hadoop	

Cluster
Extract from
MySQL tables
Load raw CSV to HDFS
(e.g., via LOAD DATA to
Hive)
Access via Hive
ŠContinuent 2014
Tungsten Replicator Roadmap
33
• Parallel CSV file loading (supported)
• Partition loaded data by commit time
(supported)
• Expanded Data format support (CSV, JSON)
• Replication out of Hadoop
ŠContinuent 2014
Continuent Hadoop Tools Roadmap
• HBase Data Support & Materialization
• Impala Data Support & Materialization
• Integration with emerging real-time analytics
(e.g. Storm, Spark, Shark, Stinger, …)
• Point-in Time Table Generation
• Time-Series Generation
• Rolling and Managed Materialization
• Replicator driven data manipulation (e.g.
denormalisation, combining, …)
34
ŠContinuent 2014 35
Getting Started with
Continuent Tungsten
ŠContinuent 2014
Where Is Everything?
36
• Tungsten Replicator 3.0 builds are now available on
code.google.com
http://code.google.com/p/tungsten-replicator/
• Replicator 3.0 documentation is available on
Continuent website
http://docs.continuent.com/tungsten-replicator-3.0/
deployment-hadoop.html
• Tungsten Hadoop tools are available on GitHub
https://github.com/continuent/continuent-tools-hadoop
Contact Continuent for support
ŠContinuent 2014
Commercial Terms
• Replicator features are open source (GPL V2)
• Investment Elements
• POC / Development (Walk Away Option)
• Production Deployment
• Annual Support Subscription
• Governing Principles
• Annual Subscription Required
• More Upfront Investment -> Less Annual Subscription
37
ŠContinuent 2014
We Do Clustering Too!
38
Tungsten clusters combine off-
the-shelf open source MySQL
servers into data services with:
!
• 24x7 data access
• Scaling of load on replicas
• Simple management commands
!
...without app changes or data
migration
Amazon
US West
apache
/php
GonzoPortal.com
Connector Connector
ŠContinuent 2014
In Conclusion: Tungsten Offers...
• Fully automated, real-time replication from MySQL
into Hadoop
• Support for automatic transformation to HDFS data
formats and creation of full materialized views
• Positions users to take advantage of evolving real-
time features in Hadoop
39
ŠContinuent 2014
Continuent Web Page:	

http://www.continuent.com	

!
Tungsten Replicator:	

http://code.google.com/p/tungsten-replicator	

Our Blogs:
http://scale-out-blog.blogspot.com
http://mcslp.wordpress.com
http://www.continuent.com/news/blogs
560 S. Winchester Blvd., Suite 500
San Jose, CA 95128
Tel +1 (866) 998-3642
Fax +1 (408) 668-1009
e-mail: sales@continuent.com

More Related Content

PDF
Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0
Continuent
 
PDF
Real-Time Data Loading from MySQL to Hadoop
Continuent
 
PDF
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS
Mats Kindahl
 
PDF
Replicate from Oracle to data warehouses and analytics
Continuent
 
PDF
Replicate from Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Continuent
 
PDF
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Continuent
 
PDF
Business-critical MySQL with DR in vCloud Air
Continuent
 
PDF
Set Up & Operate Open Source Oracle Replication
Continuent
 
Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0
Continuent
 
Real-Time Data Loading from MySQL to Hadoop
Continuent
 
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFS
Mats Kindahl
 
Replicate from Oracle to data warehouses and analytics
Continuent
 
Replicate from Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Continuent
 
Real-time Data Loading from Oracle and MySQL to Data Warehouses, Analytics
Continuent
 
Business-critical MySQL with DR in vCloud Air
Continuent
 
Set Up & Operate Open Source Oracle Replication
Continuent
 

What's hot (20)

PDF
Geographically Distributed Multi-Master MySQL Clusters
Continuent
 
PDF
Sqoop
Prashant Gupta
 
PDF
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
PDF
Apache Sqoop: A Data Transfer Tool for Hadoop
Cloudera, Inc.
 
PDF
Tungsten University: Replicate Between MySQL And Oracle
Continuent
 
PDF
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
PPTX
Hadoop operations-2015-hadoop-summit-san-jose-v5
Chris Nauroth
 
PDF
ORC 2015: Faster, Better, Smaller
The Apache Software Foundation
 
PPTX
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
PDF
TeraCache: Efficient Caching Over Fast Storage Devices
Databricks
 
PDF
New VMware Continuent 5.0 - A powerful and cost-efficient Oracle GoldenGate a...
Continuent
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PDF
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
gethue
 
PPTX
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
Newton Alex
 
PDF
Oracle HA, DR, data warehouse loading, and license reduction through edge app...
Continuent
 
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
PDF
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Linas Virbalas
 
PPTX
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
 
PPTX
Time-Series Apache HBase
HBaseCon
 
PPTX
Hadoop engineering bo_f_final
Ramya Sunil
 
Geographically Distributed Multi-Master MySQL Clusters
Continuent
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Skills Matter
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Cloudera, Inc.
 
Tungsten University: Replicate Between MySQL And Oracle
Continuent
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Chris Nauroth
 
ORC 2015: Faster, Better, Smaller
The Apache Software Foundation
 
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
TeraCache: Efficient Caching Over Fast Storage Devices
Databricks
 
New VMware Continuent 5.0 - A powerful and cost-efficient Oracle GoldenGate a...
Continuent
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Sqoop2 refactoring for generic data transfer - NYC Sqoop Meetup
gethue
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
Newton Alex
 
Oracle HA, DR, data warehouse loading, and license reduction through edge app...
Continuent
 
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Replicate Oracle to Oracle, Oracle to MySQL, and Oracle to Analytics
Linas Virbalas
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
 
Time-Series Apache HBase
HBaseCon
 
Hadoop engineering bo_f_final
Ramya Sunil
 
Ad

Similar to Set Up & Operate Real-Time Data Loading into Hadoop (20)

PDF
Replicating in Real-time from MySQL to Amazon Redshift
Continuent
 
PDF
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Continuent
 
PDF
Webinar Slides: Real-Time Analytics from MySQL
Continuent
 
PDF
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Continuent
 
PDF
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Continuent
 
PDF
Sneak Peek: Continuent Tungsten 3.0
Continuent
 
PDF
Harnessing the Power of Master/Slave Clusters to Operate Data-Driven Business...
Continuent
 
PDF
Tungsten University: Load A Vertica Data Warehouse With MySQL Data
Continuent
 
PDF
Set Up & Operate Tungsten Replicator
Continuent
 
PDF
Setup & Operate Tungsten Replicator
Continuent
 
PDF
Training Slides: Tungsten Replicator AMI - The Getting Started Guide
Continuent
 
PDF
Tungsten University: Setup and Operate Tungsten Replicators
Continuent
 
PDF
Replicate from Oracle to Oracle, Oracle to MySQL, and Oracle to analytics
Continuent
 
PDF
Training Slides: Basics 101: Introduction to Tungsten Replicator
Continuent
 
PDF
Liberating Your Data From MySQL: Cross-Database Replication to the Rescue!
Linas Virbalas
 
PDF
Flexible heterogenous replication
Jeff Mace
 
PDF
Webinar Slides: Multi-Master MySQL
Continuent
 
PDF
Webinar Slides: Geo-Scale MySQL in AWS
Continuent
 
PDF
Tungsten University: Setup & Operate Tungsten Replicator
Continuent
 
PDF
Webinar Slides: MySQL Native Replication vs. Tungsten Clustering
Continuent
 
Replicating in Real-time from MySQL to Amazon Redshift
Continuent
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Continuent
 
Webinar Slides: Real-Time Analytics from MySQL
Continuent
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Continuent
 
Replication in real-time from Oracle and MySQL into data warehouses and analy...
Continuent
 
Sneak Peek: Continuent Tungsten 3.0
Continuent
 
Harnessing the Power of Master/Slave Clusters to Operate Data-Driven Business...
Continuent
 
Tungsten University: Load A Vertica Data Warehouse With MySQL Data
Continuent
 
Set Up & Operate Tungsten Replicator
Continuent
 
Setup & Operate Tungsten Replicator
Continuent
 
Training Slides: Tungsten Replicator AMI - The Getting Started Guide
Continuent
 
Tungsten University: Setup and Operate Tungsten Replicators
Continuent
 
Replicate from Oracle to Oracle, Oracle to MySQL, and Oracle to analytics
Continuent
 
Training Slides: Basics 101: Introduction to Tungsten Replicator
Continuent
 
Liberating Your Data From MySQL: Cross-Database Replication to the Rescue!
Linas Virbalas
 
Flexible heterogenous replication
Jeff Mace
 
Webinar Slides: Multi-Master MySQL
Continuent
 
Webinar Slides: Geo-Scale MySQL in AWS
Continuent
 
Tungsten University: Setup & Operate Tungsten Replicator
Continuent
 
Webinar Slides: MySQL Native Replication vs. Tungsten Clustering
Continuent
 
Ad

More from Continuent (20)

PDF
Tungsten Webinar: v6 & v7 Release Recap, and Beyond
Continuent
 
PDF
Continuent Tungsten Value Proposition Webinar
Continuent
 
PDF
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #7: ClusterControl
Continuent
 
PDF
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #5: Oracle’s InnoDB Cluster
Continuent
 
PDF
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Continuent
 
PDF
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Continuent
 
PDF
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #1: AWS Aurora
Continuent
 
PDF
Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...
Continuent
 
PDF
Webinar Slides: No Data Loss MySQL: Guaranteed Credit Card Transaction Availa...
Continuent
 
PDF
Webinar Slides: Intelligent Database Proxies: Routing & Transparent Failover
Continuent
 
PPTX
Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...
Continuent
 
PDF
Training Slides: 205 - Installing and Configuring Tungsten Dashboard
Continuent
 
PDF
Training Slides: 352 - Tungsten Replicator for MongoDB & Kafka
Continuent
 
PDF
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Continuent
 
PDF
Training Slides: 303 - Replicating out of a Cluster
Continuent
 
PDF
Training Slides: 206 - Using the Tungsten Cluster AMI
Continuent
 
PDF
Training Slides: 254 - Using the Tungsten Replicator AMI
Continuent
 
PDF
Training Slides: 253 - Filter like a Pro
Continuent
 
PDF
Training Slides: 252 - Monitoring & Troubleshooting
Continuent
 
PDF
Training Slides: 302 - Securing Your Cluster With SSL
Continuent
 
Tungsten Webinar: v6 & v7 Release Recap, and Beyond
Continuent
 
Continuent Tungsten Value Proposition Webinar
Continuent
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #7: ClusterControl
Continuent
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #5: Oracle’s InnoDB Cluster
Continuent
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Continuent
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #2: Galera Cluster
Continuent
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #1: AWS Aurora
Continuent
 
Webinar Slides: AWS Aurora MySQL Replacement: Break Away From Geo-Limitations...
Continuent
 
Webinar Slides: No Data Loss MySQL: Guaranteed Credit Card Transaction Availa...
Continuent
 
Webinar Slides: Intelligent Database Proxies: Routing & Transparent Failover
Continuent
 
Webinar Slides: High Volume MySQL HA: SaaS Continuous Operations with Terabyt...
Continuent
 
Training Slides: 205 - Installing and Configuring Tungsten Dashboard
Continuent
 
Training Slides: 352 - Tungsten Replicator for MongoDB & Kafka
Continuent
 
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Continuent
 
Training Slides: 303 - Replicating out of a Cluster
Continuent
 
Training Slides: 206 - Using the Tungsten Cluster AMI
Continuent
 
Training Slides: 254 - Using the Tungsten Replicator AMI
Continuent
 
Training Slides: 253 - Filter like a Pro
Continuent
 
Training Slides: 252 - Monitoring & Troubleshooting
Continuent
 
Training Slides: 302 - Securing Your Cluster With SSL
Continuent
 

Recently uploaded (20)

PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Software Development Methodologies in 2025
KodekX
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
The Future of AI & Machine Learning.pptx
pritsen4700
 

Set Up & Operate Real-Time Data Loading into Hadoop

  • 1. ŠContinuent 2014 Real-Time Loading from MySQL to Hadoop Featuring Continuent Tungsten MC Brown, Senior Information Architect
  • 3. ŠContinuent 2014 Introducing Continuent 3 • The leading provider of clustering and replication for open source DBMS • Our Product: Continuent Tungsten • Clustering - Commercial-grade HA, performance scaling and data management for MySQL • Replication - Flexible, high-performance data movement
  • 4. ŠContinuent 2014 Quick Continuent Facts • Largest Tungsten installation processes over 700 million transactions daily on 225 terabytes of data • Tungsten Replicator was application of the year at the 2011 MySQL User Conference • Wide variety of topologies including MySQL, Oracle, Vertica, and MongoDB are in production now • MySQL to Hadoop deployments are now in progress with multiple customers 4
  • 6. ŠContinuent 2014 6 Five Minute Hadoop Introduction
  • 7. ŠContinuent 2014 What Is Hadoop, Exactly? 7 a.A distributed file system b.A method of processing massive quantities of data in parallel c.The Cutting family’s stuffed elephant d.All of the above
  • 8. ŠContinuent 2014 Hadoop Distributed File System 8 Java Client NameNode (directory) DataNodes (replicated data) Hive Pig hadoop command Find le Read block(s)
  • 10. ŠContinuent 2014 Typical MySQL to Hadoop Use Case 10 Hive (Analytics) Hadoop Cluster Transaction Processing Initial Load? Latency? App changes? Materialized views? Changes? App load?
  • 11. ŠContinuent 2014 Options for Loading Data 11 CSV Files Sqoop Manual Loading Sqoop Tungsten Replicator
  • 12. ŠContinuent 2014 Comparing Methods in Detail 12 Manual via CSV Sqoop Tungsten Replicator Process Manual/ Scripted Manual/ Scripted Fully automated Incremental Loading Possible with DDL changes Requires DDL changes Fully supported Latency Full-load Intermittent Real-time Extraction Requirements Full table scan Full and partial table scans Low-impact binlog scan
  • 13. ŠContinuent 2014 13 Replicating MySQL Data to Hadoop using Tungsten Replicator
  • 14. ŠContinuent 2014 What is Tungsten Replicator? 14 A real-time, high-performance, open source database replication engine ! GPLV2 license - 100% open source Download from https://code.google.com/p/tungsten-replicator/ Annual support subscription available from Continuent “GoldenGate without the Price Tag”®
  • 15. ŠContinuent 2014 Tungsten Replicator Overview 15 Master (Transactions + Metadata) Slave THL DBMS Logs Replicator (Transactions + Metadata) THLReplicator Extract transactions from log Apply
  • 16. ŠContinuent 2014 Tungsten Replicator 3.0 & Hadoop 16 • Extract from MySQL or Oracle • Base Hadoop support • Platforms: Cloudera, HortonWorks, MapR, Amazon EMR, IBM InfoSphere BigInsights • Provision using Sqoop or parallel extraction • Automatic replication of incremental changes • Transformation to preferred HDFS formats • Schema generation for Hive • Tools for generating materialized views
  • 17. ŠContinuent 2014 Hadoop Support 17 Hadoop Hadoop-BaseFS Apache Hadoop Yes Yes Cloudera Yes (Certied) Yes (Certied) MapR Yes HortonWorks Yes (Awaiting Certication) IBM InfoSphere BigInsights Yes Amazon EMR Yes
  • 18. ŠContinuent 2014 Basic MySQL to Hadoop Replication 18 MySQL Tungsten Master Replicator hadoop Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated binlog_format=row Tungsten Slave Replicator hadoop MySQL Binlog CSV Files CSV Files CSV Files CSV Files CSV Files Hadoop Cluster Extract from MySQL binlog Load raw CSV to HDFS (e.g., via LOAD DATA to Hive) Access via Hive
  • 19. ŠContinuent 2014 Hadoop Data Loading - Gory Details 19 Replicator hadoop Transactions from master CSV Files CSV Files CSV Files Staging Tables Staging Tables Staging “Tables” Base TablesBase TablesMaterializedViews Javascript load script e.g. hadoop.js Write data to CSV (Run Map/ Reduce) (Generate Table Denitions) (Generate Table Denitions) Load using hadoop command
  • 20. ŠContinuent 2014 20 Demo #1 ! Replicating sysbench data
  • 21. ŠContinuent 2014 21 Viewing MySQL Data in Hadoop
  • 22. ŠContinuent 2014 Generating Staging Table Schema 22 $ ddlscan -template ddl-mysql-hive-0.10-staging.vm ! -user tungsten -pass secret ! -url jdbc:mysql:thin://logos1:3306/db01 -db db01! ...! DROP TABLE IF EXISTS db01.stage_xxx_sbtest;! ! CREATE EXTERNAL TABLE db01.stage_xxx_sbtest! (! tungsten_opcode STRING ,! tungsten_seqno INT ,! tungsten_row_id INT ,! id INT ,! k INT ,! c STRING ,! pad STRING)! ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' ESCAPED BY ''! LINES TERMINATED BY 'n'! STORED AS TEXTFILE LOCATION '/user/tungsten/staging/db01/sbtest';
  • 23. ŠContinuent 2014 Generating Base Table Schema $ ddlscan -template ddl-mysql-hive-0.10.vm -user tungsten ! -pass secret -url jdbc:mysql:thin://logos1:3306/db01 -db db01! ...! DROP TABLE IF EXISTS db01.sbtest;! ! CREATE TABLE db01.sbtest! (! id INT ,! k INT ,! c STRING ,! pad STRING )! ;! 23
  • 24. ŠContinuent 2014 Creating a Materialized View in Theory 24 Log #1 Log #2 Log #N... MAP Sort by key(s), transaction order REDUCE Emit last row per key if not a delete
  • 25. ŠContinuent 2014 Creating a Materialized View in Hive $ hive! ...! hive> ADD FILE /home/rhodges/github/continuent-tools-hadoop/bin/ tungsten-reduce;! hive> FROM ( ! SELECT sbx.*! FROM db01.stage_xxx_sbtest sbx! DISTRIBUTE BY id ! SORT BY id,tungsten_seqno,tungsten_row_id! ) map1! INSERT OVERWRITE TABLE db01.sbtest! SELECT TRANSFORM(! tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad)! USING 'perl tungsten-reduce -k id -c tungsten_opcode,tungsten_seqno,tungsten_row_id,id,k,c,pad'! AS id INT,k INT,c STRING,pad STRING;! ... 25 MAP REDUCE
  • 26. ŠContinuent 2014 Comparing MySQL and Hadoop Data $ export TUNGSTEN_EXT_LIBS=/usr/lib/hive/lib! ...! $ /opt/continuent/tungsten/bristlecone/bin/dc ! -url1 jdbc:mysql:thin://logos1:3306/db01 ! -user1 tungsten -password1 secret ! -url2 jdbc:hive2://localhost:10000 ! -user2 'tungsten' -password2 'secret' -schema db01 ! -table sbtest -verbose -keys id ! -driver org.apache.hive.jdbc.HiveDriver! 22:33:08,093 INFO DC - Data comparison utility! ...! 22:33:24,526 INFO Tables compare OK! 26
  • 27. ŠContinuent 2014 Doing it all at once $ git clone ! https://github.com/continuent/continuent-tools- hadoop.git! ! $ cd continuent-tools-hadoop! ! $ bin/load-reduce-check ! -U jdbc:mysql:thin://logos1:3306/db01 ! -s db01 --verbose 27
  • 28. ŠContinuent 2014 28 Demo #2 ! Constructing and Checking a Materialized View
  • 30. ŠContinuent 2014 MySQL to Hadoop Fan-In Architecture 30 Replicator m1 (slave) m2 (slave) m3 (slave) Replicator m1 (master) m2 (master) m3 (master) Replicator Replicator RBR RBR Slaves Hadoop Cluster (many nodes) Masters RBR
  • 31. ŠContinuent 2014 Integration with Provisioning 31 MySQL Tungsten Master hadoop binlog_format=row Tungsten Slave hadoop MySQL Binlog CSV Files CSV Files CSV Files CSV Files CSV Files Hadoop Cluster Access via Hive Sqoop/ETL (Initial provisioning run)
  • 32. ŠContinuent 2014 On-Demand Provisioning via Parallel Extract 32 MySQL Tungsten Master Replicator hadoop Master-Side Filtering * pkey - Fill in pkey info * colnames - Fill in names * cdc - Add update type and schema/table info * source - Add source DBMS * replicate - Subset tables to be replicated (other lters as needed) binlog_format=row Tungsten Slave Replicator hadoop MySQL Binlog CSV Files CSV Files CSV Files CSV Files CSV Files Hadoop Cluster Extract from MySQL tables Load raw CSV to HDFS (e.g., via LOAD DATA to Hive) Access via Hive
  • 33. ŠContinuent 2014 Tungsten Replicator Roadmap 33 • Parallel CSV file loading (supported) • Partition loaded data by commit time (supported) • Expanded Data format support (CSV, JSON) • Replication out of Hadoop
  • 34. ŠContinuent 2014 Continuent Hadoop Tools Roadmap • HBase Data Support & Materialization • Impala Data Support & Materialization • Integration with emerging real-time analytics (e.g. Storm, Spark, Shark, Stinger, …) • Point-in Time Table Generation • Time-Series Generation • Rolling and Managed Materialization • Replicator driven data manipulation (e.g. denormalisation, combining, …) 34
  • 35. ŠContinuent 2014 35 Getting Started with Continuent Tungsten
  • 36. ŠContinuent 2014 Where Is Everything? 36 • Tungsten Replicator 3.0 builds are now available on code.google.com http://code.google.com/p/tungsten-replicator/ • Replicator 3.0 documentation is available on Continuent website http://docs.continuent.com/tungsten-replicator-3.0/ deployment-hadoop.html • Tungsten Hadoop tools are available on GitHub https://github.com/continuent/continuent-tools-hadoop Contact Continuent for support
  • 37. ŠContinuent 2014 Commercial Terms • Replicator features are open source (GPL V2) • Investment Elements • POC / Development (Walk Away Option) • Production Deployment • Annual Support Subscription • Governing Principles • Annual Subscription Required • More Upfront Investment -> Less Annual Subscription 37
  • 38. ŠContinuent 2014 We Do Clustering Too! 38 Tungsten clusters combine off- the-shelf open source MySQL servers into data services with: ! • 24x7 data access • Scaling of load on replicas • Simple management commands ! ...without app changes or data migration Amazon US West apache /php GonzoPortal.com Connector Connector
  • 39. ŠContinuent 2014 In Conclusion: Tungsten Offers... • Fully automated, real-time replication from MySQL into Hadoop • Support for automatic transformation to HDFS data formats and creation of full materialized views • Positions users to take advantage of evolving real- time features in Hadoop 39
  • 40. ŠContinuent 2014 Continuent Web Page: http://www.continuent.com ! Tungsten Replicator: http://code.google.com/p/tungsten-replicator Our Blogs: http://scale-out-blog.blogspot.com http://mcslp.wordpress.com http://www.continuent.com/news/blogs 560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: [email protected]