SlideShare a Scribd company logo
1
Distributed, fault-tolerant, transactional
Real-Time Integration: MongoDB and SQL Databases
Eugene Dvorkin
Architect, WebMD
2
WebMD: A lot of data; a lot of traffic
~900 millions page view a month
~100 million unique visitors a month
3
How We Use MongoDB
User Activity
4
Why Move Data to RDBMS?
Preserve existing investment in BI
and data warehouse
To use analytical database such as
Vertica
To use SQL
5
Why Move Data In Real-time?
Batch process is slow
No ad-hoc queries
No real-time reports
6
Challenge in moving data
Transform Document to Relational Structure
Insert into RDBMS at high rate
7
Challenge in moving data
Scale easily as data volume and velocity
increase
8
Our Solution to move data in Real-time: Storm
tem.Storm – open source distributed real-
time computation system.
Developed by Nathan Marz - acquired
by Twitter
9
Hadoop Storm
Our Solution to move data in Real-time: Storm
10
Why STORM?
JVM-based framework
Guaranteed data processing
Supports development in multiple languages
Scalable and transactional
11
Overview of Storm cluster
Master Node
Cluster Coordination
run worker processes
12
Storm Abstractions
Tuples, Streams, Spouts, Bolts and Topologies
13
Tuples
(“ns:events”,”email:edvorkin@gmail.com”)
Ordered list of elements
14
Stream
Unbounded sequence of tuples
Example: Stream of messages from
message queue
15
Spout
Read from stream of data – Queues, web
logs, API calls, mongoDB oplog
Emit documents as tuples
Source of Streams
16
Bolts
Process tuples and create new streams
17
Bolts
Apply functions /transforms
Calculate and aggregate
data (word count!)
Access DB, API , etc.
Filter data
Map/Reduce
Process tuples and create new streams
18
Topology
19
Topology
Storm is transforming and moving data
20
MongoDB
How To Read All Incoming Data
from MongoDB?
21
MongoDB
How To Read All Incoming Data
from MongoDB?
Use MongoDB OpLog
22
What is OpLog?
Replication
mechanism in
MongoDB
It is a Capped
Collection
23
Spout: reading from OpLog
Located at local database, oplog.rs collection
24
Spout: reading from OpLog
Operations: Insert, Update, Delete
25
Spout: reading from OpLog
Name space: Table – Collection name
26
Spout: reading from OpLog
Data object:
27
Sharded cluster
28
Automatic discovery of sharded cluster
29
Example: Shard vs Replica set discovery
30
Example: Shard discovery
31
Spout: Reading data from OpLog
How to Read data continuously
from OpLog?
32
Spout: Reading data from OpLog
How to Read data continuously
from OpLog?
Use Tailable Cursor
33
Example: Tailable cursor - like tail –f
34
Manage timestamps
Use ts (timestamp in oplog entry) field to
track processed records
If system restart, start from recorded ts
35
Spout: reading from OpLog
36
SPOUT – Code Example
37
TOPOLOGY
38
Working With Embedded Arrays
Array represents One-to-Many relationship in
RDBMS
39
Example: Working with embedded arrays
40
Example: Working with embedded arrays
{_id: 1,
ns: “person_awards”,
o: { award: 'National Medal of Science',
year: 1975,
by: 'National Science Foundation' }
}
{ _id: 1,
ns: “person_awards”,
o: {award: 'Turing Award',
year: 1977,
by: 'ACM' }
}
41
Example: Working with embedded arrays
public void execute(Tuple tuple) {
.........
if (field instanceof BasicDBList) {
BasicDBObject arrayElement=processArray(field)
......
outputCollector.emit("documents", tuple, arrayElement);
42
Parse documents with Bolt
43
{"ns": "people", "op":"i",
o : {
_id: 1,
name: { first: 'John', last:
'Backus' },
birth: 'Dec 03, 1924’
}
["ns": "people", "op":"i",
[“id”:1,
"name_first": "John",
"name_last":"Backus",
"birth": "DEc 03, 1924"
]
]
Parse documents with Bolt
44
@Override
public void execute(Tuple tuple) {
......
final BasicDBObject oplogObject =
(BasicDBObject)tuple.getValueByField("document");
final BasicDBObject document = (BasicDBObject)oplogObject.get("o");
......
outputValues.add(flattenDocument(document));
outputCollector.emit(tuple,outputValues);
Parse documents with Bolt
45
Write to SQL with SQLWriter Bolt
46
Write to SQL with SQLWriter Bolt
["ns": "people", "op":"i",
[“id”:1,
"name_first": "John",
"name_last":"Backus",
"birth": "Dec 03, 1924"
]
]
insert into people (_id,name_first,name_last,birth) values
(1,'John','Backus','Dec 03,1924') ,
insert into people_awards (_id,awards_award,awards_award,awards_by)
values (1,'Turing Award',1977,'ACM'),
insert into people_awards (_id,awards_award,awards_award,awards_by)
values (1,'National Medal of Science',1975,'National Science Foundation')
47
@Override
public void prepare(.....) {
....
Class.forName("com.vertica.jdbc.Driver");
con = DriverManager.getConnection(dBUrl, username,password);
@Override
public void execute(Tuple tuple) {
String insertStatement=createInsertStatement(tuple);
try {
Statement stmt = con.createStatement();
stmt.execute(insertStatement);
stmt.close();
Write to SQL with SQLWriter Bolt
48
Topology Definition
TopologyBuilder builder = new TopologyBuilder();
// define our spout
builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",
opslog_progress)
builder.setBolt(arrayExtractorId ,new
ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)
builder.setBolt(mongoDocParserId, new
MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,
documentsStreamId)
builder.setBolt(sqlWriterId, new
SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle
Grouping(mongoDocParserId)
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf,
builder.createTopology());
49
Topology Definition
TopologyBuilder builder = new TopologyBuilder();
// define our spout
builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",
opslog_progress)
builder.setBolt(arrayExtractorId ,new
ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)
builder.setBolt(mongoDocParserId, new
MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId
,documentsStreamId)
builder.setBolt(sqlWriterId, new
SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffl
eGrouping(mongoDocParserId)
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf,
builder.createTopology());
50
Topology Definition
TopologyBuilder builder = new TopologyBuilder();
// define our spout
builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",
opslog_progress)
builder.setBolt(arrayExtractorId ,new
ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)
builder.setBolt(mongoDocParserId, new
MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,
documentsStreamId)
builder.setBolt(sqlWriterId, new
SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle
Grouping(mongoDocParserId)
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("test", conf,
builder.createTopology());
51
Topology Definition
TopologyBuilder builder = new TopologyBuilder();
// define our spout
builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://",
opslog_progress)
builder.setBolt(arrayExtractorId ,new
ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId)
builder.setBolt(mongoDocParserId, new
MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId,
documentsStreamId)
builder.setBolt(sqlWriterId, new
SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle
Grouping(mongoDocParserId)
StormSubmitter.submitTopology("OfflineEventProcess",
conf,builder.createTopology())
52
Lesson learned
By leveraging MongoDB Oplog or other
capped collection, tailable cursor and Storm
framework, you can build fast, scalable,
real-time data processing pipeline.
53
Resources
Book: Getting started with Storm
Storm Project wiki
Storm starter project
Storm contributions project
Running a Multi-Node Storm cluster tutorial
Implementing real-time trending topic
A Hadoop Alternative: Building a real-time
data pipeline with Storm
Storm Use cases
54
Resources (cont’d)
Understanding the Parallelism of a Storm
Topology
Trident – high level Storm abstraction
A practical Storm’s Trident API
Storm online forum
Mongo connector from 10gen Labs
MoSQL streaming Translator in Ruby
Project source code
New York City Storm Meetup
55
Questions
Eugene Dvorkin, Architect, WebMD edvorkin@webmd.net
Twitter: @edvorkin LinkedIn: eugenedvorkin
56
57
58
Next Sessions at 2:50
5th Floor:
WestSideBallroom3&4:DataModelingExamplesfromtheRealWorld
WestSideBallroom1&2: GrowingUpMongoDB
JuilliardComplex:BusinessTrack:MetLifeLeapfrogsInsuranceIndustry
withMongoDB-PoweredBigDataApplication
LyceumComplex: AsktheExperts:MongoDBMonitoringandBackup
ServiceSession
7th Floor:
EmpireComplex:HowWeFixedOurMongoDBProblems
SoHoComplex:HighPerformance,HighScaleMongoDBonAWS:AHands
OnGuide

More Related Content

What's hot (20)

PDF
[245] presto 내부구조 파헤치기
NAVER D2
 
ODP
Introduction to PostgreSQL
Jim Mlodgenski
 
PDF
Linux kernel
Mahmoud Shiri Varamini
 
PPTX
Concurrent Processing Performance Analysis for Apps DBAs
Maris Elsins
 
PDF
How to migrate from Alfresco Search Services to Alfresco SearchEnterprise
Angel Borroy López
 
PPTX
Postgresql
NexThoughts Technologies
 
PDF
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
PDF
Glusterfs 파일시스템 구성_및 운영가이드_v2.0
sprdd
 
PDF
Disk forensics
Chiawei Wang
 
PDF
ELK Stack
Eberhard Wolff
 
PPTX
Alfresco Certificates
Angel Borroy López
 
PDF
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
PgDay.Seoul
 
PDF
A Practical Introduction to Apache Solr
Angel Borroy López
 
PPTX
Useful Group Policy Concepts
Rob Dunn
 
PDF
[Pgday.Seoul 2017] 7. PostgreSQL DB Tuning 기업사례 - 송춘자
PgDay.Seoul
 
PPTX
Built in physical and logical replication in postgresql-Firat Gulec
FIRAT GULEC
 
ODP
Introduction to Apache solr
Knoldus Inc.
 
PPT
Basic command ppt
Rohit Kumar
 
PDF
New availability features in oracle rac 12c release 2 anair ss
Anil Nair
 
PPTX
(Ab)Using GPOs for Active Directory Pwnage
Petros Koutroumpis
 
[245] presto 내부구조 파헤치기
NAVER D2
 
Introduction to PostgreSQL
Jim Mlodgenski
 
Concurrent Processing Performance Analysis for Apps DBAs
Maris Elsins
 
How to migrate from Alfresco Search Services to Alfresco SearchEnterprise
Angel Borroy López
 
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Glusterfs 파일시스템 구성_및 운영가이드_v2.0
sprdd
 
Disk forensics
Chiawei Wang
 
ELK Stack
Eberhard Wolff
 
Alfresco Certificates
Angel Borroy López
 
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
PgDay.Seoul
 
A Practical Introduction to Apache Solr
Angel Borroy López
 
Useful Group Policy Concepts
Rob Dunn
 
[Pgday.Seoul 2017] 7. PostgreSQL DB Tuning 기업사례 - 송춘자
PgDay.Seoul
 
Built in physical and logical replication in postgresql-Firat Gulec
FIRAT GULEC
 
Introduction to Apache solr
Knoldus Inc.
 
Basic command ppt
Rohit Kumar
 
New availability features in oracle rac 12c release 2 anair ss
Anil Nair
 
(Ab)Using GPOs for Active Directory Pwnage
Petros Koutroumpis
 

Viewers also liked (7)

PDF
Building Real Time Systems on MongoDB Using the Oplog at Stripe
MongoDB
 
PPTX
Walking the Walk: Developing the MongoDB Backup Service with MongoDB
MongoDB
 
PDF
Making Mongo realtime - oplog tailing in Meteor
yaliceme
 
PPTX
Real-Time Integration Between MongoDB and SQL Databases
Eugene Dvorkin
 
PPTX
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Avinash Ramineni
 
PDF
How to monitor MongoDB
Server Density
 
PDF
Introduction to Real-time data processing
Yogi Devendra Vyavahare
 
Building Real Time Systems on MongoDB Using the Oplog at Stripe
MongoDB
 
Walking the Walk: Developing the MongoDB Backup Service with MongoDB
MongoDB
 
Making Mongo realtime - oplog tailing in Meteor
yaliceme
 
Real-Time Integration Between MongoDB and SQL Databases
Eugene Dvorkin
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Avinash Ramineni
 
How to monitor MongoDB
Server Density
 
Introduction to Real-time data processing
Yogi Devendra Vyavahare
 
Ad

Similar to Real-Time Integration Between MongoDB and SQL Databases (20)

PPTX
MongoDB 3.0
Victoria Malaya
 
ODP
MongoDB - A Document NoSQL Database
Ruben Inoto Soto
 
PDF
MongoDB for Coder Training (Coding Serbia 2013)
Uwe Printz
 
PPTX
How to learn MongoDB for beginner's
surajkumartpoint
 
PPTX
Mongodb introduction and_internal(simple)
Kai Zhao
 
PPTX
MongoDB - A next-generation database that lets you create applications never ...
Ram Murat Sharma
 
KEY
MongoDB
Steven Francia
 
PDF
Using MongoDB and Python
Mike Bright
 
PDF
2016 feb-23 pyugre-py_mongo
Michael Bright
 
PPTX
Introduction To MongoDB
ElieHannouch
 
PDF
MongoDB.pdf
KuldeepKumar778733
 
PPTX
Introduction to MongoDB
Raghunath A
 
PPTX
MongoDB_ppt.pptx
1AP18CS037ShirishKul
 
PDF
Getting Ahead with MongoDB
Neha Nivedita
 
PDF
MongoDB FabLab León
Juan Antonio Roy Couto
 
PDF
Spark Summit EU talk by Ross Lawley
Spark Summit
 
PDF
How To Connect Spark To Your Own Datasource
MongoDB
 
PPTX
Mongo db1
VandanaKukreja
 
PPTX
MongoDB + Spring
Norberto Leite
 
PPTX
MongoDB and Spring - Two leaves of a same tree
MongoDB
 
MongoDB 3.0
Victoria Malaya
 
MongoDB - A Document NoSQL Database
Ruben Inoto Soto
 
MongoDB for Coder Training (Coding Serbia 2013)
Uwe Printz
 
How to learn MongoDB for beginner's
surajkumartpoint
 
Mongodb introduction and_internal(simple)
Kai Zhao
 
MongoDB - A next-generation database that lets you create applications never ...
Ram Murat Sharma
 
Using MongoDB and Python
Mike Bright
 
2016 feb-23 pyugre-py_mongo
Michael Bright
 
Introduction To MongoDB
ElieHannouch
 
MongoDB.pdf
KuldeepKumar778733
 
Introduction to MongoDB
Raghunath A
 
MongoDB_ppt.pptx
1AP18CS037ShirishKul
 
Getting Ahead with MongoDB
Neha Nivedita
 
MongoDB FabLab León
Juan Antonio Roy Couto
 
Spark Summit EU talk by Ross Lawley
Spark Summit
 
How To Connect Spark To Your Own Datasource
MongoDB
 
Mongo db1
VandanaKukreja
 
MongoDB + Spring
Norberto Leite
 
MongoDB and Spring - Two leaves of a same tree
MongoDB
 
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 

Recently uploaded (20)

PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 

Real-Time Integration Between MongoDB and SQL Databases

  • 1. 1 Distributed, fault-tolerant, transactional Real-Time Integration: MongoDB and SQL Databases Eugene Dvorkin Architect, WebMD
  • 2. 2 WebMD: A lot of data; a lot of traffic ~900 millions page view a month ~100 million unique visitors a month
  • 3. 3 How We Use MongoDB User Activity
  • 4. 4 Why Move Data to RDBMS? Preserve existing investment in BI and data warehouse To use analytical database such as Vertica To use SQL
  • 5. 5 Why Move Data In Real-time? Batch process is slow No ad-hoc queries No real-time reports
  • 6. 6 Challenge in moving data Transform Document to Relational Structure Insert into RDBMS at high rate
  • 7. 7 Challenge in moving data Scale easily as data volume and velocity increase
  • 8. 8 Our Solution to move data in Real-time: Storm tem.Storm – open source distributed real- time computation system. Developed by Nathan Marz - acquired by Twitter
  • 9. 9 Hadoop Storm Our Solution to move data in Real-time: Storm
  • 10. 10 Why STORM? JVM-based framework Guaranteed data processing Supports development in multiple languages Scalable and transactional
  • 11. 11 Overview of Storm cluster Master Node Cluster Coordination run worker processes
  • 12. 12 Storm Abstractions Tuples, Streams, Spouts, Bolts and Topologies
  • 14. 14 Stream Unbounded sequence of tuples Example: Stream of messages from message queue
  • 15. 15 Spout Read from stream of data – Queues, web logs, API calls, mongoDB oplog Emit documents as tuples Source of Streams
  • 16. 16 Bolts Process tuples and create new streams
  • 17. 17 Bolts Apply functions /transforms Calculate and aggregate data (word count!) Access DB, API , etc. Filter data Map/Reduce Process tuples and create new streams
  • 20. 20 MongoDB How To Read All Incoming Data from MongoDB?
  • 21. 21 MongoDB How To Read All Incoming Data from MongoDB? Use MongoDB OpLog
  • 22. 22 What is OpLog? Replication mechanism in MongoDB It is a Capped Collection
  • 23. 23 Spout: reading from OpLog Located at local database, oplog.rs collection
  • 24. 24 Spout: reading from OpLog Operations: Insert, Update, Delete
  • 25. 25 Spout: reading from OpLog Name space: Table – Collection name
  • 26. 26 Spout: reading from OpLog Data object:
  • 28. 28 Automatic discovery of sharded cluster
  • 29. 29 Example: Shard vs Replica set discovery
  • 31. 31 Spout: Reading data from OpLog How to Read data continuously from OpLog?
  • 32. 32 Spout: Reading data from OpLog How to Read data continuously from OpLog? Use Tailable Cursor
  • 33. 33 Example: Tailable cursor - like tail –f
  • 34. 34 Manage timestamps Use ts (timestamp in oplog entry) field to track processed records If system restart, start from recorded ts
  • 36. 36 SPOUT – Code Example
  • 38. 38 Working With Embedded Arrays Array represents One-to-Many relationship in RDBMS
  • 39. 39 Example: Working with embedded arrays
  • 40. 40 Example: Working with embedded arrays {_id: 1, ns: “person_awards”, o: { award: 'National Medal of Science', year: 1975, by: 'National Science Foundation' } } { _id: 1, ns: “person_awards”, o: {award: 'Turing Award', year: 1977, by: 'ACM' } }
  • 41. 41 Example: Working with embedded arrays public void execute(Tuple tuple) { ......... if (field instanceof BasicDBList) { BasicDBObject arrayElement=processArray(field) ...... outputCollector.emit("documents", tuple, arrayElement);
  • 43. 43 {"ns": "people", "op":"i", o : { _id: 1, name: { first: 'John', last: 'Backus' }, birth: 'Dec 03, 1924’ } ["ns": "people", "op":"i", [“id”:1, "name_first": "John", "name_last":"Backus", "birth": "DEc 03, 1924" ] ] Parse documents with Bolt
  • 44. 44 @Override public void execute(Tuple tuple) { ...... final BasicDBObject oplogObject = (BasicDBObject)tuple.getValueByField("document"); final BasicDBObject document = (BasicDBObject)oplogObject.get("o"); ...... outputValues.add(flattenDocument(document)); outputCollector.emit(tuple,outputValues); Parse documents with Bolt
  • 45. 45 Write to SQL with SQLWriter Bolt
  • 46. 46 Write to SQL with SQLWriter Bolt ["ns": "people", "op":"i", [“id”:1, "name_first": "John", "name_last":"Backus", "birth": "Dec 03, 1924" ] ] insert into people (_id,name_first,name_last,birth) values (1,'John','Backus','Dec 03,1924') , insert into people_awards (_id,awards_award,awards_award,awards_by) values (1,'Turing Award',1977,'ACM'), insert into people_awards (_id,awards_award,awards_award,awards_by) values (1,'National Medal of Science',1975,'National Science Foundation')
  • 47. 47 @Override public void prepare(.....) { .... Class.forName("com.vertica.jdbc.Driver"); con = DriverManager.getConnection(dBUrl, username,password); @Override public void execute(Tuple tuple) { String insertStatement=createInsertStatement(tuple); try { Statement stmt = con.createStatement(); stmt.execute(insertStatement); stmt.close(); Write to SQL with SQLWriter Bolt
  • 48. 48 Topology Definition TopologyBuilder builder = new TopologyBuilder(); // define our spout builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress) builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId) builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId, documentsStreamId) builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle Grouping(mongoDocParserId) LocalCluster cluster = new LocalCluster(); cluster.submitTopology("test", conf, builder.createTopology());
  • 49. 49 Topology Definition TopologyBuilder builder = new TopologyBuilder(); // define our spout builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress) builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId) builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId ,documentsStreamId) builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffl eGrouping(mongoDocParserId) LocalCluster cluster = new LocalCluster(); cluster.submitTopology("test", conf, builder.createTopology());
  • 50. 50 Topology Definition TopologyBuilder builder = new TopologyBuilder(); // define our spout builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress) builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId) builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId, documentsStreamId) builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle Grouping(mongoDocParserId) LocalCluster cluster = new LocalCluster(); cluster.submitTopology("test", conf, builder.createTopology());
  • 51. 51 Topology Definition TopologyBuilder builder = new TopologyBuilder(); // define our spout builder.setSpout(spoutId, new MongoOpLogSpout("mongodb://", opslog_progress) builder.setBolt(arrayExtractorId ,new ArrayFieldExtractorBolt(),5).shuffleGrouping(spoutId) builder.setBolt(mongoDocParserId, new MongoDocumentParserBolt()).shuffleGrouping(arrayExtractorId, documentsStreamId) builder.setBolt(sqlWriterId, new SQLWriterBolt(rdbmsUrl,rdbmsUserName,rdbmsPassword)).shuffle Grouping(mongoDocParserId) StormSubmitter.submitTopology("OfflineEventProcess", conf,builder.createTopology())
  • 52. 52 Lesson learned By leveraging MongoDB Oplog or other capped collection, tailable cursor and Storm framework, you can build fast, scalable, real-time data processing pipeline.
  • 53. 53 Resources Book: Getting started with Storm Storm Project wiki Storm starter project Storm contributions project Running a Multi-Node Storm cluster tutorial Implementing real-time trending topic A Hadoop Alternative: Building a real-time data pipeline with Storm Storm Use cases
  • 54. 54 Resources (cont’d) Understanding the Parallelism of a Storm Topology Trident – high level Storm abstraction A practical Storm’s Trident API Storm online forum Mongo connector from 10gen Labs MoSQL streaming Translator in Ruby Project source code New York City Storm Meetup
  • 55. 55 Questions Eugene Dvorkin, Architect, WebMD [email protected] Twitter: @edvorkin LinkedIn: eugenedvorkin
  • 56. 56
  • 57. 57
  • 58. 58 Next Sessions at 2:50 5th Floor: WestSideBallroom3&4:DataModelingExamplesfromtheRealWorld WestSideBallroom1&2: GrowingUpMongoDB JuilliardComplex:BusinessTrack:MetLifeLeapfrogsInsuranceIndustry withMongoDB-PoweredBigDataApplication LyceumComplex: AsktheExperts:MongoDBMonitoringandBackup ServiceSession 7th Floor: EmpireComplex:HowWeFixedOurMongoDBProblems SoHoComplex:HighPerformance,HighScaleMongoDBonAWS:AHands OnGuide

Editor's Notes

  • #3: Leading source of health and medical information.
  • #4: Data is rawData is immutable, data is trueDynamic personalized marketing campaigns
  • #24: The oplog is a capped collection that lives in a database calledlocal on every replicating node and records all changes to the data. Every time a client writes to the primary, an entry with enough information to reproduce the write is automatically added to the primary’s oplog. Once the write is replicated to a given secondary, that secondary’s oplog also stores a record of the write. Each oplog entry is identified with a BSON timestamp, and all secondaries use the timestamp to keep track of the latest entry they’ve applied.
  • #29: How do you now if you connected to shard cluster
  • #36: Use mongo Oplog as a queue
  • #37: Spout extend interface
  • #40: Awards array in Person document – converted into 2 documents with id as of parent document Id
  • #41: Awards array – converted into 2 documents with id as of parent document Id. Name space will be used later to insert data into correct table on SQL side
  • #42: Instance of BasicDBList in Java
  • #44: Flatten out your document structure – use loop or recursion to flatten it outHopefully you don’t have deeply nested documents, which against mongoDB guidelines for schema design
  • #48: Use tickle tuples and update in batches
  • #49: Local mode vs prod mode
  • #50: Increasing papallelization of the bolt. Let say You want 5 bolts to process your array, because it more time consuming operation or you want more SQLWtirerBolts,Because it takes long time to insert data, then use parallelization hint parameters in bolt definition.System will create correspponding number of workers to process your request.
  • #51: Local mode vs prod mode
  • #52: Local mode vs prod mode