SlideShare a Scribd company logo
By OCTO & The RefinersPierre-Alain Jachiet - Aurélien Gervasi
PEN S URCE
ANALYTICS
on MONGO DB
with Schema
Pierre-Alain Jachiet Aurélien Gervasi
DATA
SCIENTIST
Data strategist Applied mathematician
Analysts, with developer skills
DATA
SCIENTIST
DATA
PROCESSOR
Data strategist Applied mathematician
Analysts, with developer skills
“
the major activity in the data science process is
identifying, accessing and preparing data
for analysis
From MongoDB data … to Superset Colors
OCTO TECHNOLOGY > THERE IS A BETTER WAY
So ! What's the point with
MongoDB ?>
MongoDB - The Leading NoSQL Database
Cassandra
Redis
Hbase
MongoDB - A NoSQL database in the big leagues of RDBMS
2013 2014 2015
2016 2017
https://db-engines.com/en/ranking
Popularity score by db-engines.com
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Why MongoDB ?
Yes !
Semi-structured data ? Performance ?Scalability ?
And more generally because
it is natural for developers
a pleasure to use from the
developer perspective
“
“
“ MongoDB is fast
to get started “
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Developers speak json …
XML
JSON
100
75
50
25
2008 2011 2014 2017
(= document with schema)
… the modern data exchange format …
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Developers speak json …
XML
JSON
100
75
50
25
2008 2011 2014 2017
(= document with schema)
… the modern data exchange format …
… and Mongo DB eats JSON
OCTO TECHNOLOGY > THERE IS A BETTER WAY
MongoDB, a common technology to store data
OCTO TECHNOLOGY > THERE IS A BETTER WAY
So far, so good>
And, one day,
someone has a dream…
So far, so good.
But times goes on and data goes in.
AI
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Please ! An analyst for this data !
Hey !
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Please ! An analyst for this data !
But NoSQL / json data is not
natural for analysts
?
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Analysts use SQL
MongoDB : aggregation framework
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Analysts work with tables
Analyst land
… and relations
Developer land
Developer like json
… and imbrications
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Relational database
Code
Model layer
Application
= API to data
Map + Contract
Data
schema
Analyst landDeveloper land
Analysts work with a data schemaDeveloper have a data model
in the code
OCTO TECHNOLOGY > THERE IS A BETTER WAY
MongoDB
Code
Model layer
Application
= API to data
Map + Contract
Data
schema
Analyst landDeveloper land
> But mongoDB is schema-less
Analysts work with a data schemaDeveloper have a data model
in the code
OCTO TECHNOLOGY > THERE IS A BETTER WAY
The usual reaction…
MongoDB ExcelAccessSAS
Hack a pipeline to flatten the Mongo DB data
Pymongo
+ scripts
Python notebooksCSV file
Difficulties
☉ Hard job for the analyst
☉ Batch / no real time
☉ Not robust to changes
=> Difficult to industrialize
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Mongo DB enterprise
solution>
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Mongo BI Connector
Mongo BI Connector
Developed for integration with SQL-based BI tools
An SQL compatibility layer to MongoDB
Mongo SQLD
MongoDB
Data
Model
Tableau
MySQL
Wire
* DRDL = Document - Relational
Definition Language
Mongo DRDL*
- SQL translator
Data
table - Post-processor Data
json
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Mongo BI Connector - Pro & Cons
Pro
☉ Official & Supported
Install it and go
Cons
☉ Commercial → MongoDB Enterprise license
☉ Closed-source → black box
☉ Limited performance ?
☉ Mandatory use of SQL wire protocol
OCTO TECHNOLOGY > THERE IS A BETTER WAY
An open-source
solution ?>
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Open-source bricks put together !
PostgreSQLMongoDB Mongo Connector
Pymongo-Schema
Doc-manager
(PostgreSQL)
Streaming data from MongoDB to PostgreSQL
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Mongo Connector : Connect them all !
Developed by MongoDB Labs
Python 2.6, 2.7, 3.3+
MongoDB 2.4, 2.6, 3.0, 3.2, and 3.4
Apache License 2.0
https://github.com/mongodb-labs/mongo-connector
PostgreSQLMongoDB Mongo Connector
Pymongo-Schema
Doc-manager
Synchronize a Mongodb database with another database
☉ MongoDB
☉ SolR
☉ ElasticSearch
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Mongo Connector : Connect them all !
changes in DB
write new events
(differential)
replication
Oplog file
propagate changes
to other DB
Primary
Secondary
Secondary
Mongo Connector
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Doc-manager : Do you speak PostgreSQL ?
PostgreSQLMongoDB Mongo Connector
Pymongo-Schema
Doc-manager
Developed by Hopwork
Python 2.7, 3.4+
PostgreSQL 9.5
Apache License 2.0
https://github.com/Hopwork/mongo-connector-postgresql
☉ Translate a modification request from MongoConnector to the
target database
☉ Speak the target database language
OCTO TECHNOLOGY > THERE IS A BETTER WAY
{
_id: “12”,
f1: “fu”,
f2: true,
f3: 42,
f4: {
sf1: “pyparis”
sf2: 2017
},
f5: [
“fu”,
“bar”,
“fubar”
]
}
Doc-manager : Do you speak PostgreSQL ?
_id f1 f2 f3
12 “fu” true 42
_id value id_parent
1 ‘fu’ 12
2 ‘bar’ 12
3 ‘fubar’ 12
f4.sf1 f4.sf2
‘pyparis’ 2017
Mongo DB world SQL world
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Pymongo Schema : A mapping to rule them all
PostgreSQLMongoDB Mongo Connector
Pymongo-Schema
Doc-manager
“Homemade”
Python 2.7
Apache License 2.0
https://github.com/pajachiet/pymongo-schema
☉ Scan the entire database to define its data model schema
☉ Generate a mapping file flattening the MongoDB schema into
an SQL-compatible schema
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Demo>
OCTO TECHNOLOGY > THERE IS A BETTER WAY
PostgreSQLMongoDB Mongo Connector
Pymongo-Schema
Doc-manager
☉ Mongodb example Dataset: Restaurants in New York
> Address & coordinates
> Cuisi ne type
> List of grades
☉ Nested data structure
OCTO TECHNOLOGY > THERE IS A BETTER WAY
OCTO TECHNOLOGY > THERE IS A BETTER WAY
PostgreSQLMongoDB Mongo Connector
Pymongo-Schema
Doc-manager
EXTRACT
Read entire database to extract its data model schema
Returns:
☉ Field name and field nesting
☉ Field completion (frequence and ratio)
☉ Field type
OCTO TECHNOLOGY > THERE IS A BETTER WAY
OCTO TECHNOLOGY > THERE IS A BETTER WAY
PostgreSQLMongoDB Mongo Connector
Pymongo-Schema
Doc-manager
TOSQL
Read a schema to generate a MongoDB/SQL mapping.
Returns:
☉ Mapping file used by the doc-manager
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Same table
Column “cuisine”
New table
“restaurants__address__coord
OCTO TECHNOLOGY > THERE IS A BETTER WAY
PostgreSQLMongoDB Mongo Connector
Pymongo-Schema
Doc-manager
Check for updates in
the oplog file
Send update
commands with data
Translate command and
make SQL requests
OCTO TECHNOLOGY > THERE IS A BETTER WAY
PostgreSQLMongoDB Mongo Connector
Pymongo-Schema
Doc-manager
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Time to play with your
analytics tools>
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Adding an open-source BI tool...
PostgreSQLMongoDB Mongo Connector
Pymongo-Schema
Doc-manager
(PostgreSQL)
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Now, in Superset colors !
“Superset is a data exploration platform designed
to be visual, intuitive and interactive.”
PostgreSQLMongoDB Mongo Connector
Pymongo-Schema
Doc-manager
Developed by AirBnB
Python 2.7, 3.4, 3.5
Apache License 2.0
https://github.com/airbnb/superset
Superset
Open-Source Analytics Stack on MongoDB, with Schema, Pierre-Alain Jachiet and Aurélien Gervasi
Open-Source Analytics Stack on MongoDB, with Schema, Pierre-Alain Jachiet and Aurélien Gervasi
OCTO TECHNOLOGY > THERE IS A BETTER WAY
SQL lab
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Wrap up>
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Take home message
☉ Issues for analysts with NoSQL frameworks
> Developer oriented languages
> Nested data structure
> Schema-less
☉ An open-source stack to unlock analysis of MongoDB data
> Extract a MongoDB schema
> Normalize the data model
> Real time synchronization to PostgreSQL
☉ Currently running in production environments
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Come, use and contribute ! :)
pajachiet@octo.com
agervasi@octo.com
https://github.com/mongodb-labs/mongo-connector
https://github.com/Hopwork/mongo-connector-postgresql
https://github.com/pajachiet/pymongo-schema
https://github.com/airbnb/superset
Open-Source Analytics Stack on MongoDB, with Schema, Pierre-Alain Jachiet and Aurélien Gervasi
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Bien rappeler qu’on est sur une stack open-source
☉ Collaborative
☉ Gratuite
But hey ! It’s Open-Source !
OCTO TECHNOLOGY > THERE IS A BETTER WAY 53OCTO TECHNOLOGY > THERE IS A BETTER WAY
« J’analyse mes données
pour me comprendre »
« J’apprends
automatiquement à réaliser
des tâches complexes à partir
des données »
« Je me dote d’outils avancés
me permettant des analyses
complexes et interactives »
Dataviz
Search
Statistics
Organisation pilotée
par la donnée
Learning
OCTO TECHNOLOGY > THERE IS A BETTER WAY
MongoDB popularity
https://db-engines.com/en/ranking
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Analysts use SQL
The Mongo way : aggregation framework
Superset
Architecture des visualisations
Datasource (tables SQLa)Tables PostgreSQL Visualisations Tableau de bord
OCTO TECHNOLOGY > THERE IS A BETTER WAY
Please ! An analyst for this data !
But analysts don’t speak json…
??? ?
Should we call the developer ?

More Related Content

What's hot (20)

PDF
Advanced Node.JS Meetup
LINAGORA
 
PPTX
Wonders of Golang
Kartik Sura
 
PPTX
Golang
Michael Blake
 
PDF
Golang from Scala developer’s perspective
Sveta Bozhko
 
PDF
OSDC 2017 - Casey Callendrello -The evolution of the Container Network Interface
NETWAYS
 
PDF
Debugging of (C)Python applications
Roman Podoliaka
 
PDF
PyParis 2017 / Pandas - What's new and whats coming - Joris van den Bossche
Pôle Systematic Paris-Region
 
PDF
Dependency management in golang
Ramit Surana
 
PDF
202107 - Orion introduction - COSCUP
Ronald Hsu
 
PDF
Inroduction to golang
Yoni Davidson
 
PDF
Droidcon Summary 2021
Bartosz Kosarzycki
 
PDF
Chromium: NaCl and Pepper API
Chang W. Doh
 
PDF
Streams API (Web Engines Hackfest 2015)
Igalia
 
PDF
Full stack development
Pavlo Iuriichuk
 
PPTX
Golang basics for Java developers - Part 1
Robert Stern
 
PPTX
Rust programming-language
Mujahid Malik Arain
 
PPTX
Node js meetup
Ansuman Roy
 
PDF
really really really awesome php application with bdd behat and iterfaces
Giulio De Donato
 
PDF
Run Go applications on Pico using TinyGo
Yu-Shuan Hsieh
 
PDF
Why is Python slow? Python Nordeste 2013
Daker Fernandes
 
Advanced Node.JS Meetup
LINAGORA
 
Wonders of Golang
Kartik Sura
 
Golang from Scala developer’s perspective
Sveta Bozhko
 
OSDC 2017 - Casey Callendrello -The evolution of the Container Network Interface
NETWAYS
 
Debugging of (C)Python applications
Roman Podoliaka
 
PyParis 2017 / Pandas - What's new and whats coming - Joris van den Bossche
Pôle Systematic Paris-Region
 
Dependency management in golang
Ramit Surana
 
202107 - Orion introduction - COSCUP
Ronald Hsu
 
Inroduction to golang
Yoni Davidson
 
Droidcon Summary 2021
Bartosz Kosarzycki
 
Chromium: NaCl and Pepper API
Chang W. Doh
 
Streams API (Web Engines Hackfest 2015)
Igalia
 
Full stack development
Pavlo Iuriichuk
 
Golang basics for Java developers - Part 1
Robert Stern
 
Rust programming-language
Mujahid Malik Arain
 
Node js meetup
Ansuman Roy
 
really really really awesome php application with bdd behat and iterfaces
Giulio De Donato
 
Run Go applications on Pico using TinyGo
Yu-Shuan Hsieh
 
Why is Python slow? Python Nordeste 2013
Daker Fernandes
 

Similar to Open-Source Analytics Stack on MongoDB, with Schema, Pierre-Alain Jachiet and Aurélien Gervasi (20)

PDF
Mongo db first steps with csharp
Serdar Buyuktemiz
 
PPTX
NoSQL and MongoDB Introdction
Brian Enochson
 
PPTX
BedCon 2013 - Java Persistenz-Frameworks für MongoDB
Tobias Trelle
 
PPT
MongoDB Pros and Cons
johnrjenson
 
PPTX
Mongo db on azure for developers
Mark Greenway
 
PDF
Mongo db transcript
foliba
 
PPTX
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB
 
PPT
The MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.js
MongoDB
 
KEY
MongoDB
Steven Francia
 
PPTX
Jumpstart: Building Your First MongoDB App
MongoDB
 
PDF
Java Persistence Frameworks for MongoDB
MongoDB
 
PPTX
Intro To Mongo Db
chriskite
 
PPTX
SH 1 - SES 4 - Microservices - Andrew Morgan TLV.pptx
MongoDB
 
PDF
MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...
Daniel M. Farrell
 
PDF
Mdb dn 2016_07_elastic_search
Daniel M. Farrell
 
PPTX
Full-stack Web Development with MongoDB, Node.js and AWS
MongoDB
 
PDF
PostgreSQL versus MySQL - What Are The Real Differences
All Things Open
 
PDF
Gérer ses contenus avec MongoDB et Nuxeo
Nuxeo
 
PPTX
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDB
MongoDB
 
PDF
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB
 
Mongo db first steps with csharp
Serdar Buyuktemiz
 
NoSQL and MongoDB Introdction
Brian Enochson
 
BedCon 2013 - Java Persistenz-Frameworks für MongoDB
Tobias Trelle
 
MongoDB Pros and Cons
johnrjenson
 
Mongo db on azure for developers
Mark Greenway
 
Mongo db transcript
foliba
 
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB
 
The MEAN Stack: MongoDB, ExpressJS, AngularJS and Node.js
MongoDB
 
Jumpstart: Building Your First MongoDB App
MongoDB
 
Java Persistence Frameworks for MongoDB
MongoDB
 
Intro To Mongo Db
chriskite
 
SH 1 - SES 4 - Microservices - Andrew Morgan TLV.pptx
MongoDB
 
MongoDB Developer's Notebook, March 2016 -- MongoDB Connector for Business In...
Daniel M. Farrell
 
Mdb dn 2016_07_elastic_search
Daniel M. Farrell
 
Full-stack Web Development with MongoDB, Node.js and AWS
MongoDB
 
PostgreSQL versus MySQL - What Are The Real Differences
All Things Open
 
Gérer ses contenus avec MongoDB et Nuxeo
Nuxeo
 
Powering Microservices with Docker, Kubernetes, Kafka, and MongoDB
MongoDB
 
MongoDB Days Silicon Valley: Winning the Dreamforce Hackathon with MongoDB
MongoDB
 
Ad

More from Pôle Systematic Paris-Region (20)

PDF
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
Pôle Systematic Paris-Region
 
PDF
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
Pôle Systematic Paris-Region
 
PDF
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
Pôle Systematic Paris-Region
 
PDF
OSIS19_Cloud : Performance and power management in virtualized data centers, ...
Pôle Systematic Paris-Region
 
PDF
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
Pôle Systematic Paris-Region
 
PDF
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
Pôle Systematic Paris-Region
 
PDF
OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...
Pôle Systematic Paris-Region
 
PDF
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Pôle Systematic Paris-Region
 
PDF
Osis18_Cloud : Pas de commun sans communauté ?
Pôle Systematic Paris-Region
 
PDF
Osis18_Cloud : Projet Wolphin
Pôle Systematic Paris-Region
 
PDF
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
Pôle Systematic Paris-Region
 
PDF
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Pôle Systematic Paris-Region
 
PDF
Osis18_Cloud : Software-heritage
Pôle Systematic Paris-Region
 
PDF
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
Pôle Systematic Paris-Region
 
PDF
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
Pôle Systematic Paris-Region
 
PDF
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
Pôle Systematic Paris-Region
 
PDF
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
Pôle Systematic Paris-Region
 
PDF
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
Pôle Systematic Paris-Region
 
PDF
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
Pôle Systematic Paris-Region
 
PDF
PyParis 2017 / Un mooc python, by thierry parmentelat
Pôle Systematic Paris-Region
 
OSIS19_IoT :Transparent remote connectivity to short-range IoT devices, by Na...
Pôle Systematic Paris-Region
 
OSIS19_Cloud : SAFC: Scheduling and Allocation Framework for Containers in a ...
Pôle Systematic Paris-Region
 
OSIS19_Cloud : Qu’apporte l’observabilité à la gestion de configuration? par ...
Pôle Systematic Paris-Region
 
OSIS19_Cloud : Performance and power management in virtualized data centers, ...
Pôle Systematic Paris-Region
 
OSIS19_Cloud : Des objets dans le cloud, et qui y restent -- L'expérience du ...
Pôle Systematic Paris-Region
 
OSIS19_Cloud : Attribution automatique de ressources pour micro-services, Alt...
Pôle Systematic Paris-Region
 
OSIS19_IoT : State of the art in security for embedded systems and IoT, by Pi...
Pôle Systematic Paris-Region
 
Osis19_IoT: Proof of Pointer Programs with Ownership in SPARK, by Yannick Moy
Pôle Systematic Paris-Region
 
Osis18_Cloud : Pas de commun sans communauté ?
Pôle Systematic Paris-Region
 
Osis18_Cloud : Projet Wolphin
Pôle Systematic Paris-Region
 
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
Pôle Systematic Paris-Region
 
Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent
Pôle Systematic Paris-Region
 
Osis18_Cloud : Software-heritage
Pôle Systematic Paris-Region
 
OSIS18_IoT: L'approche machine virtuelle pour les microcontrôleurs, le projet...
Pôle Systematic Paris-Region
 
OSIS18_IoT: La securite des objets connectes a bas cout avec l'os et riot
Pôle Systematic Paris-Region
 
OSIS18_IoT : Solution de mise au point pour les systemes embarques, par Julio...
Pôle Systematic Paris-Region
 
OSIS18_IoT : Securisation du reseau des objets connectes, par Nicolas LE SAUZ...
Pôle Systematic Paris-Region
 
OSIS18_IoT : Ada and SPARK - Defense in Depth for Safe Micro-controller Progr...
Pôle Systematic Paris-Region
 
OSIS18_IoT : RTEMS pour l'IoT professionnel, par Pierre Ficheux (Smile ECS)
Pôle Systematic Paris-Region
 
PyParis 2017 / Un mooc python, by thierry parmentelat
Pôle Systematic Paris-Region
 
Ad

Recently uploaded (20)

PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Digital Circuits, important subject in CS
contactparinay1
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 

Open-Source Analytics Stack on MongoDB, with Schema, Pierre-Alain Jachiet and Aurélien Gervasi

  • 1. By OCTO & The RefinersPierre-Alain Jachiet - Aurélien Gervasi PEN S URCE ANALYTICS on MONGO DB with Schema
  • 2. Pierre-Alain Jachiet Aurélien Gervasi DATA SCIENTIST
  • 3. Data strategist Applied mathematician Analysts, with developer skills DATA SCIENTIST
  • 4. DATA PROCESSOR Data strategist Applied mathematician Analysts, with developer skills
  • 5. “ the major activity in the data science process is identifying, accessing and preparing data for analysis
  • 6. From MongoDB data … to Superset Colors
  • 7. OCTO TECHNOLOGY > THERE IS A BETTER WAY So ! What's the point with MongoDB ?>
  • 8. MongoDB - The Leading NoSQL Database Cassandra Redis Hbase
  • 9. MongoDB - A NoSQL database in the big leagues of RDBMS 2013 2014 2015 2016 2017 https://db-engines.com/en/ranking Popularity score by db-engines.com
  • 10. OCTO TECHNOLOGY > THERE IS A BETTER WAY Why MongoDB ? Yes ! Semi-structured data ? Performance ?Scalability ? And more generally because it is natural for developers a pleasure to use from the developer perspective “ “ “ MongoDB is fast to get started “
  • 11. OCTO TECHNOLOGY > THERE IS A BETTER WAY Developers speak json … XML JSON 100 75 50 25 2008 2011 2014 2017 (= document with schema) … the modern data exchange format …
  • 12. OCTO TECHNOLOGY > THERE IS A BETTER WAY Developers speak json … XML JSON 100 75 50 25 2008 2011 2014 2017 (= document with schema) … the modern data exchange format … … and Mongo DB eats JSON
  • 13. OCTO TECHNOLOGY > THERE IS A BETTER WAY MongoDB, a common technology to store data
  • 14. OCTO TECHNOLOGY > THERE IS A BETTER WAY So far, so good>
  • 15. And, one day, someone has a dream… So far, so good. But times goes on and data goes in. AI
  • 16. OCTO TECHNOLOGY > THERE IS A BETTER WAY Please ! An analyst for this data ! Hey !
  • 17. OCTO TECHNOLOGY > THERE IS A BETTER WAY Please ! An analyst for this data ! But NoSQL / json data is not natural for analysts ?
  • 18. OCTO TECHNOLOGY > THERE IS A BETTER WAY Analysts use SQL MongoDB : aggregation framework
  • 19. OCTO TECHNOLOGY > THERE IS A BETTER WAY Analysts work with tables Analyst land … and relations Developer land Developer like json … and imbrications
  • 20. OCTO TECHNOLOGY > THERE IS A BETTER WAY Relational database Code Model layer Application = API to data Map + Contract Data schema Analyst landDeveloper land Analysts work with a data schemaDeveloper have a data model in the code
  • 21. OCTO TECHNOLOGY > THERE IS A BETTER WAY MongoDB Code Model layer Application = API to data Map + Contract Data schema Analyst landDeveloper land > But mongoDB is schema-less Analysts work with a data schemaDeveloper have a data model in the code
  • 22. OCTO TECHNOLOGY > THERE IS A BETTER WAY The usual reaction… MongoDB ExcelAccessSAS Hack a pipeline to flatten the Mongo DB data Pymongo + scripts Python notebooksCSV file Difficulties ☉ Hard job for the analyst ☉ Batch / no real time ☉ Not robust to changes => Difficult to industrialize
  • 23. OCTO TECHNOLOGY > THERE IS A BETTER WAY Mongo DB enterprise solution>
  • 24. OCTO TECHNOLOGY > THERE IS A BETTER WAY Mongo BI Connector Mongo BI Connector Developed for integration with SQL-based BI tools An SQL compatibility layer to MongoDB Mongo SQLD MongoDB Data Model Tableau MySQL Wire * DRDL = Document - Relational Definition Language Mongo DRDL* - SQL translator Data table - Post-processor Data json
  • 25. OCTO TECHNOLOGY > THERE IS A BETTER WAY Mongo BI Connector - Pro & Cons Pro ☉ Official & Supported Install it and go Cons ☉ Commercial → MongoDB Enterprise license ☉ Closed-source → black box ☉ Limited performance ? ☉ Mandatory use of SQL wire protocol
  • 26. OCTO TECHNOLOGY > THERE IS A BETTER WAY An open-source solution ?>
  • 27. OCTO TECHNOLOGY > THERE IS A BETTER WAY Open-source bricks put together ! PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager (PostgreSQL) Streaming data from MongoDB to PostgreSQL
  • 28. OCTO TECHNOLOGY > THERE IS A BETTER WAY Mongo Connector : Connect them all ! Developed by MongoDB Labs Python 2.6, 2.7, 3.3+ MongoDB 2.4, 2.6, 3.0, 3.2, and 3.4 Apache License 2.0 https://github.com/mongodb-labs/mongo-connector PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager Synchronize a Mongodb database with another database ☉ MongoDB ☉ SolR ☉ ElasticSearch
  • 29. OCTO TECHNOLOGY > THERE IS A BETTER WAY Mongo Connector : Connect them all ! changes in DB write new events (differential) replication Oplog file propagate changes to other DB Primary Secondary Secondary Mongo Connector
  • 30. OCTO TECHNOLOGY > THERE IS A BETTER WAY Doc-manager : Do you speak PostgreSQL ? PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager Developed by Hopwork Python 2.7, 3.4+ PostgreSQL 9.5 Apache License 2.0 https://github.com/Hopwork/mongo-connector-postgresql ☉ Translate a modification request from MongoConnector to the target database ☉ Speak the target database language
  • 31. OCTO TECHNOLOGY > THERE IS A BETTER WAY { _id: “12”, f1: “fu”, f2: true, f3: 42, f4: { sf1: “pyparis” sf2: 2017 }, f5: [ “fu”, “bar”, “fubar” ] } Doc-manager : Do you speak PostgreSQL ? _id f1 f2 f3 12 “fu” true 42 _id value id_parent 1 ‘fu’ 12 2 ‘bar’ 12 3 ‘fubar’ 12 f4.sf1 f4.sf2 ‘pyparis’ 2017 Mongo DB world SQL world
  • 32. OCTO TECHNOLOGY > THERE IS A BETTER WAY Pymongo Schema : A mapping to rule them all PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager “Homemade” Python 2.7 Apache License 2.0 https://github.com/pajachiet/pymongo-schema ☉ Scan the entire database to define its data model schema ☉ Generate a mapping file flattening the MongoDB schema into an SQL-compatible schema
  • 33. OCTO TECHNOLOGY > THERE IS A BETTER WAY Demo>
  • 34. OCTO TECHNOLOGY > THERE IS A BETTER WAY PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager ☉ Mongodb example Dataset: Restaurants in New York > Address & coordinates > Cuisi ne type > List of grades ☉ Nested data structure
  • 35. OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 36. OCTO TECHNOLOGY > THERE IS A BETTER WAY PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager EXTRACT Read entire database to extract its data model schema Returns: ☉ Field name and field nesting ☉ Field completion (frequence and ratio) ☉ Field type
  • 37. OCTO TECHNOLOGY > THERE IS A BETTER WAY
  • 38. OCTO TECHNOLOGY > THERE IS A BETTER WAY PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager TOSQL Read a schema to generate a MongoDB/SQL mapping. Returns: ☉ Mapping file used by the doc-manager
  • 39. OCTO TECHNOLOGY > THERE IS A BETTER WAY Same table Column “cuisine” New table “restaurants__address__coord
  • 40. OCTO TECHNOLOGY > THERE IS A BETTER WAY PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager Check for updates in the oplog file Send update commands with data Translate command and make SQL requests
  • 41. OCTO TECHNOLOGY > THERE IS A BETTER WAY PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager
  • 42. OCTO TECHNOLOGY > THERE IS A BETTER WAY Time to play with your analytics tools>
  • 43. OCTO TECHNOLOGY > THERE IS A BETTER WAY Adding an open-source BI tool... PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager (PostgreSQL)
  • 44. OCTO TECHNOLOGY > THERE IS A BETTER WAY Now, in Superset colors ! “Superset is a data exploration platform designed to be visual, intuitive and interactive.” PostgreSQLMongoDB Mongo Connector Pymongo-Schema Doc-manager Developed by AirBnB Python 2.7, 3.4, 3.5 Apache License 2.0 https://github.com/airbnb/superset Superset
  • 47. OCTO TECHNOLOGY > THERE IS A BETTER WAY SQL lab
  • 48. OCTO TECHNOLOGY > THERE IS A BETTER WAY Wrap up>
  • 49. OCTO TECHNOLOGY > THERE IS A BETTER WAY Take home message ☉ Issues for analysts with NoSQL frameworks > Developer oriented languages > Nested data structure > Schema-less ☉ An open-source stack to unlock analysis of MongoDB data > Extract a MongoDB schema > Normalize the data model > Real time synchronization to PostgreSQL ☉ Currently running in production environments
  • 50. OCTO TECHNOLOGY > THERE IS A BETTER WAY Come, use and contribute ! :) [email protected] [email protected] https://github.com/mongodb-labs/mongo-connector https://github.com/Hopwork/mongo-connector-postgresql https://github.com/pajachiet/pymongo-schema https://github.com/airbnb/superset
  • 52. OCTO TECHNOLOGY > THERE IS A BETTER WAY Bien rappeler qu’on est sur une stack open-source ☉ Collaborative ☉ Gratuite But hey ! It’s Open-Source !
  • 53. OCTO TECHNOLOGY > THERE IS A BETTER WAY 53OCTO TECHNOLOGY > THERE IS A BETTER WAY « J’analyse mes données pour me comprendre » « J’apprends automatiquement à réaliser des tâches complexes à partir des données » « Je me dote d’outils avancés me permettant des analyses complexes et interactives » Dataviz Search Statistics Organisation pilotée par la donnée Learning
  • 54. OCTO TECHNOLOGY > THERE IS A BETTER WAY MongoDB popularity https://db-engines.com/en/ranking
  • 55. OCTO TECHNOLOGY > THERE IS A BETTER WAY Analysts use SQL The Mongo way : aggregation framework
  • 56. Superset Architecture des visualisations Datasource (tables SQLa)Tables PostgreSQL Visualisations Tableau de bord
  • 57. OCTO TECHNOLOGY > THERE IS A BETTER WAY Please ! An analyst for this data ! But analysts don’t speak json… ??? ? Should we call the developer ?

Editor's Notes

  • #2: TODO Ajouter logo Python Ajouter logo ou mention Pyparis
  • #3: On se présente mutuellement ? Intérêt de prendre tout les deux la parole dés le début. TODO : Supprimer interrogation We love Python. Easy enough for Data Scientist
  • #4: What do they do Master analyst that controls the world ? Crazy mathematician that build artificial intelligence ? TODO ajouter en rouge animé qui barre l’image “What our moms think we do” “What we think we do” Ajouter / reprendre image data scietintist face à écran qui parle SQL?
  • #5: What do they do Master analyst that controls the world ? Crazy mathematician that build artificial intelligence ? TODO ajouter en rouge animé qui barre l’image “What our moms think we do” “What we think we do” Ajouter / reprendre image data scietintist face à écran qui parle SQL?
  • #6: But in fact This is the main challenge. Always… From 80-95 % of the time (ajouter à la slide ?) TODO : image camembert ? The proportion between the 3 phases may vary a lot Data may be lost and difficult to know about Accessing data might be technically difficult. But it’s often a technical nightmare but even more often
  • #7: Subject of the talk : how to identify, access and prepare MongoDB data for analysis … from a Data Scientist point of view MongoDB data ~= json. Imbracated and flexible => fractal`` Prepare with Open-Source Python Advanced analysis : first level interactive dashboards. TODO Ajouter logo Open-Source Ajouter logo Python Animer la slide (chou pour fractal, arc en ciel, superset, etc)
  • #9: TODO Améliorer le graphique / visibilité
  • #11: Open-Source, and “free” Great APIs & documentation Works well with Object-Oriented programming Schema-less database : adapt your model as you go
  • #20: model layer http://seldo.com/weblog/2011/08/11/orm_is_an_antipattern
  • #21: model layer http://seldo.com/weblog/2011/08/11/orm_is_an_antipattern
  • #22: model layer http://seldo.com/weblog/2011/08/11/orm_is_an_antipattern
  • #23: Several usual reactions Prepare the data for the analysis tool you know … with the transform tool you know Traditional Analysts : mixed graphical / code : SAS stream Data Scientist : custom python scripts TODO animer
  • #28: Stream de MongoDB vers Postgres Procédure en STREAM et pas en BATCH
  • #29: MongoConnector synchronize a MongoDB to another DB Developed by MongoDB labs, open-source license Duplicate a mongodb DB to another database (originally NosQL) Read oplog (dessin) and propagate modifications to another DB
  • #31: Translate a modification request from MongoConnector to the target database. Implement some general methods : Upsert, Bulk Upsert, Update, Remove, etc Already existing for MongoDB, SolR, ElasticSearch (NoSQL DB) Open-source Contributions welcomed In contrary to the other doc-manager, it requires a mapping file which is used to flatten the NoSQL DB schema.
  • #33: By us, inspired by Variety, + license Contributions welcomed Objectif : generate mapping for doc-manager Could have been written by hand, but very error prone. Simplifies schema evolutions.
  • #35: Story to present : startup to grade restaurants in NY
  • #37: Story to present : startup to grade restaurants in NY Source du dataset Présentation d’un élément json
  • #39: Story to present : startup to grade restaurants in NY Source du dataset Présentation d’un élément json Habituellement réalisé à la main
  • #41: Story to present : startup to grade restaurants in NY Source du dataset Présentation d’un élément json
  • #44: Stream de MongoDB vers Postgres Procédure en STREAM et pas en BATCH
  • #45: Pas sec : release fréquentes, gestion des utilisateurs et permissions, documentation Interface web basée sur Flask
  • #55: Classement de l’intérêt des principales base de données