SlideShare a Scribd company logo
EndoMine System
Jewish General Hospital

by David Lauzon
and Anton Zakharov
Big Data Montreal #9
February 5th 2013         1 / 18
Presentation

•   Our Objectives
•   Requirements and context
•   Project scope
•   Hadoop Solution
    –   Big Data Solution Overview
    –   Hive Table Schema
    –   Compression Performance
    –   Data Architecture in Hadoop
    –   Hadoop/Impala Prototype Demo
• Oracle Solution
• Hadoop vs Oracle comparison
• What are expensive queries?

                                       2 / 18
Our Objectives


• Lead an end-of-study project in an
  industrial context
  – Requirements elicitation
  – Implement a « proof-of-concept » prototype


• Experiment with big data technologies
  – Compare with RDBMS



                                                 3 / 18
Requirements and context

• Department of Medical Diagnostic
  (medical test results DB, e.g. blood, urine, ...)
   – Dr. Shaun Eintracht
      • « ad hoc » Query
      • ETL Query
   – Dr. Elizabeth Mac Namara
      • « business intelligence » requirements
      • Realtime Dashboard

• Department of Endocrinology
   – Dr. Mark Trifiro
      • Data mining

                                                      4 / 18
Project scope


• First iteration = improve ad-hoc queries
  – Slow analytical queries and ETL (MS Access)
  – Risk of « crashing » production DB
  – Some queries impossible to process




                                                  5 / 18
Production DB (Oracle)




                         6 / 18
Solutions


• Solution 1 : Hadoop + Impala

• Solution 2 : Tune the existing Oracle RDBMS




                                                7 / 18
Big Data Solution Overview




                             8 / 18
Hive Table Schema




                    9 / 18
Compression Performance

250

200

150
                                                                 Impala
100                                                              Hive
                                                                 Oracle
50

 0
      Oracle FS   Text File   Sequence   SeqFile +   SeqFile +
                                 File      Gzip       Snappy


                                                                    10 / 18
Data Architecture in Hadoop

• All big tables are pre-joined
   – With specimen (1)
   – Without specimen (2)
• Partitioned using two schemes
   – Year-month (3)
   – Year and Test (4)
• 4 different versions of the same data:
   –   stay_order_results_yearmonth
   –   stay_order_results_year_and_test
   –   stay_order_results_specimen_yearmonth
   –   stay_order_results_specimen_year_and_test


                                                   11 / 18
Hadoop Prototype Demo




                        12 / 18
Oracle Solution


• Same tables as source DB
  – A big pre-joined table is not a good solution
• Techniques explored :
  – Partitioning
     • Partitions automatically created
  – Compression
     • Inefficient for joins
  – Clustering
  – Join multiple partitioned tables


                                                    13 / 18
Oracle Solution (continued)


• Avoid too many indexes on the big tables:
  – Takes a lot of memory
  – Slow to create
  – May not be used if query use more than 5% of the
    rows




                                                  14 / 18
Comparison: Hadoop Solution


• Pro
  – Crunch massive amount of data
  – Scalability
  – Free software
• Cons
  – Needs better UI and tune-ups
  – Maintenance cost
  – Require ETL time to merge data into one table
  – BIG Joins should be avoided

                                                    15 / 18
Comparison: Oracle Solution


• Pro
  – Just need to create a slave DB (just?)
  – Faster random-lookup
  – Easier to find expertise
• Cons
  – Scalability up to a certain point..
  – Synchronisation with master DB:
        • Rebuilding indexes would take hours


                                                16 / 18
What are expensive queries?


• If possible, avoid these constructs on
  large result sets
  – SELECT DISTINCT
  – ORDER BY
  – GROUP BY
  – JOIN big table with another big table
     • JOIN big table with multiple small tables should be OK




                                                            17 / 18
Conclusion


• Recommendation to use a “classic” RDBMS
  – The database fit on a single-node
  – Existing expertise in-house
  – Acceptable performance with appropriate
    tune-ups
  – Stop using MS Access
• Disadvantage : limited scalability



                                              18 / 18

More Related Content

PPTX
BDM8 - Near-realtime Big Data Analytics using Impala
David Lauzon
 
PDF
ODI11g, Hadoop and "Big Data" Sources
Mark Rittman
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PPTX
SQL-on-Hadoop Tutorial
Daniel Abadi
 
PPSX
Hadoop Ecosystem
Patrick Nicolas
 
PPTX
Bigdata antipatterns
Anurag S
 
PPTX
SQL on Hadoop
Bigdatapump
 
PPT
Boston Hadoop Meetup, April 26 2012
Daniel Abadi
 
BDM8 - Near-realtime Big Data Analytics using Impala
David Lauzon
 
ODI11g, Hadoop and "Big Data" Sources
Mark Rittman
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
SQL-on-Hadoop Tutorial
Daniel Abadi
 
Hadoop Ecosystem
Patrick Nicolas
 
Bigdata antipatterns
Anurag S
 
SQL on Hadoop
Bigdatapump
 
Boston Hadoop Meetup, April 26 2012
Daniel Abadi
 

What's hot (20)

PPTX
Column Stores and Google BigQuery
Csaba Toth
 
PDF
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
PDF
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Asis Mohanty
 
PDF
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
PPTX
Big Data in the Real World
Mark Kromer
 
PPT
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
PDF
ETL Practices for Better or Worse
Eric Sun
 
PDF
Hadoop and IDW - When_to_use_which
Dan TheMan
 
PDF
Optiq: A dynamic data management framework
Julian Hyde
 
PPTX
NoSQL Needs SomeSQL
DataWorks Summit
 
PPTX
Scaling Deep Learning on Hadoop at LinkedIn
DataWorks Summit
 
PPTX
Apache HBase™
Prashant Gupta
 
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
PDF
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
PPTX
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Douglas Moore
 
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Introduction To Hadoop Ecosystem
InSemble
 
PDF
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Databricks
 
Column Stores and Google BigQuery
Csaba Toth
 
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Asis Mohanty
 
Big Data and Hadoop Ecosystem
Rajkumar Singh
 
Big Data in the Real World
Mark Kromer
 
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
ETL Practices for Better or Worse
Eric Sun
 
Hadoop and IDW - When_to_use_which
Dan TheMan
 
Optiq: A dynamic data management framework
Julian Hyde
 
NoSQL Needs SomeSQL
DataWorks Summit
 
Scaling Deep Learning on Hadoop at LinkedIn
DataWorks Summit
 
Apache HBase™
Prashant Gupta
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Douglas Moore
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Big data vahidamiri-tabriz-13960226-datastack.ir
datastack
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Introduction To Hadoop Ecosystem
InSemble
 
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Databricks
 
Ad

Viewers also liked (6)

PDF
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 
PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
PPTX
Oracle big data appliance and solutions
solarisyougood
 
PPTX
Extending Hortonworks with Oracle's Big Data Platform
DataWorks Summit/Hadoop Summit
 
PDF
A7 storytelling with_oracle_analytics_cloud
Dr. Wilfred Lin (Ph.D.)
 
PDF
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Rittman Analytics
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
Oracle big data appliance and solutions
solarisyougood
 
Extending Hortonworks with Oracle's Big Data Platform
DataWorks Summit/Hadoop Summit
 
A7 storytelling with_oracle_analytics_cloud
Dr. Wilfred Lin (Ph.D.)
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Rittman Analytics
 
Ad

Similar to BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case (20)

PDF
Big data and mstr bridge the elephant
Kognitio
 
PPTX
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Alex Gorbachev
 
PPTX
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
 
PPS
Big data hadoop rdbms
Arjen de Vries
 
PPTX
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Cloudera, Inc.
 
PPTX
Big dataappliance hadoopworld_final
jdijcks
 
PDF
Shared slides-edbt-keynote-03-19-13
Daniel Abadi
 
PPTX
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 
PDF
Cjoin
blogboy
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
PDF
Hadoop and Hive Development at Facebook
S S
 
PPTX
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Cloudera, Inc.
 
PPTX
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Yahoo Developer Network
 
PDF
Preparing yourdataforcloud
Inphina Technologies
 
PPTX
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Qian Lin
 
PDF
Prepare Your Data For The Cloud
IndicThreads
 
PDF
Preparing your data for the cloud
Inphina Technologies
 
PDF
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
StampedeCon
 
PPTX
Apache Drill
Ted Dunning
 
Big data and mstr bridge the elephant
Kognitio
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Alex Gorbachev
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
 
Big data hadoop rdbms
Arjen de Vries
 
Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...
Cloudera, Inc.
 
Big dataappliance hadoopworld_final
jdijcks
 
Shared slides-edbt-keynote-03-19-13
Daniel Abadi
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
DataWorks Summit
 
Cjoin
blogboy
 
Hadoop and Hive Development at Facebook
elliando dias
 
Hadoop and Hive Development at Facebook
S S
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Cloudera, Inc.
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Yahoo Developer Network
 
Preparing yourdataforcloud
Inphina Technologies
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Qian Lin
 
Prepare Your Data For The Cloud
IndicThreads
 
Preparing your data for the cloud
Inphina Technologies
 
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
StampedeCon
 
Apache Drill
Ted Dunning
 

Recently uploaded (20)

PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
The Future of Artificial Intelligence (AI)
Mukul
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Doc9.....................................
SofiaCollazos
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

  • 1. EndoMine System Jewish General Hospital by David Lauzon and Anton Zakharov Big Data Montreal #9 February 5th 2013 1 / 18
  • 2. Presentation • Our Objectives • Requirements and context • Project scope • Hadoop Solution – Big Data Solution Overview – Hive Table Schema – Compression Performance – Data Architecture in Hadoop – Hadoop/Impala Prototype Demo • Oracle Solution • Hadoop vs Oracle comparison • What are expensive queries? 2 / 18
  • 3. Our Objectives • Lead an end-of-study project in an industrial context – Requirements elicitation – Implement a « proof-of-concept » prototype • Experiment with big data technologies – Compare with RDBMS 3 / 18
  • 4. Requirements and context • Department of Medical Diagnostic (medical test results DB, e.g. blood, urine, ...) – Dr. Shaun Eintracht • « ad hoc » Query • ETL Query – Dr. Elizabeth Mac Namara • « business intelligence » requirements • Realtime Dashboard • Department of Endocrinology – Dr. Mark Trifiro • Data mining 4 / 18
  • 5. Project scope • First iteration = improve ad-hoc queries – Slow analytical queries and ETL (MS Access) – Risk of « crashing » production DB – Some queries impossible to process 5 / 18
  • 7. Solutions • Solution 1 : Hadoop + Impala • Solution 2 : Tune the existing Oracle RDBMS 7 / 18
  • 8. Big Data Solution Overview 8 / 18
  • 10. Compression Performance 250 200 150 Impala 100 Hive Oracle 50 0 Oracle FS Text File Sequence SeqFile + SeqFile + File Gzip Snappy 10 / 18
  • 11. Data Architecture in Hadoop • All big tables are pre-joined – With specimen (1) – Without specimen (2) • Partitioned using two schemes – Year-month (3) – Year and Test (4) • 4 different versions of the same data: – stay_order_results_yearmonth – stay_order_results_year_and_test – stay_order_results_specimen_yearmonth – stay_order_results_specimen_year_and_test 11 / 18
  • 13. Oracle Solution • Same tables as source DB – A big pre-joined table is not a good solution • Techniques explored : – Partitioning • Partitions automatically created – Compression • Inefficient for joins – Clustering – Join multiple partitioned tables 13 / 18
  • 14. Oracle Solution (continued) • Avoid too many indexes on the big tables: – Takes a lot of memory – Slow to create – May not be used if query use more than 5% of the rows 14 / 18
  • 15. Comparison: Hadoop Solution • Pro – Crunch massive amount of data – Scalability – Free software • Cons – Needs better UI and tune-ups – Maintenance cost – Require ETL time to merge data into one table – BIG Joins should be avoided 15 / 18
  • 16. Comparison: Oracle Solution • Pro – Just need to create a slave DB (just?) – Faster random-lookup – Easier to find expertise • Cons – Scalability up to a certain point.. – Synchronisation with master DB: • Rebuilding indexes would take hours 16 / 18
  • 17. What are expensive queries? • If possible, avoid these constructs on large result sets – SELECT DISTINCT – ORDER BY – GROUP BY – JOIN big table with another big table • JOIN big table with multiple small tables should be OK 17 / 18
  • 18. Conclusion • Recommendation to use a “classic” RDBMS – The database fit on a single-node – Existing expertise in-house – Acceptable performance with appropriate tune-ups – Stop using MS Access • Disadvantage : limited scalability 18 / 18

Editor's Notes

  • #4: ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie
  • #5: ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie
  • #6: Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
  • #8: Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
  • #10: NE PARLERONS PAS DE : Extraction des exigences
  • #11: 25% plusrapide avec compression Snappy (5.5X compression)Impala 80% plus rapidequ’Oracle
  • #13: ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie