BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

Download as PPTX, PDF

•0 likes•3,249 views

The document discusses a project aimed at improving database performance at a medical facility using big data technologies, specifically comparing Hadoop with Oracle solutions. The objectives include enhancing ad-hoc query processing and addressing performance issues related to slow analytical queries. Recommendations lean towards using a classic RDBMS for its existing in-house expertise and acceptable performance despite limited scalability.

Technology

EndoMine System
Jewish General Hospital

by David Lauzon
and Anton Zakharov
Big Data Montreal #9
February 5th 2013 1 / 18

Presentation

• Our Objectives
• Requirements and context
• Project scope
• Hadoop Solution
– Big Data Solution Overview
– Hive Table Schema
– Compression Performance
– Data Architecture in Hadoop
– Hadoop/Impala Prototype Demo
• Oracle Solution
• Hadoop vs Oracle comparison
• What are expensive queries?

2 / 18

Our Objectives

• Lead an end-of-study project in an
industrial context
– Requirements elicitation
– Implement a « proof-of-concept » prototype

• Experiment with big data technologies
– Compare with RDBMS

3 / 18

Requirements and context

• Department of Medical Diagnostic
(medical test results DB, e.g. blood, urine, ...)
– Dr. Shaun Eintracht
• « ad hoc » Query
• ETL Query
– Dr. Elizabeth Mac Namara
• « business intelligence » requirements
• Realtime Dashboard

• Department of Endocrinology
– Dr. Mark Trifiro
• Data mining

4 / 18

Project scope

• First iteration = improve ad-hoc queries
– Slow analytical queries and ETL (MS Access)
– Risk of « crashing » production DB
– Some queries impossible to process

5 / 18

Solutions

• Solution 1 : Hadoop + Impala

• Solution 2 : Tune the existing Oracle RDBMS

7 / 18

Compression Performance

250

200

150
Impala
100 Hive
Oracle
50

0
Oracle FS Text File Sequence SeqFile + SeqFile +
File Gzip Snappy

10 / 18

Data Architecture in Hadoop

• All big tables are pre-joined
– With specimen (1)
– Without specimen (2)
• Partitioned using two schemes
– Year-month (3)
– Year and Test (4)
• 4 different versions of the same data:
– stay_order_results_yearmonth
– stay_order_results_year_and_test
– stay_order_results_specimen_yearmonth
– stay_order_results_specimen_year_and_test

11 / 18

Oracle Solution

• Same tables as source DB
– A big pre-joined table is not a good solution
• Techniques explored :
– Partitioning
• Partitions automatically created
– Compression
• Inefficient for joins
– Clustering
– Join multiple partitioned tables

13 / 18

Oracle Solution (continued)

• Avoid too many indexes on the big tables:
– Takes a lot of memory
– Slow to create
– May not be used if query use more than 5% of the
rows

14 / 18

Comparison: Hadoop Solution

• Pro
– Crunch massive amount of data
– Scalability
– Free software
• Cons
– Needs better UI and tune-ups
– Maintenance cost
– Require ETL time to merge data into one table
– BIG Joins should be avoided

15 / 18

Comparison: Oracle Solution

• Pro
– Just need to create a slave DB (just?)
– Faster random-lookup
– Easier to find expertise
• Cons
– Scalability up to a certain point..
– Synchronisation with master DB:
• Rebuilding indexes would take hours

16 / 18

What are expensive queries?

• If possible, avoid these constructs on
large result sets
– SELECT DISTINCT
– ORDER BY
– GROUP BY
– JOIN big table with another big table
• JOIN big table with multiple small tables should be OK

17 / 18

Conclusion

• Recommendation to use a “classic” RDBMS
– The database fit on a single-node
– Existing expertise in-house
– Acceptable performance with appropriate
tune-ups
– Stop using MS Access
• Disadvantage : limited scalability

18 / 18

More Related Content

PPTX

BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon

PDF

ODI11g, Hadoop and "Big Data" SourcesMark Rittman

PDF

Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen

PPTX

SQL-on-Hadoop TutorialDaniel Abadi

PPSX

Hadoop EcosystemPatrick Nicolas

PPTX

Bigdata antipatternsAnurag S

PPTX

SQL on HadoopBigdatapump

PPT

Boston Hadoop Meetup, April 26 2012Daniel Abadi

BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon

ODI11g, Hadoop and "Big Data" SourcesMark Rittman

Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen

SQL-on-Hadoop TutorialDaniel Abadi

Hadoop EcosystemPatrick Nicolas

Bigdata antipatternsAnurag S

SQL on HadoopBigdatapump

Boston Hadoop Meetup, April 26 2012Daniel Abadi

What's hot (20)

PPTX

Column Stores and Google BigQueryCsaba Toth

PDF

From Raw Data to Analytics with No ETLCloudera, Inc.

PDF

Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty

PDF

Big Data and Hadoop EcosystemRajkumar Singh

PPTX

Big Data in the Real WorldMark Kromer

PPT

SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti

PDF

ETL Practices for Better or WorseEric Sun

PDF

Hadoop and IDW - When_to_use_whichDan TheMan

PDF

Optiq: A dynamic data management frameworkJulian Hyde

PPTX

NoSQL Needs SomeSQLDataWorks Summit

PPTX

Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit

PPTX

Apache HBase™Prashant Gupta

PDF

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

PDF

Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.

PPTX

Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore

PPTX

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

PPTX

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

PPTX

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

PDF

Introduction To Hadoop EcosystemInSemble

PDF

Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks

Column Stores and Google BigQueryCsaba Toth

From Raw Data to Analytics with No ETLCloudera, Inc.

Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty

Big Data and Hadoop EcosystemRajkumar Singh

Big Data in the Real WorldMark Kromer

SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti

ETL Practices for Better or WorseEric Sun

Hadoop and IDW - When_to_use_whichDan TheMan

Optiq: A dynamic data management frameworkJulian Hyde

NoSQL Needs SomeSQLDataWorks Summit

Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit

Apache HBase™Prashant Gupta

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Alluxio, Inc.

Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

Big data vahidamiri-tabriz-13960226-datastack.irdatastack

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Introduction To Hadoop EcosystemInSemble

Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...Databricks

Viewers also liked (6)

PDF

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman

PDF

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman

PPTX

Oracle big data appliance and solutionssolarisyougood

PPTX

Extending Hortonworks with Oracle's Big Data PlatformDataWorks Summit/Hadoop Summit

PDF

A7 storytelling with_oracle_analytics_cloudDr. Wilfred Lin (Ph.D.)

PDF

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman

Oracle big data appliance and solutionssolarisyougood

Extending Hortonworks with Oracle's Big Data PlatformDataWorks Summit/Hadoop Summit

A7 storytelling with_oracle_analytics_cloudDr. Wilfred Lin (Ph.D.)

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics

Similar to BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case (20)

PDF

Big data and mstr bridge the elephantKognitio

PPTX

Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Alex Gorbachev

PPTX

Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem

PPS

Big data hadoop rdbmsArjen de Vries

PPTX

Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Cloudera, Inc.

PPTX

Big dataappliance hadoopworld_finaljdijcks

PDF

Shared slides-edbt-keynote-03-19-13Daniel Abadi

PPTX

Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit

PDF

Cjoinblogboy

PDF

Hadoop and Hive Development at Facebookelliando dias

PDF

Hadoop and Hive Development at FacebookS S

PPTX

Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Cloudera, Inc.

PPTX

Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseYahoo Developer Network

PDF

Preparing yourdataforcloudInphina Technologies

PPTX

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin

PDF

Prepare Your Data For The CloudIndicThreads

PDF

Preparing your data for the cloudInphina Technologies

PDF

Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015StampedeCon

PPTX

Hadoop DBTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

PPTX

Apache DrillTed Dunning

Big data and mstr bridge the elephantKognitio

Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Alex Gorbachev

Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem

Big data hadoop rdbmsArjen de Vries

Hadoop World 2011: Unlocking the Value of Big Data with Oracle - Jean-Pierre ...Cloudera, Inc.

Big dataappliance hadoopworld_finaljdijcks

Shared slides-edbt-keynote-03-19-13Daniel Abadi

Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit

Cjoinblogboy

Hadoop and Hive Development at Facebookelliando dias

Hadoop and Hive Development at FacebookS S

Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Cloudera, Inc.

Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseYahoo Developer Network

Preparing yourdataforcloudInphina Technologies

A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin

Prepare Your Data For The CloudIndicThreads

Preparing your data for the cloudInphina Technologies

Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015StampedeCon

Hadoop DBTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Apache DrillTed Dunning

Recently uploaded (20)

PDF

The Future of Artificial Intelligence (AI)Mukul

PDF

Structs to JSON: How Go Powers REST APIsEmily Achieng

PPTX

OA presentation.pptx OA presentation.pptxpateldhruv002338

PDF

Automating ArcGIS Content Discovery with FME: A Real World Use CaseSafe Software

PDF

How Open Source Changed My Career by abdelrahman ismaila0m0rajab1

PDF

MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdfNeo4j

PDF

GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdfLuiz Carneiro

PDF

Using Anchore and DefectDojo to Stand Up Your DevSecOps FunctionAnchore

PPTX

The Future of AI & Machine Learning.pptxpritsen4700

PPTX

cloud computing vai.pptx for the projectvaibhavdobariyal79

PDF

Orbitly Pitch Deck｜A Mission-Driven Platform for Side Project Collaboration (...zz41354899

PDF

SparkLabs Primer on Artificial Intelligence 2025SparkLabs Group

PDF

Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdfCA Suvidha Chaplot

PDF

Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdfPrecisely

PDF

Doc9.....................................SofiaCollazos

PDF

A Strategic Analysis of the MVNO Wave in Emerging Markets.pdfIPLOOK Networks

PDF

Tea4chat - another LLM Project by Kerem Atama0m0rajab1

PDF

Trying to figure out MCP by actually building an app from scratch with open s...Julien SIMON

PDF

Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdfSandesh Rao

PDF

AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...Sandesh Rao

The Future of Artificial Intelligence (AI)Mukul

Structs to JSON: How Go Powers REST APIsEmily Achieng

OA presentation.pptx OA presentation.pptxpateldhruv002338

Automating ArcGIS Content Discovery with FME: A Real World Use CaseSafe Software

How Open Source Changed My Career by abdelrahman ismaila0m0rajab1

MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdfNeo4j

GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdfLuiz Carneiro

Using Anchore and DefectDojo to Stand Up Your DevSecOps FunctionAnchore

The Future of AI & Machine Learning.pptxpritsen4700

cloud computing vai.pptx for the projectvaibhavdobariyal79

Orbitly Pitch Deck｜A Mission-Driven Platform for Side Project Collaboration (...zz41354899

SparkLabs Primer on Artificial Intelligence 2025SparkLabs Group

Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdfCA Suvidha Chaplot

Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdfPrecisely

Doc9.....................................SofiaCollazos

A Strategic Analysis of the MVNO Wave in Emerging Markets.pdfIPLOOK Networks

Tea4chat - another LLM Project by Kerem Atama0m0rajab1

Trying to figure out MCP by actually building an app from scratch with open s...Julien SIMON

Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdfSandesh Rao

AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...Sandesh Rao

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

1. EndoMine System Jewish General Hospital by David Lauzon and Anton Zakharov Big Data Montreal #9 February 5th 2013 1 / 18

2. Presentation • Our Objectives • Requirements and context • Project scope • Hadoop Solution – Big Data Solution Overview – Hive Table Schema – Compression Performance – Data Architecture in Hadoop – Hadoop/Impala Prototype Demo • Oracle Solution • Hadoop vs Oracle comparison • What are expensive queries? 2 / 18

3. Our Objectives • Lead an end-of-study project in an industrial context – Requirements elicitation – Implement a « proof-of-concept » prototype • Experiment with big data technologies – Compare with RDBMS 3 / 18

4. Requirements and context • Department of Medical Diagnostic (medical test results DB, e.g. blood, urine, ...) – Dr. Shaun Eintracht • « ad hoc » Query • ETL Query – Dr. Elizabeth Mac Namara • « business intelligence » requirements • Realtime Dashboard • Department of Endocrinology – Dr. Mark Trifiro • Data mining 4 / 18

5. Project scope • First iteration = improve ad-hoc queries – Slow analytical queries and ETL (MS Access) – Risk of « crashing » production DB – Some queries impossible to process 5 / 18

6. Production DB (Oracle) 6 / 18

7. Solutions • Solution 1 : Hadoop + Impala • Solution 2 : Tune the existing Oracle RDBMS 7 / 18

8. Big Data Solution Overview 8 / 18

9. Hive Table Schema 9 / 18

10. Compression Performance 250 200 150 Impala 100 Hive Oracle 50 0 Oracle FS Text File Sequence SeqFile + SeqFile + File Gzip Snappy 10 / 18

11. Data Architecture in Hadoop • All big tables are pre-joined – With specimen (1) – Without specimen (2) • Partitioned using two schemes – Year-month (3) – Year and Test (4) • 4 different versions of the same data: – stay_order_results_yearmonth – stay_order_results_year_and_test – stay_order_results_specimen_yearmonth – stay_order_results_specimen_year_and_test 11 / 18

12. Hadoop Prototype Demo 12 / 18

13. Oracle Solution • Same tables as source DB – A big pre-joined table is not a good solution • Techniques explored : – Partitioning • Partitions automatically created – Compression • Inefficient for joins – Clustering – Join multiple partitioned tables 13 / 18

14. Oracle Solution (continued) • Avoid too many indexes on the big tables: – Takes a lot of memory – Slow to create – May not be used if query use more than 5% of the rows 14 / 18

15. Comparison: Hadoop Solution • Pro – Crunch massive amount of data – Scalability – Free software • Cons – Needs better UI and tune-ups – Maintenance cost – Require ETL time to merge data into one table – BIG Joins should be avoided 15 / 18

16. Comparison: Oracle Solution • Pro – Just need to create a slave DB (just?) – Faster random-lookup – Easier to find expertise • Cons – Scalability up to a certain point.. – Synchronisation with master DB: • Rebuilding indexes would take hours 16 / 18

17. What are expensive queries? • If possible, avoid these constructs on large result sets – SELECT DISTINCT – ORDER BY – GROUP BY – JOIN big table with another big table • JOIN big table with multiple small tables should be OK 17 / 18

18. Conclusion • Recommendation to use a “classic” RDBMS – The database fit on a single-node – Existing expertise in-house – Acceptable performance with appropriate tune-ups – Stop using MS Access • Disadvantage : limited scalability 18 / 18

Editor's Notes

#4: ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie
#5: ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie
#6: Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
#8: Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
#10: NE PARLERONS PAS DE : Extraction des exigences
#11: 25% plusrapide avec compression Snappy (5.5X compression)Impala 80% plus rapidequ’Oracle
#13: ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie