SlideShare a Scribd company logo
A STUDY OF SQL OPTIMIZER AND HIVE
COEN 380 Project
Project Group : 1
Bhide, Aishwarya
Patnaik, Anita
Sekar, Vishaka Balasubramanian
Yoloye, Mose
2
Project Goal
● Understand how SQL Optimizer works
● Generate query plans using Oracle Explain
● Understand the basic principles of Hive
● Execute queries on Hive
● Compare query execution using Oracle and Hive
3
The SQL Optimizer
● Why do we need the optimizer?
Select * from Books where author = ‘Ernest Hemingway’;
Two ways to execute it –
• Full table scan
• Index on author
Is there a difference?
• 10 rows
• 10 million rows
4
The SQL Optimizer
● SQL is a declarative language
○ Query specifies what, the SQL engine decides how
○ How does understanding SQL optimizer help?
5
Data Set Up
6Reference : Database System Concepts - Silberschatz,Korth,Sudarshan
Data Set Up
● Queries
○ single relation
○ join ( 2-way and 3-way join)
○ aggregate function
○ Aggregates with grouping
○ Set function – Union, Except
○ Sub queries
○ Sub queries using with clause
○ Update and Delete 7
Project Execution
● Set up Oracle database
● Generate query optimizer plan using Oracle
Explain
● Set up tables and insert data in Hive
● Execute queries on Hive
8
Oracle Query Plan Results - 1
Query using single relation
SELECT title FROM course WHERE dept_name = 'Comp. Sci.' AND credits = 3;
9
Oracle Query Plan Results - 1
Query using single relation
SELECT title FROM course WHERE dept_name = 'Comp. Sci.' AND credits = 3;
10
Oracle Query Plan Results - 2
Query using 2-way join
SELECT DISTINCT ID FROM takes WHERE (takes.course_id , takes.sec_id, takes.semester,
takes.year) IN (SELECT course_id, sec_id,semester, year FROM teaches NATURAL JOIN
instructor WHERE name = 'Einstein');
11
Oracle Query Plan Results - 2
Query using 2-way join
SELECT DISTINCT ID FROM takes WHERE (takes.course_id , takes.sec_id, takes.semester,
takes.year) IN (SELECT course_id, sec_id,semester, year FROM teaches NATURAL JOIN
instructor WHERE name = 'Einstein');
Relational Algebra Expression:
12
Based on Oracle generated query Plan Self created query Plan
Oracle Query Plan Results - 3
Query using 3-way join
SELECT name, title FROM (instructor NATURAL JOIN teaches) JOIN course USING
(course_id);
13
Oracle Query Plan Results - 3
Query using 3-way join
SELECT name, title FROM (instructor NATURAL JOIN teaches) JOIN course USING
(course_id);
Relational Algebra Expression:
Equivalent Expression:
Based on Oracle generated query Plan
instructor(ID, name, dept_name,salary)
teaches(ID, course_id, sec_id, semester, year)
course(course_id, title, dept_name, credits) 14
Self Created Query Plan
Oracle Query Plan Results - 4
Query using aggregate function
SELECT MAX(salary) FROM instructor;
Relational Algebra Expression:
15
Oracle Query Plan Results - 5
Query for aggregate with grouping
SELECT COUNT(ID), course_id, sec_id FROM section NATURAL JOIN takes
WHERE semester='Fall' AND year=2009 GROUP BY course_id, sec_id;
16
Oracle Query Plan Results - 5
Query for aggregate with grouping
17
SELECT COUNT(ID), course_id, sec_id FROM section NATURAL JOIN takes
WHERE semester='Fall' AND year=2009 GROUP BY course_id, sec_id;
Oracle Query Plan Results - 6
Query using union operation
(SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009) UNION
(SELECT course_id FROM section WHERE semester='Spring' AND year=2010);
18
Oracle Query Plan Results - 6
Query using union operation (Expected plan)
(SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009) UNION
(SELECT course_id FROM section WHERE semester='Spring' AND year=2010);
19
Oracle Query Plan Results - 6
Query using union operation (Oracle plan)
(SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009) UNION
(SELECT course_id FROM section WHERE semester='Spring' AND year=2010);
20
Oracle Query Plan Results - 7
Query using except (intersect) operation
(SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009)
INTERSECT (SELECT course_id FROM section WHERE semester='Spring' AND
year=2010);
21
Oracle Query Plan Results - 7
Query using except (intersect) operation (Expected Plan)
(SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009)
INTERSECT (SELECT course_id FROM section WHERE semester='Spring' AND
year=2010);
22
Oracle Query Plan Results - 7
Query using except (intersect) operation (Oracle Plan)
(SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009)
INTERSECT (SELECT course_id FROM section WHERE semester='Spring' AND
year=2010);
23
Oracle Query Plan Results - 8
Query using a subquery
SELECT name FROM instructor WHERE salary = (SELECT MAX(salary) FROM
instructor);
24
Oracle Query Plan Results - 8
Query using a subquery
SELECT name FROM instructor WHERE salary = (SELECT MAX(salary) FROM
instructor);
25
Expected
Oracle
Oracle Query Plan Results # 9
Query using subquery and rename operation
SELECT MAX(enrollment), course_id FROM (SELECT Count(ID) as enrollment, sec_id, course_id
FROM takes WHERE year=2009 and semester='Fall' GROUP BY sec_id, course_id) GROUP BY
course_id;
26
Query Plan # 9 - using subquery
SELECT MAX(enrollment),
course_id
FROM (SELECT Count(ID) as
enrollment, sec_id, course_id FROM
takes
WHERE year=2009 and
semester='Fall'
GROUP BY sec_id, course_id)
GROUP BY course_id;
27
Matches with Oracle’s plan
Oracle Query Plan Results - 10
Find the maximum enrollment across all sections in Fall 2009
WITH enrollment(course_id, sec_id, total) AS (SELECT course_id, sec_id, COUNT(ID) FROM
section NATURAL JOIN takes WHERE semester='Fall' and year='2009' GROUP BY course_id,
sec_id) SELECT MAX(total) FROM enrollment;
28
Query # 10 subquery and aggregation
SELECT COUNT(ID) as id FROM
section NATURAL JOIN takes
WHERE semester='Fall' and
year=2009 GROUP BY course_id,
sec_id
select max(id)
29
Matches with Oracle’s plan
Oracle Query Plan Results -11
Increase salary of each instructor in comp. sci dept. by 10%
UPDATE instructor SET salary = salary * 1.10 WHERE dept_name = 'Comp. Sci.';
30
Query #11 update query
instructor<- ∏name, ID,dept_name,(salary*0.10)((σinstructor..dept_name =
‘Comp Sci’ ) U ( σ instructor..dept_name <> ‘Comp Sci’) )
31
Oracle Query Plan Results -12
Delete all courses that have never been offered
DELETE FROM course
WHERE course_id IN (SELECT course_id FROM course MINUS SELECT course_id FROM course
NATURAL JOIN section);
32
Query #12 join
33
Oracle Optimizer - Summary
The purpose of the Oracle Optimizer is to determine the most efficient
execution plan for the queries
Explain plan is the most efficient tool to see why the current plan was chosen
It chooses the best plan by reviewing four key elements of queries:
cardinality, access methods, join methods, and join orders
34
Hive
● Why Hive?
Rapidly increasing size of datasets - 700TB data set
Warehouse built using RDBMS failed to scale
Need for scalable analysis on large data sets
Hadoop was not easy for the end users
Need for improved querying capability
Need for diverse applications and users
35
Hive is NOT
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
36
Hive - Features
● Features of Hive
○ It stores schema in a database and processed data into HDFS.
○ It is designed for OLAP.
○ It provides SQL type language for querying called HiveQL or HQL.
○ It is familiar, fast, scalable, and extensible.
37
HiveQL - Query Language
Query Language (HiveQL)
subset of SQL queries - SQL like language
metadata browsing capabilities
explain plan capabilities (naive rule based optimizer)
seamless plugging in of map-reduce programs
eg. FROM(
MAP doctext USING ‘python wc_mapper.py’ AS (word,cnt)
FROM docs
CLUSTER BY word
) a
REDUCE word, cnt USING ‘python wc_reduce.py’;
38
Data Model and Query Language
HiveQL - Limitations
No support for where clause subqueries (not in the initial version)
Only equality predicates supported for join
Does not support inserting into an existing table (UPDATE, DELETE
or INSERT INTO are not supported)
Why is this not a problem at FB?
Almost all queries can be expressed using equi-join
Data is loaded in separate partitions
No Complex locking protocol required
39
Hive Query Execution
Parse the query
Type Checking and Semantic Analysis
Optimization
performs a chain of transformations
Walks the DAG, checks for Rule condition fulfillment, rule execution
40
Hive - Query Optimizer
Query Optimizer - Transformations
Column Pruning
Predicate Pushdown
Partition pruning
Map side joins
small tables kept in all mappers memory
minimizes cost of sorting and merging
Join Reordering 41
Hive: Comparison with RDBMS
● Hive
designed for analytics performed on static data
lack of record level update/delete functionality
Write once read many times
process massive amount of data
supports subset of sql queries
● RDBMS
designed for transaction processing and analytics on dynamic data
does support record level update/delete
Read and write many times 42
Hive Query Execution Results
(Simple Select Query)
43
SELECT title FROM course WHERE dept_name = 'Comp. Sci.' AND credits = 3
Hive Query Execution Results
(subquery in FROM clause)
44
Hive Query Execution Results
(Aggregation & Join)
45
Hive Query Execution Results
(subquery)
46
SELECT name,salary FROM instructor i WHERE salary = (SELECT MAX(salary) FROM
instructor)
Hive Query Execution Inference
● Queries which include subqueries in Where or Having clause, e.g.
SELECT t.sec_id, t.course_id FROM takes t WHERE t.year=2009 AND
t.semester='Fall' HAVING count(t.ID) IN (SELECT MAX(enrollment) FROM
(SELECT COUNT(tin.ID) AS enrollment, tin.sec_id, tin.course_id FROM takes
tin WHERE tin.year=2009 AND tin.semester='Fall' GROUP BY
tin.sec_id,tin.course_id))
Queries which include subqueries in From clause, e.g.,
SELECT MAX(enrollment), s.course_id FROM (SELECT Count(t.ID) as
enrollment, t.sec_id, t.course_id FROM takes t WHERE t.year=2009 and
t.semester='Fall' GROUP BY t.sec_id,t.course_id) s GROUP BY s.course_id")
47
Hive - Use cases
● Hive should be used for analytical querying of data collected over a period of
time - for instance, to calculate trends or website logs.
● Hive should not be used for real-time querying
● It provides us data warehousing facilities on top of an existing Hadoop
cluster. Along with that it provides an SQL like interface which makes work
easier.
● create tables in Hive and store data there. Along with that, an existing HBase
tables can be mapped to Hive and operate on them.
48
Hive Query execution inference
Data Size: 20MB
49
HADOOP ORACLE
Hardware
Configuration
Environment: Cloudera CDH-5.6 -
YARN (MapReduce v2) and
Spark (1.5)
Worker Nodes: 24
Cores: 96 (4 cores per node)
Threads: 192
RAM: 768GB
● AMD A8-4555M APU
with Radeon HD
Graphics 1.60 GHz
● 4 cores
● 8GB Ram
● 64-bit operating
system
Average
Execution time
of queries
31.85 seconds 1 second
Hive Query execution inference
50
Executed Queries Failed Queries
● simple SELECT queries
● join
● subqueries within FROM
clause
● Union
● Intersection (sub-queries
within FROM clause)
● Aggregation with grouping
● Update
● Delete
● Queries with ‘WITH’ clause
● Sub queries within WHERE
clause
Demo
● Oracle Explain
● Hive
51
Thank you!
52

More Related Content

What's hot (8)

PDF
How to Analyze and Tune MySQL Queries for Better Performance
oysteing
 
PDF
Oracle 12c New Features For Better Performance
Zohar Elkayam
 
PPTX
Query parameterization
Riteshkiit
 
PPT
Single-Row Functions in orcale Data base
Salman Memon
 
PDF
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
PPTX
R- Introduction
Venkata Reddy Konasani
 
PPTX
Dynamic Publishing with Arbortext Data Merge
Clay Helberg
 
PPT
Les10
Sudharsan S
 
How to Analyze and Tune MySQL Queries for Better Performance
oysteing
 
Oracle 12c New Features For Better Performance
Zohar Elkayam
 
Query parameterization
Riteshkiit
 
Single-Row Functions in orcale Data base
Salman Memon
 
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
R- Introduction
Venkata Reddy Konasani
 
Dynamic Publishing with Arbortext Data Merge
Clay Helberg
 

Similar to SQL Optimizer vs Hive (20)

PPTX
05_DP_300T00A_Optimize.pptx
KareemBullard1
 
PDF
Oracle Database Advanced Querying (2016)
Zohar Elkayam
 
PPT
Chapter15
gourab87
 
PPTX
MySQL Optimizer: What's New in 8.0
Manyi Lu
 
PPTX
Explain the explain_plan
Maria Colgan
 
PDF
D80194GC20_sg1.pdf
Edris Fedlu
 
PPTX
Understanding DB2 Optimizer
terraborealis
 
PDF
Presentation interpreting execution plans for sql statements
xKinAnx
 
PPTX
Sql and PL/SQL Best Practices I
Carlos Oliveira
 
PDF
dd presentation.pdf
AnSHiKa187943
 
PPTX
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Ronald Francisco Vargas Quesada
 
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Julian Hyde
 
PPTX
Web Cloud Computing SQL Server - Ferrara University
antimo musone
 
PPTX
Query processing
Dr. C.V. Suresh Babu
 
PDF
Database and application performance vivek sharma
aioughydchapter
 
ODP
SQL Tunning
Dhananjay Goel
 
PPTX
Java Database Connectivity with JDBC.pptx
takomatiesucy
 
PDF
03-advanced-sqdkdkdkdkdkddkkkkkl-annotated.pdf
uuganbayrexe
 
PDF
The art of querying – newest and advanced SQL techniques
Zohar Elkayam
 
PDF
Performance Stability, Tips and Tricks and Underscores
Jitendra Singh
 
05_DP_300T00A_Optimize.pptx
KareemBullard1
 
Oracle Database Advanced Querying (2016)
Zohar Elkayam
 
Chapter15
gourab87
 
MySQL Optimizer: What's New in 8.0
Manyi Lu
 
Explain the explain_plan
Maria Colgan
 
D80194GC20_sg1.pdf
Edris Fedlu
 
Understanding DB2 Optimizer
terraborealis
 
Presentation interpreting execution plans for sql statements
xKinAnx
 
Sql and PL/SQL Best Practices I
Carlos Oliveira
 
dd presentation.pdf
AnSHiKa187943
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Ronald Francisco Vargas Quesada
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Julian Hyde
 
Web Cloud Computing SQL Server - Ferrara University
antimo musone
 
Query processing
Dr. C.V. Suresh Babu
 
Database and application performance vivek sharma
aioughydchapter
 
SQL Tunning
Dhananjay Goel
 
Java Database Connectivity with JDBC.pptx
takomatiesucy
 
03-advanced-sqdkdkdkdkdkddkkkkkl-annotated.pdf
uuganbayrexe
 
The art of querying – newest and advanced SQL techniques
Zohar Elkayam
 
Performance Stability, Tips and Tricks and Underscores
Jitendra Singh
 
Ad

Recently uploaded (20)

PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Open Chain Q2 Steering Committee Meeting - 2025-06-25
Shane Coughlan
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Ad

SQL Optimizer vs Hive

  • 1. A STUDY OF SQL OPTIMIZER AND HIVE COEN 380 Project
  • 2. Project Group : 1 Bhide, Aishwarya Patnaik, Anita Sekar, Vishaka Balasubramanian Yoloye, Mose 2
  • 3. Project Goal ● Understand how SQL Optimizer works ● Generate query plans using Oracle Explain ● Understand the basic principles of Hive ● Execute queries on Hive ● Compare query execution using Oracle and Hive 3
  • 4. The SQL Optimizer ● Why do we need the optimizer? Select * from Books where author = ‘Ernest Hemingway’; Two ways to execute it – • Full table scan • Index on author Is there a difference? • 10 rows • 10 million rows 4
  • 5. The SQL Optimizer ● SQL is a declarative language ○ Query specifies what, the SQL engine decides how ○ How does understanding SQL optimizer help? 5
  • 6. Data Set Up 6Reference : Database System Concepts - Silberschatz,Korth,Sudarshan
  • 7. Data Set Up ● Queries ○ single relation ○ join ( 2-way and 3-way join) ○ aggregate function ○ Aggregates with grouping ○ Set function – Union, Except ○ Sub queries ○ Sub queries using with clause ○ Update and Delete 7
  • 8. Project Execution ● Set up Oracle database ● Generate query optimizer plan using Oracle Explain ● Set up tables and insert data in Hive ● Execute queries on Hive 8
  • 9. Oracle Query Plan Results - 1 Query using single relation SELECT title FROM course WHERE dept_name = 'Comp. Sci.' AND credits = 3; 9
  • 10. Oracle Query Plan Results - 1 Query using single relation SELECT title FROM course WHERE dept_name = 'Comp. Sci.' AND credits = 3; 10
  • 11. Oracle Query Plan Results - 2 Query using 2-way join SELECT DISTINCT ID FROM takes WHERE (takes.course_id , takes.sec_id, takes.semester, takes.year) IN (SELECT course_id, sec_id,semester, year FROM teaches NATURAL JOIN instructor WHERE name = 'Einstein'); 11
  • 12. Oracle Query Plan Results - 2 Query using 2-way join SELECT DISTINCT ID FROM takes WHERE (takes.course_id , takes.sec_id, takes.semester, takes.year) IN (SELECT course_id, sec_id,semester, year FROM teaches NATURAL JOIN instructor WHERE name = 'Einstein'); Relational Algebra Expression: 12 Based on Oracle generated query Plan Self created query Plan
  • 13. Oracle Query Plan Results - 3 Query using 3-way join SELECT name, title FROM (instructor NATURAL JOIN teaches) JOIN course USING (course_id); 13
  • 14. Oracle Query Plan Results - 3 Query using 3-way join SELECT name, title FROM (instructor NATURAL JOIN teaches) JOIN course USING (course_id); Relational Algebra Expression: Equivalent Expression: Based on Oracle generated query Plan instructor(ID, name, dept_name,salary) teaches(ID, course_id, sec_id, semester, year) course(course_id, title, dept_name, credits) 14 Self Created Query Plan
  • 15. Oracle Query Plan Results - 4 Query using aggregate function SELECT MAX(salary) FROM instructor; Relational Algebra Expression: 15
  • 16. Oracle Query Plan Results - 5 Query for aggregate with grouping SELECT COUNT(ID), course_id, sec_id FROM section NATURAL JOIN takes WHERE semester='Fall' AND year=2009 GROUP BY course_id, sec_id; 16
  • 17. Oracle Query Plan Results - 5 Query for aggregate with grouping 17 SELECT COUNT(ID), course_id, sec_id FROM section NATURAL JOIN takes WHERE semester='Fall' AND year=2009 GROUP BY course_id, sec_id;
  • 18. Oracle Query Plan Results - 6 Query using union operation (SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009) UNION (SELECT course_id FROM section WHERE semester='Spring' AND year=2010); 18
  • 19. Oracle Query Plan Results - 6 Query using union operation (Expected plan) (SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009) UNION (SELECT course_id FROM section WHERE semester='Spring' AND year=2010); 19
  • 20. Oracle Query Plan Results - 6 Query using union operation (Oracle plan) (SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009) UNION (SELECT course_id FROM section WHERE semester='Spring' AND year=2010); 20
  • 21. Oracle Query Plan Results - 7 Query using except (intersect) operation (SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009) INTERSECT (SELECT course_id FROM section WHERE semester='Spring' AND year=2010); 21
  • 22. Oracle Query Plan Results - 7 Query using except (intersect) operation (Expected Plan) (SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009) INTERSECT (SELECT course_id FROM section WHERE semester='Spring' AND year=2010); 22
  • 23. Oracle Query Plan Results - 7 Query using except (intersect) operation (Oracle Plan) (SELECT course_id FROM section WHERE semester = 'Fall' AND year = 2009) INTERSECT (SELECT course_id FROM section WHERE semester='Spring' AND year=2010); 23
  • 24. Oracle Query Plan Results - 8 Query using a subquery SELECT name FROM instructor WHERE salary = (SELECT MAX(salary) FROM instructor); 24
  • 25. Oracle Query Plan Results - 8 Query using a subquery SELECT name FROM instructor WHERE salary = (SELECT MAX(salary) FROM instructor); 25 Expected Oracle
  • 26. Oracle Query Plan Results # 9 Query using subquery and rename operation SELECT MAX(enrollment), course_id FROM (SELECT Count(ID) as enrollment, sec_id, course_id FROM takes WHERE year=2009 and semester='Fall' GROUP BY sec_id, course_id) GROUP BY course_id; 26
  • 27. Query Plan # 9 - using subquery SELECT MAX(enrollment), course_id FROM (SELECT Count(ID) as enrollment, sec_id, course_id FROM takes WHERE year=2009 and semester='Fall' GROUP BY sec_id, course_id) GROUP BY course_id; 27 Matches with Oracle’s plan
  • 28. Oracle Query Plan Results - 10 Find the maximum enrollment across all sections in Fall 2009 WITH enrollment(course_id, sec_id, total) AS (SELECT course_id, sec_id, COUNT(ID) FROM section NATURAL JOIN takes WHERE semester='Fall' and year='2009' GROUP BY course_id, sec_id) SELECT MAX(total) FROM enrollment; 28
  • 29. Query # 10 subquery and aggregation SELECT COUNT(ID) as id FROM section NATURAL JOIN takes WHERE semester='Fall' and year=2009 GROUP BY course_id, sec_id select max(id) 29 Matches with Oracle’s plan
  • 30. Oracle Query Plan Results -11 Increase salary of each instructor in comp. sci dept. by 10% UPDATE instructor SET salary = salary * 1.10 WHERE dept_name = 'Comp. Sci.'; 30
  • 31. Query #11 update query instructor<- ∏name, ID,dept_name,(salary*0.10)((σinstructor..dept_name = ‘Comp Sci’ ) U ( σ instructor..dept_name <> ‘Comp Sci’) ) 31
  • 32. Oracle Query Plan Results -12 Delete all courses that have never been offered DELETE FROM course WHERE course_id IN (SELECT course_id FROM course MINUS SELECT course_id FROM course NATURAL JOIN section); 32
  • 34. Oracle Optimizer - Summary The purpose of the Oracle Optimizer is to determine the most efficient execution plan for the queries Explain plan is the most efficient tool to see why the current plan was chosen It chooses the best plan by reviewing four key elements of queries: cardinality, access methods, join methods, and join orders 34
  • 35. Hive ● Why Hive? Rapidly increasing size of datasets - 700TB data set Warehouse built using RDBMS failed to scale Need for scalable analysis on large data sets Hadoop was not easy for the end users Need for improved querying capability Need for diverse applications and users 35
  • 36. Hive is NOT A relational database A design for OnLine Transaction Processing (OLTP) A language for real-time queries and row-level updates 36
  • 37. Hive - Features ● Features of Hive ○ It stores schema in a database and processed data into HDFS. ○ It is designed for OLAP. ○ It provides SQL type language for querying called HiveQL or HQL. ○ It is familiar, fast, scalable, and extensible. 37
  • 38. HiveQL - Query Language Query Language (HiveQL) subset of SQL queries - SQL like language metadata browsing capabilities explain plan capabilities (naive rule based optimizer) seamless plugging in of map-reduce programs eg. FROM( MAP doctext USING ‘python wc_mapper.py’ AS (word,cnt) FROM docs CLUSTER BY word ) a REDUCE word, cnt USING ‘python wc_reduce.py’; 38
  • 39. Data Model and Query Language HiveQL - Limitations No support for where clause subqueries (not in the initial version) Only equality predicates supported for join Does not support inserting into an existing table (UPDATE, DELETE or INSERT INTO are not supported) Why is this not a problem at FB? Almost all queries can be expressed using equi-join Data is loaded in separate partitions No Complex locking protocol required 39
  • 40. Hive Query Execution Parse the query Type Checking and Semantic Analysis Optimization performs a chain of transformations Walks the DAG, checks for Rule condition fulfillment, rule execution 40
  • 41. Hive - Query Optimizer Query Optimizer - Transformations Column Pruning Predicate Pushdown Partition pruning Map side joins small tables kept in all mappers memory minimizes cost of sorting and merging Join Reordering 41
  • 42. Hive: Comparison with RDBMS ● Hive designed for analytics performed on static data lack of record level update/delete functionality Write once read many times process massive amount of data supports subset of sql queries ● RDBMS designed for transaction processing and analytics on dynamic data does support record level update/delete Read and write many times 42
  • 43. Hive Query Execution Results (Simple Select Query) 43 SELECT title FROM course WHERE dept_name = 'Comp. Sci.' AND credits = 3
  • 44. Hive Query Execution Results (subquery in FROM clause) 44
  • 45. Hive Query Execution Results (Aggregation & Join) 45
  • 46. Hive Query Execution Results (subquery) 46 SELECT name,salary FROM instructor i WHERE salary = (SELECT MAX(salary) FROM instructor)
  • 47. Hive Query Execution Inference ● Queries which include subqueries in Where or Having clause, e.g. SELECT t.sec_id, t.course_id FROM takes t WHERE t.year=2009 AND t.semester='Fall' HAVING count(t.ID) IN (SELECT MAX(enrollment) FROM (SELECT COUNT(tin.ID) AS enrollment, tin.sec_id, tin.course_id FROM takes tin WHERE tin.year=2009 AND tin.semester='Fall' GROUP BY tin.sec_id,tin.course_id)) Queries which include subqueries in From clause, e.g., SELECT MAX(enrollment), s.course_id FROM (SELECT Count(t.ID) as enrollment, t.sec_id, t.course_id FROM takes t WHERE t.year=2009 and t.semester='Fall' GROUP BY t.sec_id,t.course_id) s GROUP BY s.course_id") 47
  • 48. Hive - Use cases ● Hive should be used for analytical querying of data collected over a period of time - for instance, to calculate trends or website logs. ● Hive should not be used for real-time querying ● It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes work easier. ● create tables in Hive and store data there. Along with that, an existing HBase tables can be mapped to Hive and operate on them. 48
  • 49. Hive Query execution inference Data Size: 20MB 49 HADOOP ORACLE Hardware Configuration Environment: Cloudera CDH-5.6 - YARN (MapReduce v2) and Spark (1.5) Worker Nodes: 24 Cores: 96 (4 cores per node) Threads: 192 RAM: 768GB ● AMD A8-4555M APU with Radeon HD Graphics 1.60 GHz ● 4 cores ● 8GB Ram ● 64-bit operating system Average Execution time of queries 31.85 seconds 1 second
  • 50. Hive Query execution inference 50 Executed Queries Failed Queries ● simple SELECT queries ● join ● subqueries within FROM clause ● Union ● Intersection (sub-queries within FROM clause) ● Aggregation with grouping ● Update ● Delete ● Queries with ‘WITH’ clause ● Sub queries within WHERE clause