SlideShare a Scribd company logo
How to understand and analyze
Apache Hive query execution plan
for performance debugging
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Pengcheng Xiong and Ashutosh Chauhan
Hortonworks Inc., Apache Hive
Community
{pxiong,ashutosh}@hortonworks.com
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Goals:
• WHY old style Hive explain plan is hard to read
• Compare the old style explain with Postgres over a body of 500+ realistic SQL queries.
• WHAT is new style Hive explain plan
• Show orchestration of the tasks and operator trees, join sequences and algorithms, operator
execution costs
• HOW to performance debug a real query by analyzing the new Hive
explain plan
• Identify the potential improvement by changing join sequence, join algorithm and etc
• Show the real improvement by running the query in real cluster
• Integration/interaction with other system/tools
• Future work
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHY old style Hive explain plan is hard to read
• M**** company’s schema and queries.
• Comparison of explain plans between Hive and Postgres for the 528
queries they can both execute.
• Hive: “explain”, Postgres: “explain verbose”
Hive “old
style”, 233.5
postgres,
53.8
Hive “old
style”, 1289
postgres,
328.6
Average Lines Per Explain Plan Average Words Per Explain Plan
N = 528 queries
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
We can see that Hive old style explain is quite
verbose, is it necessary?
select a11.PBTNAME PBTNAME
from PMT_INVENTORY a11
join LU_MONTH a12
on (a11.QUARTER_ID = a12.QUARTER_ID)
where a12.MONTH_ID in (200607, 200606)
group by a11.PBTNAME;
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
High level plan comparison: Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
16 lines
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
High level plan comparison: Hive
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Map 2 <- Map 1 (BROADCAST_EDGE)
Reducer 3 <- Map 2 (SIMPLE_EDGE)
DagName: carter_20151114133018_2f2f0101-d14d-4688-bbb1-db67055016c3:946
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Reducer 3
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0
Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Too verbose, need a magnifier!
80+ lines
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 2 Operator
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Data flows from top to bottom
Each operator has 0 or 1
child
Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 2 Operator
Map 2
Map Operator Tree:
TableScan
alias: a12
filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean)
Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 quarter_id (type: int)
1 quarter_id (type: int)
outputColumnNames: _col1
input vertices:
0 Map 1
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Group By Operator
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE
Execution mode: vectorized
Must scroll to another part
of the plan to see what this is
Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 1 Operator
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
Actual table name (PMT_INVENTORY)
not mentioned anywhere, only
the alias
Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Map 1 Operator
Map 1
Map Operator Tree:
TableScan
alias: a11
filterExpr: quarter_id is not null (type: boolean)
Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: quarter_id is not null (type: boolean)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: quarter_id (type: int)
sort order: +
Map-reduce partition columns: quarter_id (type: int)
Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE
value expressions: pbtname (type: string)
Execution mode: vectorized
How much of this information is really
necessary to SQL users?
Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Data flows bottom to top
Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Operators have multiple
children when it makes
sense
Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Join is done using a scan of pmt_inventory and a hash
following a scan of lu_month.
All this info is available without referring to a stage plan.
IOW you don’t have to jump around in the plan.
Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Actual schema / table names
visible
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Back to Postgres
QUERY PLAN
---------------------------------------------------------------------------------
Group (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Group Key: a11.pbtname
-> Sort (cost=3.83..3.84 rows=2 width=18)
Output: a11.pbtname
Sort Key: a11.pbtname
-> Hash Join (cost=2.62..3.83 rows=2 width=18)
Output: a11.pbtname
Hash Cond: (a11.quarter_id = a12.quarter_id)
-> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22)
Output: a11.quarter_id, a11.pbtname
-> Hash (cost=2.60..2.60 rows=2 width=4)
Output: a12.quarter_id
-> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4)
Output: a12.quarter_id
Filter: (a12.month_id = ANY ('{200607,200606}'::integer[]))
Cost mentioned once per operator
Cost monotonically increases
as you go up
Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
• Set hive.explain.user=true; (by default). Use Tez, LLAP, etc
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Immediate Notes:
1. Much smaller
2. Can be read in order
Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Data flows bottom to top
Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Operators have multiple
children when it makes
sense
Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Join’s information is clear
pmt_inventory is broadcasted to lu_month
and a MapJoin is done
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
WHAT is new style Hive explain plan (HIVE-9780)
Stage-1
Reducer 3
File Output Operator [FS_14]
Group By Operator [GBY_12] (rows=8 width=101)
Output:["_col0"],keys:KEY._col0
<-Map 2 [SIMPLE_EDGE]
SHUFFLE [RS_11]
PartitionCols:_col0
Group By Operator [GBY_10] (rows=8 width=101)
Output:["_col0"],keys:_col1
Map Join Operator [MAPJOIN_19] (rows=33 width=101)
Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"]
<-Map 1 [BROADCAST_EDGE]
BROADCAST [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=12 width=105)
Output:["_col0","_col1"]
Filter Operator [FIL_17] (rows=12 width=105)
predicate:quarter_id is not null
TableScan [TS_0] (rows=12 width=105)
m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"]
<-Select Operator [SEL_5] (rows=48 width=8)
Output:["_col1"]
Filter Operator [FIL_18] (rows=48 width=8)
predicate:((month_id) IN (200607, 200606) and quarter_id is not null)
TableScan [TS_3] (rows=48 width=8)
m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"]
Cost mentioned once per operator
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HOW to performance debug a real query (TPC-DS Q3, 1TB)
select
dt.d_year, item.i_brand_id brand_id, item.i_brand brand, sum(ss_ext_sales_price) sum_agg
from
date_dim dt, store_sales, item
where
dt.d_date_sk = store_sales.ss_sold_date_sk
and store_sales.ss_item_sk = item.i_item_sk
and item.i_manufact_id = 436
and dt.d_moy = 12
group by dt.d_year , item.i_brand , item.i_brand_id
order by dt.d_year , sum_agg desc , brand_id
limit 10
partitioned by ss_sold_date_sk
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
<-Reducer 3 [SIMPLE_EDGE]
SHUFFLE [RS_15]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_14] (rows=1 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col45)"],keys:_col6, _col65, _col64
Select Operator [SEL_13] (rows=76515 width=128)
Output:["_col6","_col65","_col64","_col45"]
Filter Operator [FIL_23] (rows=76515 width=128)
predicate:((_col0 = _col53) and (_col32 = _col57))
Merge Join Operator [MERGEJOIN_28] (rows=306061 width=128)
Conds:RS_8._col32=RS_34.i_item_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53","_col57","_col64","_col65"]
<-Map 7 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_34]
PartitionCols:i_item_sk
Filter Operator [FIL_33] (rows=434 width=111)
predicate:(i_item_sk is not null and (i_manufact_id = 436))
TableScan [TS_2] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Reducer 2 [SIMPLE_EDGE]
SHUFFLE [RS_8]
PartitionCols:_col32
Merge Join Operator [MERGEJOIN_27] (rows=211562452 width=20)
Conds:RS_30.d_date_sk=RS_32.ss_sold_date_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53"]
<-Map 1 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_30]
PartitionCols:d_date_sk
Filter Operator [FIL_29] (rows=5619 width=12)
predicate:(d_date_sk is not null and (d_moy = 12))
TableScan [TS_0] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Map 6 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_32]
PartitionCols:ss_sold_date_sk
Filter Operator [FIL_31] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_1] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Original plan runs 163.33s. Sounds
like column pruning and predicate
push down are working fine.
However, the join sequence
store_sales✖date_dim✖item is not
good enough. A better one is
store_sales✖item✖date_dim
Table Cardinality Cardinality after filter Selectivity
date_dim 73K 5619 7.6%
item 300K 434 0.14%
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
<-Reducer 3 [SIMPLE_EDGE]
SHUFFLE [RS_17]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_16] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_15] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Merge Join Operator [MERGEJOIN_29] (rows=306061 width=112)
Conds:RS_12._col2=RS_38._col0(Inner),Output:["_col1","_col4","_col5","_col8"]
<-Map 7 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_38]
PartitionCols:_col0
Select Operator [SEL_37] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_36] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Reducer 2 [SIMPLE_EDGE]
SHUFFLE [RS_12]
PartitionCols:_col2
Merge Join Operator [MERGEJOIN_28] (rows=3978894 width=112)
Conds:RS_32._col0=RS_35._col0(Inner),Output:["_col1","_col2","_col4","_col5"]
<-Map 1 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_32]
PartitionCols:_col0
Select Operator [SEL_31] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_30] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
<-Map 6 [SIMPLE_EDGE] vectorized
SHUFFLE [RS_35]
PartitionCols:_col0
Select Operator [SEL_34] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_33] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
CBO on, new plan runs 143.97s
with new join sequence
store_sales✖item✖date_dim.
The input data size of one branch of
join is pretty small, should use map
join, rather than merge join.
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_44]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_43] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_42] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Map Join Operator [MAPJOIN_41] (rows=306061 width=112)
Conds:MAPJOIN_40._col2=RS_37._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_37]
PartitionCols:_col0
Select Operator [SEL_36] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_35] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
<-Map Join Operator [MAPJOIN_40] (rows=3978894 width=112)
Conds:SEL_39._col0=RS_34._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_34]
PartitionCols:_col0
Select Operator [SEL_33] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_32] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_39] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_38] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Increase
hive.auto.convert.join.noconditionaltask.size=
1,359,688,499, we can see it is now using map
join operators. New plan runs 45.84s.
store_sales is a partitioned table on
the join key ss_sold_date_sk with
date_dim table.
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_55]
PartitionCols:_col0, _col1, _col2
Group By Operator [GBY_54] (rows=9 width=116)
Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5
Select Operator [SEL_53] (rows=306061 width=112)
Output:["_col8","_col4","_col5","_col1"]
Map Join Operator [MAPJOIN_52] (rows=306061 width=112)
Conds:MAPJOIN_51._col2=RS_45._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_45]
PartitionCols:_col0
Select Operator [SEL_44] (rows=5619 width=12)
Output:["_col0","_col1"]
Filter Operator [FIL_43] (rows=5619 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12)
Group By Operator [GBY_47] (rows=2809 width=12)
Output:["_col0"],keys:_col0
Select Operator [SEL_46] (rows=5619 width=12)
Output:["_col0"]
Please refer to the previous Select Operator [SEL_44]
<-Map Join Operator [MAPJOIN_51] (rows=3978894 width=112)
Conds:SEL_50._col0=RS_42._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_42]
PartitionCols:_col0
Select Operator [SEL_41] (rows=434 width=111)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_40] (rows=434 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_50] (rows=2750387156 width=11)
Output:["_col0","_col1","_col2"]
Filter Operator [FIL_49] (rows=2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
By setting
hive.tez.dynamic.partition.pruning=true,
we can see dynamic partitioning
event operators. See more about this
in HIVE-7826. New plan runs 31.35s.
In the run time, dynamic partition event
operator will send values needed to
prune to the application master - where
splits are generated and tasks are
submitted. Using these values we can
strip out any unneeded partitions
dynamically, while the query is running.
Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Performance debugging summary (TPC-DS Q3, 1TB)
0
20
40
60
80
100
120
140
160
180
Original Join re-order Join selection Dynamic partition pruning
Queryexecutiontime(s)
Query execution time
Continuous improvement
Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Integration with Apache Ambari
1. Type the query here
2. Click “explain”
3. explain plan will be shown
Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
 “set hive.tez.exec.print.summary=true;”
 Get more insights on query performance
Can be used along with Tez vertex runtime stats
Runtime stats
Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Summary
• Show old style Hive explain plan is hard to read.
• Verbose with too much redundant information, hard to follow how data
flows, cost of operator is unclear
• Compare with Postgres over a body of 500+ realistic SQL queries and
identify the candidate improving points
• Introduce new style Hive explain plan
• Use a concrete example to help understand the explain: execution cost, join
sequence and orchestration of the operator tree
• Use the new Hive explain plan to performance debug TPC-DS Q3
• Show the improvement after join re-ordering, join selection, and dynamic
partition pruning
• Integration/interaction with other system/tools
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Future work -- Some gaps remain after HIVE-9780
• Put the real schema, table and column names in the
explain plan, e.g., no more _col0 etc.
• This will help users to understand the plan better
• HIVE-8681: CBO: Column names are missing from join expression
in Map join with CBO enabled
• Get an equivalent of “EXPLAIN ANALYZE” – such as
operator level runtime stats and warnings.
• This will help users to find out the gap between estimated cost and
real cost
• HIVE-14362: Support explain analyze in Hive
Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SHUFFLE [RS_55]
PartitionCols:d_year, i_brand, i_brand_id
Group By Operator [GBY_54] (rows=9/10 width=116)
Output:["d_year","sum_agg","i_brand","i_brand_id"],aggregations:["sum(ss_ext_sales_price)"],keys:d_year, i_brand, i_brand_id
Select Operator [SEL_53] (rows=306061/324651 width=112)
Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"]
Map Join Operator [MAPJOIN_52] (rows=306061/324651 width=112)
Conds:MAPJOIN_51. ss_sold_date_sk=RS_45. d_date_sk(Inner),HybridGraceHashJoin:true,Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"]
<-Map 5 [BROADCAST_EDGE] vectorized
BROADCAST [RS_45]
PartitionCols:d_date_sk
Select Operator [SEL_44] (rows=5619/6034 width=12)
Output:["d_date_sk","d_year"]
Filter Operator [FIL_43] (rows=5619/6034 width=12)
predicate:((d_moy = 12) and d_date_sk is not null)
TableScan [TS_6] (rows=73049/73049 width=12)
tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"]
Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12)
Group By Operator [GBY_47] (rows=2809/2324 width=12)
Output:["d_date_sk"],keys:d_date_sk
Select Operator [SEL_46] (rows=5619/6034 width=12)
Output:["d_date_sk"]
Please refer to the previous Select Operator [SEL_44]
<-Map Join Operator [MAPJOIN_51] (rows=3978894/4202377 width=112)
Conds:SEL_50.ss_item_sk=RS_42. i_item_sk(Inner),HybridGraceHashJoin:true,Output:["ss_ext_sales_price",” ss_sold_date_sk","i_brand","i_brand_id"]
<-Map 4 [BROADCAST_EDGE] vectorized
BROADCAST [RS_42]
PartitionCols:i_item_sk
Select Operator [SEL_41] (rows=434/453 width=111)
Output:[” i_item_sk","i_brand_id","i_brand"]
Filter Operator [FIL_40] (rows=434/453 width=111)
predicate:((i_manufact_id = 436) and i_item_sk is not null)
TableScan [TS_3] (rows=300000/300000 width=111)
tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"]
<-Select Operator [SEL_50] (rows=2750387156/2750387156 width=11)
Output:["ss_item_sk","ss_ext_sales_price","ss_sold_date_sk"]
Filter Operator [FIL_49] (rows=2750387156/2750387156 width=11)
predicate:ss_item_sk is not null
TableScan [TS_0] (rows=2750387156/2750387156 width=11)
tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Acknowledgement
• We thank all the anonymous reviewers’ votes to give us this
opportunity to share our work.
• Part of the slides are borrowed from or modified based on Carter
Shanklin and Rajesh Balamohan’s slides.
• We thank Gunther Hagleitner for all the support and inputs.
• We thank Sapin Amin for setting up the testing cluster.
Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Thank you! Questions?

More Related Content

What's hot (20)

PDF
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
PDF
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
PPTX
Hive: Loading Data
Benjamin Leonhardi
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Spark shuffle introduction
colorant
 
PDF
Apache Spark Core – Practical Optimization
Databricks
 
PDF
Introduction to elasticsearch
hypto
 
PDF
Introduction to Impala
markgrover
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PPTX
Introduction to HiveQL
kristinferrier
 
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
PDF
Building Robust ETL Pipelines with Apache Spark
Databricks
 
PDF
Hive Anatomy
nzhang
 
PPTX
LLAP: long-lived execution in Hive
DataWorks Summit
 
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Hive: Loading Data
Benjamin Leonhardi
 
Apache Spark Architecture
Alexey Grishchenko
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Optimizing Apache Spark SQL Joins
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Spark shuffle introduction
colorant
 
Apache Spark Core – Practical Optimization
Databricks
 
Introduction to elasticsearch
hypto
 
Introduction to Impala
markgrover
 
Introduction to Spark Internals
Pietro Michiardi
 
Introduction to HiveQL
kristinferrier
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Hive Anatomy
nzhang
 
LLAP: long-lived execution in Hive
DataWorks Summit
 

Viewers also liked (9)

PDF
Optimizing Hive Queries
Owen O'Malley
 
PDF
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
PPTX
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Athiq Ahamed
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
SQL to Hive Cheat Sheet
Hortonworks
 
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
PPTX
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
Edureka!
 
Optimizing Hive Queries
Owen O'Malley
 
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Athiq Ahamed
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
Introduction to Apache Spark
Rahul Jain
 
SQL to Hive Cheat Sheet
Hortonworks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
Edureka!
 
Ad

Similar to How to understand and analyze Apache Hive query execution plan for performance debugging (20)

PPTX
PPT on Data Science Using Python
NishantKumar1179
 
PPTX
Query Compilation in Impala
Cloudera, Inc.
 
PPTX
Lecture 9.pptx
MathewJohnSinoCruz
 
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
PPTX
interenship.pptx
Naveen316549
 
PPT
Myth busters - performance tuning 101 2007
paulguerin
 
PDF
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
PPTX
Basic Analysis using Python
Sankhya_Analytics
 
PDF
R programming & Machine Learning
AmanBhalla14
 
PDF
Cost-Based Optimizer in Apache Spark 2.2
Databricks
 
PPTX
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
PPTX
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
PPTX
Presentation_BigData_NenaMarin
n5712036
 
PPT
MYSQL database presentation slides with examples
dhanishev1
 
PDF
Chapter 5
EasyStudy3
 
PPTX
Python-for-Data-Analysis.pptx
ParveenShaik21
 
PPTX
Data Visualization_pandas in hadoop.pptx
Rahul Borate
 
PPT
DATA REPRESENTATIONS and Data codes and formats.ppt
gnvivekananda4u
 
PDF
Tactical data engineering
Julian Hyde
 
PPTX
Meetup Junio Data Analysis with python 2018
DataLab Community
 
PPT on Data Science Using Python
NishantKumar1179
 
Query Compilation in Impala
Cloudera, Inc.
 
Lecture 9.pptx
MathewJohnSinoCruz
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Jen Aman
 
interenship.pptx
Naveen316549
 
Myth busters - performance tuning 101 2007
paulguerin
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
Basic Analysis using Python
Sankhya_Analytics
 
R programming & Machine Learning
AmanBhalla14
 
Cost-Based Optimizer in Apache Spark 2.2
Databricks
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
Flink Forward San Francisco 2019: Build a Table-centric Apache Flink Ecosyste...
Flink Forward
 
Presentation_BigData_NenaMarin
n5712036
 
MYSQL database presentation slides with examples
dhanishev1
 
Chapter 5
EasyStudy3
 
Python-for-Data-Analysis.pptx
ParveenShaik21
 
Data Visualization_pandas in hadoop.pptx
Rahul Borate
 
DATA REPRESENTATIONS and Data codes and formats.ppt
gnvivekananda4u
 
Tactical data engineering
Julian Hyde
 
Meetup Junio Data Analysis with python 2018
DataLab Community
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Digital Circuits, important subject in CS
contactparinay1
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 

How to understand and analyze Apache Hive query execution plan for performance debugging

  • 1. How to understand and analyze Apache Hive query execution plan for performance debugging © Hortonworks Inc. 2011 – 2015. All Rights Reserved Pengcheng Xiong and Ashutosh Chauhan Hortonworks Inc., Apache Hive Community {pxiong,ashutosh}@hortonworks.com
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Goals: • WHY old style Hive explain plan is hard to read • Compare the old style explain with Postgres over a body of 500+ realistic SQL queries. • WHAT is new style Hive explain plan • Show orchestration of the tasks and operator trees, join sequences and algorithms, operator execution costs • HOW to performance debug a real query by analyzing the new Hive explain plan • Identify the potential improvement by changing join sequence, join algorithm and etc • Show the real improvement by running the query in real cluster • Integration/interaction with other system/tools • Future work
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHY old style Hive explain plan is hard to read • M**** company’s schema and queries. • Comparison of explain plans between Hive and Postgres for the 528 queries they can both execute. • Hive: “explain”, Postgres: “explain verbose” Hive “old style”, 233.5 postgres, 53.8 Hive “old style”, 1289 postgres, 328.6 Average Lines Per Explain Plan Average Words Per Explain Plan N = 528 queries
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved We can see that Hive old style explain is quite verbose, is it necessary? select a11.PBTNAME PBTNAME from PMT_INVENTORY a11 join LU_MONTH a12 on (a11.QUARTER_ID = a12.QUARTER_ID) where a12.MONTH_ID in (200607, 200606) group by a11.PBTNAME;
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved High level plan comparison: Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) 16 lines
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved High level plan comparison: Hive STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez Edges: Map 2 <- Map 1 (BROADCAST_EDGE) Reducer 3 <- Map 2 (SIMPLE_EDGE) DagName: carter_20151114133018_2f2f0101-d14d-4688-bbb1-db67055016c3:946 Vertices: Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Reducer 3 Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 6 Data size: 5933 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Too verbose, need a magnifier! 80+ lines
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 2 Operator Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Data flows from top to bottom Each operator has 0 or 1 child
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 2 Operator Map 2 Map Operator Tree: TableScan alias: a12 filterExpr: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 48 Data size: 46752 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (quarter_id is not null and (month_id) IN (200607, 200606)) (type: boolean) Statistics: Num rows: 12 Data size: 11688 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 quarter_id (type: int) 1 quarter_id (type: int) outputColumnNames: _col1 input vertices: 0 Map 1 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE HybridGraceHashJoin: true Group By Operator keys: _col1 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 13 Data size: 12856 Basic stats: COMPLETE Column stats: NONE Execution mode: vectorized Must scroll to another part of the plan to see what this is
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 1 Operator Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized Actual table name (PMT_INVENTORY) not mentioned anywhere, only the alias
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map 1 Operator Map 1 Map Operator Tree: TableScan alias: a11 filterExpr: quarter_id is not null (type: boolean) Statistics: Num rows: 12 Data size: 1260 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: quarter_id is not null (type: boolean) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: quarter_id (type: int) sort order: + Map-reduce partition columns: quarter_id (type: int) Statistics: Num rows: 6 Data size: 630 Basic stats: COMPLETE Column stats: NONE value expressions: pbtname (type: string) Execution mode: vectorized How much of this information is really necessary to SQL users?
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Data flows bottom to top
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Operators have multiple children when it makes sense
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Join is done using a scan of pmt_inventory and a hash following a scan of lu_month. All this info is available without referring to a stage plan. IOW you don’t have to jump around in the plan.
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Actual schema / table names visible
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Back to Postgres QUERY PLAN --------------------------------------------------------------------------------- Group (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Group Key: a11.pbtname -> Sort (cost=3.83..3.84 rows=2 width=18) Output: a11.pbtname Sort Key: a11.pbtname -> Hash Join (cost=2.62..3.83 rows=2 width=18) Output: a11.pbtname Hash Cond: (a11.quarter_id = a12.quarter_id) -> Seq Scan on public.pmt_inventory a11 (cost=0.00..1.12 rows=12 width=22) Output: a11.quarter_id, a11.pbtname -> Hash (cost=2.60..2.60 rows=2 width=4) Output: a12.quarter_id -> Seq Scan on public.lu_month a12 (cost=0.00..2.60 rows=2 width=4) Output: a12.quarter_id Filter: (a12.month_id = ANY ('{200607,200606}'::integer[])) Cost mentioned once per operator Cost monotonically increases as you go up
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) • Set hive.explain.user=true; (by default). Use Tez, LLAP, etc Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Immediate Notes: 1. Much smaller 2. Can be read in order
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Data flows bottom to top
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Operators have multiple children when it makes sense
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Join’s information is clear pmt_inventory is broadcasted to lu_month and a MapJoin is done
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved WHAT is new style Hive explain plan (HIVE-9780) Stage-1 Reducer 3 File Output Operator [FS_14] Group By Operator [GBY_12] (rows=8 width=101) Output:["_col0"],keys:KEY._col0 <-Map 2 [SIMPLE_EDGE] SHUFFLE [RS_11] PartitionCols:_col0 Group By Operator [GBY_10] (rows=8 width=101) Output:["_col0"],keys:_col1 Map Join Operator [MAPJOIN_19] (rows=33 width=101) Conds:RS_6._col0=SEL_5._col1(Inner),HybridGraceHashJoin:true,Output:["_col1"] <-Map 1 [BROADCAST_EDGE] BROADCAST [RS_6] PartitionCols:_col0 Select Operator [SEL_2] (rows=12 width=105) Output:["_col0","_col1"] Filter Operator [FIL_17] (rows=12 width=105) predicate:quarter_id is not null TableScan [TS_0] (rows=12 width=105) m****@pmt_inventory,a11,Tbl:COMPLETE,Col:COMPLETE,Output:["quarter_id","pbtname"] <-Select Operator [SEL_5] (rows=48 width=8) Output:["_col1"] Filter Operator [FIL_18] (rows=48 width=8) predicate:((month_id) IN (200607, 200606) and quarter_id is not null) TableScan [TS_3] (rows=48 width=8) m****@lu_month,a12,Tbl:COMPLETE,Col:COMPLETE,Output:["month_id","quarter_id"] Cost mentioned once per operator
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HOW to performance debug a real query (TPC-DS Q3, 1TB) select dt.d_year, item.i_brand_id brand_id, item.i_brand brand, sum(ss_ext_sales_price) sum_agg from date_dim dt, store_sales, item where dt.d_date_sk = store_sales.ss_sold_date_sk and store_sales.ss_item_sk = item.i_item_sk and item.i_manufact_id = 436 and dt.d_moy = 12 group by dt.d_year , item.i_brand , item.i_brand_id order by dt.d_year , sum_agg desc , brand_id limit 10 partitioned by ss_sold_date_sk
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved <-Reducer 3 [SIMPLE_EDGE] SHUFFLE [RS_15] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_14] (rows=1 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col45)"],keys:_col6, _col65, _col64 Select Operator [SEL_13] (rows=76515 width=128) Output:["_col6","_col65","_col64","_col45"] Filter Operator [FIL_23] (rows=76515 width=128) predicate:((_col0 = _col53) and (_col32 = _col57)) Merge Join Operator [MERGEJOIN_28] (rows=306061 width=128) Conds:RS_8._col32=RS_34.i_item_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53","_col57","_col64","_col65"] <-Map 7 [SIMPLE_EDGE] vectorized SHUFFLE [RS_34] PartitionCols:i_item_sk Filter Operator [FIL_33] (rows=434 width=111) predicate:(i_item_sk is not null and (i_manufact_id = 436)) TableScan [TS_2] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Reducer 2 [SIMPLE_EDGE] SHUFFLE [RS_8] PartitionCols:_col32 Merge Join Operator [MERGEJOIN_27] (rows=211562452 width=20) Conds:RS_30.d_date_sk=RS_32.ss_sold_date_sk(Inner),Output:["_col0","_col6","_col32","_col45","_col53"] <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_30] PartitionCols:d_date_sk Filter Operator [FIL_29] (rows=5619 width=12) predicate:(d_date_sk is not null and (d_moy = 12)) TableScan [TS_0] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Map 6 [SIMPLE_EDGE] vectorized SHUFFLE [RS_32] PartitionCols:ss_sold_date_sk Filter Operator [FIL_31] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_1] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] Original plan runs 163.33s. Sounds like column pruning and predicate push down are working fine. However, the join sequence store_sales✖date_dim✖item is not good enough. A better one is store_sales✖item✖date_dim Table Cardinality Cardinality after filter Selectivity date_dim 73K 5619 7.6% item 300K 434 0.14%
  • 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved <-Reducer 3 [SIMPLE_EDGE] SHUFFLE [RS_17] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_16] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_15] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Merge Join Operator [MERGEJOIN_29] (rows=306061 width=112) Conds:RS_12._col2=RS_38._col0(Inner),Output:["_col1","_col4","_col5","_col8"] <-Map 7 [SIMPLE_EDGE] vectorized SHUFFLE [RS_38] PartitionCols:_col0 Select Operator [SEL_37] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_36] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Reducer 2 [SIMPLE_EDGE] SHUFFLE [RS_12] PartitionCols:_col2 Merge Join Operator [MERGEJOIN_28] (rows=3978894 width=112) Conds:RS_32._col0=RS_35._col0(Inner),Output:["_col1","_col2","_col4","_col5"] <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_32] PartitionCols:_col0 Select Operator [SEL_31] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_30] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] <-Map 6 [SIMPLE_EDGE] vectorized SHUFFLE [RS_35] PartitionCols:_col0 Select Operator [SEL_34] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_33] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] CBO on, new plan runs 143.97s with new join sequence store_sales✖item✖date_dim. The input data size of one branch of join is pretty small, should use map join, rather than merge join.
  • 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_44] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_43] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_42] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Map Join Operator [MAPJOIN_41] (rows=306061 width=112) Conds:MAPJOIN_40._col2=RS_37._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_37] PartitionCols:_col0 Select Operator [SEL_36] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_35] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] <-Map Join Operator [MAPJOIN_40] (rows=3978894 width=112) Conds:SEL_39._col0=RS_34._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_34] PartitionCols:_col0 Select Operator [SEL_33] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_32] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_39] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_38] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] Increase hive.auto.convert.join.noconditionaltask.size= 1,359,688,499, we can see it is now using map join operators. New plan runs 45.84s. store_sales is a partitioned table on the join key ss_sold_date_sk with date_dim table.
  • 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_55] PartitionCols:_col0, _col1, _col2 Group By Operator [GBY_54] (rows=9 width=116) Output:["_col0","_col1","_col2","_col3"],aggregations:["sum(_col1)"],keys:_col8, _col4, _col5 Select Operator [SEL_53] (rows=306061 width=112) Output:["_col8","_col4","_col5","_col1"] Map Join Operator [MAPJOIN_52] (rows=306061 width=112) Conds:MAPJOIN_51._col2=RS_45._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col4","_col5","_col8"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_45] PartitionCols:_col0 Select Operator [SEL_44] (rows=5619 width=12) Output:["_col0","_col1"] Filter Operator [FIL_43] (rows=5619 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12) Group By Operator [GBY_47] (rows=2809 width=12) Output:["_col0"],keys:_col0 Select Operator [SEL_46] (rows=5619 width=12) Output:["_col0"] Please refer to the previous Select Operator [SEL_44] <-Map Join Operator [MAPJOIN_51] (rows=3978894 width=112) Conds:SEL_50._col0=RS_42._col0(Inner),HybridGraceHashJoin:true,Output:["_col1","_col2","_col4","_col5"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_42] PartitionCols:_col0 Select Operator [SEL_41] (rows=434 width=111) Output:["_col0","_col1","_col2"] Filter Operator [FIL_40] (rows=434 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_50] (rows=2750387156 width=11) Output:["_col0","_col1","_col2"] Filter Operator [FIL_49] (rows=2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"] By setting hive.tez.dynamic.partition.pruning=true, we can see dynamic partitioning event operators. See more about this in HIVE-7826. New plan runs 31.35s. In the run time, dynamic partition event operator will send values needed to prune to the application master - where splits are generated and tasks are submitted. Using these values we can strip out any unneeded partitions dynamically, while the query is running.
  • 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Performance debugging summary (TPC-DS Q3, 1TB) 0 20 40 60 80 100 120 140 160 180 Original Join re-order Join selection Dynamic partition pruning Queryexecutiontime(s) Query execution time Continuous improvement
  • 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Integration with Apache Ambari 1. Type the query here 2. Click “explain” 3. explain plan will be shown
  • 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved  “set hive.tez.exec.print.summary=true;”  Get more insights on query performance Can be used along with Tez vertex runtime stats Runtime stats
  • 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary • Show old style Hive explain plan is hard to read. • Verbose with too much redundant information, hard to follow how data flows, cost of operator is unclear • Compare with Postgres over a body of 500+ realistic SQL queries and identify the candidate improving points • Introduce new style Hive explain plan • Use a concrete example to help understand the explain: execution cost, join sequence and orchestration of the operator tree • Use the new Hive explain plan to performance debug TPC-DS Q3 • Show the improvement after join re-ordering, join selection, and dynamic partition pruning • Integration/interaction with other system/tools
  • 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Future work -- Some gaps remain after HIVE-9780 • Put the real schema, table and column names in the explain plan, e.g., no more _col0 etc. • This will help users to understand the plan better • HIVE-8681: CBO: Column names are missing from join expression in Map join with CBO enabled • Get an equivalent of “EXPLAIN ANALYZE” – such as operator level runtime stats and warnings. • This will help users to find out the gap between estimated cost and real cost • HIVE-14362: Support explain analyze in Hive
  • 31. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SHUFFLE [RS_55] PartitionCols:d_year, i_brand, i_brand_id Group By Operator [GBY_54] (rows=9/10 width=116) Output:["d_year","sum_agg","i_brand","i_brand_id"],aggregations:["sum(ss_ext_sales_price)"],keys:d_year, i_brand, i_brand_id Select Operator [SEL_53] (rows=306061/324651 width=112) Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"] Map Join Operator [MAPJOIN_52] (rows=306061/324651 width=112) Conds:MAPJOIN_51. ss_sold_date_sk=RS_45. d_date_sk(Inner),HybridGraceHashJoin:true,Output:["d_year","i_brand","i_brand_id","ss_ext_sales_price"] <-Map 5 [BROADCAST_EDGE] vectorized BROADCAST [RS_45] PartitionCols:d_date_sk Select Operator [SEL_44] (rows=5619/6034 width=12) Output:["d_date_sk","d_year"] Filter Operator [FIL_43] (rows=5619/6034 width=12) predicate:((d_moy = 12) and d_date_sk is not null) TableScan [TS_6] (rows=73049/73049 width=12) tpcds_bin_partitioned_orc_1000@date_dim,dt,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk","d_year","d_moy"] Dynamic Partitioning Event Operator [EVENT_48] (rows=2809 width=12) Group By Operator [GBY_47] (rows=2809/2324 width=12) Output:["d_date_sk"],keys:d_date_sk Select Operator [SEL_46] (rows=5619/6034 width=12) Output:["d_date_sk"] Please refer to the previous Select Operator [SEL_44] <-Map Join Operator [MAPJOIN_51] (rows=3978894/4202377 width=112) Conds:SEL_50.ss_item_sk=RS_42. i_item_sk(Inner),HybridGraceHashJoin:true,Output:["ss_ext_sales_price",” ss_sold_date_sk","i_brand","i_brand_id"] <-Map 4 [BROADCAST_EDGE] vectorized BROADCAST [RS_42] PartitionCols:i_item_sk Select Operator [SEL_41] (rows=434/453 width=111) Output:[” i_item_sk","i_brand_id","i_brand"] Filter Operator [FIL_40] (rows=434/453 width=111) predicate:((i_manufact_id = 436) and i_item_sk is not null) TableScan [TS_3] (rows=300000/300000 width=111) tpcds_bin_partitioned_orc_1000@item,item,Tbl:COMPLETE,Col:COMPLETE,Output:["i_item_sk","i_brand_id","i_brand","i_manufact_id"] <-Select Operator [SEL_50] (rows=2750387156/2750387156 width=11) Output:["ss_item_sk","ss_ext_sales_price","ss_sold_date_sk"] Filter Operator [FIL_49] (rows=2750387156/2750387156 width=11) predicate:ss_item_sk is not null TableScan [TS_0] (rows=2750387156/2750387156 width=11) tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE,Output:["ss_item_sk","ss_ext_sales_price"]
  • 32. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Acknowledgement • We thank all the anonymous reviewers’ votes to give us this opportunity to share our work. • Part of the slides are borrowed from or modified based on Carter Shanklin and Rajesh Balamohan’s slides. • We thank Gunther Hagleitner for all the support and inputs. • We thank Sapin Amin for setting up the testing cluster.
  • 33. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank you! Questions?

Editor's Notes

  • #2: Hive contributors have striven to improve the capability of Hive in terms of both performance and functionality. We assert that understanding and analyzing Apache Hive query execution plan is crucial for performance debugging. In this talk, we study why Apache Hive’s query execution plan today are so difficult to analyze. We identify a set of pain points from our Apache Hive performance engineers, development engineers as well as real users/customers. We propose and show a new presentation data model that can well address the pain points. The three most critical parts of the presentation are (1) the estimated query execution cost, which is the planner's guess at how long it will take to run the query (measured in #rows); (2) the orchestration of the operator tree across consecutive M/R (Tez) jobs; and (3) integration and extension support with other presentation tools, e.g., Apache Ambari.
  • #4: We can see that Hive old style explain is quite verbose, is it necessary?
  • #5: We pick a real query out of the 500 queries.
  • #6: 16 lines
  • #7: 80 lines
  • #11: Although Reduce Output Operator is crucial as it defines the boundary between a map task and a reduce task, we are thinking how much information of the reduce output operator is.... SQL users care about more about relational operators rather than MR operators.
  • #13: We can also see more details about this join.
  • #16: It seems that Postgres sets up a good example for hive to improve on.
  • #21: So this is the basic introduction of the new style hive explain plan
  • #22: This query involves 3 tables.
  • #25: splits
  • #29: For Query 87
  • #32: The next version of Hive explain plan will look like this. We highlight the changes that we want to make in red color.
  • #33: We thank the audience for your attention.
  • #34: I am ready to take any questions.