SlideShare a Scribd company logo
Copyright 2013 by Hortonworks and Microsoft
ORC File & Vectorization
Improving Hive Data Storage and Query Performance
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
Jitendra Pandey
jitendra@hortonworks.com
Eric Hanson
ehans@microsoft.com
owen@hortonworks.c
om
ORC – Optimized RC File
Page 2
History
Page 3
Remaining Challenges
Page 4
Requirements
Page 5
File Structure
Page 6
Stripe Structure
Page 7
File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4
Compression
Page 9
Integer Column Serialization
Page 10
String Column Serialization
Page 11
Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double
Compound Type Serialization
Page 13
Generic Compression
Page 14
Column Projection
Page 15
How Do You Use ORC
Page 16
Managing Memory
Page 17
TPC-DS File Sizes
Page 18
ORC Predicate Pushdown
Page 19
Additional Details
Page 20
Current work for Hive 0.12
Page 21
Future Work
Page 22
Comparison
Page 23
RC File Trevni Parquet ORC
Hive Integration Y N N Y
Active Development N N Y Y
Hive Type Model N N N Y
Shred complex columns N Y Y Y
Splits found quickly N Y Y Y
Files per a bucket 1 many 1 or many 1
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N Y Y
Store min, max, sum, count N N N Y
Store internal indexes N N N Y
No overhead for non-null N N N Y ≥ 0.12
Predicate Pushdown N N N Y ≥ 0.12
Vectorization
Page 24
Vectorization
Page 25
Why row-at-a-time execution is slow
Page 26
• Hive uses Object Inspectors to work on a row
• Enables level of abstraction
• Costs major performance
• Exacerbated by using lazy serdes
• Inner loop has many method, new(), and if-
then-else calls
• Lots of CPU instructions
• Pipeline stalls Poor instructions/cycle
• Poor cache locality
How the code works (simplified)
Page 27
class LongColumnAddLongScalarExpression {
int inputColumn;
int outputColumn;
long scalar;
void evaluate(VectorizedRowBatch batch) {
long [] inVector =
((LongColumnVector) batch.columns[inputColumn]).vector;
long [] outVector =
((LongColumnVector) batch.columns[outputColumn]).vector;
if (batch.selectedInUse) {
for (int j = 0; j < batch.size; j++) {
int i = batch.selected[j];
outVector[i] = inVector[i] + scalar;
}
} else {
for (int i = 0; i < batch.size; i++) {
outVector[i] = inVector[i] + scalar;
}
}
}
}
}
No method calls
Low instruction count
Cache locality to 1024 values
No pipeline stalls
SIMD in Java 8
Vectorization project
Page 28
Preliminary performance results
• NOT a benchmark
• 218 million row fact table of real data, 25 columns
• 18GB raw data
• 6 core, 12 thread workstation, 1 disk, 16GB RAM
• select a, b, count(*) from t
where c >= const group by a, b -- 53 row result
Page 29
warm start times RC non-
vectorized
(default, not
compressed)
ORC non-
vectorized
(default,
compressed)
ORC vectorized
(default,
compressed)
Runtime (sec) 261 58 43
Total CPU (sec) 381 159 42
Thanks to contributors!
Page 30
• Microsoft Big Data:
• Eric Hanson, Remus Rusanu, Sarvesh
Sakalanaga, Tony Murphy, Ashit Gosalia
• Hortonworks:
• Jitendra Pandey, Owen O’Malley, Gopal V
• Others:
• Teddy Choi, Tim Chen
Jitendra/Eric are joint leads

More Related Content

PDF
ORC Files
Owen O'Malley
 
PDF
Apache Hive Hook
Minwoo Kim
 
PPTX
ORC File - Optimizing Your Big Data
DataWorks Summit
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
PDF
The Apache Spark File Format Ecosystem
Databricks
 
PDF
Change Data Feed in Delta
Databricks
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
ORC Files
Owen O'Malley
 
Apache Hive Hook
Minwoo Kim
 
ORC File - Optimizing Your Big Data
DataWorks Summit
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
The Apache Spark File Format Ecosystem
Databricks
 
Change Data Feed in Delta
Databricks
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

What's hot (20)

PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PPTX
Hive: Loading Data
Benjamin Leonhardi
 
PPTX
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
 
PPTX
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
PPTX
Hive 3 - a new horizon
Thejas Nair
 
PPTX
ORC 2015
t3rmin4t0r
 
PDF
What is new in Apache Hive 3.0?
DataWorks Summit
 
PPTX
HBase Low Latency
DataWorks Summit
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PDF
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
PDF
Facebook Messages & HBase
强 王
 
PDF
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
PPTX
Local Secondary Indexes in Apache Phoenix
Rajeshbabu Chintaguntla
 
PPTX
Apache Ranger
Rommel Garcia
 
PPT
Storage Technology Overview
nomathjobs
 
PPTX
Hive+Tez: A performance deep dive
t3rmin4t0r
 
PPTX
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 
PDF
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
PPTX
ORC Deep Dive 2020
Owen O'Malley
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Hive: Loading Data
Benjamin Leonhardi
 
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
 
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Hive 3 - a new horizon
Thejas Nair
 
ORC 2015
t3rmin4t0r
 
What is new in Apache Hive 3.0?
DataWorks Summit
 
HBase Low Latency
DataWorks Summit
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Facebook Messages & HBase
强 王
 
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Local Secondary Indexes in Apache Phoenix
Rajeshbabu Chintaguntla
 
Apache Ranger
Rommel Garcia
 
Storage Technology Overview
nomathjobs
 
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
ORC Deep Dive 2020
Owen O'Malley
 
Ad

Viewers also liked (6)

PDF
Ingesting Data at Blazing Speed Using Apache Orc
DataWorks Summit
 
PPTX
Big data - Apache Hadoop for Beginner's
senthil0809
 
PPTX
Get started with R lang
senthil0809
 
PPTX
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
xKinAnx
 
PPTX
Storage Cloud and Spectrum deck 2017 June update
Joe Krotz
 
PDF
Alphorm.com Formation Docker (2/2) - Administration Avancée
Alphorm
 
Ingesting Data at Blazing Speed Using Apache Orc
DataWorks Summit
 
Big data - Apache Hadoop for Beginner's
senthil0809
 
Get started with R lang
senthil0809
 
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
xKinAnx
 
Storage Cloud and Spectrum deck 2017 June update
Joe Krotz
 
Alphorm.com Formation Docker (2/2) - Administration Avancée
Alphorm
 
Ad

Similar to ORC File and Vectorization - Hadoop Summit 2013 (20)

PDF
Overview of the Hive Stinger Initiative
Modern Data Stack France
 
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
PPTX
Master tuning
Thomas Kejser
 
PDF
Web analytics at scale with Druid at naver.com
Jungsu Heo
 
PDF
CBStreams - Java Streams for ColdFusion (CFML)
Ortus Solutions, Corp
 
PDF
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
Ortus Solutions, Corp
 
PPTX
User Group3009
sqlserver.co.il
 
PDF
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
NETWAYS
 
PDF
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
NETWAYS
 
PDF
Fighting Against Chaotically Separated Values with Embulk
Sadayuki Furuhashi
 
PDF
WebObjects Optimization
WO Community
 
PDF
Nodejs - Should Ruby Developers Care?
Felix Geisendörfer
 
PPT
NOSQL and Cassandra
rantav
 
PPTX
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
PPTX
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
PPTX
Orms vs Micro-ORMs
David Paquette
 
PPTX
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
DataWorks Summit
 
PDF
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld
 
PPT
Performance optimization - JavaScript
Filip Mares
 
PPTX
Node.js: The What, The How and The When
FITC
 
Overview of the Hive Stinger Initiative
Modern Data Stack France
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Master tuning
Thomas Kejser
 
Web analytics at scale with Druid at naver.com
Jungsu Heo
 
CBStreams - Java Streams for ColdFusion (CFML)
Ortus Solutions, Corp
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
Ortus Solutions, Corp
 
User Group3009
sqlserver.co.il
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
NETWAYS
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
NETWAYS
 
Fighting Against Chaotically Separated Values with Embulk
Sadayuki Furuhashi
 
WebObjects Optimization
WO Community
 
Nodejs - Should Ruby Developers Care?
Felix Geisendörfer
 
NOSQL and Cassandra
rantav
 
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Orms vs Micro-ORMs
David Paquette
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
DataWorks Summit
 
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
VMworld
 
Performance optimization - JavaScript
Filip Mares
 
Node.js: The What, The How and The When
FITC
 

More from Owen O'Malley (20)

PPTX
Running An Apache Project: 10 Traps and How to Avoid Them
Owen O'Malley
 
PPTX
Big Data's Journey to ACID
Owen O'Malley
 
PPTX
Protect your private data with ORC column encryption
Owen O'Malley
 
PPTX
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
 
PPTX
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
PDF
Strata NYC 2018 Iceberg
Owen O'Malley
 
PPTX
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
PPTX
ORC Column Encryption
Owen O'Malley
 
PPTX
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
PPTX
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
 
PPTX
Data protection2015
Owen O'Malley
 
PPTX
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
 
PPT
Hadoop Security Architecture
Owen O'Malley
 
PPTX
Adding ACID Updates to Hive
Owen O'Malley
 
PPTX
ORC File Introduction
Owen O'Malley
 
PDF
Optimizing Hive Queries
Owen O'Malley
 
PDF
Next Generation Hadoop Operations
Owen O'Malley
 
PDF
Next Generation MapReduce
Owen O'Malley
 
PDF
Bay Area HUG Feb 2011 Intro
Owen O'Malley
 
PDF
Plugging the Holes: Security and Compatability in Hadoop
Owen O'Malley
 
Running An Apache Project: 10 Traps and How to Avoid Them
Owen O'Malley
 
Big Data's Journey to ACID
Owen O'Malley
 
Protect your private data with ORC column encryption
Owen O'Malley
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
Strata NYC 2018 Iceberg
Owen O'Malley
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
ORC Column Encryption
Owen O'Malley
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
 
Data protection2015
Owen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
 
Hadoop Security Architecture
Owen O'Malley
 
Adding ACID Updates to Hive
Owen O'Malley
 
ORC File Introduction
Owen O'Malley
 
Optimizing Hive Queries
Owen O'Malley
 
Next Generation Hadoop Operations
Owen O'Malley
 
Next Generation MapReduce
Owen O'Malley
 
Bay Area HUG Feb 2011 Intro
Owen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Owen O'Malley
 

Recently uploaded (20)

PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
The Future of Artificial Intelligence (AI)
Mukul
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Software Development Methodologies in 2025
KodekX
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 

ORC File and Vectorization - Hadoop Summit 2013