SlideShare a Scribd company logo
Fast Spark Access To Your Data -
Avro, JSON, ORC, and Parquet
Owen O’Malley
owen@hortonworks.com
@owen_omalley
September 2018
2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Who Am I?
Worked on Hadoop since Jan 2006
MapReduce, Security, Hive, and ORC
Worked on different file formats
–Sequence File, RCFile, ORC File, T-File, and Avro
requirements
3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Goal
Benchmark for Spark SQL
–Use Spark’s FileFormat API
Seeking to discover unknowns
–How do the different formats perform?
–What could they do better?
Use real & diverse data sets
–Over-reliance on artificial datasets leads to weakness
Open & reviewed benchmarks
4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Benchmarking is Hard
Is this a good benchmark?
long start = System.nanoTime();
testMethod(new A());
long middle = System.nanoTime();
testMethod(new B());
long end = System.nanoTime();
5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
JMH to the Rescue
Interfaces to JVM
Launches fork as requested
Runs warmup iterations
Runs multiple iterations
Provides parameter sweeps
Provides blackholes
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The File Formats
7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Avro
Cross-language file format for Hadoop
Schema evolution was primary goal
Schema segregated from data
–Unlike Protobuf and Thrift
Row major format
8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
JSON
Serialization format for HTTP & Javascript
Text-format with MANY parsers
Schema completely integrated with data
Row major format
Compression applied on top
9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
ORC
Originally part of Hive to replace RCFile
–Now top-level project
Schema segregated into footer
Column major format with stripes
Rich type model, stored top-down
Integrated compression, indexes, & stats
10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Parquet
Design based on Google’s Dremel paper
Schema segregated into footer
Column major format with stripes
Simpler type-model with logical types
All data pushed to leaves of the tree
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Sets
12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
NYC Taxi Data
Every taxi cab ride in NYC from 2009
–Publically available
–http://tinyurl.com/nyc-taxi-analysis
18 columns with no null values
–Doubles, integers, decimals, & strings
2 months of data – 22.7 million rows
13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Sales
Generated data
–Real schema from a production Hive deployment
–Random data based on the data statistics
55 columns with lots of nulls
–A little structure
–Timestamps, strings, longs, booleans, list, & struct
25 million rows
14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Github Logs
All actions on Github public repositories
–Publically available
–https://www.githubarchive.org/
704 columns with a lot of structure & nulls
–Pretty much the kitchen sink
 1/2 month of data – 10.5 million rows
15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Finding the Github Schema
The data is all in JSON.
No schema for the data is published.
We wrote a JSON schema discoverer.
–Scans the document and figures out the types
Available in ORC tool jar.
Schema is huge (12k)
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Software
17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Software Versions
All of these projects are evolving rapidly
–Spark 2.3.1
–Avro 1.8.2
–ORC 1.5.1
–Parquet 1.8.2
–Spark-Avro 4.0.0
Dependency hell 👿
18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Configuration
Spark Configuration
–spark.sql.orc.filterPushdown = true
–spark.sql.orc.impl = native
Hadoop Configuration
–session.sparkContext().hadoopConfiguration()
–avro.mapred.ignore.inputs.without.extension = false
19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Spark-Avro
Benchmark uses Spark SQL’s FileFormat
–JSON, ORC, and Parquet all in Spark
–Avro is provided by Databricks via spark-avro
It maps the Spark to Avro types differently
–Timestamp as long vs int96
–Decimal as string vs bytes
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storage costs
21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Compression
Data size matters!
–Hadoop stores all your data, but requires hardware
–Is one factor in read speed (HDFS ~15mb/sec)
ORC and Parquet use RLE & Dictionaries
All the formats have general compression
–ZLIB (GZip) – tight compression, slower
–Snappy – some compression, faster
22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
23 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Taxi Size Analysis
Don’t use JSON
Use either Snappy or Zlib compression
Avro’s small compression window hurts
Parquet Zlib is smaller than ORC
24 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
25 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Sales Size Analysis
ORC did better than expected
–String columns have small cardinality
–Lots of timestamp columns
–No doubles 
26 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
27 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Github Size Analysis
Surprising win for JSON and Avro
–Worst when uncompressed
–Best with zlib
Many partially shared strings
–ORC and Parquet don’t compress across columns
Need to investigate Zstd with dictionary
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Use Cases
29 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Full Table Scans
Read all columns & rows
All formats except JSON are splittable
–Different workers do different parts of file
Taxi schema supports ColumnarBatch
–All primitive types
30 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
0
50
100
150
200
250
300
350
orc parquet json orc parquet json orc parquet
taxi taxi taxi taxi taxi taxi taxi taxi
none none none zlib zlib zlib snappy snappy
Taxi Times
31 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Taxi Read Performance Analysis
JSON is very slow to read
–Large storage size for this data set
–Needs to do a LOT of string parsing
Parquet is faster
–ORC is going through an extra layer
–VectorizedRowBatch -> OrcStruct -> ColumnarBatch
32 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
0
50
100
150
200
250
300
350
400
orc parquet json orc parquet orc parquet
sales sales sales sales sales sales sales
none none none zlib zlib snappy snappy
Sales Times
33 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Sales Read Performance Analysis
Read performance is dominated by format
–Compression matters less for this data set
–Straight ordering: ORC, Parquet, & JSON
Uses Row instead of ColumnarBatch
34 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
0
100
200
300
400
500
600
700
800
900
1000
orc parquet json orc parquet orc parquet
github github github github github github github
none none none zlib zlib snappy snappy
Github Times
35 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Github Read Performance Analysis
JSON did really well
A lot of columns needs more space
–We need bigger stripes (add min rows in ORC-190)
–Rows/stripe - ORC: 18.6k, Parquet: 88.1k
Parquet struggles
–Twitter recommends against Parquet for this case
36 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Column Projection
Often just need a few columns
–Only ORC & Parquet are columnar
–Only read, decompress, & deserialize some columns
Spark FileFormat passes in desired schema
–Drop columns that aren’t needed
–JSON and Avro read first and then drop columns
37 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
0
2
4
6
8
10
12
14
16
18
20
none snappy zlib none snappy zlib none snappy zlib none snappy zlib none snappy zlib none snappy zlib
orc orc orc parquet parquet parquet orc orc orc parquet parquet parquet orc orc orc parquet parquet parquet
github github github github github github sales sales sales sales sales sales taxi taxi taxi taxi taxi taxi
Column Projection % Sizes
38 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Predicate Pushdown
Query:
–select first_name, last_name from employees where
hire_date between ‘01/01/2017’ and ‘01/03/2017’
Predicate:
–hire_date between ‘01/01/2017’ and ‘01/03/2017’
Given to FileFormat via filters
For benchmark, filter on a sorted column
39 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Predicate Pushdown
ORC & Parquet indexes with min & max
–Sorted data is critical!
ORC has optional bloom filters
Reader filters out sections of file
–Entire file
–Stripe
–Row group (only ORC, default 10k rows)
Engine needs to apply row level filter
40 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
0
5000000
10000000
15000000
20000000
25000000
30000000
taxi sales github
Predicate Pushdown Rows
orc parquet total
41 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Predicate Pushdown
Parquet doesn’t pushdown timestamp filters
–Taxi and Github filters were on timestamps.
Spark defaults ORC predicate pushdown off.
Small ORC stripes for Github lead to sub-10k
row read.
Because predicate pushdown is an optimization,
it isn’t clear when it isn’t used.
42 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Metadata Access
ORC & Parquet store metadata
–Stored in file footer
–File schema
–Number of records
–Min, max, count of each column
Provides O(1) Access
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Conclusions
44 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Recommendations
Disclaimer – Everything changes!
–Both these benchmarks and the formats will change.
Evaluate needs
–Column projection and predicate pushdown are only
in ORC & Parquet
–Determine how to sort data
–Are bloom filters useful?
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!
Twitter: @owen_omalley
Email: owen@hortonworks.com

More Related Content

What's hot (20)

PPTX
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
 
PPTX
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
 
PPTX
ORC 2015: Faster, Better, Smaller
DataWorks Summit
 
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
PDF
Apache CarbonData:New high performance data format for faster data analysis
liang chen
 
PDF
Optimizing Hive Queries
DataWorks Summit
 
PDF
Optimizing Hive Queries
Owen O'Malley
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PPTX
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
PPTX
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
 
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
PPT
Parquet and impala overview external
mattlieber
 
PDF
Parquet Hadoop Summit 2013
Julien Le Dem
 
PDF
Parquet and AVRO
airisData
 
PPTX
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
 
PPTX
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
 
PPTX
Data organization: hive meetup
t3rmin4t0r
 
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
PDF
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
PDF
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
 
ORC 2015: Faster, Better, Smaller
DataWorks Summit
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Apache CarbonData:New high performance data format for faster data analysis
liang chen
 
Optimizing Hive Queries
DataWorks Summit
 
Optimizing Hive Queries
Owen O'Malley
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Using Apache Hive with High Performance
Inderaj (Raj) Bains
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Parquet and impala overview external
mattlieber
 
Parquet Hadoop Summit 2013
Julien Le Dem
 
Parquet and AVRO
airisData
 
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
 
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
 
Data organization: hive meetup
t3rmin4t0r
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 

Similar to Fast Access to Your Data - Avro, JSON, ORC, and Parquet (20)

PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Spark Crash Course
DataWorks Summit
 
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Apache Metron in the Real World
DataWorks Summit
 
PDF
Apache Metron in the Real World
Dave Russell
 
PPTX
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Analyzing Hadoop Using Hadoop
DataWorks Summit
 
PPTX
Druid Scaling Realtime Analytics
Aaron Brooks
 
PPTX
Hive acid and_2.x new_features
Alberto Romero
 
PPTX
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
PPTX
State of the Apache NiFi Ecosystem & Community
Accumulo Summit
 
PDF
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
 
PPTX
An Overview on Optimization in Apache Hive: Past, Present Future
DataWorks Summit/Hadoop Summit
 
PPTX
ORC File - Optimizing Your Big Data
DataWorks Summit
 
PPTX
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
DataWorks Summit
 
PDF
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
 
PPTX
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
PPTX
An Apache Hive Based Data Warehouse
DataWorks Summit
 
PPTX
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
 
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
Apache Metron in the Real World
DataWorks Summit
 
Apache Metron in the Real World
Dave Russell
 
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Analyzing Hadoop Using Hadoop
DataWorks Summit
 
Druid Scaling Realtime Analytics
Aaron Brooks
 
Hive acid and_2.x new_features
Alberto Romero
 
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
State of the Apache NiFi Ecosystem & Community
Accumulo Summit
 
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
 
An Overview on Optimization in Apache Hive: Past, Present Future
DataWorks Summit/Hadoop Summit
 
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
DataWorks Summit
 
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
 
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
 
Ad

More from Owen O'Malley (15)

PPTX
Running An Apache Project: 10 Traps and How to Avoid Them
Owen O'Malley
 
PPTX
Big Data's Journey to ACID
Owen O'Malley
 
PPTX
Protect your private data with ORC column encryption
Owen O'Malley
 
PDF
Strata NYC 2018 Iceberg
Owen O'Malley
 
PPTX
ORC Column Encryption
Owen O'Malley
 
PPTX
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
 
PPTX
Data protection2015
Owen O'Malley
 
PPTX
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
 
PPT
Hadoop Security Architecture
Owen O'Malley
 
PPTX
Adding ACID Updates to Hive
Owen O'Malley
 
PPTX
ORC File Introduction
Owen O'Malley
 
PDF
Next Generation Hadoop Operations
Owen O'Malley
 
PDF
Next Generation MapReduce
Owen O'Malley
 
PDF
Bay Area HUG Feb 2011 Intro
Owen O'Malley
 
PDF
Plugging the Holes: Security and Compatability in Hadoop
Owen O'Malley
 
Running An Apache Project: 10 Traps and How to Avoid Them
Owen O'Malley
 
Big Data's Journey to ACID
Owen O'Malley
 
Protect your private data with ORC column encryption
Owen O'Malley
 
Strata NYC 2018 Iceberg
Owen O'Malley
 
ORC Column Encryption
Owen O'Malley
 
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
 
Data protection2015
Owen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
 
Hadoop Security Architecture
Owen O'Malley
 
Adding ACID Updates to Hive
Owen O'Malley
 
ORC File Introduction
Owen O'Malley
 
Next Generation Hadoop Operations
Owen O'Malley
 
Next Generation MapReduce
Owen O'Malley
 
Bay Area HUG Feb 2011 Intro
Owen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Owen O'Malley
 
Ad

Recently uploaded (20)

PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PDF
Top Civil Engineer Canada Services111111
nengineeringfirms
 
PDF
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PPTX
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
PPTX
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Nursing Shift Supervisor 24/7 in a week .pptx
amjadtanveer
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Top Civil Engineer Canada Services111111
nengineeringfirms
 
apidays Munich 2025 - The Physics of Requirement Sciences Through Application...
apidays
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Solution+Architecture+Review+-+Sample.pptx
manuvratsingh1
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
Customer Segmentation: Seeing the Trees and the Forest Simultaneously
Sione Palu
 
UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the top
chinnusindhu1
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...
apidays
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

  • 1. Fast Spark Access To Your Data - Avro, JSON, ORC, and Parquet Owen O’Malley [email protected] @owen_omalley September 2018
  • 2. 2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Who Am I? Worked on Hadoop since Jan 2006 MapReduce, Security, Hive, and ORC Worked on different file formats –Sequence File, RCFile, ORC File, T-File, and Avro requirements
  • 3. 3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Goal Benchmark for Spark SQL –Use Spark’s FileFormat API Seeking to discover unknowns –How do the different formats perform? –What could they do better? Use real & diverse data sets –Over-reliance on artificial datasets leads to weakness Open & reviewed benchmarks
  • 4. 4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Benchmarking is Hard Is this a good benchmark? long start = System.nanoTime(); testMethod(new A()); long middle = System.nanoTime(); testMethod(new B()); long end = System.nanoTime();
  • 5. 5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved JMH to the Rescue Interfaces to JVM Launches fork as requested Runs warmup iterations Runs multiple iterations Provides parameter sweeps Provides blackholes
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved The File Formats
  • 7. 7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Avro Cross-language file format for Hadoop Schema evolution was primary goal Schema segregated from data –Unlike Protobuf and Thrift Row major format
  • 8. 8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved JSON Serialization format for HTTP & Javascript Text-format with MANY parsers Schema completely integrated with data Row major format Compression applied on top
  • 9. 9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ORC Originally part of Hive to replace RCFile –Now top-level project Schema segregated into footer Column major format with stripes Rich type model, stored top-down Integrated compression, indexes, & stats
  • 10. 10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Parquet Design based on Google’s Dremel paper Schema segregated into footer Column major format with stripes Simpler type-model with logical types All data pushed to leaves of the tree
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Sets
  • 12. 12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved NYC Taxi Data Every taxi cab ride in NYC from 2009 –Publically available –http://tinyurl.com/nyc-taxi-analysis 18 columns with no null values –Doubles, integers, decimals, & strings 2 months of data – 22.7 million rows
  • 13. 13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Sales Generated data –Real schema from a production Hive deployment –Random data based on the data statistics 55 columns with lots of nulls –A little structure –Timestamps, strings, longs, booleans, list, & struct 25 million rows
  • 14. 14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Github Logs All actions on Github public repositories –Publically available –https://www.githubarchive.org/ 704 columns with a lot of structure & nulls –Pretty much the kitchen sink  1/2 month of data – 10.5 million rows
  • 15. 15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Finding the Github Schema The data is all in JSON. No schema for the data is published. We wrote a JSON schema discoverer. –Scans the document and figures out the types Available in ORC tool jar. Schema is huge (12k)
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Software
  • 17. 17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Software Versions All of these projects are evolving rapidly –Spark 2.3.1 –Avro 1.8.2 –ORC 1.5.1 –Parquet 1.8.2 –Spark-Avro 4.0.0 Dependency hell 👿
  • 18. 18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Configuration Spark Configuration –spark.sql.orc.filterPushdown = true –spark.sql.orc.impl = native Hadoop Configuration –session.sparkContext().hadoopConfiguration() –avro.mapred.ignore.inputs.without.extension = false
  • 19. 19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Spark-Avro Benchmark uses Spark SQL’s FileFormat –JSON, ORC, and Parquet all in Spark –Avro is provided by Databricks via spark-avro It maps the Spark to Avro types differently –Timestamp as long vs int96 –Decimal as string vs bytes
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Storage costs
  • 21. 21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Compression Data size matters! –Hadoop stores all your data, but requires hardware –Is one factor in read speed (HDFS ~15mb/sec) ORC and Parquet use RLE & Dictionaries All the formats have general compression –ZLIB (GZip) – tight compression, slower –Snappy – some compression, faster
  • 22. 22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
  • 23. 23 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Taxi Size Analysis Don’t use JSON Use either Snappy or Zlib compression Avro’s small compression window hurts Parquet Zlib is smaller than ORC
  • 24. 24 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
  • 25. 25 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Sales Size Analysis ORC did better than expected –String columns have small cardinality –Lots of timestamp columns –No doubles 
  • 26. 26 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
  • 27. 27 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Github Size Analysis Surprising win for JSON and Avro –Worst when uncompressed –Best with zlib Many partially shared strings –ORC and Parquet don’t compress across columns Need to investigate Zstd with dictionary
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Use Cases
  • 29. 29 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Full Table Scans Read all columns & rows All formats except JSON are splittable –Different workers do different parts of file Taxi schema supports ColumnarBatch –All primitive types
  • 30. 30 © Hortonworks Inc. 2011 – 2018. All Rights Reserved 0 50 100 150 200 250 300 350 orc parquet json orc parquet json orc parquet taxi taxi taxi taxi taxi taxi taxi taxi none none none zlib zlib zlib snappy snappy Taxi Times
  • 31. 31 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Taxi Read Performance Analysis JSON is very slow to read –Large storage size for this data set –Needs to do a LOT of string parsing Parquet is faster –ORC is going through an extra layer –VectorizedRowBatch -> OrcStruct -> ColumnarBatch
  • 32. 32 © Hortonworks Inc. 2011 – 2018. All Rights Reserved 0 50 100 150 200 250 300 350 400 orc parquet json orc parquet orc parquet sales sales sales sales sales sales sales none none none zlib zlib snappy snappy Sales Times
  • 33. 33 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Sales Read Performance Analysis Read performance is dominated by format –Compression matters less for this data set –Straight ordering: ORC, Parquet, & JSON Uses Row instead of ColumnarBatch
  • 34. 34 © Hortonworks Inc. 2011 – 2018. All Rights Reserved 0 100 200 300 400 500 600 700 800 900 1000 orc parquet json orc parquet orc parquet github github github github github github github none none none zlib zlib snappy snappy Github Times
  • 35. 35 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Github Read Performance Analysis JSON did really well A lot of columns needs more space –We need bigger stripes (add min rows in ORC-190) –Rows/stripe - ORC: 18.6k, Parquet: 88.1k Parquet struggles –Twitter recommends against Parquet for this case
  • 36. 36 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Column Projection Often just need a few columns –Only ORC & Parquet are columnar –Only read, decompress, & deserialize some columns Spark FileFormat passes in desired schema –Drop columns that aren’t needed –JSON and Avro read first and then drop columns
  • 37. 37 © Hortonworks Inc. 2011 – 2018. All Rights Reserved 0 2 4 6 8 10 12 14 16 18 20 none snappy zlib none snappy zlib none snappy zlib none snappy zlib none snappy zlib none snappy zlib orc orc orc parquet parquet parquet orc orc orc parquet parquet parquet orc orc orc parquet parquet parquet github github github github github github sales sales sales sales sales sales taxi taxi taxi taxi taxi taxi Column Projection % Sizes
  • 38. 38 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Predicate Pushdown Query: –select first_name, last_name from employees where hire_date between ‘01/01/2017’ and ‘01/03/2017’ Predicate: –hire_date between ‘01/01/2017’ and ‘01/03/2017’ Given to FileFormat via filters For benchmark, filter on a sorted column
  • 39. 39 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Predicate Pushdown ORC & Parquet indexes with min & max –Sorted data is critical! ORC has optional bloom filters Reader filters out sections of file –Entire file –Stripe –Row group (only ORC, default 10k rows) Engine needs to apply row level filter
  • 40. 40 © Hortonworks Inc. 2011 – 2018. All Rights Reserved 0 5000000 10000000 15000000 20000000 25000000 30000000 taxi sales github Predicate Pushdown Rows orc parquet total
  • 41. 41 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Predicate Pushdown Parquet doesn’t pushdown timestamp filters –Taxi and Github filters were on timestamps. Spark defaults ORC predicate pushdown off. Small ORC stripes for Github lead to sub-10k row read. Because predicate pushdown is an optimization, it isn’t clear when it isn’t used.
  • 42. 42 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Metadata Access ORC & Parquet store metadata –Stored in file footer –File schema –Number of records –Min, max, count of each column Provides O(1) Access
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Conclusions
  • 44. 44 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Recommendations Disclaimer – Everything changes! –Both these benchmarks and the formats will change. Evaluate needs –Column projection and predicate pushdown are only in ORC & Parquet –Determine how to sort data –Are bloom filters useful?
  • 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you! Twitter: @owen_omalley Email: [email protected]