SlideShare a Scribd company logo
Parquet at Datadog
How we use Parquet for tons of metrics data
Doug Daniels, Director of Engineering
Outline
• Monitor everything
• Our data / why we chose Parquet
• A bit about Parquet
• Our pipeline
• What we see in production
Datadog is a monitoring
service for large scale cloud
applications
Collect Everything
Integrations for 100+ components
Monitor Everything
Alert on Critical Issues Collaborate to Fix them Together
Monitor Everything
We collect a lot of data
We collect a lot of data…
the biggest and most
important of which is
Metric timeseries data
timestamp 1447020511
metric system.cpu.idle
value 98.16687
We collect
hundreds of billions
of these per day
…and growing every week
And we do massive
computation on them
• Statistical analysis
• Machine learning
• Ad-hoc queries
• Reporting and aggregation
• Metering and billing
One size does not fit all.
ETL and aggregation Pig / Hive
ML and iterative algorithms Spark
Interactive SQL Presto
We want the best framework
for each job
How do we do that?
Duplicating data storage
Writing redundant glue code
Copying data definitions and schema
1. Separate Compute and Storage
• Amazon S3 as data system-of-record
• Ephemeral, job-specific clusters
• Write storage once, read everywhere
2. Standard Data Format
• Supported by major frameworks
• Schema-aware
• Fast to read
• Strong community
DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data
Parquet is a column-oriented
data storage format
What we love about Parquet
• Interoperable!
• Stores our data super efficiently
• Proven at scale on S3
• Strong community
Quick Parquet primer
Column A
Row Group 0
Page 0
Page 1
Page 2
Column B
Page 0
Page 1
File Meta Data
Footer
Row Group 0 Metadata
Column B Metadata
…
Column A Metadata
Efficient storage and fast reads
• Space efficiencies (per page)
• Type-specific encodings: run-length, delta, …
• Compression
• Query efficiencies (support varies by framework)
• Projection pushdown (skip columns)
• Predicate pushdown (skip row groups)
• Vectorized read (many rows at a time)
Broad ecosystem support
Our Parquet pipeline
Kafka
- Buffer
- Sort
- Dedupe
- Upload
Go
Hadoop Spark Presto
PrestoS3FileSystemEMRFS
- Partition
- Write Parquet
- Update Metastore
Luigi/Pig
Metadata
Hive Metastore
csv-gz
Amazon S3
Parquet
What we see in production
Excellent storage efficiency
• For just 5 columns:
• 3.5X less storage than gz-compressed CSV
• 2.5X less than internal query-optimized columnar format
…a little too efficient
• One 80MB parquet file with 160M rows / row group
• Creates long-running map tasks
• Added PARQUET-344 to limit rows per row group
• Want to switch this to limit by uncompressed size
Slower read performance
with AvroParquet
Runtime for our test job (mins)
0 min
10 min
20 min
30 min
40 min
C
SV
+
gz
AvroParquet+
gz
AvroParquet+
snappy
Parquet+
gz
• Tried reading schema w/
AvroReader
• Saw 3x slower reads with
AvroParquet (YMMV) on jobs
• Using HCatalog reader + hive
metastore for schema in
production
Our Parquet configuration
• Parquet block size (and dfs block size): 128 MB
• Page size: 1 MB
• Compression: gzip
• Schema Metadata: pig (we actually use hive metastore)
Thanks!
Want to work with us on Spark, Hadoop,
Kafka, Parquet, Presto, and more?
DM me @ddaniels888 or doug@datadoghq.com

More Related Content

What's hot (20)

PDF
Apache Hudi: The Path Forward
Alluxio, Inc.
 
PDF
BPF / XDP 8월 세미나 KossLab
Taeung Song
 
PDF
Barman (PostgreSql) manual
Marcelo Pesallaccia
 
PPTX
HBase Low Latency
DataWorks Summit
 
PDF
Building Your Data Streams for all the IoT
DevOps.com
 
PDF
Understanding and Improving Code Generation
Databricks
 
PDF
Presto At Arm Treasure Data - 2019 Updates
Taro L. Saito
 
PDF
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
PPTX
Oracle SQL Tuning for Day-to-Day Data Warehouse Support
nkarag
 
ODP
Presto
Knoldus Inc.
 
PPTX
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
StreamNative
 
PDF
Who needs containers in a serverless world
Matthias Luebken
 
PDF
Stream Processing – Concepts and Frameworks
Guido Schmutz
 
PPTX
Using Queryable State for Fun and Profit
Flink Forward
 
PDF
Nginx Internals
Joshua Zhu
 
PDF
[Pgday.Seoul 2017] 2. PostgreSQL을 위한 리눅스 커널 최적화 - 김상욱
PgDay.Seoul
 
PPTX
Introduction to KSQL: Streaming SQL for Apache Kafka®
confluent
 
PDF
Ceph Day Beijing - SPDK for Ceph
Danielle Womboldt
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Apache Hudi: The Path Forward
Alluxio, Inc.
 
BPF / XDP 8월 세미나 KossLab
Taeung Song
 
Barman (PostgreSql) manual
Marcelo Pesallaccia
 
HBase Low Latency
DataWorks Summit
 
Building Your Data Streams for all the IoT
DevOps.com
 
Understanding and Improving Code Generation
Databricks
 
Presto At Arm Treasure Data - 2019 Updates
Taro L. Saito
 
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
Oracle SQL Tuning for Day-to-Day Data Warehouse Support
nkarag
 
Presto
Knoldus Inc.
 
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
StreamNative
 
Who needs containers in a serverless world
Matthias Luebken
 
Stream Processing – Concepts and Frameworks
Guido Schmutz
 
Using Queryable State for Fun and Profit
Flink Forward
 
Nginx Internals
Joshua Zhu
 
[Pgday.Seoul 2017] 2. PostgreSQL을 위한 리눅스 커널 최적화 - 김상욱
PgDay.Seoul
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
confluent
 
Ceph Day Beijing - SPDK for Ceph
Danielle Womboldt
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 

Viewers also liked (20)

PDF
The Performance and Scalability Mindset
Brian Doll
 
PPTX
Real time network monitoring
Sumit Rajpal
 
PDF
Rails Performance
Wen-Tien Chang
 
KEY
Using rbenv in Production
Nic Benders
 
PDF
Running & Monitoring Docker at Scale
Datadog
 
PDF
Scaling monitoring with Datadog
alexismidon
 
PDF
Monitoring, Hold the Infrastructure
Sonatype
 
PPTX
Rock Stars, Builders, and Janitors: You're Doing It Wrong, New Relic [FutureS...
New Relic
 
PDF
Monitoring your technology stack with New Relic
Ronald Bradford
 
PDF
Datadog- Monitoring In Motion
Cloud Native Apps SF
 
PDF
PyData NYC 2015 - Automatically Detecting Outliers with Datadog
Datadog
 
PDF
Datadog meetup (dd_sushi #2) Outlier & anomaly detection tips
Naotaka Jay HOTTA
 
PDF
Real Time Structural Monitoring for High Rise Buildings and Bridges
RekaNext Capital
 
PDF
Application Monitoring using Datadog
Mukta Aphale
 
PDF
Running Analytics at the Speed of Your Business
Redis Labs
 
PPTX
Real time water quality monitoring system in ganga basin
HydrologyWebsite
 
PDF
Dataday Texas 2016 - Datadog
Datadog
 
PDF
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
KEY
Scalable Architectures - Taming the Twitter Firehose
Lorenzo Alberton
 
PDF
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
The Performance and Scalability Mindset
Brian Doll
 
Real time network monitoring
Sumit Rajpal
 
Rails Performance
Wen-Tien Chang
 
Using rbenv in Production
Nic Benders
 
Running & Monitoring Docker at Scale
Datadog
 
Scaling monitoring with Datadog
alexismidon
 
Monitoring, Hold the Infrastructure
Sonatype
 
Rock Stars, Builders, and Janitors: You're Doing It Wrong, New Relic [FutureS...
New Relic
 
Monitoring your technology stack with New Relic
Ronald Bradford
 
Datadog- Monitoring In Motion
Cloud Native Apps SF
 
PyData NYC 2015 - Automatically Detecting Outliers with Datadog
Datadog
 
Datadog meetup (dd_sushi #2) Outlier & anomaly detection tips
Naotaka Jay HOTTA
 
Real Time Structural Monitoring for High Rise Buildings and Bridges
RekaNext Capital
 
Application Monitoring using Datadog
Mukta Aphale
 
Running Analytics at the Speed of Your Business
Redis Labs
 
Real time water quality monitoring system in ganga basin
HydrologyWebsite
 
Dataday Texas 2016 - Datadog
Datadog
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
Scalable Architectures - Taming the Twitter Firehose
Lorenzo Alberton
 
Always Valid Inference (Ramesh Johari, Stanford)
Hakka Labs
 
Ad

Similar to DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data (20)

PDF
Parquet Hadoop Summit 2013
Julien Le Dem
 
PDF
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
PDF
(Julien le dem) parquet
NAVER D2
 
PDF
Using Databricks as an Analysis Platform
Databricks
 
PDF
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Iulia Emanuela Iancuta
 
PDF
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...
DataStax
 
PDF
Wisely Chen Spark Talk At Spark Gathering in Taiwan
Wisely chen
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PPTX
Big Data Lakes Benchmarking 2018
Tom Grek
 
PDF
Storage in hadoop
Puneet Tripathi
 
PDF
Parquet Twitter Seattle open house
Julien Le Dem
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Uwe Korn
 
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PDF
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
PDF
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
PDF
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Parquet Hadoop Summit 2013
Julien Le Dem
 
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
(Julien le dem) parquet
NAVER D2
 
Using Databricks as an Analysis Platform
Databricks
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Iulia Emanuela Iancuta
 
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...
DataStax
 
Wisely Chen Spark Talk At Spark Gathering in Taiwan
Wisely chen
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Big Data Lakes Benchmarking 2018
Tom Grek
 
Storage in hadoop
Puneet Tripathi
 
Parquet Twitter Seattle open house
Julien Le Dem
 
Parquet performance tuning: the missing guide
Ryan Blue
 
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Uwe Korn
 
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Ad

More from Hakka Labs (20)

PPTX
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
PDF
DataEngConf SF16 - Data Asserts: Defensive Data Science
Hakka Labs
 
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
PDF
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
PDF
DataEngConf SF16 - Running simulations at scale
Hakka Labs
 
PDF
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
Hakka Labs
 
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
PDF
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
PDF
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
PDF
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
PDF
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
PDF
DataEngConf SF16 - Bridging the gap between data science and data engineering
Hakka Labs
 
PDF
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
PDF
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
PDF
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
PDF
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
PDF
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
PDF
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
PDF
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Hakka Labs
 
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
Hakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
Hakka Labs
 
DataEngConf SF16 - Running simulations at scale
Hakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
Hakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
Hakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
Hakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
Hakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
Hakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
Hakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
Hakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Hakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
Hakka Labs
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Hakka Labs
 

Recently uploaded (20)

PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
July Patch Tuesday
Ivanti
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
July Patch Tuesday
Ivanti
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

  • 1. Parquet at Datadog How we use Parquet for tons of metrics data Doug Daniels, Director of Engineering
  • 2. Outline • Monitor everything • Our data / why we chose Parquet • A bit about Parquet • Our pipeline • What we see in production
  • 3. Datadog is a monitoring service for large scale cloud applications
  • 6. Alert on Critical Issues Collaborate to Fix them Together Monitor Everything
  • 7. We collect a lot of data
  • 8. We collect a lot of data… the biggest and most important of which is
  • 9. Metric timeseries data timestamp 1447020511 metric system.cpu.idle value 98.16687
  • 10. We collect hundreds of billions of these per day …and growing every week
  • 11. And we do massive computation on them
  • 12. • Statistical analysis • Machine learning • Ad-hoc queries • Reporting and aggregation • Metering and billing
  • 13. One size does not fit all.
  • 14. ETL and aggregation Pig / Hive ML and iterative algorithms Spark Interactive SQL Presto We want the best framework for each job
  • 15. How do we do that? Duplicating data storage Writing redundant glue code Copying data definitions and schema
  • 16. 1. Separate Compute and Storage • Amazon S3 as data system-of-record • Ephemeral, job-specific clusters • Write storage once, read everywhere
  • 17. 2. Standard Data Format • Supported by major frameworks • Schema-aware • Fast to read • Strong community
  • 19. Parquet is a column-oriented data storage format
  • 20. What we love about Parquet • Interoperable! • Stores our data super efficiently • Proven at scale on S3 • Strong community
  • 21. Quick Parquet primer Column A Row Group 0 Page 0 Page 1 Page 2 Column B Page 0 Page 1 File Meta Data Footer Row Group 0 Metadata Column B Metadata … Column A Metadata
  • 22. Efficient storage and fast reads • Space efficiencies (per page) • Type-specific encodings: run-length, delta, … • Compression • Query efficiencies (support varies by framework) • Projection pushdown (skip columns) • Predicate pushdown (skip row groups) • Vectorized read (many rows at a time)
  • 24. Our Parquet pipeline Kafka - Buffer - Sort - Dedupe - Upload Go Hadoop Spark Presto PrestoS3FileSystemEMRFS - Partition - Write Parquet - Update Metastore Luigi/Pig Metadata Hive Metastore csv-gz Amazon S3 Parquet
  • 25. What we see in production
  • 26. Excellent storage efficiency • For just 5 columns: • 3.5X less storage than gz-compressed CSV • 2.5X less than internal query-optimized columnar format
  • 27. …a little too efficient • One 80MB parquet file with 160M rows / row group • Creates long-running map tasks • Added PARQUET-344 to limit rows per row group • Want to switch this to limit by uncompressed size
  • 28. Slower read performance with AvroParquet Runtime for our test job (mins) 0 min 10 min 20 min 30 min 40 min C SV + gz AvroParquet+ gz AvroParquet+ snappy Parquet+ gz • Tried reading schema w/ AvroReader • Saw 3x slower reads with AvroParquet (YMMV) on jobs • Using HCatalog reader + hive metastore for schema in production
  • 29. Our Parquet configuration • Parquet block size (and dfs block size): 128 MB • Page size: 1 MB • Compression: gzip • Schema Metadata: pig (we actually use hive metastore)
  • 30. Thanks! Want to work with us on Spark, Hadoop, Kafka, Parquet, Presto, and more? DM me @ddaniels888 or [email protected]