DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

1 like1,485 views

The document outlines Datadog's use of Parquet for efficiently collecting and processing vast amounts of metrics data from cloud applications. It highlights key aspects such as their data pipeline, the benefits of using Parquet, and some production insights related to storage efficiency and read performance. The discussion emphasizes the importance of separate compute and storage as well as a standard data format for optimal performance.

Technology

Parquet at Datadog
How we use Parquet for tons of metrics data
Doug Daniels, Director of Engineering

Outline
• Monitor everything
• Our data / why we chose Parquet
• A bit about Parquet
• Our pipeline
• What we see in production

Datadog is a monitoring
service for large scale cloud
applications

Collect Everything
Integrations for 100+ components

Alert on Critical Issues Collaborate to Fix them Together
Monitor Everything

We collect a lot of data…
the biggest and most
important of which is

Metric timeseries data
timestamp 1447020511
metric system.cpu.idle
value 98.16687

We collect
hundreds of billions
of these per day
…and growing every week

• Statistical analysis
• Machine learning
• Ad-hoc queries
• Reporting and aggregation
• Metering and billing

ETL and aggregation Pig / Hive
ML and iterative algorithms Spark
Interactive SQL Presto
We want the best framework
for each job

How do we do that?
Duplicating data storage
Writing redundant glue code
Copying data definitions and schema

1. Separate Compute and Storage
• Amazon S3 as data system-of-record
• Ephemeral, job-specific clusters
• Write storage once, read everywhere

2. Standard Data Format
• Supported by major frameworks
• Schema-aware
• Fast to read
• Strong community

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Parquet is a column-oriented
data storage format

What we love about Parquet
• Interoperable!
• Stores our data super efficiently
• Proven at scale on S3
• Strong community

Quick Parquet primer
Column A
Row Group 0
Page 0
Page 1
Page 2
Column B
Page 0
Page 1
File Meta Data
Footer
Row Group 0 Metadata
Column B Metadata
…
Column A Metadata

Efficient storage and fast reads
• Space efficiencies (per page)
• Type-specific encodings: run-length, delta, …
• Compression
• Query efficiencies (support varies by framework)
• Projection pushdown (skip columns)
• Predicate pushdown (skip row groups)
• Vectorized read (many rows at a time)

Our Parquet pipeline
Kafka
- Buffer
- Sort
- Dedupe
- Upload
Go
Hadoop Spark Presto
PrestoS3FileSystemEMRFS
- Partition
- Write Parquet
- Update Metastore
Luigi/Pig
Metadata
Hive Metastore
csv-gz
Amazon S3
Parquet

Excellent storage efficiency
• For just 5 columns:
• 3.5X less storage than gz-compressed CSV
• 2.5X less than internal query-optimized columnar format

…a little too efficient
• One 80MB parquet file with 160M rows / row group
• Creates long-running map tasks
• Added PARQUET-344 to limit rows per row group
• Want to switch this to limit by uncompressed size

Slower read performance
with AvroParquet
Runtime for our test job (mins)
0 min
10 min
20 min
30 min
40 min
C
SV
+
gz
AvroParquet+
gz
AvroParquet+
snappy
Parquet+
gz
• Tried reading schema w/
AvroReader
• Saw 3x slower reads with
AvroParquet (YMMV) on jobs
• Using HCatalog reader + hive
metastore for schema in
production

Our Parquet configuration
• Parquet block size (and dfs block size): 128 MB
• Page size: 1 MB
• Compression: gzip
• Schema Metadata: pig (we actually use hive metastore)

Thanks!
Want to work with us on Spark, Hadoop,
Kafka, Parquet, Presto, and more?
DM me @ddaniels888 or doug@datadoghq.com

More Related Content

What's hot (20)

PDF

Apache Hudi: The Path ForwardAlluxio, Inc.

PDF

BPF / XDP 8월 세미나 KossLabTaeung Song

PDF

Barman (PostgreSql) manualMarcelo Pesallaccia

PPTX

HBase Low LatencyDataWorks Summit

PDF

Building Your Data Streams for all the IoTDevOps.com

PDF

Understanding and Improving Code GenerationDatabricks

PDF

Presto At Arm Treasure Data - 2019 UpdatesTaro L. Saito

PDF

High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB

PPTX

Oracle SQL Tuning for Day-to-Day Data Warehouse Supportnkarag

ODP

PrestoKnoldus Inc.

PPTX

Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021StreamNative

PDF

Who needs containers in a serverless worldMatthias Luebken

PDF

Stream Processing – Concepts and FrameworksGuido Schmutz

PPTX

Using Queryable State for Fun and ProfitFlink Forward

PDF

Nginx InternalsJoshua Zhu

PDF

[Pgday.Seoul 2017] 2. PostgreSQL을 위한 리눅스 커널 최적화 - 김상욱PgDay.Seoul

PPTX

Introduction to KSQL: Streaming SQL for Apache Kafka®confluent

PDF

Ceph Day Beijing - SPDK for CephDanielle Womboldt

PDF

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

PDF

Deep dive into PostgreSQL statistics.Alexey Lesovsky

Apache Hudi: The Path ForwardAlluxio, Inc.

BPF / XDP 8월 세미나 KossLabTaeung Song

Barman (PostgreSql) manualMarcelo Pesallaccia

HBase Low LatencyDataWorks Summit

Building Your Data Streams for all the IoTDevOps.com

Understanding and Improving Code GenerationDatabricks

Presto At Arm Treasure Data - 2019 UpdatesTaro L. Saito

High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB

Oracle SQL Tuning for Day-to-Day Data Warehouse Supportnkarag

PrestoKnoldus Inc.

Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021StreamNative

Who needs containers in a serverless worldMatthias Luebken

Stream Processing – Concepts and FrameworksGuido Schmutz

Using Queryable State for Fun and ProfitFlink Forward

Nginx InternalsJoshua Zhu

[Pgday.Seoul 2017] 2. PostgreSQL을 위한 리눅스 커널 최적화 - 김상욱PgDay.Seoul

Introduction to KSQL: Streaming SQL for Apache Kafka®confluent

Ceph Day Beijing - SPDK for CephDanielle Womboldt

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Deep dive into PostgreSQL statistics.Alexey Lesovsky

Viewers also liked (20)

PDF

The Performance and Scalability MindsetBrian Doll

PPTX

Real time network monitoringSumit Rajpal

PDF

Rails PerformanceWen-Tien Chang

KEY

Using rbenv in ProductionNic Benders

PDF

Running & Monitoring Docker at ScaleDatadog

PDF

Scaling monitoring with Datadogalexismidon

PDF

Monitoring, Hold the InfrastructureSonatype

PPTX

Rock Stars, Builders, and Janitors: You're Doing It Wrong, New Relic [FutureS...New Relic

PDF

Monitoring your technology stack with New RelicRonald Bradford

PDF

Datadog- Monitoring In Motion Cloud Native Apps SF

PDF

PyData NYC 2015 - Automatically Detecting Outliers with Datadog Datadog

PDF

Datadog meetup (dd_sushi #2) Outlier & anomaly detection tipsNaotaka Jay HOTTA

PDF

Real Time Structural Monitoring for High Rise Buildings and BridgesRekaNext Capital

PDF

Application Monitoring using DatadogMukta Aphale

PDF

Running Analytics at the Speed of Your BusinessRedis Labs

PPTX

Real time water quality monitoring system in ganga basinHydrologyWebsite

PDF

Dataday Texas 2016 - DatadogDatadog

PDF

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs

KEY

Scalable Architectures - Taming the Twitter FirehoseLorenzo Alberton

PDF

Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs

The Performance and Scalability MindsetBrian Doll

Real time network monitoringSumit Rajpal

Rails PerformanceWen-Tien Chang

Using rbenv in ProductionNic Benders

Running & Monitoring Docker at ScaleDatadog

Scaling monitoring with Datadogalexismidon

Monitoring, Hold the InfrastructureSonatype

Rock Stars, Builders, and Janitors: You're Doing It Wrong, New Relic [FutureS...New Relic

Monitoring your technology stack with New RelicRonald Bradford

Datadog- Monitoring In Motion Cloud Native Apps SF

PyData NYC 2015 - Automatically Detecting Outliers with Datadog Datadog

Datadog meetup (dd_sushi #2) Outlier & anomaly detection tipsNaotaka Jay HOTTA

Real Time Structural Monitoring for High Rise Buildings and BridgesRekaNext Capital

Application Monitoring using DatadogMukta Aphale

Running Analytics at the Speed of Your BusinessRedis Labs

Real time water quality monitoring system in ganga basinHydrologyWebsite

Dataday Texas 2016 - DatadogDatadog

Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs

Scalable Architectures - Taming the Twitter FirehoseLorenzo Alberton

Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs

Similar to DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data (20)

PDF

Parquet Hadoop Summit 2013Julien Le Dem

PDF

Parquet Strata/Hadoop World, New York 2013Julien Le Dem

PDF

(Julien le dem) parquetNAVER D2

PDF

Using Databricks as an Analysis PlatformDatabricks

PDF

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta

PDF

Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...DataStax

PDF

Wisely Chen Spark Talk At Spark Gathering in Taiwan Wisely chen

PDF

The Parquet Format and Performance Optimization OpportunitiesDatabricks

PPTX

Big Data Lakes Benchmarking 2018Tom Grek

PDF

Storage in hadoopPuneet Tripathi

PDF

Parquet Twitter Seattle open houseJulien Le Dem

PDF

Parquet performance tuning: the missing guideRyan Blue

PDF

How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit

PDF

ApacheCon Europe Big Data 2016 – Parquet in practice & detailUwe Korn

PPTX

Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit

PPTX

The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit

PDF

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

PDF

If you have your own Columnar format, stop now and use Parquet 😛Julien Le Dem

PDF

Data Day Seattle 2017: Scaling Data Science at Stitch FixStefan Krawczyk

PDF

How to use Parquet as a basis for ETL and analyticsJulien Le Dem