SlideShare a Scribd company logo
© 2017 Dremio Corporation @DremioHQ
Using Apache Arrow, Calcite and Parquet to build a
Relational Cache
Halloween 2017
@DataEngConf
Jacques Nadeau
© 2017 Dremio Corporation @DremioHQ
Who?
Jacques Nadeau
@intjesus
• CTO & Co-founder of Dremio
• Apache member
• VP Apache Arrow
• PMCs: Arrow, Calcite, Incubator, Heron (incubating)
© 2017 Dremio Corporation @DremioHQ
Agenda
• Tech Backgrounder
• Caching Techniques
• Relational Caching In Depth
• Definition and Matching
• Dealing with Updates
• Closing Words
© 2017 Dremio Corporation @DremioHQ
Tech Backgrounder
© 2017 Dremio Corporation @DremioHQ
What is Apache Arrow
• Columnar In-memory Data processing
library
• Designed to work with any programming
language
• Support for both relational and complex
data as-is
• Used by Pandas, Spark, Dremio
© 2017 Dremio Corporation @DremioHQ
What is Apache Calcite
• SQL parser, Relational Algebra &
Optimizer
• Understands Materialized Views and
Lattices
• Used by many to add SQL functionality
including Apex, Drill, Hive, Flink, Kylin,
Phoenix, Samza, Storm, Cascading &
Dremio
© 2017 Dremio Corporation @DremioHQ
What is Apache Parquet
• OSS implementation of Google Dremel
disk format for complex columnar data
• Support high-level of data-ware
columnar compression, vectorized
columnar readback
• Defacto standard for Analytical data on
disk in Big Data ecosystem
© 2017 Dremio Corporation @DremioHQ
Caching Techniques
© 2017 Dremio Corporation @DremioHQ
What does Caching Mean?
• Caching: Reduce the distance to data (DTD).
• Distance: How much time and resources it takes to
access data?
– How fast is the medium? How near is it?
– Is the data designed for efficient consumption?
– How similar is the data to what you need to answer a
question?
Perf & Proximity
Relevance
Consumability
Ways to reduce DTD
© 2017 Dremio Corporation @DremioHQ
Types of Caching
• In-Memory File Pinning
• Columnar Disk Caching
• In-Memory Block Caching
• Near-CPU Data Caching
• Cube Relational Caching
• Arbitrary Relational Caching
© 2017 Dremio Corporation @DremioHQ
In-Memory File Pinning
• Hold a File in Memory for frequent retrieval
• Pros
– Simple, standard and well-defined interface
– Improves the performance of the medium.
– If you’re performance is primarily bound by disk IO,
this might be a good option.
• Cons
– File structure not necessarily best in-memory
structure.
– Data manipulation almost always requires a copy of
data to also be held in memory (because the file
format is not directly consumable).
© 2017 Dremio Corporation @DremioHQ
Columnar Disk Caching
• Store the data in an optimized columnar
format.
• Pros
– Better compression reduces IO
– Good structure improves processing
– Benefits selective workloads (needed
subset of all columns)
• Cons
– Requires duplicating data
– Typically manual/semi-automated (e.g.
MapReduce/Spark to ETL persist/update)
© 2017 Dremio Corporation @DremioHQ
In-Memory Block Caching
• Maintain portions of on-disk data in
Memory (e.g. Linux page cache, HBase
block cache)
• Pros
– Very mature and usually had for free
• Cons
– Not easy to control/influence.
– Very disconnected from workloads.
© 2017 Dremio Corporation @DremioHQ
Near-CPU Data Caching (memory or disk)
• Hold the data directly in a representation that can
be processed without restructuring (e.g. Arrow
format)
• Pros
– Processing can be done without interpretation of
format
– Very efficient to consume
– Possible to consume data by multiple consumers
without duplicating memory
• Cons
– Larger than compressed formats
– Requires applications to agree on format
© 2017 Dremio Corporation @DremioHQ
Cube-Based Relational Caching
• Create several partially aggregated cuboids that can
satisfy a range of aggregation queries
• Pros
– Low-latency performance for common aggregate
query patterns
– Cube storage requirements can be small fraction of
original dataset size
• Cons
– Analysis latency is bi-modal: cube hit is great but a
miss is either unserved or served slowly
– Difficult or impossible to satisfy arbitrary queries
© 2017 Dremio Corporation @DremioHQ
Arbitrary Relational Caching
• Create arbitrary data fragments combined
with partitioning and sorting schemes to
speed any query
• Pros
– Base case is easy to understand
– Can improve the performance of any query
• Cons
– Complex to match to arbitrary queries
– Can be large depending on needs
© 2017 Dremio Corporation @DremioHQ
Types of Caching: The combination we found useful
• In-Memory File Pinning
– Too non-specific given memory scarcity
• ✔ Columnar Disk Caching
– Make sure everything is in Parquet (for any non-ephemeral data)
• ✔ In-Memory Block Caching
– Leverage existing page-cache, avoid additional memory cache layers
• ✔ Near-CPU Data Caching
– Used primarily for ephemeral/short-term persistence to avoid overhead
• ✔ Cube Relational Caching
– Useful for aggregation patterns
• ✔ Arbitrary Relational Caching
– Useful for unusual aggregation and non-aggregation needs
© 2017 Dremio Corporation @DremioHQ
Relational Caching In Depth
© 2017 Dremio Corporation @DremioHQ
Relational Algebra Refresher
• Relations: Source of data (a table)
• Operators: Define a set of transformations
– Join, Project, Scan, Filter, Aggregate, Window, etc
• Properties: Defining traits of data at a particular
relation
– Sorted by X, Hash distributed by Y, etc.
• Rules: Defining equality conditions between a
collection of operations
– Project > Filter can be changed to Filter > Project, A scan
doesn’t need to project columns that aren’t used later,
etc.
• Graph/Tree: A collection of operators that define a
particular dataset in a DAG
Project
Scan
Filter
Filter
Scan
Project
© 2017 Dremio Corporation @DremioHQ
Relational Caching: Basic Concept
• Store derived data that is
between what you want
and original dataset
• Shortens Distance to
Data (DTD)
• Reduces resource
requirements & latency
Original Data
What you
Want
What you
Want
What you
Want
Persisted Shared
Intermediate State
originalDTD
newDTDcostreduction
© 2017 Dremio Corporation @DremioHQ
You Probably Already Do This!
Data Alternatives (Manually Created)
• Sessionized
• Cleansed
• Partitioned by time or region
• Summarized for a particular
purpose
Users Choose Depending on Need
• Analysts trained on using different
tables depending on use case
• Custom datasets built for
reporting
• Summarization and/or extraction
for dashboards
© 2017 Dremio Corporation @DremioHQ
Benefit of Relational Caching over “Copy and Pick”
“Copy and Pick” Relational Caching
Physical
Optimizations
(transform, sort, partition,
aggregate)
Logical Model
Source Table
????
User picks best
optimization
Cache picks best optimization
Cache maintains
representations
Admin picks manage
maintenance
© 2017 Dremio Corporation @DremioHQ
Key Components of Relational Caching
• How to Express Transformations/States: SQL
• Hold and Match Relational algebra: Calcite
• Persist alternative datasets: Parquet
• A way to process: Arrow + Sabot
• And a lot of code to put it all together…
© 2017 Dremio Corporation @DremioHQ
Query Planner
Our Approach
Data Processing
System (Sabot)
End User Queries
UI to Define
Cached Patterns
Source Storage Interface (Arrow)
HDFS S3 Elastic
Relational Pattern
Matching System
Relational
Pattern
Database
Change
Detection
Database
Cache
Persistence
Parquet
Arrow
Refresh
System
© 2017 Dremio Corporation @DremioHQ
Definition and Matching
© 2017 Dremio Corporation @DremioHQ
Coming Back to Calcite
• Calcite is a Planner & Optimizer
• Comes with a prebuilt selection of
operators, rules, properties (called
traits) and ways to express relations
• Also has a basic Materialized View
facility (relevant!)
Perfect
Foundation
for Relational
Caching
© 2017 Dremio Corporation @DremioHQ
How We Built Caching: Reflections
• Reflection: A persisted alternative view of data in Parquet
format
– Raw Reflection: Persist all records of underlying dataset, controlling
partitioning and sortedness
– Aggregate Reflection: Persist a partially aggregated dataset based on a
selection of dimensions and measures, still controlling partitioning and
sortedness
• Reflections can be built on either source tables or arbitrarily
defined Virtual Datasets
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Aggregation Rollup
Given a user query, try to create an alternative version of the
query that matches the cached target.
P(a,c)
F(c’ < 10)
S(t1)
S(t1)
A(a, sum(c) as c’)
A(a,b, sum(c))
S(r1)
User Query Reflection Definition Alternative Plan
F(c’ < 10)
S(r1)
A(a, sum(c) as c’)
Target
Materialization
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Join/Aggregation Transposition
Join(t1.id=t2.id)
S(t1)
S(t1)
A(a, sum(c) as c’)
A(id, sum(c))
S(r1)
User Query Reflection Definition Alternative Plan
Target
MaterializationS(t2)
Join(r1.id=t2.id)
S(r1)
A(a, sum(c) as c’)
S(t2)
© 2017 Dremio Corporation @DremioHQ
Cache Matching: Costing and Partitioning Benefits
F(a)
S(t1)
S(t1)
S(r1)
Part by a
User Query
Target
Materialization
S(t1)
S(r1)
Part by b
Target
Materialization
S(r1)
pruned on a
© 2017 Dremio Corporation @DremioHQ
Relational Matching, Other Examples
• Physical Property Matching
• Predicate Promotion
• Predicate Inference
• Join Decomposition
• Join Promotion
© 2017 Dremio Corporation @DremioHQ
Dealing with Updates
© 2017 Dremio Corporation @DremioHQ
Refresh Management
Importance of Cache Creation Ordering
• Not all updating
orderings are equal
• Want to order
updates based on
“Refresh Graph” and
dependencies
• Multiple orders
possible, cost against
each other to
minimize update cost
Freshness Management
• Underlying data
may change
• User Should define
refresh frequency
• Separately Define
Absolute TTL
Physical
dataset
1H refresh
3H expiration
Raw Reflection
Aggregate Reflection
© 2017 Dremio Corporation @DremioHQ
Multiple Update Modes (Depending on Mutation Pattern)
• Full: Always rebuild reflections from scratch (highly mutating)
• Incremental (files): Incrementally builds reflections based on new
files and folders (append-only)
• Incremental (rowstores): Incrementally builds reflections based on
monotonically increasing field (append-only)
• Partitioned Refresh: Maintains reflections based on source
partitions (e.g. Filesystem directories, Hive partitions). (partially
mutating)
© 2017 Dremio Corporation @DremioHQ
Closing Words
© 2017 Dremio Corporation @DremioHQ
What We’ve Seen Using these Techniques
• Frequent 10x-100x+ performance improvements in multiple
workloads
• Vast reduction in resources required to achieve performance
levels
• In many cases, a reduction in disk space
– Due to avoidance of excessive unused or rarely used physical copies
© 2017 Dremio Corporation @DremioHQ
Find out More and Get Involved
• Drop by my office hours (East Room Lounge - now)
• Drop by the Dremio table behind you
• Join us at @ApacheArrow meetup at @enigma_data Midtown
– Wes Mckinney, creator of Pandas and myself, tech deep dive
• Join the Dremio community (Relational Caching)
– github.com/dremio/dremio-oss (Apache Licensed)
– dremio.com
– community.dremio.com
• Find out more about the Building Blocks
– dev@[arrow|calcite|parquet].apache.org
– http://github.com/apache/[arrow|calcite|parquet-mr]
– http://[arrow|calcite|parquet].apache.org
• Follow @DremioHQ, @intjesus, @ApacheArrow, @ApacheCalcite,
@ApacheParquet

More Related Content

What's hot (20)

PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
PPTX
Hive 3 - a new horizon
Thejas Nair
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PPTX
Apache Arrow Flight Overview
Jacques Nadeau
 
PDF
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
 
PDF
State of the Trino Project
Martin Traverso
 
PDF
SeaweedFS introduction
chrislusf
 
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
PDF
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PDF
Understanding Query Plans and Spark UIs
Databricks
 
PDF
Databricks Delta Lake and Its Benefits
Databricks
 
PPTX
Druid deep dive
Kashif Khan
 
PDF
Using ClickHouse for Experimentation
Gleb Kanterov
 
PDF
Apache Flink internals
Kostas Tzoumas
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
Altinity Ltd
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Hive 3 - a new horizon
Thejas Nair
 
Iceberg: a fast table format for S3
DataWorks Summit
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Apache Arrow Flight Overview
Jacques Nadeau
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
 
State of the Trino Project
Martin Traverso
 
SeaweedFS introduction
chrislusf
 
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Understanding Query Plans and Spark UIs
Databricks
 
Databricks Delta Lake and Its Benefits
Databricks
 
Druid deep dive
Kashif Khan
 
Using ClickHouse for Experimentation
Gleb Kanterov
 
Apache Flink internals
Kostas Tzoumas
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 

Viewers also liked (12)

PPTX
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
 
PPTX
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
PDF
The twins that everyone loved too much
Julian Hyde
 
PPTX
Apache Arrow - An Overview
Dremio Corporation
 
PDF
Data Science Languages and Industry Analytics
Wes McKinney
 
PDF
Apache Calcite: One planner fits all
Julian Hyde
 
PDF
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
PDF
Don’t optimize my queries, optimize my data!
Julian Hyde
 
PDF
SQL on everything, in memory
Julian Hyde
 
PPTX
Apache Calcite overview
Julian Hyde
 
PDF
Oracle対応アプリケーションのDockerize事始め
Satoshi Nagayasu
 
PPTX
はじめてのDockerパーフェクトガイド(2017年版)
Hiroshi Hayakawa
 
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
 
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
The twins that everyone loved too much
Julian Hyde
 
Apache Arrow - An Overview
Dremio Corporation
 
Data Science Languages and Industry Analytics
Wes McKinney
 
Apache Calcite: One planner fits all
Julian Hyde
 
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Don’t optimize my queries, optimize my data!
Julian Hyde
 
SQL on everything, in memory
Julian Hyde
 
Apache Calcite overview
Julian Hyde
 
Oracle対応アプリケーションのDockerize事始め
Satoshi Nagayasu
 
はじめてのDockerパーフェクトガイド(2017年版)
Hiroshi Hayakawa
 
Ad

Similar to Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache (20)

PPTX
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
PPTX
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
PPTX
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
 
PDF
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Databricks
 
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
PPTX
Webinar: Is Your Storage Ready for Disaster?
Storage Switzerland
 
PDF
Meta scale kognitio hadoop webinar
Kognitio
 
PPTX
IBM Spectrum Scale Overview november 2015
Doug O'Flaherty
 
PDF
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
PPTX
12 Architectural Requirements for Protecting Business Data in the Cloud
Buurst
 
PDF
Processing Drone data @Scale
Dr Hajji Hicham
 
PPTX
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Storage Switzerland
 
PPTX
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
PDF
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Perficient, Inc.
 
PPTX
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War
Storage Switzerland
 
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Databricks
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
Webinar: Is Your Storage Ready for Disaster?
Storage Switzerland
 
Meta scale kognitio hadoop webinar
Kognitio
 
IBM Spectrum Scale Overview november 2015
Doug O'Flaherty
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
12 Architectural Requirements for Protecting Business Data in the Cloud
Buurst
 
Processing Drone data @Scale
Dr Hajji Hicham
 
Webinar: Cut Disaster Recovery Expenses – Improve Recovery Times
Storage Switzerland
 
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
Lower Cost and Complexity with Azure and StorSimple Hybrid Cloud Solutions
Perficient, Inc.
 
Webinar: Performance vs. Cost - Solving The HPC Storage Tug-of-War
Storage Switzerland
 
Ad

Recently uploaded (20)

PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

  • 1. © 2017 Dremio Corporation @DremioHQ Using Apache Arrow, Calcite and Parquet to build a Relational Cache Halloween 2017 @DataEngConf Jacques Nadeau
  • 2. © 2017 Dremio Corporation @DremioHQ Who? Jacques Nadeau @intjesus • CTO & Co-founder of Dremio • Apache member • VP Apache Arrow • PMCs: Arrow, Calcite, Incubator, Heron (incubating)
  • 3. © 2017 Dremio Corporation @DremioHQ Agenda • Tech Backgrounder • Caching Techniques • Relational Caching In Depth • Definition and Matching • Dealing with Updates • Closing Words
  • 4. © 2017 Dremio Corporation @DremioHQ Tech Backgrounder
  • 5. © 2017 Dremio Corporation @DremioHQ What is Apache Arrow • Columnar In-memory Data processing library • Designed to work with any programming language • Support for both relational and complex data as-is • Used by Pandas, Spark, Dremio
  • 6. © 2017 Dremio Corporation @DremioHQ What is Apache Calcite • SQL parser, Relational Algebra & Optimizer • Understands Materialized Views and Lattices • Used by many to add SQL functionality including Apex, Drill, Hive, Flink, Kylin, Phoenix, Samza, Storm, Cascading & Dremio
  • 7. © 2017 Dremio Corporation @DremioHQ What is Apache Parquet • OSS implementation of Google Dremel disk format for complex columnar data • Support high-level of data-ware columnar compression, vectorized columnar readback • Defacto standard for Analytical data on disk in Big Data ecosystem
  • 8. © 2017 Dremio Corporation @DremioHQ Caching Techniques
  • 9. © 2017 Dremio Corporation @DremioHQ What does Caching Mean? • Caching: Reduce the distance to data (DTD). • Distance: How much time and resources it takes to access data? – How fast is the medium? How near is it? – Is the data designed for efficient consumption? – How similar is the data to what you need to answer a question? Perf & Proximity Relevance Consumability Ways to reduce DTD
  • 10. © 2017 Dremio Corporation @DremioHQ Types of Caching • In-Memory File Pinning • Columnar Disk Caching • In-Memory Block Caching • Near-CPU Data Caching • Cube Relational Caching • Arbitrary Relational Caching
  • 11. © 2017 Dremio Corporation @DremioHQ In-Memory File Pinning • Hold a File in Memory for frequent retrieval • Pros – Simple, standard and well-defined interface – Improves the performance of the medium. – If you’re performance is primarily bound by disk IO, this might be a good option. • Cons – File structure not necessarily best in-memory structure. – Data manipulation almost always requires a copy of data to also be held in memory (because the file format is not directly consumable).
  • 12. © 2017 Dremio Corporation @DremioHQ Columnar Disk Caching • Store the data in an optimized columnar format. • Pros – Better compression reduces IO – Good structure improves processing – Benefits selective workloads (needed subset of all columns) • Cons – Requires duplicating data – Typically manual/semi-automated (e.g. MapReduce/Spark to ETL persist/update)
  • 13. © 2017 Dremio Corporation @DremioHQ In-Memory Block Caching • Maintain portions of on-disk data in Memory (e.g. Linux page cache, HBase block cache) • Pros – Very mature and usually had for free • Cons – Not easy to control/influence. – Very disconnected from workloads.
  • 14. © 2017 Dremio Corporation @DremioHQ Near-CPU Data Caching (memory or disk) • Hold the data directly in a representation that can be processed without restructuring (e.g. Arrow format) • Pros – Processing can be done without interpretation of format – Very efficient to consume – Possible to consume data by multiple consumers without duplicating memory • Cons – Larger than compressed formats – Requires applications to agree on format
  • 15. © 2017 Dremio Corporation @DremioHQ Cube-Based Relational Caching • Create several partially aggregated cuboids that can satisfy a range of aggregation queries • Pros – Low-latency performance for common aggregate query patterns – Cube storage requirements can be small fraction of original dataset size • Cons – Analysis latency is bi-modal: cube hit is great but a miss is either unserved or served slowly – Difficult or impossible to satisfy arbitrary queries
  • 16. © 2017 Dremio Corporation @DremioHQ Arbitrary Relational Caching • Create arbitrary data fragments combined with partitioning and sorting schemes to speed any query • Pros – Base case is easy to understand – Can improve the performance of any query • Cons – Complex to match to arbitrary queries – Can be large depending on needs
  • 17. © 2017 Dremio Corporation @DremioHQ Types of Caching: The combination we found useful • In-Memory File Pinning – Too non-specific given memory scarcity • ✔ Columnar Disk Caching – Make sure everything is in Parquet (for any non-ephemeral data) • ✔ In-Memory Block Caching – Leverage existing page-cache, avoid additional memory cache layers • ✔ Near-CPU Data Caching – Used primarily for ephemeral/short-term persistence to avoid overhead • ✔ Cube Relational Caching – Useful for aggregation patterns • ✔ Arbitrary Relational Caching – Useful for unusual aggregation and non-aggregation needs
  • 18. © 2017 Dremio Corporation @DremioHQ Relational Caching In Depth
  • 19. © 2017 Dremio Corporation @DremioHQ Relational Algebra Refresher • Relations: Source of data (a table) • Operators: Define a set of transformations – Join, Project, Scan, Filter, Aggregate, Window, etc • Properties: Defining traits of data at a particular relation – Sorted by X, Hash distributed by Y, etc. • Rules: Defining equality conditions between a collection of operations – Project > Filter can be changed to Filter > Project, A scan doesn’t need to project columns that aren’t used later, etc. • Graph/Tree: A collection of operators that define a particular dataset in a DAG Project Scan Filter Filter Scan Project
  • 20. © 2017 Dremio Corporation @DremioHQ Relational Caching: Basic Concept • Store derived data that is between what you want and original dataset • Shortens Distance to Data (DTD) • Reduces resource requirements & latency Original Data What you Want What you Want What you Want Persisted Shared Intermediate State originalDTD newDTDcostreduction
  • 21. © 2017 Dremio Corporation @DremioHQ You Probably Already Do This! Data Alternatives (Manually Created) • Sessionized • Cleansed • Partitioned by time or region • Summarized for a particular purpose Users Choose Depending on Need • Analysts trained on using different tables depending on use case • Custom datasets built for reporting • Summarization and/or extraction for dashboards
  • 22. © 2017 Dremio Corporation @DremioHQ Benefit of Relational Caching over “Copy and Pick” “Copy and Pick” Relational Caching Physical Optimizations (transform, sort, partition, aggregate) Logical Model Source Table ???? User picks best optimization Cache picks best optimization Cache maintains representations Admin picks manage maintenance
  • 23. © 2017 Dremio Corporation @DremioHQ Key Components of Relational Caching • How to Express Transformations/States: SQL • Hold and Match Relational algebra: Calcite • Persist alternative datasets: Parquet • A way to process: Arrow + Sabot • And a lot of code to put it all together…
  • 24. © 2017 Dremio Corporation @DremioHQ Query Planner Our Approach Data Processing System (Sabot) End User Queries UI to Define Cached Patterns Source Storage Interface (Arrow) HDFS S3 Elastic Relational Pattern Matching System Relational Pattern Database Change Detection Database Cache Persistence Parquet Arrow Refresh System
  • 25. © 2017 Dremio Corporation @DremioHQ Definition and Matching
  • 26. © 2017 Dremio Corporation @DremioHQ Coming Back to Calcite • Calcite is a Planner & Optimizer • Comes with a prebuilt selection of operators, rules, properties (called traits) and ways to express relations • Also has a basic Materialized View facility (relevant!) Perfect Foundation for Relational Caching
  • 27. © 2017 Dremio Corporation @DremioHQ How We Built Caching: Reflections • Reflection: A persisted alternative view of data in Parquet format – Raw Reflection: Persist all records of underlying dataset, controlling partitioning and sortedness – Aggregate Reflection: Persist a partially aggregated dataset based on a selection of dimensions and measures, still controlling partitioning and sortedness • Reflections can be built on either source tables or arbitrarily defined Virtual Datasets
  • 28. © 2017 Dremio Corporation @DremioHQ Cache Matching: Aggregation Rollup Given a user query, try to create an alternative version of the query that matches the cached target. P(a,c) F(c’ < 10) S(t1) S(t1) A(a, sum(c) as c’) A(a,b, sum(c)) S(r1) User Query Reflection Definition Alternative Plan F(c’ < 10) S(r1) A(a, sum(c) as c’) Target Materialization
  • 29. © 2017 Dremio Corporation @DremioHQ Cache Matching: Join/Aggregation Transposition Join(t1.id=t2.id) S(t1) S(t1) A(a, sum(c) as c’) A(id, sum(c)) S(r1) User Query Reflection Definition Alternative Plan Target MaterializationS(t2) Join(r1.id=t2.id) S(r1) A(a, sum(c) as c’) S(t2)
  • 30. © 2017 Dremio Corporation @DremioHQ Cache Matching: Costing and Partitioning Benefits F(a) S(t1) S(t1) S(r1) Part by a User Query Target Materialization S(t1) S(r1) Part by b Target Materialization S(r1) pruned on a
  • 31. © 2017 Dremio Corporation @DremioHQ Relational Matching, Other Examples • Physical Property Matching • Predicate Promotion • Predicate Inference • Join Decomposition • Join Promotion
  • 32. © 2017 Dremio Corporation @DremioHQ Dealing with Updates
  • 33. © 2017 Dremio Corporation @DremioHQ Refresh Management Importance of Cache Creation Ordering • Not all updating orderings are equal • Want to order updates based on “Refresh Graph” and dependencies • Multiple orders possible, cost against each other to minimize update cost Freshness Management • Underlying data may change • User Should define refresh frequency • Separately Define Absolute TTL Physical dataset 1H refresh 3H expiration Raw Reflection Aggregate Reflection
  • 34. © 2017 Dremio Corporation @DremioHQ Multiple Update Modes (Depending on Mutation Pattern) • Full: Always rebuild reflections from scratch (highly mutating) • Incremental (files): Incrementally builds reflections based on new files and folders (append-only) • Incremental (rowstores): Incrementally builds reflections based on monotonically increasing field (append-only) • Partitioned Refresh: Maintains reflections based on source partitions (e.g. Filesystem directories, Hive partitions). (partially mutating)
  • 35. © 2017 Dremio Corporation @DremioHQ Closing Words
  • 36. © 2017 Dremio Corporation @DremioHQ What We’ve Seen Using these Techniques • Frequent 10x-100x+ performance improvements in multiple workloads • Vast reduction in resources required to achieve performance levels • In many cases, a reduction in disk space – Due to avoidance of excessive unused or rarely used physical copies
  • 37. © 2017 Dremio Corporation @DremioHQ Find out More and Get Involved • Drop by my office hours (East Room Lounge - now) • Drop by the Dremio table behind you • Join us at @ApacheArrow meetup at @enigma_data Midtown – Wes Mckinney, creator of Pandas and myself, tech deep dive • Join the Dremio community (Relational Caching) – github.com/dremio/dremio-oss (Apache Licensed) – dremio.com – community.dremio.com • Find out more about the Building Blocks – dev@[arrow|calcite|parquet].apache.org – http://github.com/apache/[arrow|calcite|parquet-mr] – http://[arrow|calcite|parquet].apache.org • Follow @DremioHQ, @intjesus, @ApacheArrow, @ApacheCalcite, @ApacheParquet