SlideShare a Scribd company logo
Uwe L. Korn
PyData Paris 14th June 2016
How Apache Arrow and Parquet
boost cross-language interop
About me
• Data Scientist at Blue Yonder (@BlueYonderTech)
• We optimize Replenishment and Pricing for the Retail
industry with Predictive Analytics
• Contributor to Apache {Arrow, Parquet}
• Work in Python, Cython, C++11 and SQL
Agenda
The Problem
Arrow
Parquet
Outlook
Why is columnar better?
Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
Different Systems - Varying
Python Support
• Various levels of Python Support
• Build in Python
• Python API
• No Python at all
• Each tool/algorithm works on
columnar data
• Separate conversion routines for
each pair
• causes overhead
• there’s no one-size-fits-all solution
Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )
Apache Arrow
• Specification for in-memory
columnar data layout
• No overhead for cross-system /
cross-language communication
• Designed for efficiency (exploit
SIMD, cache locality, ..)
• Supports nested data structures
Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )
Apache Arrow - The Impact
• An example: Retrieve a dataset from an MPP database
and analyze it in Pandas
• Run a query in the DB
• Pass it in columnar form to the DB driver
• The OBDC layer transform it into row-wise form
• Pandas makes it columnar again
• Ugly real-life solution: export as CSV, bypass ODBC
• In future: Use Arrow as interface between the DB and
Pandas
Apache Arrow
• Top-level Apache project from the beginning
• Not only a specification: also includes C++ / Java /
Python / .. code.
• Arrow structures / classes
• RPC (upcoming) & IPC (alpha) support
• Conversion code for Parquet, Pandas, ..
• Combined effort from developer of over 13 major OSS
projects
• Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, ..
• Spec: https://github.com/apache/arrow/blob/master/format/Layout.md
Arrow in Action: Feather
• Language-agnostic file format for
binary data frame storage
• Read performance close to raw
disk I/O
• by Wes McKinney (Python) and
Hadley Wickham (R)
• Julia Support in progress
Arrow Arrays
Feather Metadata
(flatbuffers)
Apache Parquet
Apache Parquet
• Binary file format for nested columnar data
• Inspired from Google Dremel paper
• space and query efficient
• multiple encodings
• predicate pushdown
• column-wise compression
• many tools use Parquet as the default input format
• very popular in the JVM/Hadoop-based world
The Basics
• 1 File, includes metadata
• Several row groups
• all with the same number of column chunks
• n pages per column chunk
• Benefits:
• pre-partitioned for fast distributed access
• statistics in the metadata for predicate pushdown
Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made-
simple-with-parquet
File
Row Group
Column Chunk
Page
Using Parquet in Python
• You can use it already today with Python:
• sqlContext.read.parquet(“..“).toPandas()	
• Needs to pass through Spark, very slow
• Native Python support on its way:
• Parquet I/O to Arrow
• Arrow provides NumPy conversion
State of Arrow & Parquet
Arrow
in-memory spec for columnar data
• Java (beta)
• C++ (in progress)
• Python (in progress)
• Planned:
• Julia
• R
Parquet
columnar on-disk storage
• Java (mature)
• C++ (in progress)
• Python (in progress)
• Planned:
• Julia
• R
Upcoming
• Parquet <-Arrow-> Pandas
• IPC on its way
• alpha implementation using memory mapped files
• JVM <-> native with shared reference counting
Get Involved!
• dev@arrow.apache.org & dev@parquet.apache.org
• https://apachearrowslackin.herokuapp.com/
• https://arrow.apache.org/
• https://parquet.apache.org/
• @ApacheArrow & @ApacheParquet
Questions ?!

More Related Content

What's hot (20)

PDF
Apache Arrow and Python: The latest
Wes McKinney
 
PDF
Data Science Languages and Industry Analytics
Wes McKinney
 
PDF
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
PDF
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
 
PPTX
Strata NY 2018: The deconstructed database
Julien Le Dem
 
PDF
Ibis: Scaling the Python Data Experience
Wes McKinney
 
PPTX
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
PDF
HUG_Ireland_Apache_Arrow_Tomer_Shiran
John Mulhall
 
PDF
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
PDF
Improving data interoperability in Python and R
Wes McKinney
 
PPTX
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
PDF
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
PPTX
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
High Performance Python on Apache Spark
Wes McKinney
 
PPTX
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
DataWorks Summit/Hadoop Summit
 
PDF
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PDF
DataFrames: The Extended Cut
Wes McKinney
 
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
PPTX
Node Labels in YARN
DataWorks Summit
 
Apache Arrow and Python: The latest
Wes McKinney
 
Data Science Languages and Industry Analytics
Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
 
Strata NY 2018: The deconstructed database
Julien Le Dem
 
Ibis: Scaling the Python Data Experience
Wes McKinney
 
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
HUG_Ireland_Apache_Arrow_Tomer_Shiran
John Mulhall
 
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
Improving data interoperability in Python and R
Wes McKinney
 
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
High Performance Python on Apache Spark
Wes McKinney
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
DataWorks Summit/Hadoop Summit
 
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
DataFrames: The Extended Cut
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Node Labels in YARN
DataWorks Summit
 

Similar to How Apache Arrow and Parquet boost cross-language interoperability (20)

PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
PDF
Improving Data Interoperability for Python and R
Work-Bench
 
PDF
Apache Arrow
Mike Frampton
 
PDF
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PDF
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
PPTX
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PyData
 
PDF
Big data berlin
kammeyer
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
PPTX
aip_developer_overview_icar_2014
Matthew Vaughn
 
PDF
Parquet and AVRO
airisData
 
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
PPTX
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
PPTX
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
Improving Data Interoperability for Python and R
Work-Bench
 
Apache Arrow
Mike Frampton
 
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PyData
 
Big data berlin
kammeyer
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
aip_developer_overview_icar_2014
Matthew Vaughn
 
Parquet and AVRO
airisData
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Michael Rys
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Ad

More from Uwe Korn (10)

PDF
PyData Sofia May 2024 - Intro to Apache Arrow
Uwe Korn
 
PDF
Going beyond Apache Parquet's default settings
Uwe Korn
 
PDF
pandas.(to/from)_sql is simple but not fast
Uwe Korn
 
PDF
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
Uwe Korn
 
PDF
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn
 
PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Uwe Korn
 
PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
PDF
Scalable Scientific Computing with Dask
Uwe Korn
 
PDF
Extending Pandas using Apache Arrow and Numba
Uwe Korn
 
PDF
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
Uwe Korn
 
PyData Sofia May 2024 - Intro to Apache Arrow
Uwe Korn
 
Going beyond Apache Parquet's default settings
Uwe Korn
 
pandas.(to/from)_sql is simple but not fast
Uwe Korn
 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
Uwe Korn
 
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn
 
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Uwe Korn
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
Scalable Scientific Computing with Dask
Uwe Korn
 
Extending Pandas using Apache Arrow and Numba
Uwe Korn
 
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
Uwe Korn
 
Ad

Recently uploaded (20)

PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Data base management system Transactions.ppt
gandhamcharan2006
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
deep dive data management sharepoint apps.ppt
novaprofk
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
Choosing the Right Database for Indexing.pdf
Tamanna
 

How Apache Arrow and Parquet boost cross-language interoperability

  • 1. Uwe L. Korn PyData Paris 14th June 2016 How Apache Arrow and Parquet boost cross-language interop
  • 2. About me • Data Scientist at Blue Yonder (@BlueYonderTech) • We optimize Replenishment and Pricing for the Retail industry with Predictive Analytics • Contributor to Apache {Arrow, Parquet} • Work in Python, Cython, C++11 and SQL
  • 4. Why is columnar better? Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
  • 5. Different Systems - Varying Python Support • Various levels of Python Support • Build in Python • Python API • No Python at all • Each tool/algorithm works on columnar data • Separate conversion routines for each pair • causes overhead • there’s no one-size-fits-all solution Image source: https://arrow.apache.org/img/copy2.png ( https://arrow.apache.org/ )
  • 6. Apache Arrow • Specification for in-memory columnar data layout • No overhead for cross-system / cross-language communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Supports nested data structures Image source: https://arrow.apache.org/img/shared2.png ( https://arrow.apache.org/ )
  • 7. Apache Arrow - The Impact • An example: Retrieve a dataset from an MPP database and analyze it in Pandas • Run a query in the DB • Pass it in columnar form to the DB driver • The OBDC layer transform it into row-wise form • Pandas makes it columnar again • Ugly real-life solution: export as CSV, bypass ODBC • In future: Use Arrow as interface between the DB and Pandas
  • 8. Apache Arrow • Top-level Apache project from the beginning • Not only a specification: also includes C++ / Java / Python / .. code. • Arrow structures / classes • RPC (upcoming) & IPC (alpha) support • Conversion code for Parquet, Pandas, .. • Combined effort from developer of over 13 major OSS projects • Impala, Kudu, Spark, Cassandra, Drill, Pandas, R, .. • Spec: https://github.com/apache/arrow/blob/master/format/Layout.md
  • 9. Arrow in Action: Feather • Language-agnostic file format for binary data frame storage • Read performance close to raw disk I/O • by Wes McKinney (Python) and Hadley Wickham (R) • Julia Support in progress Arrow Arrays Feather Metadata (flatbuffers)
  • 11. Apache Parquet • Binary file format for nested columnar data • Inspired from Google Dremel paper • space and query efficient • multiple encodings • predicate pushdown • column-wise compression • many tools use Parquet as the default input format • very popular in the JVM/Hadoop-based world
  • 12. The Basics • 1 File, includes metadata • Several row groups • all with the same number of column chunks • n pages per column chunk • Benefits: • pre-partitioned for fast distributed access • statistics in the metadata for predicate pushdown Blogpost by Julien Le Dem: https://blog.twitter.com/2013/dremel-made- simple-with-parquet File Row Group Column Chunk Page
  • 13. Using Parquet in Python • You can use it already today with Python: • sqlContext.read.parquet(“..“).toPandas() • Needs to pass through Spark, very slow • Native Python support on its way: • Parquet I/O to Arrow • Arrow provides NumPy conversion
  • 14. State of Arrow & Parquet Arrow in-memory spec for columnar data • Java (beta) • C++ (in progress) • Python (in progress) • Planned: • Julia • R Parquet columnar on-disk storage • Java (mature) • C++ (in progress) • Python (in progress) • Planned: • Julia • R
  • 15. Upcoming • Parquet <-Arrow-> Pandas • IPC on its way • alpha implementation using memory mapped files • JVM <-> native with shared reference counting
  • 16. Get Involved! • [email protected] & [email protected] • https://apachearrowslackin.herokuapp.com/ • https://arrow.apache.org/ • https://parquet.apache.org/ • @ApacheArrow & @ApacheParquet