SlideShare a Scribd company logo
pd.{read/to}_sql is simple but
not fast
Uwe Korn – QuantCo – November 2020
About me
• Engineering at QuantCo

• Apache {Arrow, Parquet} PMC

• Turbodbc Maintainer

• Other OSS stuff
@xhochy
@xhochy
mail@uwekorn.com
https://uwekorn.com
Our setting
• We like tabular data

• Thus we use pandas

• We want large amounts of this data in pandas
• The traditional storage for it is SQL databases

• How do we get from one to another?
SQL
• Very very brief intro:

• „domain-specific language for accessing data held in a relational
database management system“

• The one language in data systems that precedes all the Python, R,
Julia, … we use as our „main“ language, also much wider user
base

• SELECT * FROM table

INSERT INTO table
• Two main arguments:

• sql: SQL query to be executed or a table name.

• con: SQLAlchemy connectable, str, or sqlite3 connection
• Two main arguments:

• name: Name of SQL table.

• con: SQLAlchemy connectable, str, or sqlite3 connection
• Let’s look at the other nice bits („additional arguments“)

• if_exists: „What should we do when the target already exists?“

• fail

• replace

• append
• index: „What should we with this one magical column?“ (bool)

• index_label

• chunksize: „Write less data at once“

• dtype: „What should we with this one magical column?“ (bool)

• method: „Supply some magic insertion hook“ (callable)
SQLAlchemy
• SQLAlchemy is a Python SQL toolkit and Object Relational Mapper
(ORM)

• We only use the toolkit part for:

• Metadata about schema and tables (incl. creation)

• Engine for connecting to various databases using a uniform
interface
Under the bonnet
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
pandas.(to/from)_sql is simple but not fast
How does it work (read_sql)?
• pandas.read_sql [1] calls SQLDatabase.read_query [2]

• This then does



• Depending on whether a chunksize was given, this fetches all or
parts of the result
[1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L509-L516
[2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1243
How does it work (read_sql)?
• Passes in the data into the from_records constructor


• Optionally parses dates and sets an index
How does it work (to_sql)?
• This is more tricky as we modify the database.

• to_sql [1] may need to create the target

• If not existing, it will call CREATE TABLE [2]

• Afterwards, we INSERT [3] into the (new) table

• The insertion step is where we convert from DataFrame back into
records [4]



[1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1320
[2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1383-L1393
[3] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1398
[4] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L734-L747
Why is it slow?
No benchmarks yet, theory first.



















Why is it slow?
Thanks
Slides will come after PyData Global

Follow me on Twitter: @xhochy
How to get fast?
ODBC
• Open Database Connectivity (ODBC) is a standard API for accessing
databases

• Most databases provide an ODBC interface, some of them are
efficient

• Two popular Python libraries for that:

• https://github.com/mkleehammer/pyodbc

• https://github.com/blue-yonder/turbodbc
ODBC
Turbodbc has support for Apache Arrow: https://arrow.apache.org/
blog/2017/06/16/turbodbc-arrow/
ODBC
• With turbodbc + Arrow we get the following performance
improvements:

• 3-4x for MS SQL, see https://youtu.be/B-uj8EDcjLY?t=1208

• 3-4x speedup for Exasol, see https://youtu.be/B-uj8EDcjLY?t=1390
Snowflake
• Turbodbc is a solution that retrofits performance

• Snowflake drivers already come with built-in speed

• Default response is JSON-based, BUT:

• The database server can answer directly with Arrow

• Client only needs the Arrow->pandas conversion (lightning fast⚡)

• Up to 10x faster, see https://www.snowflake.com/blog/fetching-
query-results-from-snowflake-just-got-a-lot-faster-with-apache-
arrow/
JDBC
• Blogged about this at: https://uwekorn.com/2019/11/17/fast-jdbc-
access-in-python-using-pyarrow-jvm.html

• Not yet so convenient and read-only

• First, you need all your Java dependencies incl arrow-jdbc in your
classpath

• Start JVM and load the driver, setup Arrow Java
JDBC
• Then:

• Fetch result using the Arrow Java JDBC adapter

• Use pyarrow.jvm to get a Python reference to the JVM memory

• Convert to pandas 136x speedup!
Postgres
Not yet opensourced but this is how it works:
How do we get this
into pandas.read_sql?
API troubles
• pandas’ simple API: 



• turbodbc

API troubles
• pandas’ simple API: 



• Snowflake

API troubles
• pandas’ simple API: 



• pyarrow.jvm + JDBC

Building a better API
• We want to use pandas’ simple API but with the nice performance
benefits

• One idea: Dispatching based on the connection class



• User doesn’t need to learn a new API

• Performance improvements come via optional packages

Building a better API
Alternative idea:
Building a better API
Discussion in https://github.com/pandas-dev/pandas/issues/36893
Thanks
Follow me on Twitter: @xhochy

More Related Content

What's hot (20)

PDF
Presto as a Service - Tips for operation and monitoring
Taro L. Saito
 
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
PDF
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
MLconf
 
PDF
Prestogres, ODBC & JDBC connectivity for Presto
Sadayuki Furuhashi
 
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
PDF
LuceneRDD for (Geospatial) Search and Entity Linkage
zouzias
 
PDF
Pandas/Data Analysis at Baypiggies
Andy Hayden
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Lucidworks
 
PDF
Data Engineering with Solr and Spark
Lucidworks
 
PDF
DataFrames: The Extended Cut
Wes McKinney
 
PDF
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
PPTX
Rust & Apache Arrow @ RMS
Andy Grove
 
PDF
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
viirya
 
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Presto as a Service - Tips for operation and monitoring
Taro L. Saito
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
MLconf
 
Prestogres, ODBC & JDBC connectivity for Presto
Sadayuki Furuhashi
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
LuceneRDD for (Geospatial) Search and Entity Linkage
zouzias
 
Pandas/Data Analysis at Baypiggies
Andy Hayden
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Lucidworks
 
Data Engineering with Solr and Spark
Lucidworks
 
DataFrames: The Extended Cut
Wes McKinney
 
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
Rust & Apache Arrow @ RMS
Andy Grove
 
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Databricks
 
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
viirya
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 

Similar to pandas.(to/from)_sql is simple but not fast (20)

PDF
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
Mosky Liu
 
PPTX
Relational Database Access with Python ‘sans’ ORM
Mark Rees
 
PPTX
unit-5 SQL 1 creating a databse connection.pptx
HuzaifaAhmedFarooqi
 
PDF
Develop Python Applications with MySQL Connector/Python
Jesper Wisborg Krogh
 
PPTX
Relational Database Access with Python
Mark Rees
 
PDF
Michael Bayer Introduction to SQLAlchemy @ Postgres Open
PostgresOpen
 
PDF
MySQL Tech Café #8: MySQL 8.0 for Python Developers
Frederic Descamps
 
PDF
Oleksandr Tarasenko "ORM vs GraphQL"
Fwdays
 
PDF
ORM vs GraphQL - Python fwdays 2019
Oleksandr Tarasenko
 
PDF
Python BCN Introduction to SQLAlchemy
theManda
 
PDF
AmI 2015 - Databases in Python
Fulvio Corno
 
PPTX
Database Connectivity using Python and MySQL
devsuchaye
 
PDF
Interface python with sql database10.pdf
HiteshNandi
 
PDF
Python and the MySQL Document Store
Jesper Wisborg Krogh
 
PPTX
Database connectivity in python
baabtra.com - No. 1 supplier of quality freshers
 
PPTX
015. Interface Python with sql interface ppt class 12
Shuvanth
 
PDF
PyCon 2010 SQLAlchemy tutorial
jbellis
 
PPTX
Django - sql alchemy - jquery
Mohammed El Rafie Tarabay
 
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
Mosky Liu
 
Relational Database Access with Python ‘sans’ ORM
Mark Rees
 
unit-5 SQL 1 creating a databse connection.pptx
HuzaifaAhmedFarooqi
 
Develop Python Applications with MySQL Connector/Python
Jesper Wisborg Krogh
 
Relational Database Access with Python
Mark Rees
 
Michael Bayer Introduction to SQLAlchemy @ Postgres Open
PostgresOpen
 
MySQL Tech Café #8: MySQL 8.0 for Python Developers
Frederic Descamps
 
Oleksandr Tarasenko "ORM vs GraphQL"
Fwdays
 
ORM vs GraphQL - Python fwdays 2019
Oleksandr Tarasenko
 
Python BCN Introduction to SQLAlchemy
theManda
 
AmI 2015 - Databases in Python
Fulvio Corno
 
Database Connectivity using Python and MySQL
devsuchaye
 
Interface python with sql database10.pdf
HiteshNandi
 
Python and the MySQL Document Store
Jesper Wisborg Krogh
 
Database connectivity in python
baabtra.com - No. 1 supplier of quality freshers
 
015. Interface Python with sql interface ppt class 12
Shuvanth
 
PyCon 2010 SQLAlchemy tutorial
jbellis
 
Django - sql alchemy - jquery
Mohammed El Rafie Tarabay
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Ad

More from Uwe Korn (10)

PDF
PyData Sofia May 2024 - Intro to Apache Arrow
Uwe Korn
 
PDF
Going beyond Apache Parquet's default settings
Uwe Korn
 
PDF
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
Uwe Korn
 
PDF
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
PDF
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn
 
PDF
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Uwe Korn
 
PDF
Scalable Scientific Computing with Dask
Uwe Korn
 
PDF
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
Uwe Korn
 
PDF
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
Uwe Korn
 
PDF
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
PyData Sofia May 2024 - Intro to Apache Arrow
Uwe Korn
 
Going beyond Apache Parquet's default settings
Uwe Korn
 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
Uwe Korn
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn
 
ApacheCon Europe Big Data 2016 – Parquet in practice & detail
Uwe Korn
 
Scalable Scientific Computing with Dask
Uwe Korn
 
PyData Amsterdam 2018 – Building customer-visible data science dashboards wit...
Uwe Korn
 
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
Uwe Korn
 
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
Ad

Recently uploaded (20)

PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 

pandas.(to/from)_sql is simple but not fast

  • 1. pd.{read/to}_sql is simple but not fast Uwe Korn – QuantCo – November 2020
  • 2. About me • Engineering at QuantCo • Apache {Arrow, Parquet} PMC • Turbodbc Maintainer • Other OSS stuff @xhochy @xhochy [email protected] https://uwekorn.com
  • 3. Our setting • We like tabular data • Thus we use pandas • We want large amounts of this data in pandas • The traditional storage for it is SQL databases • How do we get from one to another?
  • 4. SQL • Very very brief intro: • „domain-specific language for accessing data held in a relational database management system“ • The one language in data systems that precedes all the Python, R, Julia, … we use as our „main“ language, also much wider user base • SELECT * FROM table
 INSERT INTO table
  • 5. • Two main arguments: • sql: SQL query to be executed or a table name. • con: SQLAlchemy connectable, str, or sqlite3 connection
  • 6. • Two main arguments: • name: Name of SQL table. • con: SQLAlchemy connectable, str, or sqlite3 connection
  • 7. • Let’s look at the other nice bits („additional arguments“) • if_exists: „What should we do when the target already exists?“ • fail • replace • append
  • 8. • index: „What should we with this one magical column?“ (bool) • index_label • chunksize: „Write less data at once“ • dtype: „What should we with this one magical column?“ (bool) • method: „Supply some magic insertion hook“ (callable)
  • 9. SQLAlchemy • SQLAlchemy is a Python SQL toolkit and Object Relational Mapper (ORM) • We only use the toolkit part for: • Metadata about schema and tables (incl. creation) • Engine for connecting to various databases using a uniform interface
  • 15. How does it work (read_sql)? • pandas.read_sql [1] calls SQLDatabase.read_query [2] • This then does
 • Depending on whether a chunksize was given, this fetches all or parts of the result [1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L509-L516 [2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1243
  • 16. How does it work (read_sql)? • Passes in the data into the from_records constructor • Optionally parses dates and sets an index
  • 17. How does it work (to_sql)? • This is more tricky as we modify the database. • to_sql [1] may need to create the target • If not existing, it will call CREATE TABLE [2] • Afterwards, we INSERT [3] into the (new) table • The insertion step is where we convert from DataFrame back into records [4]
 
 [1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1320 [2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1383-L1393 [3] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1398 [4] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L734-L747
  • 18. Why is it slow? No benchmarks yet, theory first.
 
 
 
 
 
 
 
 
 

  • 19. Why is it slow?
  • 20. Thanks Slides will come after PyData Global Follow me on Twitter: @xhochy How to get fast?
  • 21. ODBC • Open Database Connectivity (ODBC) is a standard API for accessing databases • Most databases provide an ODBC interface, some of them are efficient • Two popular Python libraries for that: • https://github.com/mkleehammer/pyodbc • https://github.com/blue-yonder/turbodbc
  • 22. ODBC Turbodbc has support for Apache Arrow: https://arrow.apache.org/ blog/2017/06/16/turbodbc-arrow/
  • 23. ODBC • With turbodbc + Arrow we get the following performance improvements: • 3-4x for MS SQL, see https://youtu.be/B-uj8EDcjLY?t=1208 • 3-4x speedup for Exasol, see https://youtu.be/B-uj8EDcjLY?t=1390
  • 24. Snowflake • Turbodbc is a solution that retrofits performance • Snowflake drivers already come with built-in speed • Default response is JSON-based, BUT: • The database server can answer directly with Arrow • Client only needs the Arrow->pandas conversion (lightning fast⚡) • Up to 10x faster, see https://www.snowflake.com/blog/fetching- query-results-from-snowflake-just-got-a-lot-faster-with-apache- arrow/
  • 25. JDBC • Blogged about this at: https://uwekorn.com/2019/11/17/fast-jdbc- access-in-python-using-pyarrow-jvm.html • Not yet so convenient and read-only • First, you need all your Java dependencies incl arrow-jdbc in your classpath • Start JVM and load the driver, setup Arrow Java
  • 26. JDBC • Then: • Fetch result using the Arrow Java JDBC adapter • Use pyarrow.jvm to get a Python reference to the JVM memory • Convert to pandas 136x speedup!
  • 27. Postgres Not yet opensourced but this is how it works:
  • 28. How do we get this into pandas.read_sql?
  • 29. API troubles • pandas’ simple API: 
 • turbodbc

  • 30. API troubles • pandas’ simple API: 
 • Snowflake

  • 31. API troubles • pandas’ simple API: 
 • pyarrow.jvm + JDBC

  • 32. Building a better API • We want to use pandas’ simple API but with the nice performance benefits • One idea: Dispatching based on the connection class
 • User doesn’t need to learn a new API • Performance improvements come via optional packages

  • 33. Building a better API Alternative idea:
  • 34. Building a better API Discussion in https://github.com/pandas-dev/pandas/issues/36893
  • 35. Thanks Follow me on Twitter: @xhochy