SlideShare a Scribd company logo
What’s new in pandas and
the SciPy stack for financial
           users
         Wes McKinney
Me
•   AQR: August 2007 - July 2010

•   Duke Statistics: 2010 - present (now on leave)

•   My plans

    •   Improving Python libs for statistics and finance

    •   Building a financial software + consulting business
        based on said tools
Core Python stack for finance
• NumPy, SciPy (heavy lifting)
• pandas (data handling / computation)
• IPython (dev and research env)
• Cython (perf optimization)
• matplotlib (visualization)
• statsmodels (statistics / econometrics)
General sentiments
•   Scientific Python growing solidly in finance and
    in many other fields

    •   Though good sci-pythonistas are still scarce

•   Important work happening in many of the core
    projects

•   Growing consensus: a new computational
    model is needed to better cope with “big data”
NumPy
• Significantly refactored C internals
• Great progress on native datetime64 type
 • Will significantly improve date-handling
    performance and usability
 • Extensible business day / holiday logic
    planned / in progress
• Addition of low-level missing data (NA)
  support in the works
IPython

• One of Python’s killer apps gets even better
• Rich Qt GUI console with inline plotting
• New and improved architecture for high perf
  parallel / distributed computing
• See Fernando Pérez’s SciPy 2011 talk / video
Cython
• Still the first tool you should reach for to get
  better performance
• New: OpenMP integration (for multi-core)
     with nogil:
         for i in prange(n):
             # do something in parallel

• Supports (almost) all of standard Python now
  (some things, like closures, used to not work)
statsmodels
•   Statistics and econometrics in Python

•   Major work in time series models over last year+

    •   VAR, SVAR models, eventually (V)ECM models
        for cointegrated time series

    •   AR/ARMA, Kalman Filter, various macro filters
        (e.g. Hodrick-Prescott) implemented

    •   Soon: Bayesian state space models (DLMs),
        ARCH/GARCH models, etc.
statsmodels
• Major criticism: weak user interface
 • No R-style formula framework
 • pandas not integrated (need to pass raw
    NumPy arrays)
• I have begun work on pandas integration,
  formulas have been implemented and will
  hopefully arrive within the next few months
pandas
• Still the Python data hacker’s best friend?
• Most recent release: 0.3.0 on 2/20/2011
• However, last 4 months have been the most
  active development period in the library’s
  history
• ~375 commits since 0.3.0 release (more than
  the entire prior open source history)
The state of data structures
Ambitious big picture

• I want to make pandas the cornerstone of the
  “next generation” statistical computing
  environment
• Ease-of-use, performance, flexibility all equally
  important
Ambitious big picture

• Taking the best features of other languages (R
  and friends) and making them better and
  easier to use
• See my recent blog article “A Roadmap for
  Rich Scientific Data Structures in Python”
pandas: under the hood
• Complete redesign of DataFrame internals
 • Now a single class for 2D data retaining
     optimal performance of old DataFrame and
     DataMatrix classes
 •   Significantly improved mixed-type and missing
     data handling
 •   Plan to use internal data structure to
     implement “NDFrame” for n-dimensional data
Fancy indexing
• Index a Series / DataFrame in a matrix-like
  way via special .ix attribute, use:
  • Slices with integers or labels
  • Lists of integers, labels, or boolean vecs
  • Integer or label locations
  df.ix[0]
  df.ix[date1:date2]
  df.ix[:5, ‘A’:’F’]

  df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
Misc new features
• “Sparse” (mostly NA) versions of Series,
  DataFrame, WidePanel
• Many new functions on Series/DataFrame
 • describe, quantile, select, drop, dropna,
    corrwith, ...
• New moving window methods: rolling_quantile
  and rolling_apply
Improved IO
• read_csv, read_table functions more
  flexible and robust, better type inferencing

 df = read_table(‘foo.txt’, skiprows=[0,1],
                     na_values=[‘#N/A’])


• ExcelFile class for reading multiple sheets
  out of .xls files
Improved IO
• HDFStore class provides a complete, tested
  dict-like PyTables storage container
       store = HDFStore(‘mydata.h5’)
       store[‘x’] = x
       store[‘y’] = y
       y = store[‘y’]

• Experimental: store as Table and query
      store.put('df', df, table=True)
      piece = store.select(‘df’,
          [{‘field’ : ‘index’, ‘op’ : ‘>=’,
            ‘value’ : date}])
Group by enhancements
• Can group by multiple columns or key
  functions, SQL-like but more general
• Syntactic sugar to invoke aggregation
  functions on groups
• Automatic exclusion of “nuisance”
  columns of DataFrames
• Various other usability enhancements
Very soon: hierarchical indexing

• Enable axis ticks to be identified by multiple
  labels instead of a single label
• Easily select subsets of data by “level”
• Create Excel-style pivot tables / cross-
  tabulations in a sensible way
• Will integrate naturally with groupby
Other misc things

• Flexible binary operators
 • a.add(b, fill_value=0.)
• Some timezone support in DateRange
• Numerous performance optimizations
• See the (long) release notes =)
Planned work

• Fast time series up/downsampling
• Improved support and perf for HF/tick data
• Even more sophisticated group by tools
• Better documentation, online screencast
  tutorials / examples
Thanks

• Email: wesmckinn@gmail.com
• Twitter: @wesmckinn
• Blog: http://blog.wesmckinney.com
• pandas: http://github.com/wesm/pandas
• statsmodels: http://statsmodels.sourceforge.net

More Related Content

What's hot (20)

PPTX
Building data pipelines
Jonathan Holloway
 
PDF
Visualizing big data in the browser using spark
Databricks
 
PDF
Overview of the Hive Stinger Initiative
Modern Data Stack France
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
PDF
Spark what's new what's coming
Databricks
 
KEY
Cascalog
nathanmarz
 
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
PDF
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Modern Data Stack France
 
PDF
Introduction to Spark (Intern Event Presentation)
Databricks
 
PDF
Python for Financial Data Analysis with pandas
Wes McKinney
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
PDF
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Databricks
 
PDF
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
PDF
New directions for Apache Spark in 2015
Databricks
 
PPTX
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
PDF
New Directions for Apache Arrow
Wes McKinney
 
Building data pipelines
Jonathan Holloway
 
Visualizing big data in the browser using spark
Databricks
 
Overview of the Hive Stinger Initiative
Modern Data Stack France
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
Spark what's new what's coming
Databricks
 
Cascalog
nathanmarz
 
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Modern Data Stack France
 
Introduction to Spark (Intern Event Presentation)
Databricks
 
Python for Financial Data Analysis with pandas
Wes McKinney
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Databricks
 
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
New directions for Apache Spark in 2015
Databricks
 
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
New Directions for Apache Arrow
Wes McKinney
 

Viewers also liked (19)

PDF
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
 
PDF
DataFrames: The Good, Bad, and Ugly
Wes McKinney
 
PDF
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
 
PDF
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
PDF
Structured Data Challenges in Finance and Statistics
Wes McKinney
 
PDF
Productive Data Tools for Quants
Wes McKinney
 
PDF
Improving data interoperability in Python and R
Wes McKinney
 
PDF
A look inside pandas design and development
Wes McKinney
 
PDF
Ibis: Scaling the Python Data Experience
Wes McKinney
 
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
PDF
Scipy 2011 Time Series Analysis in Python
Wes McKinney
 
PDF
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney
 
PDF
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney
 
PPTX
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
PDF
Road to Analytics
Datio Big Data
 
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
PDF
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
 
PDF
High Performance Python on Apache Spark
Wes McKinney
 
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
 
DataFrames: The Good, Bad, and Ugly
Wes McKinney
 
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
 
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
Structured Data Challenges in Finance and Statistics
Wes McKinney
 
Productive Data Tools for Quants
Wes McKinney
 
Improving data interoperability in Python and R
Wes McKinney
 
A look inside pandas design and development
Wes McKinney
 
Ibis: Scaling the Python Data Experience
Wes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Scipy 2011 Time Series Analysis in Python
Wes McKinney
 
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney
 
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
Road to Analytics
Datio Big Data
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
 
High Performance Python on Apache Spark
Wes McKinney
 
Ad

Similar to What's new in pandas and the SciPy stack for financial users (20)

PDF
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
PDF
Big data berlin
kammeyer
 
PPTX
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
PPTX
Python ml
Shubham Sharma
 
PPTX
MWLUG 2016 : AD117 : Xpages & jQuery DataTables
Michael Smith
 
PDF
An R primer for SQL folks
Thomas Hütter
 
PDF
Power BI / AAS Data Model Optimization 101 v2
Dan English
 
PPTX
Week4
Tony Hirst
 
PDF
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
PDF
Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches
Rui Romano
 
PDF
Modern MySQL Monitoring and Dashboards.
Mydbops
 
PPTX
Power BI - 2016 - Public
Julian Payne
 
PPTX
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
DataStax
 
PDF
PyData Boston 2013
Travis Oliphant
 
PDF
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
PPTX
Kubeflow.pptx
dhaferbenali1
 
PPTX
SAP HANA_class1.pptx
SudhaVukkalkar1
 
PDF
Levelling up your data infrastructure
Simon Belak
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PDF
Open Day May 2016
Neil Lasrado
 
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
Big data berlin
kammeyer
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
Python ml
Shubham Sharma
 
MWLUG 2016 : AD117 : Xpages & jQuery DataTables
Michael Smith
 
An R primer for SQL folks
Thomas Hütter
 
Power BI / AAS Data Model Optimization 101 v2
Dan English
 
Week4
Tony Hirst
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches
Rui Romano
 
Modern MySQL Monitoring and Dashboards.
Mydbops
 
Power BI - 2016 - Public
Julian Payne
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
DataStax
 
PyData Boston 2013
Travis Oliphant
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
Kubeflow.pptx
dhaferbenali1
 
SAP HANA_class1.pptx
SudhaVukkalkar1
 
Levelling up your data infrastructure
Simon Belak
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Open Day May 2016
Neil Lasrado
 
Ad

More from Wes McKinney (19)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
PDF
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
PDF
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
PDF
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PDF
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
PDF
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
PDF
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
PDF
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
PPTX
Shared Infrastructure for Data Science
Wes McKinney
 
PDF
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
PPTX
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
PDF
PyCon APAC 2016 Keynote
Wes McKinney
 
PDF
Apache Arrow and Python: The latest
Wes McKinney
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
PyCon APAC 2016 Keynote
Wes McKinney
 
Apache Arrow and Python: The latest
Wes McKinney
 

Recently uploaded (20)

PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 

What's new in pandas and the SciPy stack for financial users

  • 1. What’s new in pandas and the SciPy stack for financial users Wes McKinney
  • 2. Me • AQR: August 2007 - July 2010 • Duke Statistics: 2010 - present (now on leave) • My plans • Improving Python libs for statistics and finance • Building a financial software + consulting business based on said tools
  • 3. Core Python stack for finance • NumPy, SciPy (heavy lifting) • pandas (data handling / computation) • IPython (dev and research env) • Cython (perf optimization) • matplotlib (visualization) • statsmodels (statistics / econometrics)
  • 4. General sentiments • Scientific Python growing solidly in finance and in many other fields • Though good sci-pythonistas are still scarce • Important work happening in many of the core projects • Growing consensus: a new computational model is needed to better cope with “big data”
  • 5. NumPy • Significantly refactored C internals • Great progress on native datetime64 type • Will significantly improve date-handling performance and usability • Extensible business day / holiday logic planned / in progress • Addition of low-level missing data (NA) support in the works
  • 6. IPython • One of Python’s killer apps gets even better • Rich Qt GUI console with inline plotting • New and improved architecture for high perf parallel / distributed computing • See Fernando Pérez’s SciPy 2011 talk / video
  • 7. Cython • Still the first tool you should reach for to get better performance • New: OpenMP integration (for multi-core) with nogil: for i in prange(n): # do something in parallel • Supports (almost) all of standard Python now (some things, like closures, used to not work)
  • 8. statsmodels • Statistics and econometrics in Python • Major work in time series models over last year+ • VAR, SVAR models, eventually (V)ECM models for cointegrated time series • AR/ARMA, Kalman Filter, various macro filters (e.g. Hodrick-Prescott) implemented • Soon: Bayesian state space models (DLMs), ARCH/GARCH models, etc.
  • 9. statsmodels • Major criticism: weak user interface • No R-style formula framework • pandas not integrated (need to pass raw NumPy arrays) • I have begun work on pandas integration, formulas have been implemented and will hopefully arrive within the next few months
  • 10. pandas • Still the Python data hacker’s best friend? • Most recent release: 0.3.0 on 2/20/2011 • However, last 4 months have been the most active development period in the library’s history • ~375 commits since 0.3.0 release (more than the entire prior open source history)
  • 11. The state of data structures
  • 12. Ambitious big picture • I want to make pandas the cornerstone of the “next generation” statistical computing environment • Ease-of-use, performance, flexibility all equally important
  • 13. Ambitious big picture • Taking the best features of other languages (R and friends) and making them better and easier to use • See my recent blog article “A Roadmap for Rich Scientific Data Structures in Python”
  • 14. pandas: under the hood • Complete redesign of DataFrame internals • Now a single class for 2D data retaining optimal performance of old DataFrame and DataMatrix classes • Significantly improved mixed-type and missing data handling • Plan to use internal data structure to implement “NDFrame” for n-dimensional data
  • 15. Fancy indexing • Index a Series / DataFrame in a matrix-like way via special .ix attribute, use: • Slices with integers or labels • Lists of integers, labels, or boolean vecs • Integer or label locations df.ix[0] df.ix[date1:date2] df.ix[:5, ‘A’:’F’] df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
  • 16. Misc new features • “Sparse” (mostly NA) versions of Series, DataFrame, WidePanel • Many new functions on Series/DataFrame • describe, quantile, select, drop, dropna, corrwith, ... • New moving window methods: rolling_quantile and rolling_apply
  • 17. Improved IO • read_csv, read_table functions more flexible and robust, better type inferencing df = read_table(‘foo.txt’, skiprows=[0,1], na_values=[‘#N/A’]) • ExcelFile class for reading multiple sheets out of .xls files
  • 18. Improved IO • HDFStore class provides a complete, tested dict-like PyTables storage container store = HDFStore(‘mydata.h5’) store[‘x’] = x store[‘y’] = y y = store[‘y’] • Experimental: store as Table and query store.put('df', df, table=True) piece = store.select(‘df’, [{‘field’ : ‘index’, ‘op’ : ‘>=’, ‘value’ : date}])
  • 19. Group by enhancements • Can group by multiple columns or key functions, SQL-like but more general • Syntactic sugar to invoke aggregation functions on groups • Automatic exclusion of “nuisance” columns of DataFrames • Various other usability enhancements
  • 20. Very soon: hierarchical indexing • Enable axis ticks to be identified by multiple labels instead of a single label • Easily select subsets of data by “level” • Create Excel-style pivot tables / cross- tabulations in a sensible way • Will integrate naturally with groupby
  • 21. Other misc things • Flexible binary operators • a.add(b, fill_value=0.) • Some timezone support in DateRange • Numerous performance optimizations • See the (long) release notes =)
  • 22. Planned work • Fast time series up/downsampling • Improved support and perf for HF/tick data • Even more sophisticated group by tools • Better documentation, online screencast tutorials / examples
  • 23. Thanks • Email: [email protected] Twitter: @wesmckinn • Blog: http://blog.wesmckinney.com • pandas: http://github.com/wesm/pandas • statsmodels: http://statsmodels.sourceforge.net