SlideShare a Scribd company logo
Time Series Analysis in Python with statsmodels

                   Wes McKinney1                 Josef Perktold2               Skipper Seabold3

                                            1 Departmentof Statistical Science
                                                    Duke University
                                            2 Department of Economics

                                    University of North Carolina at Chapel Hill
                                               3 Departmentof Economics
                                                  American University


                       10th Python in Science Conference, 13 July 2011



McKinney, Perktold, Seabold (statsmodels)        Python Time Series Analysis          SciPy Conference 2011   1 / 29
What is statsmodels?




          A library for statistical modeling, implementing standard statistical
          models in Python using NumPy and SciPy
          Includes:
                  Linear (regression) models of many forms
                  Descriptive statistics
                  Statistical tests
                  Time series analysis
                  ...and much more




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   2 / 29
What is Time Series Analysis?




          Statistical modeling of time-ordered data observations
          Inferring structure, forecasting and simulation, and testing
          distributional assumptions about the data
          Modeling dynamic relationships among multiple time series
          Broad applications e.g. in economics, finance, neuroscience, signal
          processing...




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   3 / 29
Talk Overview



          Brief update on statsmodels development
          Aside: user interface and data structures
          Descriptive statistics and tests
          Auto-regressive moving average models (ARMA)
          Vector autoregression (VAR) models
          Filtering tools (Hodrick-Prescott and others)
          Near future: Bayesian dynamic linear models (DLMs), ARCH /
          GARCH volatility models and beyond




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   4 / 29
Statsmodels development update



          We’re now on GitHub! Join us:

                         http://github.com/statsmodels/statsmodels

          Check out the slick Sphinx docs:

                                http://statsmodels.sourceforge.net

          Development focus has been largely computational, i.e. writing
          correct, tested implementations of all the common classes of
          statistical models




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   5 / 29
Statsmodels development update




          Major work to be done on providing a nice integrated user interface
          We must work together to close the gap between R and Python!
          Some important areas:
                  Formula framework, for specifying model design matrices
                  Need integrated rich statistical data structures (pandas)
                  Data visualization of results should always be a few keystrokes away
                  Write a “Statsmodels for R users” guide




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   6 / 29
Aside: statistical data structures and user interface



          While I have a captive audience...
          Controversial fact: pandas is the only Python library currently
          providing data structures matching (and in many places exceeding)
          the richness of R’s data structures (for statistics)
                  Let’s have a BoF session so I can justify this statement
          Feedback I hear is that end users find the fragmented, incohesive set
          of Python tools for data analysis and statistics to be confusing,
          frustrating, and certainly not compelling them to use Python...
                  (Not to mention the packaging headaches)




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   7 / 29
Aside: statistical data structures and user interface




          We need to “commit” ASAP (not 12 months from now) to a high
          level data structure(s) as the “primary data structure(s) for statistical
          data analysis” and communicate that clearly to end users
                  Or we might as well all start programming in R...




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   8 / 29
Example data: EEG trace data


               300

               200

               100

                 0

               100

               200

               300

               400

               500

               600
                  0         500           0      0           0              0      0          0             0
                                      100     150         200         250       300        350        400




McKinney, Perktold, Seabold (statsmodels)     Python Time Series Analysis              SciPy Conference 2011    9 / 29
Example data: Macroeconomic data


              5.5
              5.0      cpi
              4.5
              4.0
              3.5
              3.0
              7.5
              7.0      m1
              6.5
              6.0
              5.5
              5.0
              4.5
              9.5
              9.0
                       realgdp
              8.5
              8.0
                  0   4     8  2  6   0   4   8   2   6   0   4    8
               196 196 196 197 197 198 198 198 199 199 200 200 200




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   10 / 29
Example data: Stock data


              800
                         AAPL
              700        GOOG
                         MSFT
              600        YHOO
              500
              400
              300
              200
              100
                0
                          1         2          3        4           5      6           7      8       9
                       200       200        200      200      200       200      200       200     200




McKinney, Perktold, Seabold (statsmodels)          Python Time Series Analysis              SciPy Conference 2011   11 / 29
Descriptive statistics
            Autocorrelation, partial autocorrelation plots
            Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q)
            models
            acf = tsa . acf ( eeg , 50)
            pacf = tsa . pacf ( eeg , 50)

     1.0                  Autocorrelation                     1.0               Partial Autocorrelation


     0.5                                                      0.5


     0.0                                                      0.0


     0.5                                                      0.5


     1.00         10        20        30    40        50      1.00         10        20        30         40    50

McKinney, Perktold, Seabold (statsmodels)    Python Time Series Analysis               SciPy Conference 2011   12 / 29
Statistical tests




          Ljung-Box test for zero autocorrelation
          Unit root test for cointegration (Augmented Dickey-Fuller test)
          Granger-causality
          Whiteness (iid-ness) and normality
          See our conference paper (when the proceedings get published!)




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   13 / 29
Autoregressive moving average (ARMA) models
          One of most common univariate time series models:

                   yt = µ + a1 yt−1 + ... + ak yt−p +                t    + b1   t−1   + ... + bq       t−q
                                                                                           2
                   where E ( t , s ) = 0, for t = s and                   t   ∼ N (0, σ )


          Exact log-likelihood can be evaluated via the Kalman filter, but the
          “conditional” likelihood is easier and commonly used
          statsmodels has tools for simulating ARMA processes with known
          coefficients ai , bi and also estimation given specified lag orders
              import scikits.statsmodels.tsa.arima_process as ap
              ar_coef = [1, .75, -.25]; ma_coef = [1, -.5]
              nobs = 100
              y = ap.arma_generate_sample(ar_coef, ma_coef, nobs)
              y += 4 # add in constant

McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis                SciPy Conference 2011   14 / 29
ARMA Estimation



          Several likelihood-based estimators implemented (see docs)
              model = tsa.ARMA(y)
              result = model.fit(order=(2, 1), trend=’c’,
                                 method=’css-mle’, disp=-1)
              result.params
              # array([ 3.97, -0.97, -0.05, -0.13])


          Standard model diagnostics, standard errors, information criteria
          (AIC, BIC, ...), etc available in the returned ARMAResults object




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   15 / 29
Vector Autoregression (VAR) models



          Widely used model for modeling multiple (K -variate) time series,
          especially in macroeconomics:

                           Yt = A1 Yt−1 + . . . + Ap Yt−p +               t,   t   ∼ N (0, Σ)

          Matrices Ai are K × K .
          Yt must be a stationary process (sometimes achieved by
          differencing). Related class of models (VECM) for modeling
          nonstationary (including cointegrated) processes




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis            SciPy Conference 2011   16 / 29
Vector Autoregression (VAR) models

   >>> model = VAR(data); model.select_order(8)
                    VAR Order Selection
   =====================================================
              aic          bic          fpe         hqic
   -----------------------------------------------------
   0       -27.83       -27.78    8.214e-13       -27.81
   1       -28.77       -28.57    3.189e-13       -28.69
   2       -29.00      -28.64*    2.556e-13       -28.85
   3       -29.10       -28.60    2.304e-13      -28.90*
   4       -29.09       -28.43    2.330e-13       -28.82
   5       -29.13       -28.33    2.228e-13       -28.81
   6      -29.14*       -28.18   2.213e-13*       -28.75
   7       -29.07       -27.96    2.387e-13       -28.62
   =====================================================
   * Minimum

McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   17 / 29
Vector Autoregression (VAR) models

   >>> result = model.fit(2)
   >>> result.summary() # print summary for each variable
   <snip>
   Results for equation m1
   ====================================================
               coefficient    std. error t-stat    prob
   ----------------------------------------------------
   const          0.004968      0.001850   2.685 0.008
   L1.m1          0.363636      0.071307   5.100 0.000
   L1.realgdp    -0.077460      0.092975 -0.833 0.406
   L1.cpi        -0.052387      0.128161 -0.409 0.683
   L2.m1          0.250589      0.072050   3.478 0.001
   L2.realgdp    -0.085874      0.092032 -0.933 0.352
   L2.cpi         0.169803      0.128376   1.323 0.188
   ====================================================
   <snip>


McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   18 / 29
Vector Autoregression (VAR) models




   >>> result = model.fit(2)
   >>> result.summary() # print summary for each variable
   <snip>
   Correlation matrix of residuals
                    m1   realgdp       cpi
   m1         1.000000 -0.055690 -0.297494
   realgdp   -0.055690 1.000000 0.115597
   cpi       -0.297494 0.115597 1.000000




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   19 / 29
VAR: Impulse Response analysis
          Analyze systematic impact of unit “shock” to a single variable

   irf = result.irf(10)
   irf.plot()

                                                                  Impulse responses
                                      m1 → m1                         realgdp → m1                        cpi → m1
                         1.0                               0.2                               0.4
                         0.8                               0.1                               0.3
                                                                                             0.2
                         0.6                               0.0                               0.1
                         0.4                               0.1                               0.0
                         0.2                               0.2                               0.1
                                                                                             0.2
                         0.0                               0.3                               0.3
                         0.20        4                     0.40          4                10 0.40
                                2            6
                                    m1 → realgdp   8   10         2 realgdp → realgdp 8
                                                                                6                   2   cpi4→ realgdp
                                                                                                                  6     8   10
                        0.20                               1.0                               0.2
                        0.15                               0.8                               0.1
                        0.10                               0.6                               0.0
                        0.05
                                                           0.4                               0.1
                        0.00
                        0.05                               0.2                               0.2
                        0.10                               0.0                               0.3
                        0.150   2     4      6     8   10 0.20    2     4                    0.40         4 → cpi
                                      m1 → cpi                        realgdp →6
                                                                               cpi   8    10        2     cpi 6         8   10
                        0.20                              0.15                               1.0
                        0.15                              0.10                               0.8
                        0.10                              0.05                               0.6
                        0.05                              0.00
                        0.00                              0.05                               0.4
                        0.05                              0.10                               0.2
                        0.100   2     4     6      8   10 0.150   2     4      6     8    10 0.00   2     4      6      8   10



McKinney, Perktold, Seabold (statsmodels)                 Python Time Series Analysis                                SciPy Conference 2011   20 / 29
VAR: Forecast Error Variance Decomposition
          Analyze contribution of each variable to forecasting error

   fevd = result.fevd(20)
   fevd.plot()

                                                Forecast error variance decomposition (FEVD)         m1
                         1.0                                 m1                                      realgdp
                         0.8                                                                         cpi
                         0.6
                         0.4
                         0.2
                         0.00               5                 10                        15     20
                         1.2                               realgdp
                         1.0
                         0.8
                         0.6
                         0.4
                         0.2
                         0.00               5                10                         15     20
                         1.2                                 cpi
                         1.0
                         0.8
                         0.6
                         0.4
                         0.2
                         0.00               5                 10                        15     20



McKinney, Perktold, Seabold (statsmodels)       Python Time Series Analysis                     SciPy Conference 2011   21 / 29
VAR: Statistical tests



   In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’])
   Granger causality f-test
   =========================================================
      Test statistic   Critical Value      p-value        df
   ---------------------------------------------------------
            1.248787         2.387325        0.289 (4, 579)
   =========================================================
   H_0: [’cpi’, ’realgdp’] do not Granger-cause m1
   Conclusion: fail to reject H_0 at 5.00% significance level




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   22 / 29
Filtering

          Hodrick-Prescott (HP) filter separates a time series yt into a trend τt
          and a cyclical component ζt , so that yt = τt + ζt .

              14
                                                                                       Inflation
              12                                                                       Cyclical component
              10                                                                       Trend component
               8
               6
               4
                2
               0
                2
                4
                       2      6      0      4      8       2       6       0       4      8        2       6
                    196    196    197    197    197    198     198     199     199     199      200    200

McKinney, Perktold, Seabold (statsmodels)        Python Time Series Analysis                  SciPy Conference 2011   23 / 29
Filtering

          In addition to the HP filter, 2 other filters popular in finance and
          economics, Baxter-King and Christiano-Fitzgerald, are available
          We refer you to our paper and the documentation for details on these:

                          Inflation and Unemployment: BK Filtered                           Inflation and Unemployment: CF Filtered
                                                                    INFL                                                              INFL
              4                                                               4                                                       UNEMP
                                                                    UNEMP

              2                                                               2


              0                                                               0


              2                                                               2


              4                                                               4
                                                                                  63



                                                                                               73



                                                                                                           83



                                                                                                                       93
                                                                                       68



                                                                                                     78



                                                                                                                 88



                                                                                                                             98

                                                                                                                                      03
                         71




                                      81




                                                    91




                                                                                                                                           08
                    66




                                76




                                              86




                                                           96

                                                                    01

                                                                         06



                                                                                  19



                                                                                              19



                                                                                                          19



                                                                                                                      19
                                                                                       19



                                                                                                    19



                                                                                                                19



                                                                                                                            19
                         19




                                     19




                                                   19




                                                                                                                                  20
                  19




                              19




                                            19




                                                         19




                                                                                                                                           20
                                                                20

                                                                         20




McKinney, Perktold, Seabold (statsmodels)                   Python Time Series Analysis                         SciPy Conference 2011           24 / 29
Preview: Bayesian dynamic linear models (DLM)



          A state space model by another name:

                                      yt = Ft θt + νt ,       νt ∼ N (0, Vt )
                                      θt = G θt−1 + ωt ,          ωt ∼ N (0, Wt )

          Estimation of basic model by Kalman filter recursions. Provides
          elegant way to do time-varying linear regressions for forecasting
          Extensions: multivariate DLMs, stochastic volatility (SV) models,
          MCMC-based posterior sampling, mixtures of DLMs




McKinney, Perktold, Seabold (statsmodels)    Python Time Series Analysis        SciPy Conference 2011   25 / 29
Preview: DLM Example (Constant+Trend model)

   model = Polynomial(2)
   dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model
             m0=m0, C0=C0, n0=n0, s0=s0, # priors
             state_discount=.95) # discount factor
                                                                Constant + Trend DLM



                        200



                        150



                        100



                         50
                                       8            9        009            9        009               9               9
                                    200          200        2            200    Jul 2            200             200
                              Nov          Jan          Mar        May                     Sep             Nov

McKinney, Perktold, Seabold (statsmodels)                 Python Time Series Analysis                              SciPy Conference 2011   26 / 29
Preview: Stochastic volatility models


              1.6                       JPY-USD Exchange Rate Volatility Process

              1.4

              1.2

              1.0

              0.8

              0.6

              0.4

              0.20                200             400               600            800             1000



McKinney, Perktold, Seabold (statsmodels)      Python Time Series Analysis          SciPy Conference 2011   27 / 29
Future: sandbox and beyond




          ARCH / GARCH models for volatility
          Structural VAR and error correction models (ECM) for cointegrated
          processes
          Models with non-normally distributed errors
          Better data description, visualization, and interactive research tools
          More sophisticated Bayesian time series models




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   28 / 29
Conclusions




          We’ve implemented many foundational models for time series
          analysis, but the field is very broad
          User interface can and should be much improved
          Repo: http://github.com/statsmodels/statsmodels
          Docs: http://statsmodels.sourceforge.net
          Contact: pystatsmodels@googlegroups.com




McKinney, Perktold, Seabold (statsmodels)   Python Time Series Analysis   SciPy Conference 2011   29 / 29

More Related Content

What's hot (20)

PPTX
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
PPTX
Machine Learning-Linear regression
kishanthkumaar
 
PPTX
3 data visualization
ThilinaWanshathilaka
 
PDF
Tableau online training
suresh
 
PPTX
Python Seaborn Data Visualization
Sourabh Sahu
 
PDF
Confusion Matrix Explained
Stockholm University
 
PDF
Tableau Drive, A new methodology for scaling your analytic culture
Tableau Software
 
PDF
Feature Engineering
HJ van Veen
 
PPTX
Introduction to pandas
Piyush rai
 
PPTX
Data Analysis with Python Pandas
Neeru Mittal
 
PDF
Machine Learning Deep Learning AI and Data Science
Venkata Reddy Konasani
 
PDF
An introduction to Bayesian Statistics using Python
freshdatabos
 
PPTX
Introduction to predictive modeling v1
Venkata Reddy Konasani
 
PDF
The matplotlib Library
Haim Michael
 
PPTX
Pca ppt
Dheeraj Dwivedi
 
PDF
Data Visualization(s) Using Python
Aniket Maithani
 
PDF
Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...
Edureka!
 
PDF
Data visualisation & analytics with Tableau
Outreach Digital
 
PPTX
Naive bayes
Ashraf Uddin
 
PPTX
Scikit Learn intro
9xdot
 
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Machine Learning-Linear regression
kishanthkumaar
 
3 data visualization
ThilinaWanshathilaka
 
Tableau online training
suresh
 
Python Seaborn Data Visualization
Sourabh Sahu
 
Confusion Matrix Explained
Stockholm University
 
Tableau Drive, A new methodology for scaling your analytic culture
Tableau Software
 
Feature Engineering
HJ van Veen
 
Introduction to pandas
Piyush rai
 
Data Analysis with Python Pandas
Neeru Mittal
 
Machine Learning Deep Learning AI and Data Science
Venkata Reddy Konasani
 
An introduction to Bayesian Statistics using Python
freshdatabos
 
Introduction to predictive modeling v1
Venkata Reddy Konasani
 
The matplotlib Library
Haim Michael
 
Data Visualization(s) Using Python
Aniket Maithani
 
Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...
Edureka!
 
Data visualisation & analytics with Tableau
Outreach Digital
 
Naive bayes
Ashraf Uddin
 
Scikit Learn intro
9xdot
 

Viewers also liked (20)

PDF
Python for Financial Data Analysis with pandas
Wes McKinney
 
PDF
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney
 
PDF
Data Structures for Statistical Computing in Python
Wes McKinney
 
PDF
Time travel and time series analysis with pandas + statsmodels
Alexander Hendorf
 
PPTX
Revenue Growth through Machine Learning
DataWorks Summit
 
PDF
SciPy 2011 pandas lightning talk
Wes McKinney
 
PPTX
PyDataDC- Forecasting critical food violations at restaurants using open data
Nicole A. Donnelly, CMCP
 
PDF
ET_with_EEG
Xuan Guo
 
PPTX
How Chile used social media during the Earthquake
Sebastian Salazar
 
PDF
Laughing Squid Opportunity Analysis Project
Wildfire Interactive, Inc.
 
PDF
Structured Data Challenges in Finance and Statistics
Wes McKinney
 
PDF
Multivariate time series
Luigi Piva CQF
 
PDF
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
PDF
Productive Data Tools for Quants
Wes McKinney
 
PDF
Analysis of EEG data Using ICA and Algorithm Development for Energy Comparison
ijsrd.com
 
PPT
Time series Forecasting using svm
Institute of Technology Telkom
 
PDF
Predicting Stock Market Price Using Support Vector Regression
Chittagong Independent University
 
PDF
Time series database, InfluxDB & PHP
Corley S.r.l.
 
PPTX
ForecastIT 4. Holt's Exponential Smoothing
DeepThought, Inc.
 
Python for Financial Data Analysis with pandas
Wes McKinney
 
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney
 
Data Structures for Statistical Computing in Python
Wes McKinney
 
Time travel and time series analysis with pandas + statsmodels
Alexander Hendorf
 
Revenue Growth through Machine Learning
DataWorks Summit
 
SciPy 2011 pandas lightning talk
Wes McKinney
 
PyDataDC- Forecasting critical food violations at restaurants using open data
Nicole A. Donnelly, CMCP
 
ET_with_EEG
Xuan Guo
 
How Chile used social media during the Earthquake
Sebastian Salazar
 
Laughing Squid Opportunity Analysis Project
Wildfire Interactive, Inc.
 
Structured Data Challenges in Finance and Statistics
Wes McKinney
 
Multivariate time series
Luigi Piva CQF
 
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Productive Data Tools for Quants
Wes McKinney
 
Analysis of EEG data Using ICA and Algorithm Development for Energy Comparison
ijsrd.com
 
Time series Forecasting using svm
Institute of Technology Telkom
 
Predicting Stock Market Price Using Support Vector Regression
Chittagong Independent University
 
Time series database, InfluxDB & PHP
Corley S.r.l.
 
ForecastIT 4. Holt's Exponential Smoothing
DeepThought, Inc.
 
Ad

Similar to Scipy 2011 Time Series Analysis in Python (20)

PPT
A brief introduction to 'R' statistical package
Shanmukha S. Potti
 
PDF
timeseries cheat sheet with example code for R
derekjohnson549253
 
PPTX
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Vyacheslav Arbuzov
 
PDF
Pandas
maikroeder
 
PPT
Matlab tme series benni
dvbtunisia
 
PDF
Unit 6-Introduction of Python Libraries.pdf
Harsha Patil
 
PDF
Getting started with pandas
maikroeder
 
PDF
Data science
Purna Chander
 
PDF
RDataMining slides-r-programming
Yanchang Zhao
 
PDF
Time Series for FRAM-Second_Sem_2021-22 (1).pdf
rembeauty4
 
PDF
Slides 111017220255-phpapp01
Ken Mwai
 
PDF
Time Series For Data Science Wayne A Woodward Bivin Philip Sadler
pilyozquiar
 
PDF
DS LAB MANUAL.pdf
Builders Engineering College
 
PDF
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
HONGJOO LEE
 
PDF
Time series and forecasting from wikipedia
Monica Barros
 
PPTX
Data analysis using python in Jupyter notebook.pptx
ssuserc26f8f
 
PDF
Introduction to R programming
Alberto Labarga
 
KEY
R for Pirates. ESCCONF October 27, 2011
Mandi Walls
 
PDF
Data assimilation with OpenDA
nilsvanvelzen
 
PDF
12 Introduction to Modeling Libraries in Python.pdf
PyaeSone96
 
A brief introduction to 'R' statistical package
Shanmukha S. Potti
 
timeseries cheat sheet with example code for R
derekjohnson549253
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Vyacheslav Arbuzov
 
Pandas
maikroeder
 
Matlab tme series benni
dvbtunisia
 
Unit 6-Introduction of Python Libraries.pdf
Harsha Patil
 
Getting started with pandas
maikroeder
 
Data science
Purna Chander
 
RDataMining slides-r-programming
Yanchang Zhao
 
Time Series for FRAM-Second_Sem_2021-22 (1).pdf
rembeauty4
 
Slides 111017220255-phpapp01
Ken Mwai
 
Time Series For Data Science Wayne A Woodward Bivin Philip Sadler
pilyozquiar
 
DS LAB MANUAL.pdf
Builders Engineering College
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
HONGJOO LEE
 
Time series and forecasting from wikipedia
Monica Barros
 
Data analysis using python in Jupyter notebook.pptx
ssuserc26f8f
 
Introduction to R programming
Alberto Labarga
 
R for Pirates. ESCCONF October 27, 2011
Mandi Walls
 
Data assimilation with OpenDA
nilsvanvelzen
 
12 Introduction to Modeling Libraries in Python.pdf
PyaeSone96
 
Ad

More from Wes McKinney (20)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
PDF
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
PDF
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
PDF
New Directions for Apache Arrow
Wes McKinney
 
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
PDF
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
PDF
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
PDF
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
PDF
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
PDF
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
PPTX
Shared Infrastructure for Data Science
Wes McKinney
 
PDF
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
PPTX
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 

Recently uploaded (20)

PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Designing Production-Ready AI Agents
Kunal Rai
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 

Scipy 2011 Time Series Analysis in Python

  • 1. Time Series Analysis in Python with statsmodels Wes McKinney1 Josef Perktold2 Skipper Seabold3 1 Departmentof Statistical Science Duke University 2 Department of Economics University of North Carolina at Chapel Hill 3 Departmentof Economics American University 10th Python in Science Conference, 13 July 2011 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 1 / 29
  • 2. What is statsmodels? A library for statistical modeling, implementing standard statistical models in Python using NumPy and SciPy Includes: Linear (regression) models of many forms Descriptive statistics Statistical tests Time series analysis ...and much more McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 2 / 29
  • 3. What is Time Series Analysis? Statistical modeling of time-ordered data observations Inferring structure, forecasting and simulation, and testing distributional assumptions about the data Modeling dynamic relationships among multiple time series Broad applications e.g. in economics, finance, neuroscience, signal processing... McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 3 / 29
  • 4. Talk Overview Brief update on statsmodels development Aside: user interface and data structures Descriptive statistics and tests Auto-regressive moving average models (ARMA) Vector autoregression (VAR) models Filtering tools (Hodrick-Prescott and others) Near future: Bayesian dynamic linear models (DLMs), ARCH / GARCH volatility models and beyond McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 4 / 29
  • 5. Statsmodels development update We’re now on GitHub! Join us: http://github.com/statsmodels/statsmodels Check out the slick Sphinx docs: http://statsmodels.sourceforge.net Development focus has been largely computational, i.e. writing correct, tested implementations of all the common classes of statistical models McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 5 / 29
  • 6. Statsmodels development update Major work to be done on providing a nice integrated user interface We must work together to close the gap between R and Python! Some important areas: Formula framework, for specifying model design matrices Need integrated rich statistical data structures (pandas) Data visualization of results should always be a few keystrokes away Write a “Statsmodels for R users” guide McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 6 / 29
  • 7. Aside: statistical data structures and user interface While I have a captive audience... Controversial fact: pandas is the only Python library currently providing data structures matching (and in many places exceeding) the richness of R’s data structures (for statistics) Let’s have a BoF session so I can justify this statement Feedback I hear is that end users find the fragmented, incohesive set of Python tools for data analysis and statistics to be confusing, frustrating, and certainly not compelling them to use Python... (Not to mention the packaging headaches) McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 7 / 29
  • 8. Aside: statistical data structures and user interface We need to “commit” ASAP (not 12 months from now) to a high level data structure(s) as the “primary data structure(s) for statistical data analysis” and communicate that clearly to end users Or we might as well all start programming in R... McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 8 / 29
  • 9. Example data: EEG trace data 300 200 100 0 100 200 300 400 500 600 0 500 0 0 0 0 0 0 0 100 150 200 250 300 350 400 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 9 / 29
  • 10. Example data: Macroeconomic data 5.5 5.0 cpi 4.5 4.0 3.5 3.0 7.5 7.0 m1 6.5 6.0 5.5 5.0 4.5 9.5 9.0 realgdp 8.5 8.0 0 4 8 2 6 0 4 8 2 6 0 4 8 196 196 196 197 197 198 198 198 199 199 200 200 200 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 10 / 29
  • 11. Example data: Stock data 800 AAPL 700 GOOG MSFT 600 YHOO 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 200 200 200 200 200 200 200 200 200 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 11 / 29
  • 12. Descriptive statistics Autocorrelation, partial autocorrelation plots Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q) models acf = tsa . acf ( eeg , 50) pacf = tsa . pacf ( eeg , 50) 1.0 Autocorrelation 1.0 Partial Autocorrelation 0.5 0.5 0.0 0.0 0.5 0.5 1.00 10 20 30 40 50 1.00 10 20 30 40 50 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 12 / 29
  • 13. Statistical tests Ljung-Box test for zero autocorrelation Unit root test for cointegration (Augmented Dickey-Fuller test) Granger-causality Whiteness (iid-ness) and normality See our conference paper (when the proceedings get published!) McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 13 / 29
  • 14. Autoregressive moving average (ARMA) models One of most common univariate time series models: yt = µ + a1 yt−1 + ... + ak yt−p + t + b1 t−1 + ... + bq t−q 2 where E ( t , s ) = 0, for t = s and t ∼ N (0, σ ) Exact log-likelihood can be evaluated via the Kalman filter, but the “conditional” likelihood is easier and commonly used statsmodels has tools for simulating ARMA processes with known coefficients ai , bi and also estimation given specified lag orders import scikits.statsmodels.tsa.arima_process as ap ar_coef = [1, .75, -.25]; ma_coef = [1, -.5] nobs = 100 y = ap.arma_generate_sample(ar_coef, ma_coef, nobs) y += 4 # add in constant McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 14 / 29
  • 15. ARMA Estimation Several likelihood-based estimators implemented (see docs) model = tsa.ARMA(y) result = model.fit(order=(2, 1), trend=’c’, method=’css-mle’, disp=-1) result.params # array([ 3.97, -0.97, -0.05, -0.13]) Standard model diagnostics, standard errors, information criteria (AIC, BIC, ...), etc available in the returned ARMAResults object McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 15 / 29
  • 16. Vector Autoregression (VAR) models Widely used model for modeling multiple (K -variate) time series, especially in macroeconomics: Yt = A1 Yt−1 + . . . + Ap Yt−p + t, t ∼ N (0, Σ) Matrices Ai are K × K . Yt must be a stationary process (sometimes achieved by differencing). Related class of models (VECM) for modeling nonstationary (including cointegrated) processes McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 16 / 29
  • 17. Vector Autoregression (VAR) models >>> model = VAR(data); model.select_order(8) VAR Order Selection ===================================================== aic bic fpe hqic ----------------------------------------------------- 0 -27.83 -27.78 8.214e-13 -27.81 1 -28.77 -28.57 3.189e-13 -28.69 2 -29.00 -28.64* 2.556e-13 -28.85 3 -29.10 -28.60 2.304e-13 -28.90* 4 -29.09 -28.43 2.330e-13 -28.82 5 -29.13 -28.33 2.228e-13 -28.81 6 -29.14* -28.18 2.213e-13* -28.75 7 -29.07 -27.96 2.387e-13 -28.62 ===================================================== * Minimum McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 17 / 29
  • 18. Vector Autoregression (VAR) models >>> result = model.fit(2) >>> result.summary() # print summary for each variable <snip> Results for equation m1 ==================================================== coefficient std. error t-stat prob ---------------------------------------------------- const 0.004968 0.001850 2.685 0.008 L1.m1 0.363636 0.071307 5.100 0.000 L1.realgdp -0.077460 0.092975 -0.833 0.406 L1.cpi -0.052387 0.128161 -0.409 0.683 L2.m1 0.250589 0.072050 3.478 0.001 L2.realgdp -0.085874 0.092032 -0.933 0.352 L2.cpi 0.169803 0.128376 1.323 0.188 ==================================================== <snip> McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 18 / 29
  • 19. Vector Autoregression (VAR) models >>> result = model.fit(2) >>> result.summary() # print summary for each variable <snip> Correlation matrix of residuals m1 realgdp cpi m1 1.000000 -0.055690 -0.297494 realgdp -0.055690 1.000000 0.115597 cpi -0.297494 0.115597 1.000000 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 19 / 29
  • 20. VAR: Impulse Response analysis Analyze systematic impact of unit “shock” to a single variable irf = result.irf(10) irf.plot() Impulse responses m1 → m1 realgdp → m1 cpi → m1 1.0 0.2 0.4 0.8 0.1 0.3 0.2 0.6 0.0 0.1 0.4 0.1 0.0 0.2 0.2 0.1 0.2 0.0 0.3 0.3 0.20 4 0.40 4 10 0.40 2 6 m1 → realgdp 8 10 2 realgdp → realgdp 8 6 2 cpi4→ realgdp 6 8 10 0.20 1.0 0.2 0.15 0.8 0.1 0.10 0.6 0.0 0.05 0.4 0.1 0.00 0.05 0.2 0.2 0.10 0.0 0.3 0.150 2 4 6 8 10 0.20 2 4 0.40 4 → cpi m1 → cpi realgdp →6 cpi 8 10 2 cpi 6 8 10 0.20 0.15 1.0 0.15 0.10 0.8 0.10 0.05 0.6 0.05 0.00 0.00 0.05 0.4 0.05 0.10 0.2 0.100 2 4 6 8 10 0.150 2 4 6 8 10 0.00 2 4 6 8 10 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 20 / 29
  • 21. VAR: Forecast Error Variance Decomposition Analyze contribution of each variable to forecasting error fevd = result.fevd(20) fevd.plot() Forecast error variance decomposition (FEVD) m1 1.0 m1 realgdp 0.8 cpi 0.6 0.4 0.2 0.00 5 10 15 20 1.2 realgdp 1.0 0.8 0.6 0.4 0.2 0.00 5 10 15 20 1.2 cpi 1.0 0.8 0.6 0.4 0.2 0.00 5 10 15 20 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 21 / 29
  • 22. VAR: Statistical tests In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’]) Granger causality f-test ========================================================= Test statistic Critical Value p-value df --------------------------------------------------------- 1.248787 2.387325 0.289 (4, 579) ========================================================= H_0: [’cpi’, ’realgdp’] do not Granger-cause m1 Conclusion: fail to reject H_0 at 5.00% significance level McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 22 / 29
  • 23. Filtering Hodrick-Prescott (HP) filter separates a time series yt into a trend τt and a cyclical component ζt , so that yt = τt + ζt . 14 Inflation 12 Cyclical component 10 Trend component 8 6 4 2 0 2 4 2 6 0 4 8 2 6 0 4 8 2 6 196 196 197 197 197 198 198 199 199 199 200 200 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 23 / 29
  • 24. Filtering In addition to the HP filter, 2 other filters popular in finance and economics, Baxter-King and Christiano-Fitzgerald, are available We refer you to our paper and the documentation for details on these: Inflation and Unemployment: BK Filtered Inflation and Unemployment: CF Filtered INFL INFL 4 4 UNEMP UNEMP 2 2 0 0 2 2 4 4 63 73 83 93 68 78 88 98 03 71 81 91 08 66 76 86 96 01 06 19 19 19 19 19 19 19 19 19 19 19 20 19 19 19 19 20 20 20 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 24 / 29
  • 25. Preview: Bayesian dynamic linear models (DLM) A state space model by another name: yt = Ft θt + νt , νt ∼ N (0, Vt ) θt = G θt−1 + ωt , ωt ∼ N (0, Wt ) Estimation of basic model by Kalman filter recursions. Provides elegant way to do time-varying linear regressions for forecasting Extensions: multivariate DLMs, stochastic volatility (SV) models, MCMC-based posterior sampling, mixtures of DLMs McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 25 / 29
  • 26. Preview: DLM Example (Constant+Trend model) model = Polynomial(2) dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model m0=m0, C0=C0, n0=n0, s0=s0, # priors state_discount=.95) # discount factor Constant + Trend DLM 200 150 100 50 8 9 009 9 009 9 9 200 200 2 200 Jul 2 200 200 Nov Jan Mar May Sep Nov McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 26 / 29
  • 27. Preview: Stochastic volatility models 1.6 JPY-USD Exchange Rate Volatility Process 1.4 1.2 1.0 0.8 0.6 0.4 0.20 200 400 600 800 1000 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 27 / 29
  • 28. Future: sandbox and beyond ARCH / GARCH models for volatility Structural VAR and error correction models (ECM) for cointegrated processes Models with non-normally distributed errors Better data description, visualization, and interactive research tools More sophisticated Bayesian time series models McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 28 / 29
  • 29. Conclusions We’ve implemented many foundational models for time series analysis, but the field is very broad User interface can and should be much improved Repo: http://github.com/statsmodels/statsmodels Docs: http://statsmodels.sourceforge.net Contact: [email protected] McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 29 / 29