diff --git a/.circleci/config.yml b/.circleci/config.yml
new file mode 100644
index 0000000000000..dc357101e79fd
--- /dev/null
+++ b/.circleci/config.yml
@@ -0,0 +1,21 @@
+version: 2.1
+
+jobs:
+ test-arm:
+ machine:
+ image: ubuntu-2004:202101-01
+ resource_class: arm.medium
+ environment:
+ ENV_FILE: ci/deps/circle-38-arm64.yaml
+ PYTEST_WORKERS: auto
+ PATTERN: "not slow and not network and not clipboard and not arm_slow"
+ PYTEST_TARGET: "pandas"
+ steps:
+ - checkout
+ - run: ci/setup_env.sh
+ - run: PATH=$HOME/miniconda3/envs/pandas-dev/bin:$HOME/miniconda3/condabin:$PATH ci/run_tests.sh
+
+workflows:
+ test:
+ jobs:
+ - test-arm
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
index 49200523df40f..d27eab5b9c95c 100644
--- a/.github/CONTRIBUTING.md
+++ b/.github/CONTRIBUTING.md
@@ -1,23 +1,3 @@
# Contributing to pandas
-Whether you are a novice or experienced software developer, all contributions and suggestions are welcome!
-
-Our main contributing guide can be found [in this repo](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst) or [on the website](https://pandas.pydata.org/docs/dev/development/contributing.html). If you do not want to read it in its entirety, we will summarize the main ways in which you can contribute and point to relevant sections of that document for further information.
-
-## Getting Started
-
-If you are looking to contribute to the *pandas* codebase, the best place to start is the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues). This is also a great place for filing bug reports and making suggestions for ways in which we can improve the code and documentation.
-
-If you have additional questions, feel free to ask them on the [mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata) or on [Gitter](https://gitter.im/pydata/pandas). Further information can also be found in the "[Where to start?](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#where-to-start)" section.
-
-## Filing Issues
-
-If you notice a bug in the code or documentation, or have suggestions for how we can improve either, feel free to create an issue on the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues) using [GitHub's "issue" form](https://github.com/pandas-dev/pandas/issues/new). The form contains some questions that will help us best address your issue. For more information regarding how to file issues against *pandas*, please refer to the "[Bug reports and enhancement requests](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#bug-reports-and-enhancement-requests)" section.
-
-## Contributing to the Codebase
-
-The code is hosted on [GitHub](https://www.github.com/pandas-dev/pandas), so you will need to use [Git](https://git-scm.com/) to clone the project and make changes to the codebase. Once you have obtained a copy of the code, you should create a development environment that is separate from your existing Python environment so that you can make and test changes without compromising your own work environment. For more information, please refer to the "[Working with the code](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#working-with-the-code)" section.
-
-Before submitting your changes for review, make sure to check that your changes do not break any tests. You can find more information about our test suites in the "[Test-driven development/code writing](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#test-driven-development-code-writing)" section. We also have guidelines regarding coding style that will be enforced during testing, which can be found in the "[Code standards](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#code-standards)" section.
-
-Once your changes are ready to be submitted, make sure to push your changes to GitHub before creating a pull request. Details about how to do that can be found in the "[Contributing your changes to pandas](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#contributing-your-changes-to-pandas)" section. We will review your changes, and you will most likely be asked to make additional changes before it is finally ready to merge. However, once it's ready, we will merge it, and you will have successfully contributed to the codebase!
+A detailed overview on how to contribute can be found in the **[contributing guide](https://pandas.pydata.org/docs/dev/development/contributing.html)**.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
deleted file mode 100644
index 765c1b8bff62e..0000000000000
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ /dev/null
@@ -1,39 +0,0 @@
----
-
-name: Bug Report
-about: Create a bug report to help us improve pandas
-title: "BUG:"
-labels: "Bug, Needs Triage"
-
----
-
-- [ ] I have checked that this issue has not already been reported.
-
-- [ ] I have confirmed this bug exists on the latest version of pandas.
-
-- [ ] (optional) I have confirmed this bug exists on the master branch of pandas.
-
----
-
-**Note**: Please read [this guide](https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports) detailing how to provide the necessary information for us to reproduce your bug.
-
-#### Code Sample, a copy-pastable example
-
-```python
-# Your code here
-
-```
-
-#### Problem description
-
-[this should explain **why** the current behaviour is a problem and why the expected output is a better solution]
-
-#### Expected Output
-
-#### Output of ``pd.show_versions()``
-
-
-
-[paste the output of ``pd.show_versions()`` here leaving a blank line after the details tag]
-
-
diff --git a/.github/ISSUE_TEMPLATE/bug_report.yaml b/.github/ISSUE_TEMPLATE/bug_report.yaml
new file mode 100644
index 0000000000000..36bc8dcf02bae
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/bug_report.yaml
@@ -0,0 +1,68 @@
+name: Bug Report
+description: Report incorrect behavior in the pandas library
+title: "BUG: "
+labels: [Bug, Needs Triage]
+
+body:
+ - type: checkboxes
+ id: checks
+ attributes:
+ label: Pandas version checks
+ options:
+ - label: >
+ I have checked that this issue has not already been reported.
+ required: true
+ - label: >
+ I have confirmed this bug exists on the
+ [latest version](https://pandas.pydata.org/docs/whatsnew/index.html) of pandas.
+ required: true
+ - label: >
+ I have confirmed this bug exists on the main branch of pandas.
+ - type: textarea
+ id: example
+ attributes:
+ label: Reproducible Example
+ description: >
+ Please follow [this guide](https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports) on how to
+ provide a minimal, copy-pastable example.
+ placeholder: >
+ import pandas as pd
+
+ df = pd.DataFrame(range(5))
+
+ ...
+ render: python
+ validations:
+ required: true
+ - type: textarea
+ id: problem
+ attributes:
+ label: Issue Description
+ description: >
+ Please provide a description of the issue shown in the reproducible example.
+ validations:
+ required: true
+ - type: textarea
+ id: expected-behavior
+ attributes:
+ label: Expected Behavior
+ description: >
+ Please describe or show a code example of the expected behavior.
+ validations:
+ required: true
+ - type: textarea
+ id: version
+ attributes:
+ label: Installed Versions
+ description: >
+ Please paste the output of ``pd.show_versions()``
+ value: >
+
+
+
+ Replace this line with the output of pd.show_versions()
+
+
+
+ validations:
+ required: true
diff --git a/.github/ISSUE_TEMPLATE/documentation_improvement.md b/.github/ISSUE_TEMPLATE/documentation_improvement.md
deleted file mode 100644
index 3351ff9581121..0000000000000
--- a/.github/ISSUE_TEMPLATE/documentation_improvement.md
+++ /dev/null
@@ -1,22 +0,0 @@
----
-
-name: Documentation Improvement
-about: Report wrong or missing documentation
-title: "DOC:"
-labels: "Docs, Needs Triage"
-
----
-
-#### Location of the documentation
-
-[this should provide the location of the documentation, e.g. "pandas.read_csv" or the URL of the documentation, e.g. "https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html"]
-
-**Note**: You can check the latest versions of the docs on `master` [here](https://pandas.pydata.org/docs/dev/).
-
-#### Documentation problem
-
-[this should provide a description of what documentation you believe needs to be fixed/improved]
-
-#### Suggested fix for documentation
-
-[this should explain the suggested fix and **why** it's better than the existing documentation]
diff --git a/.github/ISSUE_TEMPLATE/documentation_improvement.yaml b/.github/ISSUE_TEMPLATE/documentation_improvement.yaml
new file mode 100644
index 0000000000000..b89600f8598e7
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/documentation_improvement.yaml
@@ -0,0 +1,41 @@
+name: Documentation Improvement
+description: Report wrong or missing documentation
+title: "DOC: "
+labels: [Docs, Needs Triage]
+
+body:
+ - type: checkboxes
+ attributes:
+ label: Pandas version checks
+ options:
+ - label: >
+ I have checked that the issue still exists on the latest versions of the docs
+ on `main` [here](https://pandas.pydata.org/docs/dev/)
+ required: true
+ - type: textarea
+ id: location
+ attributes:
+ label: Location of the documentation
+ description: >
+ Please provide the location of the documentation, e.g. "pandas.read_csv" or the
+ URL of the documentation, e.g.
+ "https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html"
+ placeholder: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
+ validations:
+ required: true
+ - type: textarea
+ id: problem
+ attributes:
+ label: Documentation problem
+ description: >
+ Please provide a description of what documentation you believe needs to be fixed/improved
+ validations:
+ required: true
+ - type: textarea
+ id: suggested-fix
+ attributes:
+ label: Suggested fix for documentation
+ description: >
+ Please explain the suggested fix and **why** it's better than the existing documentation
+ validations:
+ required: true
diff --git a/.github/ISSUE_TEMPLATE/installation_issue.yaml b/.github/ISSUE_TEMPLATE/installation_issue.yaml
new file mode 100644
index 0000000000000..a80269ff0f12d
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/installation_issue.yaml
@@ -0,0 +1,66 @@
+name: Installation Issue
+description: Report issues installing the pandas library on your system
+title: "BUILD: "
+labels: [Build, Needs Triage]
+
+body:
+ - type: checkboxes
+ id: checks
+ attributes:
+ label: Installation check
+ options:
+ - label: >
+ I have read the [installation guide](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html#installing-pandas).
+ required: true
+ - type: input
+ id: platform
+ attributes:
+ label: Platform
+ description: >
+ Please provide the output of ``import platform; print(platform.platform())``
+ validations:
+ required: true
+ - type: dropdown
+ id: method
+ attributes:
+ label: Installation Method
+ description: >
+ Please provide how you tried to install pandas from a clean environment.
+ options:
+ - pip install
+ - conda install
+ - apt-get install
+ - Built from source
+ - Other
+ validations:
+ required: true
+ - type: input
+ id: pandas
+ attributes:
+ label: pandas Version
+ description: >
+ Please provide the version of pandas you are trying to install.
+ validations:
+ required: true
+ - type: input
+ id: python
+ attributes:
+ label: Python Version
+ description: >
+ Please provide the installed version of Python.
+ validations:
+ required: true
+ - type: textarea
+ id: logs
+ attributes:
+ label: Installation Logs
+ description: >
+ If possible, please copy and paste the installation logs when attempting to install pandas.
+ value: >
+
+
+
+ Replace this line with the installation logs.
+
+
+
diff --git a/.github/ISSUE_TEMPLATE/performance_issue.yaml b/.github/ISSUE_TEMPLATE/performance_issue.yaml
new file mode 100644
index 0000000000000..096e012f4ee0f
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/performance_issue.yaml
@@ -0,0 +1,53 @@
+name: Performance Issue
+description: Report slow performance or memory issues when running pandas code
+title: "PERF: "
+labels: [Performance, Needs Triage]
+
+body:
+ - type: checkboxes
+ id: checks
+ attributes:
+ label: Pandas version checks
+ options:
+ - label: >
+ I have checked that this issue has not already been reported.
+ required: true
+ - label: >
+ I have confirmed this issue exists on the
+ [latest version](https://pandas.pydata.org/docs/whatsnew/index.html) of pandas.
+ required: true
+ - label: >
+ I have confirmed this issue exists on the main branch of pandas.
+ - type: textarea
+ id: example
+ attributes:
+ label: Reproducible Example
+ description: >
+ Please provide a minimal, copy-pastable example that quantifies
+ [slow runtime](https://docs.python.org/3/library/timeit.html) or
+ [memory](https://pypi.org/project/memory-profiler/) issues.
+ validations:
+ required: true
+ - type: textarea
+ id: version
+ attributes:
+ label: Installed Versions
+ description: >
+ Please paste the output of ``pd.show_versions()``
+ value: >
+
+
+
+ Replace this line with the output of pd.show_versions()
+
+
+
+ validations:
+ required: true
+ - type: textarea
+ id: prior-performance
+ attributes:
+ label: Prior Performance
+ description: >
+ If applicable, please provide the prior version of pandas and output
+ of the same reproducible example where the performance issue did not exist.
diff --git a/.github/ISSUE_TEMPLATE/submit_question.md b/.github/ISSUE_TEMPLATE/submit_question.md
deleted file mode 100644
index 9b48918ff2f6d..0000000000000
--- a/.github/ISSUE_TEMPLATE/submit_question.md
+++ /dev/null
@@ -1,24 +0,0 @@
----
-
-name: Submit Question
-about: Ask a general question about pandas
-title: "QST:"
-labels: "Usage Question, Needs Triage"
-
----
-
-- [ ] I have searched the [[pandas] tag](https://stackoverflow.com/questions/tagged/pandas) on StackOverflow for similar questions.
-
-- [ ] I have asked my usage related question on [StackOverflow](https://stackoverflow.com).
-
----
-
-#### Question about pandas
-
-**Note**: If you'd still like to submit a question, please read [this guide](
-https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports) detailing how to provide the necessary information for us to reproduce your question.
-
-```python
-# Your code here, if applicable
-
-```
diff --git a/.github/ISSUE_TEMPLATE/submit_question.yml b/.github/ISSUE_TEMPLATE/submit_question.yml
new file mode 100644
index 0000000000000..6f73041b0f527
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/submit_question.yml
@@ -0,0 +1,44 @@
+name: Submit Question
+description: Ask a general question about pandas
+title: "QST: "
+labels: [Usage Question, Needs Triage]
+
+body:
+ - type: markdown
+ attributes:
+ value: >
+ Since [StackOverflow](https://stackoverflow.com) is better suited towards answering
+ usage questions, we ask that all usage questions are first asked on StackOverflow.
+ - type: checkboxes
+ attributes:
+ label: Research
+ options:
+ - label: >
+ I have searched the [[pandas] tag](https://stackoverflow.com/questions/tagged/pandas)
+ on StackOverflow for similar questions.
+ required: true
+ - label: >
+ I have asked my usage related question on [StackOverflow](https://stackoverflow.com).
+ required: true
+ - type: input
+ id: question-link
+ attributes:
+ label: Link to question on StackOverflow
+ validations:
+ required: true
+ - type: markdown
+ attributes:
+ value: ---
+ - type: textarea
+ id: question
+ attributes:
+ label: Question about pandas
+ description: >
+ **Note**: If you'd still like to submit a question, please read [this guide](
+ https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports) detailing
+ how to provide the necessary information for us to reproduce your question.
+ placeholder: |
+ ```python
+ # Your code here, if applicable
+
+ ```
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 7fb5a6ddf2024..42017db8a05b1 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -1,4 +1,4 @@
- [ ] closes #xxxx
- [ ] tests added / passed
-- [ ] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them
+- [ ] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#pre-commit) for how to run them
- [ ] whatsnew entry
diff --git a/.github/actions/build_pandas/action.yml b/.github/actions/build_pandas/action.yml
index d4777bcd1d079..2e4bfea165316 100644
--- a/.github/actions/build_pandas/action.yml
+++ b/.github/actions/build_pandas/action.yml
@@ -13,5 +13,5 @@ runs:
- name: Build Pandas
run: |
python setup.py build_ext -j 2
- python -m pip install -e . --no-build-isolation --no-use-pep517
+ python -m pip install -e . --no-build-isolation --no-use-pep517 --no-index
shell: bash -l {0}
diff --git a/.github/workflows/asv-bot.yml b/.github/workflows/asv-bot.yml
new file mode 100644
index 0000000000000..f3946aeb84a63
--- /dev/null
+++ b/.github/workflows/asv-bot.yml
@@ -0,0 +1,81 @@
+name: "ASV Bot"
+
+on:
+ issue_comment: # Pull requests are issues
+ types:
+ - created
+
+env:
+ ENV_FILE: environment.yml
+ COMMENT: ${{github.event.comment.body}}
+
+jobs:
+ autotune:
+ name: "Run benchmarks"
+ # TODO: Support more benchmarking options later, against different branches, against self, etc
+ if: startsWith(github.event.comment.body, '@github-actions benchmark')
+ runs-on: ubuntu-latest
+ defaults:
+ run:
+ shell: bash -l {0}
+
+ concurrency:
+ # Set concurrency to prevent abuse(full runs are ~5.5 hours !!!)
+ # each user can only run one concurrent benchmark bot at a time
+ # We don't cancel in progress jobs, but if you want to benchmark multiple PRs, you're gonna have
+ # to wait
+ group: ${{ github.actor }}-asv
+ cancel-in-progress: false
+
+ steps:
+ - name: Checkout
+ uses: actions/checkout@v2
+ with:
+ fetch-depth: 0
+
+ - name: Cache conda
+ uses: actions/cache@v2
+ with:
+ path: ~/conda_pkgs_dir
+ key: ${{ runner.os }}-conda-${{ hashFiles('${{ env.ENV_FILE }}') }}
+
+ # Although asv sets up its own env, deps are still needed
+ # during discovery process
+ - uses: conda-incubator/setup-miniconda@v2
+ with:
+ activate-environment: pandas-dev
+ channel-priority: strict
+ environment-file: ${{ env.ENV_FILE }}
+ use-only-tar-bz2: true
+
+ - name: Run benchmarks
+ id: bench
+ continue-on-error: true # This is a fake failure, asv will exit code 1 for regressions
+ run: |
+ # extracting the regex, see https://stackoverflow.com/a/36798723
+ REGEX=$(echo "$COMMENT" | sed -n "s/^.*-b\s*\(\S*\).*$/\1/p")
+ cd asv_bench
+ asv check -E existing
+ git remote add upstream https://github.com/pandas-dev/pandas.git
+ git fetch upstream
+ asv machine --yes
+ asv continuous -f 1.1 -b $REGEX upstream/main HEAD
+ echo 'BENCH_OUTPUT<> $GITHUB_ENV
+ asv compare -f 1.1 upstream/main HEAD >> $GITHUB_ENV
+ echo 'EOF' >> $GITHUB_ENV
+ echo "REGEX=$REGEX" >> $GITHUB_ENV
+
+ - uses: actions/github-script@v5
+ env:
+ BENCH_OUTPUT: ${{env.BENCH_OUTPUT}}
+ REGEX: ${{env.REGEX}}
+ with:
+ script: |
+ const ENV_VARS = process.env
+ const run_url = `https://github.com/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`
+ github.rest.issues.createComment({
+ issue_number: context.issue.number,
+ owner: context.repo.owner,
+ repo: context.repo.repo,
+ body: '\nBenchmarks completed. View runner logs here.' + run_url + '\nRegex used: '+ 'regex ' + ENV_VARS["REGEX"] + '\n' + ENV_VARS["BENCH_OUTPUT"]
+ })
diff --git a/.github/workflows/autoupdate-pre-commit-config.yml b/.github/workflows/autoupdate-pre-commit-config.yml
index 801e063f72726..3696cba8cf2e6 100644
--- a/.github/workflows/autoupdate-pre-commit-config.yml
+++ b/.github/workflows/autoupdate-pre-commit-config.yml
@@ -2,7 +2,7 @@ name: "Update pre-commit config"
on:
schedule:
- - cron: "0 7 * * 1" # At 07:00 on each Monday.
+ - cron: "0 7 1 * *" # At 07:00 on 1st of every month.
workflow_dispatch:
jobs:
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
deleted file mode 100644
index a5a802c678e20..0000000000000
--- a/.github/workflows/ci.yml
+++ /dev/null
@@ -1,171 +0,0 @@
-name: CI
-
-on:
- push:
- branches: [master]
- pull_request:
- branches:
- - master
- - 1.2.x
-
-env:
- ENV_FILE: environment.yml
- PANDAS_CI: 1
-
-jobs:
- checks:
- name: Checks
- runs-on: ubuntu-latest
- defaults:
- run:
- shell: bash -l {0}
-
- steps:
- - name: Checkout
- uses: actions/checkout@v2
- with:
- fetch-depth: 0
-
- - name: Looking for unwanted patterns
- run: ci/code_checks.sh patterns
- if: always()
-
- - name: Cache conda
- uses: actions/cache@v2
- with:
- path: ~/conda_pkgs_dir
- key: ${{ runner.os }}-conda-${{ hashFiles('${{ env.ENV_FILE }}') }}
-
- - uses: conda-incubator/setup-miniconda@v2
- with:
- activate-environment: pandas-dev
- channel-priority: strict
- environment-file: ${{ env.ENV_FILE }}
- use-only-tar-bz2: true
-
- - name: Build Pandas
- uses: ./.github/actions/build_pandas
-
- - name: Linting
- run: ci/code_checks.sh lint
- if: always()
-
- - name: Checks on imported code
- run: ci/code_checks.sh code
- if: always()
-
- - name: Running doctests
- run: ci/code_checks.sh doctests
- if: always()
-
- - name: Docstring validation
- run: ci/code_checks.sh docstrings
- if: always()
-
- - name: Typing validation
- run: ci/code_checks.sh typing
- if: always()
-
- - name: Testing docstring validation script
- run: pytest scripts
- if: always()
-
- - name: Running benchmarks
- run: |
- cd asv_bench
- asv check -E existing
- git remote add upstream https://github.com/pandas-dev/pandas.git
- git fetch upstream
- asv machine --yes
- asv dev | sed "/failed$/ s/^/##[error]/" | tee benchmarks.log
- if grep "failed" benchmarks.log > /dev/null ; then
- exit 1
- fi
- if: always()
-
- - name: Publish benchmarks artifact
- uses: actions/upload-artifact@master
- with:
- name: Benchmarks log
- path: asv_bench/benchmarks.log
- if: failure()
-
- web_and_docs:
- name: Web and docs
- runs-on: ubuntu-latest
- steps:
-
- - name: Checkout
- uses: actions/checkout@v2
- with:
- fetch-depth: 0
-
- - name: Set up pandas
- uses: ./.github/actions/setup
-
- - name: Build website
- run: |
- source activate pandas-dev
- python web/pandas_web.py web/pandas --target-path=web/build
- - name: Build documentation
- run: |
- source activate pandas-dev
- doc/make.py --warnings-are-errors | tee sphinx.log ; exit ${PIPESTATUS[0]}
-
- # This can be removed when the ipython directive fails when there are errors,
- # including the `tee sphinx.log` in te previous step (https://github.com/ipython/ipython/issues/11547)
- - name: Check ipython directive errors
- run: "! grep -B10 \"^<<<-------------------------------------------------------------------------$\" sphinx.log"
-
- - name: Install ssh key
- run: |
- mkdir -m 700 -p ~/.ssh
- echo "${{ secrets.server_ssh_key }}" > ~/.ssh/id_rsa
- chmod 600 ~/.ssh/id_rsa
- echo "${{ secrets.server_ip }} ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBE1Kkopomm7FHG5enATf7SgnpICZ4W2bw+Ho+afqin+w7sMcrsa0je7sbztFAV8YchDkiBKnWTG4cRT+KZgZCaY=" > ~/.ssh/known_hosts
- if: github.event_name == 'push'
-
- - name: Upload web
- run: rsync -az --delete --exclude='pandas-docs' --exclude='docs' --exclude='Pandas_Cheat_Sheet*' web/build/ docs@${{ secrets.server_ip }}:/usr/share/nginx/pandas
- if: github.event_name == 'push'
-
- - name: Upload dev docs
- run: rsync -az --delete doc/build/html/ docs@${{ secrets.server_ip }}:/usr/share/nginx/pandas/pandas-docs/dev
- if: github.event_name == 'push'
-
- - name: Move docs into site directory
- run: mv doc/build/html web/build/docs
- - name: Save website as an artifact
- uses: actions/upload-artifact@v2
- with:
- name: website
- path: web/build
- retention-days: 14
-
- data_manager:
- name: Test experimental data manager
- runs-on: ubuntu-latest
- strategy:
- matrix:
- pattern: ["not slow and not network and not clipboard", "slow"]
- steps:
-
- - name: Checkout
- uses: actions/checkout@v2
- with:
- fetch-depth: 0
-
- - name: Set up pandas
- uses: ./.github/actions/setup
-
- - name: Run tests
- env:
- PANDAS_DATA_MANAGER: array
- PATTERN: ${{ matrix.pattern }}
- PYTEST_WORKERS: "auto"
- run: |
- source activate pandas-dev
- ci/run_tests.sh
-
- - name: Print skipped tests
- run: python ci/print_skipped.py
diff --git a/.github/workflows/code-checks.yml b/.github/workflows/code-checks.yml
new file mode 100644
index 0000000000000..7141b02cac376
--- /dev/null
+++ b/.github/workflows/code-checks.yml
@@ -0,0 +1,158 @@
+name: Code Checks
+
+on:
+ push:
+ branches:
+ - main
+ - 1.4.x
+ pull_request:
+ branches:
+ - main
+ - 1.4.x
+
+env:
+ ENV_FILE: environment.yml
+ PANDAS_CI: 1
+
+jobs:
+ pre_commit:
+ name: pre-commit
+ runs-on: ubuntu-latest
+ concurrency:
+ # https://github.community/t/concurrecy-not-work-for-push/183068/7
+ group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-pre-commit
+ cancel-in-progress: true
+ steps:
+ - name: Checkout
+ uses: actions/checkout@v2
+
+ - name: Install Python
+ uses: actions/setup-python@v2
+ with:
+ python-version: '3.9.7'
+
+ - name: Run pre-commit
+ uses: pre-commit/action@v2.0.3
+
+ typing_and_docstring_validation:
+ name: Docstring and typing validation
+ runs-on: ubuntu-latest
+ defaults:
+ run:
+ shell: bash -l {0}
+
+ concurrency:
+ # https://github.community/t/concurrecy-not-work-for-push/183068/7
+ group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-code-checks
+ cancel-in-progress: true
+
+ steps:
+ - name: Checkout
+ uses: actions/checkout@v2
+ with:
+ fetch-depth: 0
+
+ - name: Cache conda
+ uses: actions/cache@v2
+ with:
+ path: ~/conda_pkgs_dir
+ key: ${{ runner.os }}-conda-${{ hashFiles('${{ env.ENV_FILE }}') }}
+
+ - uses: conda-incubator/setup-miniconda@v2
+ with:
+ mamba-version: "*"
+ channels: conda-forge
+ activate-environment: pandas-dev
+ channel-priority: strict
+ environment-file: ${{ env.ENV_FILE }}
+ use-only-tar-bz2: true
+
+ - name: Install node.js (for pyright)
+ uses: actions/setup-node@v2
+ with:
+ node-version: "16"
+
+ - name: Install pyright
+ # note: keep version in sync with .pre-commit-config.yaml
+ run: npm install -g pyright@1.1.202
+
+ - name: Build Pandas
+ id: build
+ uses: ./.github/actions/build_pandas
+
+ - name: Run checks on imported code
+ run: ci/code_checks.sh code
+ if: ${{ steps.build.outcome == 'success' }}
+
+ - name: Run doctests
+ run: ci/code_checks.sh doctests
+ if: ${{ steps.build.outcome == 'success' }}
+
+ - name: Run docstring validation
+ run: ci/code_checks.sh docstrings
+ if: ${{ steps.build.outcome == 'success' }}
+
+ - name: Run typing validation
+ run: ci/code_checks.sh typing
+ if: ${{ steps.build.outcome == 'success' }}
+
+ - name: Run docstring validation script tests
+ run: pytest scripts
+ if: ${{ steps.build.outcome == 'success' }}
+
+ asv-benchmarks:
+ name: ASV Benchmarks
+ runs-on: ubuntu-latest
+ defaults:
+ run:
+ shell: bash -l {0}
+
+ concurrency:
+ # https://github.community/t/concurrecy-not-work-for-push/183068/7
+ group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-asv-benchmarks
+ cancel-in-progress: true
+
+ steps:
+ - name: Checkout
+ uses: actions/checkout@v2
+ with:
+ fetch-depth: 0
+
+ - name: Cache conda
+ uses: actions/cache@v2
+ with:
+ path: ~/conda_pkgs_dir
+ key: ${{ runner.os }}-conda-${{ hashFiles('${{ env.ENV_FILE }}') }}
+
+ - uses: conda-incubator/setup-miniconda@v2
+ with:
+ mamba-version: "*"
+ channels: conda-forge
+ activate-environment: pandas-dev
+ channel-priority: strict
+ environment-file: ${{ env.ENV_FILE }}
+ use-only-tar-bz2: true
+
+ - name: Build Pandas
+ id: build
+ uses: ./.github/actions/build_pandas
+
+ - name: Run ASV benchmarks
+ run: |
+ cd asv_bench
+ asv check -E existing
+ git remote add upstream https://github.com/pandas-dev/pandas.git
+ git fetch upstream
+ asv machine --yes
+ asv dev | sed "/failed$/ s/^/##[error]/" | tee benchmarks.log
+ if grep "failed" benchmarks.log > /dev/null ; then
+ exit 1
+ fi
+ if: ${{ steps.build.outcome == 'success' }}
+
+ - name: Publish benchmarks artifact
+ uses: actions/upload-artifact@v2
+ with:
+ name: Benchmarks log
+ path: asv_bench/benchmarks.log
+ if: failure()
diff --git a/.github/workflows/comment_bot.yml b/.github/workflows/comment_bot.yml
index dc396be753269..8f610fd5781ef 100644
--- a/.github/workflows/comment_bot.yml
+++ b/.github/workflows/comment_bot.yml
@@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- - uses: r-lib/actions/pr-fetch@master
+ - uses: r-lib/actions/pr-fetch@v2
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
- name: Cache multiple paths
@@ -29,12 +29,12 @@ jobs:
- name: Install-pre-commit
run: python -m pip install --upgrade pre-commit
- name: Run pre-commit
- run: pre-commit run --from-ref=origin/master --to-ref=HEAD --all-files || (exit 0)
+ run: pre-commit run --from-ref=origin/main --to-ref=HEAD --all-files || (exit 0)
- name: Commit results
run: |
git config user.name "$(git log -1 --pretty=format:%an)"
git config user.email "$(git log -1 --pretty=format:%ae)"
git commit -a -m 'Fixes from pre-commit [automated commit]' || echo "No changes to commit"
- - uses: r-lib/actions/pr-push@master
+ - uses: r-lib/actions/pr-push@v2
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
diff --git a/.github/workflows/database.yml b/.github/workflows/database.yml
deleted file mode 100644
index 292598dfcab73..0000000000000
--- a/.github/workflows/database.yml
+++ /dev/null
@@ -1,106 +0,0 @@
-name: Database
-
-on:
- push:
- branches: [master]
- pull_request:
- branches:
- - master
- - 1.2.x
- paths-ignore:
- - "doc/**"
-
-env:
- PYTEST_WORKERS: "auto"
- PANDAS_CI: 1
- PATTERN: ((not slow and not network and not clipboard) or (single and db))
- COVERAGE: true
-
-jobs:
- Linux_py37_IO:
- runs-on: ubuntu-latest
- defaults:
- run:
- shell: bash -l {0}
-
- strategy:
- matrix:
- ENV_FILE: [ci/deps/actions-37-db-min.yaml, ci/deps/actions-37-db.yaml]
- fail-fast: false
-
- services:
- mysql:
- image: mysql
- env:
- MYSQL_ALLOW_EMPTY_PASSWORD: yes
- MYSQL_DATABASE: pandas
- options: >-
- --health-cmd "mysqladmin ping"
- --health-interval 10s
- --health-timeout 5s
- --health-retries 5
- ports:
- - 3306:3306
-
- postgres:
- image: postgres
- env:
- POSTGRES_USER: postgres
- POSTGRES_PASSWORD: postgres
- POSTGRES_DB: pandas
- options: >-
- --health-cmd pg_isready
- --health-interval 10s
- --health-timeout 5s
- --health-retries 5
- ports:
- - 5432:5432
-
- steps:
- - name: Checkout
- uses: actions/checkout@v2
- with:
- fetch-depth: 0
-
- - name: Cache conda
- uses: actions/cache@v2
- env:
- CACHE_NUMBER: 0
- with:
- path: ~/conda_pkgs_dir
- key: ${{ runner.os }}-conda-${{ env.CACHE_NUMBER }}-${{
- hashFiles('${{ matrix.ENV_FILE }}') }}
-
- - uses: conda-incubator/setup-miniconda@v2
- with:
- activate-environment: pandas-dev
- channel-priority: flexible
- environment-file: ${{ matrix.ENV_FILE }}
- use-only-tar-bz2: true
-
- - name: Build Pandas
- uses: ./.github/actions/build_pandas
-
- - name: Test
- run: pytest -m "${{ env.PATTERN }}" -n 2 --dist=loadfile --cov=pandas --cov-report=xml pandas/tests/io
- if: always()
-
- - name: Build Version
- run: pushd /tmp && python -c "import pandas; pandas.show_versions();" && popd
-
- - name: Publish test results
- uses: actions/upload-artifact@master
- with:
- name: Test results
- path: test-data.xml
- if: failure()
-
- - name: Print skipped tests
- run: python ci/print_skipped.py
-
- - name: Upload coverage to Codecov
- uses: codecov/codecov-action@v1
- with:
- flags: unittests
- name: codecov-pandas
- fail_ci_if_error: true
diff --git a/.github/workflows/datamanger.yml b/.github/workflows/datamanger.yml
new file mode 100644
index 0000000000000..3fc515883a225
--- /dev/null
+++ b/.github/workflows/datamanger.yml
@@ -0,0 +1,57 @@
+name: Data Manager
+
+on:
+ push:
+ branches:
+ - main
+ - 1.4.x
+ pull_request:
+ branches:
+ - main
+ - 1.4.x
+
+env:
+ ENV_FILE: environment.yml
+ PANDAS_CI: 1
+
+jobs:
+ data_manager:
+ name: Test experimental data manager
+ runs-on: ubuntu-latest
+ services:
+ moto:
+ image: motoserver/moto
+ env:
+ AWS_ACCESS_KEY_ID: foobar_key
+ AWS_SECRET_ACCESS_KEY: foobar_secret
+ ports:
+ - 5000:5000
+ strategy:
+ matrix:
+ pattern: ["not slow and not network and not clipboard", "slow"]
+ concurrency:
+ # https://github.community/t/concurrecy-not-work-for-push/183068/7
+ group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-data_manager-${{ matrix.pattern }}
+ cancel-in-progress: true
+
+ steps:
+ - name: Checkout
+ uses: actions/checkout@v2
+ with:
+ fetch-depth: 0
+
+ - name: Set up pandas
+ uses: ./.github/actions/setup
+
+ - name: Run tests
+ env:
+ PANDAS_DATA_MANAGER: array
+ PATTERN: ${{ matrix.pattern }}
+ PYTEST_WORKERS: "auto"
+ PYTEST_TARGET: pandas
+ run: |
+ source activate pandas-dev
+ ci/run_tests.sh
+
+ - name: Print skipped tests
+ run: python ci/print_skipped.py
diff --git a/.github/workflows/docbuild-and-upload.yml b/.github/workflows/docbuild-and-upload.yml
new file mode 100644
index 0000000000000..e8ed6d4545194
--- /dev/null
+++ b/.github/workflows/docbuild-and-upload.yml
@@ -0,0 +1,77 @@
+name: Doc Build and Upload
+
+on:
+ push:
+ branches:
+ - main
+ - 1.4.x
+ pull_request:
+ branches:
+ - main
+ - 1.4.x
+
+env:
+ ENV_FILE: environment.yml
+ PANDAS_CI: 1
+
+jobs:
+ web_and_docs:
+ name: Doc Build and Upload
+ runs-on: ubuntu-latest
+
+ concurrency:
+ # https://github.community/t/concurrecy-not-work-for-push/183068/7
+ group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-web-docs
+ cancel-in-progress: true
+
+ steps:
+ - name: Checkout
+ uses: actions/checkout@v2
+ with:
+ fetch-depth: 0
+
+ - name: Set up pandas
+ uses: ./.github/actions/setup
+
+ - name: Build website
+ run: |
+ source activate pandas-dev
+ python web/pandas_web.py web/pandas --target-path=web/build
+ - name: Build documentation
+ run: |
+ source activate pandas-dev
+ doc/make.py --warnings-are-errors | tee sphinx.log ; exit ${PIPESTATUS[0]}
+
+ # This can be removed when the ipython directive fails when there are errors,
+ # including the `tee sphinx.log` in te previous step (https://github.com/ipython/ipython/issues/11547)
+ - name: Check ipython directive errors
+ run: "! grep -B10 \"^<<<-------------------------------------------------------------------------$\" sphinx.log"
+
+ - name: Install ssh key
+ run: |
+ mkdir -m 700 -p ~/.ssh
+ echo "${{ secrets.server_ssh_key }}" > ~/.ssh/id_rsa
+ chmod 600 ~/.ssh/id_rsa
+ echo "${{ secrets.server_ip }} ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBE1Kkopomm7FHG5enATf7SgnpICZ4W2bw+Ho+afqin+w7sMcrsa0je7sbztFAV8YchDkiBKnWTG4cRT+KZgZCaY=" > ~/.ssh/known_hosts
+ if: ${{github.event_name == 'push' && github.ref == 'refs/heads/main'}}
+
+ - name: Copy cheatsheets into site directory
+ run: cp doc/cheatsheet/Pandas_Cheat_Sheet* web/build/
+
+ - name: Upload web
+ run: rsync -az --delete --exclude='pandas-docs' --exclude='docs' web/build/ docs@${{ secrets.server_ip }}:/usr/share/nginx/pandas
+ if: ${{github.event_name == 'push' && github.ref == 'refs/heads/main'}}
+
+ - name: Upload dev docs
+ run: rsync -az --delete doc/build/html/ docs@${{ secrets.server_ip }}:/usr/share/nginx/pandas/pandas-docs/dev
+ if: ${{github.event_name == 'push' && github.ref == 'refs/heads/main'}}
+
+ - name: Move docs into site directory
+ run: mv doc/build/html web/build/docs
+
+ - name: Save website as an artifact
+ uses: actions/upload-artifact@v2
+ with:
+ name: website
+ path: web/build
+ retention-days: 14
diff --git a/.github/workflows/posix.yml b/.github/workflows/posix.yml
index cb7d3fb5cabcf..135ca0703de8b 100644
--- a/.github/workflows/posix.yml
+++ b/.github/workflows/posix.yml
@@ -2,11 +2,13 @@ name: Posix
on:
push:
- branches: [master]
+ branches:
+ - main
+ - 1.4.x
pull_request:
branches:
- - master
- - 1.2.x
+ - main
+ - 1.4.x
paths-ignore:
- "doc/**"
@@ -23,19 +25,22 @@ jobs:
strategy:
matrix:
settings: [
- [actions-37-minimum_versions.yaml, "not slow and not network and not clipboard", "", "", "", "", ""],
- [actions-37.yaml, "not slow and not network and not clipboard", "", "", "", "", ""],
- [actions-37-locale_slow.yaml, "slow", "language-pack-it xsel", "it_IT.utf8", "it_IT.utf8", "", ""],
- [actions-37-slow.yaml, "slow", "", "", "", "", ""],
- [actions-38.yaml, "not slow and not network and not clipboard", "", "", "", "", ""],
- [actions-38-slow.yaml, "slow", "", "", "", "", ""],
- [actions-38-locale.yaml, "not slow and not network", "language-pack-zh-hans xsel", "zh_CN.utf8", "zh_CN.utf8", "", ""],
- [actions-38-numpydev.yaml, "not slow and not network", "xsel", "", "", "deprecate", "-W error"],
- [actions-39.yaml, "not slow and not network and not clipboard", "", "", "", "", ""]
+ [actions-38-downstream_compat.yaml, "not slow and not network and not clipboard", "", "", "", "", ""],
+ [actions-38-minimum_versions.yaml, "slow", "", "", "", "", ""],
+ [actions-38-minimum_versions.yaml, "not slow and not network and not clipboard", "", "", "", "", ""],
+ [actions-38.yaml, "not slow and not network", "language-pack-it xsel", "it_IT.utf8", "it_IT.utf8", "", ""],
+ [actions-38.yaml, "not slow and not network", "language-pack-zh-hans xsel", "zh_CN.utf8", "zh_CN.utf8", "", ""],
+ [actions-38.yaml, "not slow and not clipboard", "", "", "", "", ""],
+ [actions-38.yaml, "slow", "", "", "", "", ""],
+ [actions-pypy-38.yaml, "not slow and not clipboard", "", "", "", "", "--max-worker-restart 0"],
+ [actions-39.yaml, "slow", "", "", "", "", ""],
+ [actions-39.yaml, "not slow and not clipboard", "", "", "", "", ""],
+ [actions-310-numpydev.yaml, "not slow and not network", "xclip", "", "", "deprecate", "-W error"],
+ [actions-310.yaml, "not slow and not clipboard", "", "", "", "", ""],
+ [actions-310.yaml, "slow", "", "", "", "", ""],
]
fail-fast: false
env:
- COVERAGE: true
ENV_FILE: ci/deps/${{ matrix.settings[0] }}
PATTERN: ${{ matrix.settings[1] }}
EXTRA_APT: ${{ matrix.settings[2] }}
@@ -43,6 +48,50 @@ jobs:
LC_ALL: ${{ matrix.settings[4] }}
PANDAS_TESTING_MODE: ${{ matrix.settings[5] }}
TEST_ARGS: ${{ matrix.settings[6] }}
+ PYTEST_TARGET: pandas
+ IS_PYPY: ${{ contains(matrix.settings[0], 'pypy') }}
+ # TODO: re-enable coverage on pypy, its slow
+ COVERAGE: ${{ !contains(matrix.settings[0], 'pypy') }}
+ concurrency:
+ # https://github.community/t/concurrecy-not-work-for-push/183068/7
+ group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.settings[0] }}-${{ matrix.settings[1] }}
+ cancel-in-progress: true
+
+ services:
+ mysql:
+ image: mysql
+ env:
+ MYSQL_ALLOW_EMPTY_PASSWORD: yes
+ MYSQL_DATABASE: pandas
+ options: >-
+ --health-cmd "mysqladmin ping"
+ --health-interval 10s
+ --health-timeout 5s
+ --health-retries 5
+ ports:
+ - 3306:3306
+
+ postgres:
+ image: postgres
+ env:
+ POSTGRES_USER: postgres
+ POSTGRES_PASSWORD: postgres
+ POSTGRES_DB: pandas
+ options: >-
+ --health-cmd pg_isready
+ --health-interval 10s
+ --health-timeout 5s
+ --health-retries 5
+ ports:
+ - 5432:5432
+
+ moto:
+ image: motoserver/moto
+ env:
+ AWS_ACCESS_KEY_ID: foobar_key
+ AWS_SECRET_ACCESS_KEY: foobar_secret
+ ports:
+ - 5000:5000
steps:
- name: Checkout
@@ -64,23 +113,42 @@ jobs:
- uses: conda-incubator/setup-miniconda@v2
with:
+ mamba-version: "*"
+ channels: conda-forge
activate-environment: pandas-dev
channel-priority: flexible
environment-file: ${{ env.ENV_FILE }}
use-only-tar-bz2: true
+ if: ${{ env.IS_PYPY == 'false' }} # No pypy3.8 support
+
+ - name: Setup PyPy
+ uses: actions/setup-python@v2
+ with:
+ python-version: "pypy-3.8"
+ if: ${{ env.IS_PYPY == 'true' }}
+
+ - name: Setup PyPy dependencies
+ shell: bash
+ run: |
+ # TODO: re-enable cov, its slowing the tests down though
+ # TODO: Unpin Cython, the new Cython 0.29.26 is causing compilation errors
+ pip install Cython==0.29.25 numpy python-dateutil pytz pytest>=6.0 pytest-xdist>=1.31.0 hypothesis>=5.5.3
+ if: ${{ env.IS_PYPY == 'true' }}
- name: Build Pandas
uses: ./.github/actions/build_pandas
- name: Test
run: ci/run_tests.sh
+ # TODO: Don't continue on error for PyPy
+ continue-on-error: ${{ env.IS_PYPY == 'true' }}
if: always()
- name: Build Version
run: pushd /tmp && python -c "import pandas; pandas.show_versions();" && popd
- name: Publish test results
- uses: actions/upload-artifact@master
+ uses: actions/upload-artifact@v2
with:
name: Test results
path: test-data.xml
@@ -90,7 +158,7 @@ jobs:
run: python ci/print_skipped.py
- name: Upload coverage to Codecov
- uses: codecov/codecov-action@v1
+ uses: codecov/codecov-action@v2
with:
flags: unittests
name: codecov-pandas
diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml
deleted file mode 100644
index 723347913ac38..0000000000000
--- a/.github/workflows/pre-commit.yml
+++ /dev/null
@@ -1,14 +0,0 @@
-name: pre-commit
-
-on:
- pull_request:
- push:
- branches: [master]
-
-jobs:
- pre-commit:
- runs-on: ubuntu-latest
- steps:
- - uses: actions/checkout@v2
- - uses: actions/setup-python@v2
- - uses: pre-commit/action@v2.0.0
diff --git a/.github/workflows/python-dev.yml b/.github/workflows/python-dev.yml
index 38b1aa9ae7047..fa1eee2db6fc3 100644
--- a/.github/workflows/python-dev.yml
+++ b/.github/workflows/python-dev.yml
@@ -1,20 +1,48 @@
+# This file is purposely frozen(does not run). DO NOT DELETE IT
+# Unfreeze(by commentingthe if: false() condition) once the
+# next Python Dev version has released beta 1 and both Cython and numpy support it
+# After that Python has released, migrate the workflows to the
+# posix GHA workflows/Azure pipelines and "freeze" this file by
+# uncommenting the if: false() condition
+# Feel free to modify this comment as necessary.
+
name: Python Dev
on:
push:
branches:
- - master
+ - main
+ - 1.4.x
pull_request:
branches:
- - master
+ - main
+ - 1.4.x
paths-ignore:
- "doc/**"
+env:
+ PYTEST_WORKERS: "auto"
+ PANDAS_CI: 1
+ PATTERN: "not slow and not network and not clipboard"
+ COVERAGE: true
+ PYTEST_TARGET: pandas
+
jobs:
build:
- runs-on: ubuntu-latest
- name: actions-310-dev
- timeout-minutes: 60
+ if: false # Comment this line out to "unfreeze"
+ runs-on: ${{ matrix.os }}
+ strategy:
+ fail-fast: false
+ matrix:
+ os: [ubuntu-latest, macOS-latest, windows-latest]
+
+ name: actions-311-dev
+ timeout-minutes: 80
+
+ concurrency:
+ #https://github.community/t/concurrecy-not-work-for-push/183068/7
+ group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.os }}-${{ matrix.pytest_target }}-dev
+ cancel-in-progress: true
steps:
- uses: actions/checkout@v2
@@ -24,15 +52,16 @@ jobs:
- name: Set up Python Dev Version
uses: actions/setup-python@v2
with:
- python-version: '3.10-dev'
+ python-version: '3.11-dev'
+ # TODO: GH#44980 https://github.com/pypa/setuptools/issues/2941
- name: Install dependencies
+ shell: bash
run: |
- python -m pip install --upgrade pip setuptools wheel
- pip install git+https://github.com/numpy/numpy.git
- pip install git+https://github.com/pytest-dev/pytest.git
+ python -m pip install --upgrade pip "setuptools<60.0.0" wheel
+ pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy
pip install git+https://github.com/nedbat/coveragepy.git
- pip install cython python-dateutil pytz hypothesis pytest-xdist
+ pip install cython python-dateutil pytz hypothesis pytest>=6.2.5 pytest-xdist pytest-cov
pip list
- name: Build Pandas
@@ -45,12 +74,12 @@ jobs:
python -c "import pandas; pandas.show_versions();"
- name: Test with pytest
+ shell: bash
run: |
- coverage run -m pytest -m 'not slow and not network and not clipboard' pandas
- continue-on-error: true
+ ci/run_tests.sh
- name: Publish test results
- uses: actions/upload-artifact@master
+ uses: actions/upload-artifact@v2
with:
name: Test results
path: test-data.xml
@@ -65,7 +94,7 @@ jobs:
coverage report -m
- name: Upload coverage to Codecov
- uses: codecov/codecov-action@v1
+ uses: codecov/codecov-action@v2
with:
flags: unittests
name: codecov-pandas
diff --git a/.github/workflows/sdist.yml b/.github/workflows/sdist.yml
new file mode 100644
index 0000000000000..dd030f1aacc44
--- /dev/null
+++ b/.github/workflows/sdist.yml
@@ -0,0 +1,83 @@
+name: sdist
+
+on:
+ push:
+ branches:
+ - main
+ - 1.4.x
+ pull_request:
+ branches:
+ - main
+ - 1.4.x
+ paths-ignore:
+ - "doc/**"
+
+jobs:
+ build:
+ runs-on: ubuntu-latest
+ timeout-minutes: 60
+ defaults:
+ run:
+ shell: bash -l {0}
+
+ strategy:
+ fail-fast: false
+ matrix:
+ python-version: ["3.8", "3.9", "3.10"]
+ concurrency:
+ # https://github.community/t/concurrecy-not-work-for-push/183068/7
+ group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{matrix.python-version}}-sdist
+ cancel-in-progress: true
+
+ steps:
+ - uses: actions/checkout@v2
+ with:
+ fetch-depth: 0
+
+ - name: Set up Python
+ uses: actions/setup-python@v2
+ with:
+ python-version: ${{ matrix.python-version }}
+
+ # TODO: GH#44980 https://github.com/pypa/setuptools/issues/2941
+ - name: Install dependencies
+ run: |
+ python -m pip install --upgrade pip "setuptools<60.0.0" wheel
+
+ # GH 39416
+ pip install numpy
+
+ - name: Build pandas sdist
+ run: |
+ pip list
+ python setup.py sdist --formats=gztar
+
+ - uses: conda-incubator/setup-miniconda@v2
+ with:
+ activate-environment: pandas-sdist
+ channels: conda-forge
+ python-version: '${{ matrix.python-version }}'
+
+ # TODO: GH#44980 https://github.com/pypa/setuptools/issues/2941
+ - name: Install pandas from sdist
+ run: |
+ python -m pip install --upgrade "setuptools<60.0.0"
+ pip list
+ python -m pip install dist/*.gz
+
+ - name: Force oldest supported NumPy
+ run: |
+ case "${{matrix.python-version}}" in
+ 3.8)
+ pip install numpy==1.18.5 ;;
+ 3.9)
+ pip install numpy==1.19.3 ;;
+ 3.10)
+ pip install numpy==1.21.2 ;;
+ esac
+
+ - name: Import pandas
+ run: |
+ cd ..
+ conda list
+ python -c "import pandas; pandas.show_versions();"
diff --git a/.gitignore b/.gitignore
index 2c337be60e94e..87224f1d6060f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -50,6 +50,8 @@ dist
*.egg-info
.eggs
.pypirc
+# type checkers
+pandas/py.typed
# tox testing tool
.tox
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index d580fcf4fc545..5232b76a6388d 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -9,17 +9,17 @@ repos:
- id: absolufy-imports
files: ^pandas/
- repo: https://github.com/python/black
- rev: 21.5b2
+ rev: 21.12b0
hooks:
- id: black
- repo: https://github.com/codespell-project/codespell
- rev: v2.0.0
+ rev: v2.1.0
hooks:
- id: codespell
types_or: [python, rst, markdown]
files: ^(pandas|doc)/
- repo: https://github.com/pre-commit/pre-commit-hooks
- rev: v4.0.1
+ rev: v4.1.0
hooks:
- id: debug-statements
- id: end-of-file-fixer
@@ -35,34 +35,26 @@ repos:
# we can lint all header files since they aren't "generated" like C files are.
exclude: ^pandas/_libs/src/(klib|headers)/
args: [--quiet, '--extensions=c,h', '--headers=h', --recursive, '--filter=-readability/casting,-runtime/int,-build/include_subdir']
-- repo: https://gitlab.com/pycqa/flake8
- rev: 3.9.2
+- repo: https://github.com/PyCQA/flake8
+ rev: 4.0.1
hooks:
- id: flake8
- additional_dependencies:
- - flake8-comprehensions==3.1.0
- - flake8-bugbear==21.3.2
- - pandas-dev-flaker==0.2.0
- - id: flake8
- name: flake8 (cython)
- types: [cython]
- args: [--append-config=flake8/cython.cfg]
- - id: flake8
- name: flake8 (cython template)
- files: \.pxi\.in$
- types: [text]
- args: [--append-config=flake8/cython-template.cfg]
+ additional_dependencies: &flake8_dependencies
+ - flake8==4.0.1
+ - flake8-comprehensions==3.7.0
+ - flake8-bugbear==21.3.2
+ - pandas-dev-flaker==0.2.0
- repo: https://github.com/PyCQA/isort
- rev: 5.8.0
+ rev: 5.10.1
hooks:
- id: isort
- repo: https://github.com/asottile/pyupgrade
- rev: v2.18.3
+ rev: v2.31.0
hooks:
- id: pyupgrade
- args: [--py37-plus]
+ args: [--py38-plus]
- repo: https://github.com/pre-commit/pygrep-hooks
- rev: v1.8.0
+ rev: v1.9.0
hooks:
- id: rst-backticks
- id: rst-directive-colons
@@ -72,14 +64,21 @@ repos:
types: [text] # overwrite types: [rst]
types_or: [python, rst]
- repo: https://github.com/asottile/yesqa
- rev: v1.2.3
+ rev: v1.3.0
hooks:
- id: yesqa
- additional_dependencies:
- - flake8==3.9.2
- - flake8-comprehensions==3.1.0
- - flake8-bugbear==21.3.2
- - pandas-dev-flaker==0.2.0
+ additional_dependencies: *flake8_dependencies
+- repo: local
+ hooks:
+ - id: pyright
+ name: pyright
+ entry: pyright
+ language: node
+ pass_filenames: false
+ types: [python]
+ stages: [manual]
+ # note: keep version in sync with .github/workflows/ci.yml
+ additional_dependencies: ['pyright@1.1.202']
- repo: local
hooks:
- id: flake8-rst
@@ -102,7 +101,42 @@ repos:
# Incorrect code-block / IPython directives
|\.\.\ code-block\ ::
|\.\.\ ipython\ ::
+ # directive should not have a space before ::
+ |\.\.\ \w+\ ::
+
+ # Check for deprecated messages without sphinx directive
+ |(DEPRECATED|DEPRECATE|Deprecated)(:|,|\.)
types_or: [python, cython, rst]
+ - id: cython-casting
+ name: Check Cython casting is `obj`, not ` obj`
+ language: pygrep
+ entry: '[a-zA-Z0-9*]> '
+ files: (\.pyx|\.pxi.in)$
+ - id: incorrect-backticks
+ name: Check for backticks incorrectly rendering because of missing spaces
+ language: pygrep
+ entry: '[a-zA-Z0-9]\`\`?[a-zA-Z0-9]'
+ types: [rst]
+ files: ^doc/source/
+ - id: seed-check-asv
+ name: Check for unnecessary random seeds in asv benchmarks
+ language: pygrep
+ entry: 'np\.random\.seed'
+ files: ^asv_bench/benchmarks
+ exclude: ^asv_bench/benchmarks/pandas_vb_common\.py
+ - id: np-testing-array-equal
+ name: Check for usage of numpy testing or array_equal
+ language: pygrep
+ entry: '(numpy|np)(\.testing|\.array_equal)'
+ files: ^pandas/tests/
+ types: [python]
+ - id: invalid-ea-testing
+ name: Check for invalid EA testing
+ language: pygrep
+ entry: 'tm\.assert_(series|frame)_equal'
+ files: ^pandas/tests/extension/base
+ types: [python]
+ exclude: ^pandas/tests/extension/base/base\.py
- id: pip-to-conda
name: Generate pip dependency from conda
description: This hook checks if the conda environment.yml and requirements-dev.txt are equal
@@ -110,7 +144,7 @@ repos:
entry: python scripts/generate_pip_deps_from_conda.py
files: ^(environment.yml|requirements-dev.txt)$
pass_filenames: false
- additional_dependencies: [pyyaml]
+ additional_dependencies: [pyyaml, toml]
- id: sync-flake8-versions
name: Check flake8 version is synced across flake8, yesqa, and environment.yml
language: python
@@ -136,3 +170,19 @@ repos:
entry: python scripts/no_bool_in_generic.py
language: python
files: ^pandas/core/generic\.py$
+ - id: pandas-errors-documented
+ name: Ensure pandas errors are documented in doc/source/reference/general_utility_functions.rst
+ entry: python scripts/pandas_errors_documented.py
+ language: python
+ files: ^pandas/errors/__init__.py$
+ - id: pg8000-not-installed-CI
+ name: Check for pg8000 not installed on CI for test_pg8000_sqlalchemy_passthrough_error
+ language: pygrep
+ entry: 'pg8000'
+ files: ^ci/deps
+ types: [yaml]
+ - id: validate-min-versions-in-sync
+ name: Check minimum version of dependencies are aligned
+ entry: python scripts/validate_min_versions_in_sync.py
+ language: python
+ files: ^(ci/deps/actions-.*-minimum_versions\.yaml|pandas/compat/_optional\.py)$
diff --git a/Dockerfile b/Dockerfile
index de1c564921de9..8887e80566772 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -28,7 +28,7 @@ RUN mkdir "$pandas_home" \
&& git clone "https://github.com/$gh_username/pandas.git" "$pandas_home" \
&& cd "$pandas_home" \
&& git remote add upstream "https://github.com/pandas-dev/pandas.git" \
- && git pull upstream master
+ && git pull upstream main
# Because it is surprisingly difficult to activate a conda environment inside a DockerFile
# (from personal experience and per https://github.com/ContinuumIO/docker-images/issues/89),
diff --git a/MANIFEST.in b/MANIFEST.in
index d0d93f2cdba8c..78464c9aaedc8 100644
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -17,28 +17,38 @@ global-exclude *.h5
global-exclude *.html
global-exclude *.json
global-exclude *.jsonl
+global-exclude *.msgpack
global-exclude *.pdf
global-exclude *.pickle
global-exclude *.png
global-exclude *.pptx
-global-exclude *.pyc
-global-exclude *.pyd
global-exclude *.ods
global-exclude *.odt
+global-exclude *.orc
global-exclude *.sas7bdat
global-exclude *.sav
global-exclude *.so
global-exclude *.xls
+global-exclude *.xlsb
global-exclude *.xlsm
global-exclude *.xlsx
global-exclude *.xpt
+global-exclude *.cpt
global-exclude *.xz
global-exclude *.zip
+global-exclude *.zst
global-exclude *~
global-exclude .DS_Store
global-exclude .git*
global-exclude \#*
+global-exclude *.c
+global-exclude *.cpp
+global-exclude *.h
+
+global-exclude *.py[ocd]
+global-exclude *.pxi
+
# GH 39321
# csv_dir_path fixture checks the existence of the directory
# exclude the whole directory to avoid running related tests in sdist
@@ -47,3 +57,6 @@ prune pandas/tests/io/parser/data
include versioneer.py
include pandas/_version.py
include pandas/io/formats/templates/*.tpl
+
+graft pandas/_libs/src
+graft pandas/_libs/tslibs/src
diff --git a/Makefile b/Makefile
index 1fdd3cfdcf027..c0aa685ed47ac 100644
--- a/Makefile
+++ b/Makefile
@@ -12,7 +12,7 @@ build: clean_pyc
python setup.py build_ext
lint-diff:
- git diff upstream/master --name-only -- "*.py" | xargs flake8
+ git diff upstream/main --name-only -- "*.py" | xargs flake8
black:
black .
diff --git a/README.md b/README.md
index 04b346c198e90..26aed081de4af 100644
--- a/README.md
+++ b/README.md
@@ -9,10 +9,10 @@
[](https://anaconda.org/anaconda/pandas/)
[](https://doi.org/10.5281/zenodo.3509134)
[](https://pypi.org/project/pandas/)
-[](https://github.com/pandas-dev/pandas/blob/master/LICENSE)
-[](https://dev.azure.com/pandas-dev/pandas/_build/latest?definitionId=1&branch=master)
-[](https://codecov.io/gh/pandas-dev/pandas)
-[](https://pandas.pydata.org)
+[](https://github.com/pandas-dev/pandas/blob/main/LICENSE)
+[](https://dev.azure.com/pandas-dev/pandas/_build/latest?definitionId=1&branch=main)
+[](https://codecov.io/gh/pandas-dev/pandas)
+[](https://pepy.tech/project/pandas)
[](https://gitter.im/pydata/pandas)
[](https://numfocus.org)
[](https://github.com/psf/black)
@@ -160,7 +160,7 @@ Most development discussions take place on GitHub in this repo. Further, the [pa
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
-A detailed overview on how to contribute can be found in the **[contributing guide](https://pandas.pydata.org/docs/dev/development/contributing.html)**. There is also an [overview](.github/CONTRIBUTING.md) on GitHub.
+A detailed overview on how to contribute can be found in the **[contributing guide](https://pandas.pydata.org/docs/dev/development/contributing.html)**.
If you are simply looking to start working with the pandas codebase, navigate to the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues) and start looking through interesting issues. There are a number of issues listed under [Docs](https://github.com/pandas-dev/pandas/issues?labels=Docs&sort=updated&state=open) and [good first issue](https://github.com/pandas-dev/pandas/issues?labels=good+first+issue&sort=updated&state=open) where you could start out.
@@ -170,4 +170,4 @@ Or maybe through using pandas you have an idea of your own or are looking for so
Feel free to ask questions on the [mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata) or on [Gitter](https://gitter.im/pydata/pandas).
-As contributors and maintainers to this project, you are expected to abide by pandas' code of conduct. More information can be found at: [Contributor Code of Conduct](https://github.com/pandas-dev/pandas/blob/master/.github/CODE_OF_CONDUCT.md)
+As contributors and maintainers to this project, you are expected to abide by pandas' code of conduct. More information can be found at: [Contributor Code of Conduct](https://github.com/pandas-dev/pandas/blob/main/.github/CODE_OF_CONDUCT.md)
diff --git a/asv_bench/asv.conf.json b/asv_bench/asv.conf.json
index e8e82edabbfa3..daf2834c50d6a 100644
--- a/asv_bench/asv.conf.json
+++ b/asv_bench/asv.conf.json
@@ -13,6 +13,10 @@
// benchmarked
"repo": "..",
+ // List of branches to benchmark. If not provided, defaults to "master"
+ // (for git) or "default" (for mercurial).
+ "branches": ["main"],
+
// The tool to use to create environments. May be "conda",
// "virtualenv" or other value depending on the plugins in use.
// If missing or the empty string, the tool will be automatically
@@ -25,7 +29,6 @@
// The Pythons you'd like to test against. If not provided, defaults
// to the current version of Python used to run `asv`.
- // "pythons": ["2.7", "3.4"],
"pythons": ["3.8"],
// The matrix of dependencies to test. Each key is the name of a
@@ -39,24 +42,21 @@
// followed by the pip installed packages).
"matrix": {
"numpy": [],
- "Cython": ["0.29.21"],
+ "Cython": ["0.29.24"],
"matplotlib": [],
"sqlalchemy": [],
"scipy": [],
"numba": [],
"numexpr": [],
"pytables": [null, ""], // platform dependent, see excludes below
+ "pyarrow": [],
"tables": [null, ""],
"openpyxl": [],
"xlsxwriter": [],
"xlrd": [],
"xlwt": [],
"odfpy": [],
- "pytest": [],
"jinja2": [],
- // If using Windows with python 2.7 and want to build using the
- // mingw toolchain (rather than MSVC), uncomment the following line.
- // "libpython": [],
},
"conda_channels": ["defaults", "conda-forge"],
// Combinations of libraries/python versions can be excluded/included
diff --git a/asv_bench/benchmarks/algorithms.py b/asv_bench/benchmarks/algorithms.py
index e48a2060a3b34..2e43827232ae5 100644
--- a/asv_bench/benchmarks/algorithms.py
+++ b/asv_bench/benchmarks/algorithms.py
@@ -44,9 +44,9 @@ def setup(self, unique, sort, dtype):
raise NotImplementedError
data = {
- "int": pd.Int64Index(np.arange(N)),
- "uint": pd.UInt64Index(np.arange(N)),
- "float": pd.Float64Index(np.random.randn(N)),
+ "int": pd.Index(np.arange(N), dtype="int64"),
+ "uint": pd.Index(np.arange(N), dtype="uint64"),
+ "float": pd.Index(np.random.randn(N), dtype="float64"),
"object": string_index,
"datetime64[ns]": pd.date_range("2011-01-01", freq="H", periods=N),
"datetime64[ns, tz]": pd.date_range(
@@ -76,9 +76,9 @@ class Duplicated:
def setup(self, unique, keep, dtype):
N = 10 ** 5
data = {
- "int": pd.Int64Index(np.arange(N)),
- "uint": pd.UInt64Index(np.arange(N)),
- "float": pd.Float64Index(np.random.randn(N)),
+ "int": pd.Index(np.arange(N), dtype="int64"),
+ "uint": pd.Index(np.arange(N), dtype="uint64"),
+ "float": pd.Index(np.random.randn(N), dtype="float64"),
"string": tm.makeStringIndex(N),
"datetime64[ns]": pd.date_range("2011-01-01", freq="H", periods=N),
"datetime64[ns, tz]": pd.date_range(
diff --git a/asv_bench/benchmarks/algos/isin.py b/asv_bench/benchmarks/algos/isin.py
index 296101c9f9800..37fa0b490bd9e 100644
--- a/asv_bench/benchmarks/algos/isin.py
+++ b/asv_bench/benchmarks/algos/isin.py
@@ -1,9 +1,8 @@
import numpy as np
-from pandas.compat.numpy import np_version_under1p20
-
from pandas import (
Categorical,
+ Index,
NaT,
Series,
date_range,
@@ -280,10 +279,6 @@ class IsInLongSeriesLookUpDominates:
def setup(self, dtype, MaxNumber, series_type):
N = 10 ** 7
- # https://github.com/pandas-dev/pandas/issues/39844
- if not np_version_under1p20 and dtype in ("Int64", "Float64"):
- raise NotImplementedError
-
if series_type == "random_hits":
array = np.random.randint(0, MaxNumber, N)
if series_type == "random_misses":
@@ -294,7 +289,8 @@ def setup(self, dtype, MaxNumber, series_type):
array = np.arange(N) + MaxNumber
self.series = Series(array).astype(dtype)
- self.values = np.arange(MaxNumber).astype(dtype)
+
+ self.values = np.arange(MaxNumber).astype(dtype.lower())
def time_isin(self, dtypes, MaxNumber, series_type):
self.series.isin(self.values)
@@ -310,18 +306,37 @@ class IsInLongSeriesValuesDominate:
def setup(self, dtype, series_type):
N = 10 ** 7
- # https://github.com/pandas-dev/pandas/issues/39844
- if not np_version_under1p20 and dtype in ("Int64", "Float64"):
- raise NotImplementedError
-
if series_type == "random":
vals = np.random.randint(0, 10 * N, N)
if series_type == "monotone":
vals = np.arange(N)
- self.values = vals.astype(dtype)
+ self.values = vals.astype(dtype.lower())
M = 10 ** 6 + 1
self.series = Series(np.arange(M)).astype(dtype)
def time_isin(self, dtypes, series_type):
self.series.isin(self.values)
+
+
+class IsInWithLongTupples:
+ def setup(self):
+ t = tuple(range(1000))
+ self.series = Series([t] * 1000)
+ self.values = [t]
+
+ def time_isin(self):
+ self.series.isin(self.values)
+
+
+class IsInIndexes:
+ def setup(self):
+ self.range_idx = Index(range(1000))
+ self.index = Index(list(range(1000)))
+ self.series = Series(np.random.randint(100_000, size=1000))
+
+ def time_isin_range_index(self):
+ self.series.isin(self.range_idx)
+
+ def time_isin_index(self):
+ self.series.isin(self.index)
diff --git a/asv_bench/benchmarks/arithmetic.py b/asv_bench/benchmarks/arithmetic.py
index bfb1be8705495..edd1132116f76 100644
--- a/asv_bench/benchmarks/arithmetic.py
+++ b/asv_bench/benchmarks/arithmetic.py
@@ -144,7 +144,7 @@ def setup(self, op, shape):
# should already be the case, but just to be sure
df._consolidate_inplace()
- # TODO: GH#33198 the setting here shoudlnt need two steps
+ # TODO: GH#33198 the setting here shouldn't need two steps
arr1 = np.random.randn(n_rows, max(n_cols // 4, 3)).astype("f8")
arr2 = np.random.randn(n_rows, n_cols // 2).astype("i8")
arr3 = np.random.randn(n_rows, n_cols // 4).astype("f8")
diff --git a/asv_bench/benchmarks/dtypes.py b/asv_bench/benchmarks/dtypes.py
index c561b80ed1ca6..55f6be848aa13 100644
--- a/asv_bench/benchmarks/dtypes.py
+++ b/asv_bench/benchmarks/dtypes.py
@@ -50,15 +50,26 @@ def time_pandas_dtype_invalid(self, dtype):
class SelectDtypes:
- params = [
- tm.ALL_INT_DTYPES
- + tm.ALL_EA_INT_DTYPES
- + tm.FLOAT_DTYPES
- + tm.COMPLEX_DTYPES
- + tm.DATETIME64_DTYPES
- + tm.TIMEDELTA64_DTYPES
- + tm.BOOL_DTYPES
- ]
+ try:
+ params = [
+ tm.ALL_INT_NUMPY_DTYPES
+ + tm.ALL_INT_EA_DTYPES
+ + tm.FLOAT_NUMPY_DTYPES
+ + tm.COMPLEX_DTYPES
+ + tm.DATETIME64_DTYPES
+ + tm.TIMEDELTA64_DTYPES
+ + tm.BOOL_DTYPES
+ ]
+ except AttributeError:
+ params = [
+ tm.ALL_INT_DTYPES
+ + tm.ALL_EA_INT_DTYPES
+ + tm.FLOAT_DTYPES
+ + tm.COMPLEX_DTYPES
+ + tm.DATETIME64_DTYPES
+ + tm.TIMEDELTA64_DTYPES
+ + tm.BOOL_DTYPES
+ ]
param_names = ["dtype"]
def setup(self, dtype):
diff --git a/asv_bench/benchmarks/frame_ctor.py b/asv_bench/benchmarks/frame_ctor.py
index 7fbe249788a98..eace665ba0bac 100644
--- a/asv_bench/benchmarks/frame_ctor.py
+++ b/asv_bench/benchmarks/frame_ctor.py
@@ -2,6 +2,7 @@
import pandas as pd
from pandas import (
+ Categorical,
DataFrame,
MultiIndex,
Series,
@@ -18,7 +19,10 @@
)
except ImportError:
# For compatibility with older versions
- from pandas.core.datetools import * # noqa
+ from pandas.core.datetools import (
+ Hour,
+ Nano,
+ )
class FromDicts:
@@ -31,6 +35,9 @@ def setup(self):
self.dict_list = frame.to_dict(orient="records")
self.data2 = {i: {j: float(j) for j in range(100)} for i in range(2000)}
+ # arrays which we wont consolidate
+ self.dict_of_categoricals = {i: Categorical(np.arange(N)) for i in range(K)}
+
def time_list_of_dict(self):
DataFrame(self.dict_list)
@@ -50,6 +57,10 @@ def time_nested_dict_int64(self):
# nested dict, integer indexes, regression described in #621
DataFrame(self.data2)
+ def time_dict_of_categoricals(self):
+ # dict of arrays that we wont consolidate
+ DataFrame(self.dict_of_categoricals)
+
class FromSeries:
def setup(self):
@@ -171,4 +182,21 @@ def time_frame_from_arrays_sparse(self):
)
+class From3rdParty:
+ # GH#44616
+
+ def setup(self):
+ try:
+ import torch
+ except ImportError:
+ raise NotImplementedError
+
+ row = 700000
+ col = 64
+ self.val_tensor = torch.randn(row, col)
+
+ def time_from_torch(self):
+ DataFrame(self.val_tensor)
+
+
from .pandas_vb_common import setup # noqa: F401 isort:skip
diff --git a/asv_bench/benchmarks/frame_methods.py b/asv_bench/benchmarks/frame_methods.py
index c32eda4928da7..16925b7959e6a 100644
--- a/asv_bench/benchmarks/frame_methods.py
+++ b/asv_bench/benchmarks/frame_methods.py
@@ -76,7 +76,7 @@ def time_reindex_axis1_missing(self):
self.df.reindex(columns=self.idx)
def time_reindex_both_axes(self):
- self.df.reindex(index=self.idx, columns=self.idx)
+ self.df.reindex(index=self.idx, columns=self.idx_cols)
def time_reindex_upcast(self):
self.df2.reindex(np.random.permutation(range(1200)))
@@ -232,6 +232,22 @@ def time_to_html_mixed(self):
self.df2.to_html()
+class ToDict:
+ params = [["dict", "list", "series", "split", "records", "index"]]
+ param_names = ["orient"]
+
+ def setup(self, orient):
+ data = np.random.randint(0, 1000, size=(10000, 4))
+ self.int_df = DataFrame(data)
+ self.datetimelike_df = self.int_df.astype("timedelta64[ns]")
+
+ def time_to_dict_ints(self, orient):
+ self.int_df.to_dict(orient=orient)
+
+ def time_to_dict_datetimelike(self, orient):
+ self.datetimelike_df.to_dict(orient=orient)
+
+
class ToNumpy:
def setup(self):
N = 10000
@@ -522,8 +538,12 @@ class Interpolate:
def setup(self, downcast):
N = 10000
# this is the worst case, where every column has NaNs.
- self.df = DataFrame(np.random.randn(N, 100))
- self.df.values[::2] = np.nan
+ arr = np.random.randn(N, 100)
+ # NB: we need to set values in array, not in df.values, otherwise
+ # the benchmark will be misleading for ArrayManager
+ arr[::2] = np.nan
+
+ self.df = DataFrame(arr)
self.df2 = DataFrame(
{
@@ -711,17 +731,6 @@ def time_dataframe_describe(self):
self.df.describe()
-class SelectDtypes:
- params = [100, 1000]
- param_names = ["n"]
-
- def setup(self, n):
- self.df = DataFrame(np.random.randn(10, n))
-
- def time_select_dtypes(self, n):
- self.df.select_dtypes(include="int")
-
-
class MemoryUsage:
def setup(self):
self.df = DataFrame(np.random.randn(100000, 2), columns=list("AB"))
diff --git a/asv_bench/benchmarks/groupby.py b/asv_bench/benchmarks/groupby.py
index 1648985a56b91..ff58e382a9ba2 100644
--- a/asv_bench/benchmarks/groupby.py
+++ b/asv_bench/benchmarks/groupby.py
@@ -369,6 +369,18 @@ def time_category_size(self):
self.draws.groupby(self.cats).size()
+class Shift:
+ def setup(self):
+ N = 18
+ self.df = DataFrame({"g": ["a", "b"] * 9, "v": list(range(N))})
+
+ def time_defaults(self):
+ self.df.groupby("g").shift()
+
+ def time_fill_value(self):
+ self.df.groupby("g").shift(fill_value=99)
+
+
class FillNA:
def setup(self):
N = 100
@@ -391,7 +403,7 @@ def time_srs_bfill(self):
class GroupByMethods:
- param_names = ["dtype", "method", "application"]
+ param_names = ["dtype", "method", "application", "ncols"]
params = [
["int", "float", "object", "datetime", "uint"],
[
@@ -431,15 +443,39 @@ class GroupByMethods:
"var",
],
["direct", "transformation"],
+ [1, 5],
]
- def setup(self, dtype, method, application):
+ def setup(self, dtype, method, application, ncols):
if method in method_blocklist.get(dtype, {}):
raise NotImplementedError # skip benchmark
- ngroups = 1000
+
+ if ncols != 1 and method in ["value_counts", "unique"]:
+ # DataFrameGroupBy doesn't have these methods
+ raise NotImplementedError
+
+ if application == "transformation" and method in [
+ "describe",
+ "head",
+ "tail",
+ "unique",
+ "value_counts",
+ "size",
+ ]:
+ # DataFrameGroupBy doesn't have these methods
+ raise NotImplementedError
+
+ if method == "describe":
+ ngroups = 20
+ elif method in ["mad", "skew"]:
+ ngroups = 100
+ else:
+ ngroups = 1000
size = ngroups * 2
- rng = np.arange(ngroups)
- values = rng.take(np.random.randint(0, ngroups, size=size))
+ rng = np.arange(ngroups).reshape(-1, 1)
+ rng = np.broadcast_to(rng, (len(rng), ncols))
+ taker = np.random.randint(0, ngroups, size=size)
+ values = rng.take(taker, axis=0)
if dtype == "int":
key = np.random.randint(0, size, size=size)
elif dtype == "uint":
@@ -453,22 +489,24 @@ def setup(self, dtype, method, application):
elif dtype == "datetime":
key = date_range("1/1/2011", periods=size, freq="s")
- df = DataFrame({"values": values, "key": key})
+ cols = [f"values{n}" for n in range(ncols)]
+ df = DataFrame(values, columns=cols)
+ df["key"] = key
- if application == "transform":
- if method == "describe":
- raise NotImplementedError
+ if len(cols) == 1:
+ cols = cols[0]
- self.as_group_method = lambda: df.groupby("key")["values"].transform(method)
- self.as_field_method = lambda: df.groupby("values")["key"].transform(method)
+ if application == "transformation":
+ self.as_group_method = lambda: df.groupby("key")[cols].transform(method)
+ self.as_field_method = lambda: df.groupby(cols)["key"].transform(method)
else:
- self.as_group_method = getattr(df.groupby("key")["values"], method)
- self.as_field_method = getattr(df.groupby("values")["key"], method)
+ self.as_group_method = getattr(df.groupby("key")[cols], method)
+ self.as_field_method = getattr(df.groupby(cols)["key"], method)
- def time_dtype_as_group(self, dtype, method, application):
+ def time_dtype_as_group(self, dtype, method, application, ncols):
self.as_group_method()
- def time_dtype_as_field(self, dtype, method, application):
+ def time_dtype_as_field(self, dtype, method, application, ncols):
self.as_field_method()
@@ -568,6 +606,38 @@ def time_sum(self):
self.df.groupby(["a"])["b"].sum()
+class String:
+ # GH#41596
+ param_names = ["dtype", "method"]
+ params = [
+ ["str", "string[python]"],
+ [
+ "sum",
+ "prod",
+ "min",
+ "max",
+ "mean",
+ "median",
+ "var",
+ "first",
+ "last",
+ "any",
+ "all",
+ ],
+ ]
+
+ def setup(self, dtype, method):
+ cols = list("abcdefghjkl")
+ self.df = DataFrame(
+ np.random.randint(0, 100, size=(1_000_000, len(cols))),
+ columns=cols,
+ dtype=dtype,
+ )
+
+ def time_str_func(self, dtype, method):
+ self.df.groupby("a")[self.df.columns[1:]].agg(method)
+
+
class Categories:
def setup(self):
N = 10 ** 5
@@ -832,4 +902,18 @@ def function(values):
self.grouper.agg(function, engine="cython")
+class Sample:
+ def setup(self):
+ N = 10 ** 3
+ self.df = DataFrame({"a": np.zeros(N)})
+ self.groups = np.arange(0, N)
+ self.weights = np.ones(N)
+
+ def time_sample(self):
+ self.df.groupby(self.groups).sample(n=1)
+
+ def time_sample_weights(self):
+ self.df.groupby(self.groups).sample(n=1, weights=self.weights)
+
+
from .pandas_vb_common import setup # noqa: F401 isort:skip
diff --git a/asv_bench/benchmarks/index_object.py b/asv_bench/benchmarks/index_object.py
index 9c05019c70396..2b2302a796730 100644
--- a/asv_bench/benchmarks/index_object.py
+++ b/asv_bench/benchmarks/index_object.py
@@ -86,6 +86,12 @@ def time_iter_dec(self):
for _ in self.idx_dec:
pass
+ def time_sort_values_asc(self):
+ self.idx_inc.sort_values()
+
+ def time_sort_values_des(self):
+ self.idx_inc.sort_values(ascending=False)
+
class IndexEquals:
def setup(self):
diff --git a/asv_bench/benchmarks/indexing.py b/asv_bench/benchmarks/indexing.py
index 10fb926ee4d03..58f2a73d82842 100644
--- a/asv_bench/benchmarks/indexing.py
+++ b/asv_bench/benchmarks/indexing.py
@@ -366,11 +366,20 @@ class InsertColumns:
def setup(self):
self.N = 10 ** 3
self.df = DataFrame(index=range(self.N))
+ self.df2 = DataFrame(np.random.randn(self.N, 2))
def time_insert(self):
for i in range(100):
self.df.insert(0, i, np.random.randn(self.N), allow_duplicates=True)
+ def time_insert_middle(self):
+ # same as time_insert but inserting to a middle column rather than
+ # front or back (which have fast-paths)
+ for i in range(100):
+ self.df2.insert(
+ 1, "colname", np.random.randn(self.N), allow_duplicates=True
+ )
+
def time_assign_with_setitem(self):
for i in range(100):
self.df[i] = np.random.randn(self.N)
@@ -390,12 +399,14 @@ class ChainIndexing:
def setup(self, mode):
self.N = 1000000
+ self.df = DataFrame({"A": np.arange(self.N), "B": "foo"})
def time_chained_indexing(self, mode):
+ df = self.df
+ N = self.N
with warnings.catch_warnings(record=True):
with option_context("mode.chained_assignment", mode):
- df = DataFrame({"A": np.arange(self.N), "B": "foo"})
- df2 = df[df.A > self.N // 2]
+ df2 = df[df.A > N // 2]
df2["C"] = 1.0
diff --git a/asv_bench/benchmarks/indexing_engines.py b/asv_bench/benchmarks/indexing_engines.py
index 30ef7f63dc0dc..60e07a9d1469c 100644
--- a/asv_bench/benchmarks/indexing_engines.py
+++ b/asv_bench/benchmarks/indexing_engines.py
@@ -1,5 +1,5 @@
"""
-Benchmarks in this fiel depend exclusively on code in _libs/
+Benchmarks in this file depend exclusively on code in _libs/
If a PR does not edit anything in _libs, it is very unlikely that benchmarks
in this file will be affected.
@@ -35,25 +35,49 @@ class NumericEngineIndexing:
params = [
_get_numeric_engines(),
["monotonic_incr", "monotonic_decr", "non_monotonic"],
+ [True, False],
+ [10 ** 5, 2 * 10 ** 6], # 2e6 is above SIZE_CUTOFF
]
- param_names = ["engine_and_dtype", "index_type"]
+ param_names = ["engine_and_dtype", "index_type", "unique", "N"]
- def setup(self, engine_and_dtype, index_type):
+ def setup(self, engine_and_dtype, index_type, unique, N):
engine, dtype = engine_and_dtype
- N = 10 ** 5
- values = list([1] * N + [2] * N + [3] * N)
- arr = {
- "monotonic_incr": np.array(values, dtype=dtype),
- "monotonic_decr": np.array(list(reversed(values)), dtype=dtype),
- "non_monotonic": np.array([1, 2, 3] * N, dtype=dtype),
- }[index_type]
- self.data = engine(lambda: arr, len(arr))
+ if index_type == "monotonic_incr":
+ if unique:
+ arr = np.arange(N * 3, dtype=dtype)
+ else:
+ values = list([1] * N + [2] * N + [3] * N)
+ arr = np.array(values, dtype=dtype)
+ elif index_type == "monotonic_decr":
+ if unique:
+ arr = np.arange(N * 3, dtype=dtype)[::-1]
+ else:
+ values = list([1] * N + [2] * N + [3] * N)
+ arr = np.array(values, dtype=dtype)[::-1]
+ else:
+ assert index_type == "non_monotonic"
+ if unique:
+ arr = np.empty(N * 3, dtype=dtype)
+ arr[:N] = np.arange(N * 2, N * 3, dtype=dtype)
+ arr[N:] = np.arange(N * 2, dtype=dtype)
+ else:
+ arr = np.array([1, 2, 3] * N, dtype=dtype)
+
+ self.data = engine(arr)
# code belows avoids populating the mapping etc. while timing.
self.data.get_loc(2)
- def time_get_loc(self, engine_and_dtype, index_type):
- self.data.get_loc(2)
+ self.key_middle = arr[len(arr) // 2]
+ self.key_early = arr[2]
+
+ def time_get_loc(self, engine_and_dtype, index_type, unique, N):
+ self.data.get_loc(self.key_early)
+
+ def time_get_loc_near_middle(self, engine_and_dtype, index_type, unique, N):
+ # searchsorted performance may be different near the middle of a range
+ # vs near an endpoint
+ self.data.get_loc(self.key_middle)
class ObjectEngineIndexing:
@@ -70,7 +94,7 @@ def setup(self, index_type):
"non_monotonic": np.array(list("abc") * N, dtype=object),
}[index_type]
- self.data = libindex.ObjectEngine(lambda: arr, len(arr))
+ self.data = libindex.ObjectEngine(arr)
# code belows avoids populating the mapping etc. while timing.
self.data.get_loc("b")
diff --git a/asv_bench/benchmarks/inference.py b/asv_bench/benchmarks/inference.py
index 0aa924dabd469..a5a7bc5b5c8bd 100644
--- a/asv_bench/benchmarks/inference.py
+++ b/asv_bench/benchmarks/inference.py
@@ -115,19 +115,27 @@ def time_maybe_convert_objects(self):
class ToDatetimeFromIntsFloats:
def setup(self):
self.ts_sec = Series(range(1521080307, 1521685107), dtype="int64")
+ self.ts_sec_uint = Series(range(1521080307, 1521685107), dtype="uint64")
self.ts_sec_float = self.ts_sec.astype("float64")
self.ts_nanosec = 1_000_000 * self.ts_sec
+ self.ts_nanosec_uint = 1_000_000 * self.ts_sec_uint
self.ts_nanosec_float = self.ts_nanosec.astype("float64")
- # speed of int64 and float64 paths should be comparable
+ # speed of int64, uint64 and float64 paths should be comparable
def time_nanosec_int64(self):
to_datetime(self.ts_nanosec, unit="ns")
+ def time_nanosec_uint64(self):
+ to_datetime(self.ts_nanosec_uint, unit="ns")
+
def time_nanosec_float64(self):
to_datetime(self.ts_nanosec_float, unit="ns")
+ def time_sec_uint64(self):
+ to_datetime(self.ts_sec_uint, unit="s")
+
def time_sec_int64(self):
to_datetime(self.ts_sec, unit="s")
@@ -165,6 +173,7 @@ def setup(self):
self.strings_tz_space = [
x.strftime("%Y-%m-%d %H:%M:%S") + " -0800" for x in rng
]
+ self.strings_zero_tz = [x.strftime("%Y-%m-%d %H:%M:%S") + "Z" for x in rng]
def time_iso8601(self):
to_datetime(self.strings)
@@ -181,6 +190,10 @@ def time_iso8601_format_no_sep(self):
def time_iso8601_tz_spaceformat(self):
to_datetime(self.strings_tz_space)
+ def time_iso8601_infer_zero_tz_fromat(self):
+ # GH 41047
+ to_datetime(self.strings_zero_tz, infer_datetime_format=True)
+
class ToDatetimeNONISO8601:
def setup(self):
@@ -264,6 +277,16 @@ def time_dup_string_tzoffset_dates(self, cache):
to_datetime(self.dup_string_with_tz, cache=cache)
+# GH 43901
+class ToDatetimeInferDatetimeFormat:
+ def setup(self):
+ rng = date_range(start="1/1/2000", periods=100000, freq="H")
+ self.strings = rng.strftime("%Y-%m-%d %H:%M:%S").tolist()
+
+ def time_infer_datetime_format(self):
+ to_datetime(self.strings, infer_datetime_format=True)
+
+
class ToTimedelta:
def setup(self):
self.ints = np.random.randint(0, 60, size=10000)
diff --git a/asv_bench/benchmarks/io/csv.py b/asv_bench/benchmarks/io/csv.py
index 5ff9431fbf8e4..0b443b29116a2 100644
--- a/asv_bench/benchmarks/io/csv.py
+++ b/asv_bench/benchmarks/io/csv.py
@@ -10,6 +10,7 @@
from pandas import (
Categorical,
DataFrame,
+ concat,
date_range,
read_csv,
to_datetime,
@@ -54,6 +55,26 @@ def time_frame(self, kind):
self.df.to_csv(self.fname)
+class ToCSVMultiIndexUnusedLevels(BaseIO):
+
+ fname = "__test__.csv"
+
+ def setup(self):
+ df = DataFrame({"a": np.random.randn(100_000), "b": 1, "c": 1})
+ self.df = df.set_index(["a", "b"])
+ self.df_unused_levels = self.df.iloc[:10_000]
+ self.df_single_index = df.set_index(["a"]).iloc[:10_000]
+
+ def time_full_frame(self):
+ self.df.to_csv(self.fname)
+
+ def time_sliced_frame(self):
+ self.df_unused_levels.to_csv(self.fname)
+
+ def time_single_index_frame(self):
+ self.df_single_index.to_csv(self.fname)
+
+
class ToCSVDatetime(BaseIO):
fname = "__test__.csv"
@@ -66,6 +87,21 @@ def time_frame_date_formatting(self):
self.data.to_csv(self.fname, date_format="%Y%m%d")
+class ToCSVDatetimeIndex(BaseIO):
+
+ fname = "__test__.csv"
+
+ def setup(self):
+ rng = date_range("2000", periods=100_000, freq="S")
+ self.data = DataFrame({"a": 1}, index=rng)
+
+ def time_frame_date_formatting_index(self):
+ self.data.to_csv(self.fname, date_format="%Y-%m-%d %H:%M:%S")
+
+ def time_frame_date_no_format_index(self):
+ self.data.to_csv(self.fname)
+
+
class ToCSVDatetimeBig(BaseIO):
fname = "__test__.csv"
@@ -206,7 +242,7 @@ def time_read_csv(self, bad_date_value):
class ReadCSVSkipRows(BaseIO):
fname = "__test__.csv"
- params = ([None, 10000], ["c", "python"])
+ params = ([None, 10000], ["c", "python", "pyarrow"])
param_names = ["skiprows", "engine"]
def setup(self, skiprows, engine):
@@ -291,7 +327,8 @@ class ReadCSVFloatPrecision(StringIORewind):
def setup(self, sep, decimal, float_precision):
floats = [
- "".join(random.choice(string.digits) for _ in range(28)) for _ in range(15)
+ "".join([random.choice(string.digits) for _ in range(28)])
+ for _ in range(15)
]
rows = sep.join([f"0{decimal}" + "{}"] * 3) + "\n"
data = rows * 5
@@ -319,7 +356,7 @@ def time_read_csv_python_engine(self, sep, decimal, float_precision):
class ReadCSVEngine(StringIORewind):
- params = ["c", "python"]
+ params = ["c", "python", "pyarrow"]
param_names = ["engine"]
def setup(self, engine):
@@ -395,7 +432,7 @@ class ReadCSVCachedParseDates(StringIORewind):
param_names = ["do_cache", "engine"]
def setup(self, do_cache, engine):
- data = ("\n".join(f"10/{year}" for year in range(2000, 2100)) + "\n") * 10
+ data = ("\n".join([f"10/{year}" for year in range(2000, 2100)]) + "\n") * 10
self.StringIO_input = StringIO(data)
def time_read_csv_cached(self, do_cache, engine):
@@ -458,6 +495,34 @@ def time_read_special_date(self, value, engine):
)
+class ReadCSVMemMapUTF8:
+
+ fname = "__test__.csv"
+ number = 5
+
+ def setup(self):
+ lines = []
+ line_length = 128
+ start_char = " "
+ end_char = "\U00010080"
+ # This for loop creates a list of 128-char strings
+ # consisting of consecutive Unicode chars
+ for lnum in range(ord(start_char), ord(end_char), line_length):
+ line = "".join([chr(c) for c in range(lnum, lnum + 0x80)]) + "\n"
+ try:
+ line.encode("utf-8")
+ except UnicodeEncodeError:
+ # Some 16-bit words are not valid Unicode chars and must be skipped
+ continue
+ lines.append(line)
+ df = DataFrame(lines)
+ df = concat([df for n in range(100)], ignore_index=True)
+ df.to_csv(self.fname, index=False, header=False, encoding="utf-8")
+
+ def time_read_memmapped_utf8(self):
+ read_csv(self.fname, header=None, memory_map=True, encoding="utf-8", engine="c")
+
+
class ParseDateComparison(StringIORewind):
params = ([False, True],)
param_names = ["cache_dates"]
@@ -495,4 +560,14 @@ def time_to_datetime_format_DD_MM_YYYY(self, cache_dates):
to_datetime(df["date"], cache=cache_dates, format="%d-%m-%Y")
+class ReadCSVIndexCol(StringIORewind):
+ def setup(self):
+ count_elem = 100_000
+ data = "a,b\n" + "1,2\n" * count_elem
+ self.StringIO_input = StringIO(data)
+
+ def time_read_csv_index_col(self):
+ read_csv(self.StringIO_input, index_col="a")
+
+
from ..pandas_vb_common import setup # noqa: F401 isort:skip
diff --git a/asv_bench/benchmarks/io/json.py b/asv_bench/benchmarks/io/json.py
index d9d27ce7e5d8c..d1468a238c491 100644
--- a/asv_bench/benchmarks/io/json.py
+++ b/asv_bench/benchmarks/io/json.py
@@ -172,15 +172,19 @@ def time_to_json(self, orient, frame):
def peakmem_to_json(self, orient, frame):
getattr(self, frame).to_json(self.fname, orient=orient)
- def time_to_json_wide(self, orient, frame):
+
+class ToJSONWide(ToJSON):
+ def setup(self, orient, frame):
+ super().setup(orient, frame)
base_df = getattr(self, frame).copy()
- df = concat([base_df.iloc[:100]] * 1000, ignore_index=True, axis=1)
- df.to_json(self.fname, orient=orient)
+ df_wide = concat([base_df.iloc[:100]] * 1000, ignore_index=True, axis=1)
+ self.df_wide = df_wide
+
+ def time_to_json_wide(self, orient, frame):
+ self.df_wide.to_json(self.fname, orient=orient)
def peakmem_to_json_wide(self, orient, frame):
- base_df = getattr(self, frame).copy()
- df = concat([base_df.iloc[:100]] * 1000, ignore_index=True, axis=1)
- df.to_json(self.fname, orient=orient)
+ self.df_wide.to_json(self.fname, orient=orient)
class ToJSONISO(BaseIO):
diff --git a/asv_bench/benchmarks/io/style.py b/asv_bench/benchmarks/io/style.py
index 82166a2a95c76..f0902c9c2c328 100644
--- a/asv_bench/benchmarks/io/style.py
+++ b/asv_bench/benchmarks/io/style.py
@@ -34,13 +34,29 @@ def peakmem_classes_render(self, cols, rows):
self._style_classes()
self.st._render_html(True, True)
+ def time_tooltips_render(self, cols, rows):
+ self._style_tooltips()
+ self.st._render_html(True, True)
+
+ def peakmem_tooltips_render(self, cols, rows):
+ self._style_tooltips()
+ self.st._render_html(True, True)
+
def time_format_render(self, cols, rows):
self._style_format()
- self.st.render()
+ self.st._render_html(True, True)
def peakmem_format_render(self, cols, rows):
self._style_format()
- self.st.render()
+ self.st._render_html(True, True)
+
+ def time_apply_format_hide_render(self, cols, rows):
+ self._style_apply_format_hide()
+ self.st._render_html(True, True)
+
+ def peakmem_apply_format_hide_render(self, cols, rows):
+ self._style_apply_format_hide()
+ self.st._render_html(True, True)
def _style_apply(self):
def _apply_func(s):
@@ -63,3 +79,15 @@ def _style_format(self):
self.st = self.df.style.format(
"{:,.3f}", subset=IndexSlice["row_1":f"row_{ir}", "float_1":f"float_{ic}"]
)
+
+ def _style_apply_format_hide(self):
+ self.st = self.df.style.applymap(lambda v: "color: red;")
+ self.st.format("{:.3f}")
+ self.st.hide_index(self.st.index[1:])
+ self.st.hide_columns(self.st.columns[1:])
+
+ def _style_tooltips(self):
+ ttips = DataFrame("abc", index=self.df.index[::2], columns=self.df.columns[::2])
+ self.st = self.df.style.set_tooltips(ttips)
+ self.st.hide_index(self.st.index[12:])
+ self.st.hide_columns(self.st.columns[12:])
diff --git a/asv_bench/benchmarks/join_merge.py b/asv_bench/benchmarks/join_merge.py
index 27eaecff09d0f..ad40adc75c567 100644
--- a/asv_bench/benchmarks/join_merge.py
+++ b/asv_bench/benchmarks/join_merge.py
@@ -262,12 +262,24 @@ def setup(self):
Z=self.right_object["Z"].astype("category")
)
+ self.left_cat_col = self.left_object.astype({"X": "category"})
+ self.right_cat_col = self.right_object.astype({"X": "category"})
+
+ self.left_cat_idx = self.left_cat_col.set_index("X")
+ self.right_cat_idx = self.right_cat_col.set_index("X")
+
def time_merge_object(self):
merge(self.left_object, self.right_object, on="X")
def time_merge_cat(self):
merge(self.left_cat, self.right_cat, on="X")
+ def time_merge_on_cat_col(self):
+ merge(self.left_cat_col, self.right_cat_col, on="X")
+
+ def time_merge_on_cat_idx(self):
+ merge(self.left_cat_idx, self.right_cat_idx, on="X")
+
class MergeOrdered:
def setup(self):
diff --git a/asv_bench/benchmarks/pandas_vb_common.py b/asv_bench/benchmarks/pandas_vb_common.py
index ed44102700dc6..d3168bde0a783 100644
--- a/asv_bench/benchmarks/pandas_vb_common.py
+++ b/asv_bench/benchmarks/pandas_vb_common.py
@@ -17,7 +17,7 @@
try:
import pandas._testing as tm
except ImportError:
- import pandas.util.testing as tm # noqa
+ import pandas.util.testing as tm # noqa:F401
numeric_dtypes = [
diff --git a/asv_bench/benchmarks/reshape.py b/asv_bench/benchmarks/reshape.py
index 232aabfb87c58..c83cd9a925f6d 100644
--- a/asv_bench/benchmarks/reshape.py
+++ b/asv_bench/benchmarks/reshape.py
@@ -102,6 +102,7 @@ def setup(self, dtype):
columns = np.arange(n)
if dtype == "int":
values = np.arange(m * m * n).reshape(m * m, n)
+ self.df = DataFrame(values, index, columns)
else:
# the category branch is ~20x slower than int. So we
# cut down the size a bit. Now it's only ~3x slower.
@@ -111,7 +112,10 @@ def setup(self, dtype):
values = np.take(list(string.ascii_letters), indices)
values = [pd.Categorical(v) for v in values.T]
- self.df = DataFrame(values, index, columns)
+ self.df = DataFrame(
+ {i: cat for i, cat in enumerate(values)}, index, columns
+ )
+
self.df2 = self.df.iloc[:-1]
def time_full_product(self, dtype):
diff --git a/asv_bench/benchmarks/rolling.py b/asv_bench/benchmarks/rolling.py
index d35770b720f7a..1c53d4adc8c25 100644
--- a/asv_bench/benchmarks/rolling.py
+++ b/asv_bench/benchmarks/rolling.py
@@ -1,3 +1,5 @@
+import warnings
+
import numpy as np
import pandas as pd
@@ -7,22 +9,24 @@ class Methods:
params = (
["DataFrame", "Series"],
- [10, 1000],
+ [("rolling", {"window": 10}), ("rolling", {"window": 1000}), ("expanding", {})],
["int", "float"],
- ["median", "mean", "max", "min", "std", "count", "skew", "kurt", "sum"],
+ ["median", "mean", "max", "min", "std", "count", "skew", "kurt", "sum", "sem"],
)
- param_names = ["constructor", "window", "dtype", "method"]
+ param_names = ["constructor", "window_kwargs", "dtype", "method"]
- def setup(self, constructor, window, dtype, method):
+ def setup(self, constructor, window_kwargs, dtype, method):
N = 10 ** 5
+ window, kwargs = window_kwargs
arr = (100 * np.random.random(N)).astype(dtype)
- self.roll = getattr(pd, constructor)(arr).rolling(window)
+ obj = getattr(pd, constructor)(arr)
+ self.window = getattr(obj, window)(**kwargs)
- def time_rolling(self, constructor, window, dtype, method):
- getattr(self.roll, method)()
+ def time_method(self, constructor, window_kwargs, dtype, method):
+ getattr(self.window, method)()
- def peakmem_rolling(self, constructor, window, dtype, method):
- getattr(self.roll, method)()
+ def peakmem_method(self, constructor, window_kwargs, dtype, method):
+ getattr(self.window, method)()
class Apply:
@@ -44,77 +48,116 @@ def time_rolling(self, constructor, window, dtype, function, raw):
self.roll.apply(function, raw=raw)
-class Engine:
+class NumbaEngineMethods:
params = (
["DataFrame", "Series"],
["int", "float"],
- [np.sum, lambda x: np.sum(x) + 5],
- ["cython", "numba"],
- ["sum", "max", "min", "median", "mean"],
+ [("rolling", {"window": 10}), ("expanding", {})],
+ ["sum", "max", "min", "median", "mean", "var", "std"],
+ [True, False],
+ [None, 100],
)
- param_names = ["constructor", "dtype", "function", "engine", "method"]
-
- def setup(self, constructor, dtype, function, engine, method):
+ param_names = [
+ "constructor",
+ "dtype",
+ "window_kwargs",
+ "method",
+ "parallel",
+ "cols",
+ ]
+
+ def setup(self, constructor, dtype, window_kwargs, method, parallel, cols):
N = 10 ** 3
- arr = (100 * np.random.random(N)).astype(dtype)
- self.data = getattr(pd, constructor)(arr)
-
- def time_rolling_apply(self, constructor, dtype, function, engine, method):
- self.data.rolling(10).apply(function, raw=True, engine=engine)
-
- def time_expanding_apply(self, constructor, dtype, function, engine, method):
- self.data.expanding().apply(function, raw=True, engine=engine)
-
- def time_rolling_methods(self, constructor, dtype, function, engine, method):
- getattr(self.data.rolling(10), method)(engine=engine)
-
-
-class ExpandingMethods:
-
+ window, kwargs = window_kwargs
+ shape = (N, cols) if cols is not None and constructor != "Series" else N
+ arr = (100 * np.random.random(shape)).astype(dtype)
+ data = getattr(pd, constructor)(arr)
+
+ # Warm the cache
+ with warnings.catch_warnings(record=True):
+ # Catch parallel=True not being applicable e.g. 1D data
+ self.window = getattr(data, window)(**kwargs)
+ getattr(self.window, method)(
+ engine="numba", engine_kwargs={"parallel": parallel}
+ )
+
+ def test_method(self, constructor, dtype, window_kwargs, method, parallel, cols):
+ with warnings.catch_warnings(record=True):
+ getattr(self.window, method)(
+ engine="numba", engine_kwargs={"parallel": parallel}
+ )
+
+
+class NumbaEngineApply:
params = (
["DataFrame", "Series"],
["int", "float"],
- ["median", "mean", "max", "min", "std", "count", "skew", "kurt", "sum"],
+ [("rolling", {"window": 10}), ("expanding", {})],
+ [np.sum, lambda x: np.sum(x) + 5],
+ [True, False],
+ [None, 100],
)
- param_names = ["constructor", "window", "dtype", "method"]
-
- def setup(self, constructor, dtype, method):
- N = 10 ** 5
- N_groupby = 100
- arr = (100 * np.random.random(N)).astype(dtype)
- self.expanding = getattr(pd, constructor)(arr).expanding()
- self.expanding_groupby = (
- pd.DataFrame({"A": arr[:N_groupby], "B": range(N_groupby)})
- .groupby("B")
- .expanding()
- )
-
- def time_expanding(self, constructor, dtype, method):
- getattr(self.expanding, method)()
-
- def time_expanding_groupby(self, constructor, dtype, method):
- getattr(self.expanding_groupby, method)()
+ param_names = [
+ "constructor",
+ "dtype",
+ "window_kwargs",
+ "function",
+ "parallel",
+ "cols",
+ ]
+
+ def setup(self, constructor, dtype, window_kwargs, function, parallel, cols):
+ N = 10 ** 3
+ window, kwargs = window_kwargs
+ shape = (N, cols) if cols is not None and constructor != "Series" else N
+ arr = (100 * np.random.random(shape)).astype(dtype)
+ data = getattr(pd, constructor)(arr)
+
+ # Warm the cache
+ with warnings.catch_warnings(record=True):
+ # Catch parallel=True not being applicable e.g. 1D data
+ self.window = getattr(data, window)(**kwargs)
+ self.window.apply(
+ function, raw=True, engine="numba", engine_kwargs={"parallel": parallel}
+ )
+
+ def test_method(self, constructor, dtype, window_kwargs, function, parallel, cols):
+ with warnings.catch_warnings(record=True):
+ self.window.apply(
+ function, raw=True, engine="numba", engine_kwargs={"parallel": parallel}
+ )
class EWMMethods:
- params = (["DataFrame", "Series"], [10, 1000], ["int", "float"], ["mean", "std"])
- param_names = ["constructor", "window", "dtype", "method"]
+ params = (
+ ["DataFrame", "Series"],
+ [
+ ({"halflife": 10}, "mean"),
+ ({"halflife": 10}, "std"),
+ ({"halflife": 1000}, "mean"),
+ ({"halflife": 1000}, "std"),
+ (
+ {
+ "halflife": "1 Day",
+ "times": pd.date_range("1900", periods=10 ** 5, freq="23s"),
+ },
+ "mean",
+ ),
+ ],
+ ["int", "float"],
+ )
+ param_names = ["constructor", "kwargs_method", "dtype"]
- def setup(self, constructor, window, dtype, method):
+ def setup(self, constructor, kwargs_method, dtype):
N = 10 ** 5
+ kwargs, method = kwargs_method
arr = (100 * np.random.random(N)).astype(dtype)
- times = pd.date_range("1900", periods=N, freq="23s")
- self.ewm = getattr(pd, constructor)(arr).ewm(halflife=window)
- self.ewm_times = getattr(pd, constructor)(arr).ewm(
- halflife="1 Day", times=times
- )
-
- def time_ewm(self, constructor, window, dtype, method):
- getattr(self.ewm, method)()
+ self.method = method
+ self.ewm = getattr(pd, constructor)(arr).ewm(**kwargs)
- def time_ewm_times(self, constructor, window, dtype, method):
- self.ewm_times.mean()
+ def time_ewm(self, constructor, kwargs_method, dtype):
+ getattr(self.ewm, self.method)()
class VariableWindowMethods(Methods):
@@ -122,7 +165,7 @@ class VariableWindowMethods(Methods):
["DataFrame", "Series"],
["50s", "1h", "1d"],
["int", "float"],
- ["median", "mean", "max", "min", "std", "count", "skew", "kurt", "sum"],
+ ["median", "mean", "max", "min", "std", "count", "skew", "kurt", "sum", "sem"],
)
param_names = ["constructor", "window", "dtype", "method"]
@@ -130,35 +173,35 @@ def setup(self, constructor, window, dtype, method):
N = 10 ** 5
arr = (100 * np.random.random(N)).astype(dtype)
index = pd.date_range("2017-01-01", periods=N, freq="5s")
- self.roll = getattr(pd, constructor)(arr, index=index).rolling(window)
+ self.window = getattr(pd, constructor)(arr, index=index).rolling(window)
class Pairwise:
- params = ([10, 1000, None], ["corr", "cov"], [True, False])
- param_names = ["window", "method", "pairwise"]
+ params = (
+ [({"window": 10}, "rolling"), ({"window": 1000}, "rolling"), ({}, "expanding")],
+ ["corr", "cov"],
+ [True, False],
+ )
+ param_names = ["window_kwargs", "method", "pairwise"]
- def setup(self, window, method, pairwise):
+ def setup(self, kwargs_window, method, pairwise):
N = 10 ** 4
n_groups = 20
+ kwargs, window = kwargs_window
groups = [i for _ in range(N // n_groups) for i in range(n_groups)]
arr = np.random.random(N)
self.df = pd.DataFrame(arr)
- self.df_group = pd.DataFrame({"A": groups, "B": arr}).groupby("A")
+ self.window = getattr(self.df, window)(**kwargs)
+ self.window_group = getattr(
+ pd.DataFrame({"A": groups, "B": arr}).groupby("A"), window
+ )(**kwargs)
- def time_pairwise(self, window, method, pairwise):
- if window is None:
- r = self.df.expanding()
- else:
- r = self.df.rolling(window=window)
- getattr(r, method)(self.df, pairwise=pairwise)
+ def time_pairwise(self, kwargs_window, method, pairwise):
+ getattr(self.window, method)(self.df, pairwise=pairwise)
- def time_groupby(self, window, method, pairwise):
- if window is None:
- r = self.df_group.expanding()
- else:
- r = self.df_group.rolling(window=window)
- getattr(r, method)(self.df, pairwise=pairwise)
+ def time_groupby(self, kwargs_window, method, pairwise):
+ getattr(self.window_group, method)(self.df, pairwise=pairwise)
class Quantile:
@@ -180,6 +223,33 @@ def time_quantile(self, constructor, window, dtype, percentile, interpolation):
self.roll.quantile(percentile, interpolation=interpolation)
+class Rank:
+ params = (
+ ["DataFrame", "Series"],
+ [10, 1000],
+ ["int", "float"],
+ [True, False],
+ [True, False],
+ ["min", "max", "average"],
+ )
+ param_names = [
+ "constructor",
+ "window",
+ "dtype",
+ "percentile",
+ "ascending",
+ "method",
+ ]
+
+ def setup(self, constructor, window, dtype, percentile, ascending, method):
+ N = 10 ** 5
+ arr = np.random.random(N).astype(dtype)
+ self.roll = getattr(pd, constructor)(arr).rolling(window)
+
+ def time_rank(self, constructor, window, dtype, percentile, ascending, method):
+ self.roll.rank(pct=percentile, ascending=ascending, method=method)
+
+
class PeakMemFixedWindowMinMax:
params = ["min", "max"]
@@ -218,10 +288,18 @@ def peakmem_rolling(self, constructor, window_size, dtype, method):
class Groupby:
- params = ["sum", "median", "mean", "max", "min", "kurt", "sum"]
+ params = (
+ ["sum", "median", "mean", "max", "min", "kurt", "sum"],
+ [
+ ("rolling", {"window": 2}),
+ ("rolling", {"window": "30s", "on": "C"}),
+ ("expanding", {}),
+ ],
+ )
- def setup(self, method):
+ def setup(self, method, window_kwargs):
N = 1000
+ window, kwargs = window_kwargs
df = pd.DataFrame(
{
"A": [str(i) for i in range(N)] * 10,
@@ -229,14 +307,10 @@ def setup(self, method):
"C": pd.date_range(start="1900-01-01", freq="1min", periods=N * 10),
}
)
- self.groupby_roll_int = df.groupby("A").rolling(window=2)
- self.groupby_roll_offset = df.groupby("A").rolling(window="30s", on="C")
-
- def time_rolling_int(self, method):
- getattr(self.groupby_roll_int, method)()
+ self.groupby_window = getattr(df.groupby("A"), window)(**kwargs)
- def time_rolling_offset(self, method):
- getattr(self.groupby_roll_offset, method)()
+ def time_method(self, method, window_kwargs):
+ getattr(self.groupby_window, method)()
class GroupbyLargeGroups:
@@ -296,5 +370,8 @@ def time_apply(self, method):
table_method_func, raw=True, engine="numba"
)
+ def time_ewm_mean(self, method):
+ self.df.ewm(1, method=method).mean(engine="numba")
+
from .pandas_vb_common import setup # noqa: F401 isort:skip
diff --git a/asv_bench/benchmarks/series_methods.py b/asv_bench/benchmarks/series_methods.py
index 7592ce54e3712..d8578ed604ae3 100644
--- a/asv_bench/benchmarks/series_methods.py
+++ b/asv_bench/benchmarks/series_methods.py
@@ -27,6 +27,19 @@ def time_constructor(self, data):
Series(data=self.data, index=self.idx)
+class ToFrame:
+ params = [["int64", "datetime64[ns]", "category", "Int64"], [None, "foo"]]
+ param_names = ["dtype", "name"]
+
+ def setup(self, dtype, name):
+ arr = np.arange(10 ** 5)
+ ser = Series(arr, dtype=dtype)
+ self.ser = ser
+
+ def time_to_frame(self, dtype, name):
+ self.ser.to_frame(name)
+
+
class NSort:
params = ["first", "last", "all"]
@@ -139,6 +152,18 @@ def time_value_counts(self, N, dtype):
self.s.value_counts()
+class ValueCountsObjectDropNAFalse:
+
+ params = [10 ** 3, 10 ** 4, 10 ** 5]
+ param_names = ["N"]
+
+ def setup(self, N):
+ self.s = Series(np.random.randint(0, N, size=10 * N)).astype("object")
+
+ def time_value_counts(self, N):
+ self.s.value_counts(dropna=False)
+
+
class Mode:
params = [[10 ** 3, 10 ** 4, 10 ** 5], ["int", "uint", "float", "object"]]
@@ -151,6 +176,18 @@ def time_mode(self, N, dtype):
self.s.mode()
+class ModeObjectDropNAFalse:
+
+ params = [10 ** 3, 10 ** 4, 10 ** 5]
+ param_names = ["N"]
+
+ def setup(self, N):
+ self.s = Series(np.random.randint(0, N, size=10 * N)).astype("object")
+
+ def time_mode(self, N):
+ self.s.mode(dropna=False)
+
+
class Dir:
def setup(self):
self.s = Series(index=tm.makeStringIndex(10000))
diff --git a/asv_bench/benchmarks/sparse.py b/asv_bench/benchmarks/sparse.py
index 35e5818cd3b2b..ec704896f5726 100644
--- a/asv_bench/benchmarks/sparse.py
+++ b/asv_bench/benchmarks/sparse.py
@@ -67,16 +67,42 @@ def time_sparse_series_from_coo(self):
class ToCoo:
- def setup(self):
+ params = [True, False]
+ param_names = ["sort_labels"]
+
+ def setup(self, sort_labels):
s = Series([np.nan] * 10000)
s[0] = 3.0
s[100] = -1.0
s[999] = 12.1
- s.index = MultiIndex.from_product([range(10)] * 4)
- self.ss = s.astype("Sparse")
- def time_sparse_series_to_coo(self):
- self.ss.sparse.to_coo(row_levels=[0, 1], column_levels=[2, 3], sort_labels=True)
+ s_mult_lvl = s.set_axis(MultiIndex.from_product([range(10)] * 4))
+ self.ss_mult_lvl = s_mult_lvl.astype("Sparse")
+
+ s_two_lvl = s.set_axis(MultiIndex.from_product([range(100)] * 2))
+ self.ss_two_lvl = s_two_lvl.astype("Sparse")
+
+ def time_sparse_series_to_coo(self, sort_labels):
+ self.ss_mult_lvl.sparse.to_coo(
+ row_levels=[0, 1], column_levels=[2, 3], sort_labels=sort_labels
+ )
+
+ def time_sparse_series_to_coo_single_level(self, sort_labels):
+ self.ss_two_lvl.sparse.to_coo(sort_labels=sort_labels)
+
+
+class ToCooFrame:
+ def setup(self):
+ N = 10000
+ k = 10
+ arr = np.zeros((N, k), dtype=float)
+ arr[0, 0] = 3.0
+ arr[12, 7] = -1.0
+ arr[0, 9] = 11.2
+ self.df = pd.DataFrame(arr, dtype=pd.SparseDtype("float", fill_value=0.0))
+
+ def time_to_coo(self):
+ self.df.sparse.to_coo()
class Arithmetic:
@@ -140,4 +166,68 @@ def time_division(self, fill_value):
self.arr1 / self.arr2
+class MinMax:
+
+ params = (["min", "max"], [0.0, np.nan])
+ param_names = ["func", "fill_value"]
+
+ def setup(self, func, fill_value):
+ N = 1_000_000
+ arr = make_array(N, 1e-5, fill_value, np.float64)
+ self.sp_arr = SparseArray(arr, fill_value=fill_value)
+
+ def time_min_max(self, func, fill_value):
+ getattr(self.sp_arr, func)()
+
+
+class Take:
+
+ params = ([np.array([0]), np.arange(100_000), np.full(100_000, -1)], [True, False])
+ param_names = ["indices", "allow_fill"]
+
+ def setup(self, indices, allow_fill):
+ N = 1_000_000
+ fill_value = 0.0
+ arr = make_array(N, 1e-5, fill_value, np.float64)
+ self.sp_arr = SparseArray(arr, fill_value=fill_value)
+
+ def time_take(self, indices, allow_fill):
+ self.sp_arr.take(indices, allow_fill=allow_fill)
+
+
+class GetItem:
+ def setup(self):
+ N = 1_000_000
+ d = 1e-5
+ arr = make_array(N, d, np.nan, np.float64)
+ self.sp_arr = SparseArray(arr)
+
+ def time_integer_indexing(self):
+ self.sp_arr[78]
+
+ def time_slice(self):
+ self.sp_arr[1:]
+
+
+class GetItemMask:
+
+ params = [True, False, np.nan]
+ param_names = ["fill_value"]
+
+ def setup(self, fill_value):
+ N = 1_000_000
+ d = 1e-5
+ arr = make_array(N, d, np.nan, np.float64)
+ self.sp_arr = SparseArray(arr)
+ b_arr = np.full(shape=N, fill_value=fill_value, dtype=np.bool8)
+ fv_inds = np.unique(
+ np.random.randint(low=0, high=N - 1, size=int(N * d), dtype=np.int32)
+ )
+ b_arr[fv_inds] = True if pd.isna(fill_value) else not fill_value
+ self.sp_b_arr = SparseArray(b_arr, dtype=np.bool8, fill_value=fill_value)
+
+ def time_mask(self, fill_value):
+ self.sp_arr[self.sp_b_arr]
+
+
from .pandas_vb_common import setup # noqa: F401 isort:skip
diff --git a/asv_bench/benchmarks/tslibs/fields.py b/asv_bench/benchmarks/tslibs/fields.py
index 0607a799ec707..23ae73811204c 100644
--- a/asv_bench/benchmarks/tslibs/fields.py
+++ b/asv_bench/benchmarks/tslibs/fields.py
@@ -12,7 +12,7 @@
class TimeGetTimedeltaField:
params = [
_sizes,
- ["days", "h", "s", "seconds", "ms", "microseconds", "us", "ns", "nanoseconds"],
+ ["days", "seconds", "microseconds", "nanoseconds"],
]
param_names = ["size", "field"]
diff --git a/azure-pipelines.yml b/azure-pipelines.yml
index 956feaef5f83e..9c04d10707a64 100644
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@@ -2,43 +2,48 @@
trigger:
branches:
include:
- - master
- - 1.2.x
+ - main
+ - 1.4.x
paths:
exclude:
- 'doc/*'
pr:
-- master
-- 1.2.x
+ autoCancel: true
+ branches:
+ include:
+ - main
+ - 1.4.x
variables:
PYTEST_WORKERS: auto
+ PYTEST_TARGET: pandas
jobs:
# Mac and Linux use the same template
- template: ci/azure/posix.yml
parameters:
name: macOS
- vmImage: macOS-10.14
+ vmImage: macOS-10.15
- template: ci/azure/windows.yml
parameters:
name: Windows
- vmImage: vs2017-win2016
+ vmImage: windows-2019
-- job: py37_32bit
+- job: py38_32bit
pool:
vmImage: ubuntu-18.04
steps:
+ # TODO: GH#44980 https://github.com/pypa/setuptools/issues/2941
- script: |
docker pull quay.io/pypa/manylinux2014_i686
docker run -v $(pwd):/pandas quay.io/pypa/manylinux2014_i686 \
/bin/bash -xc "cd pandas && \
- /opt/python/cp37-cp37m/bin/python -m venv ~/virtualenvs/pandas-dev && \
+ /opt/python/cp38-cp38/bin/python -m venv ~/virtualenvs/pandas-dev && \
. ~/virtualenvs/pandas-dev/bin/activate && \
- python -m pip install --no-deps -U pip wheel setuptools && \
+ python -m pip install --no-deps -U pip wheel 'setuptools<60.0.0' && \
pip install cython numpy python-dateutil pytz pytest pytest-xdist hypothesis pytest-azurepipelines && \
python setup.py build_ext -q -j2 && \
python -m pip install --no-build-isolation -e . && \
@@ -50,4 +55,4 @@ jobs:
inputs:
testResultsFiles: '**/test-*.xml'
failTaskOnFailedTests: true
- testRunTitle: 'Publish test results for Python 3.7-32 bit full Linux'
+ testRunTitle: 'Publish test results for Python 3.8-32 bit full Linux'
diff --git a/ci/azure/posix.yml b/ci/azure/posix.yml
index 2caacf3a07290..02a4a9ad44865 100644
--- a/ci/azure/posix.yml
+++ b/ci/azure/posix.yml
@@ -8,11 +8,36 @@ jobs:
vmImage: ${{ parameters.vmImage }}
strategy:
matrix:
- ${{ if eq(parameters.name, 'macOS') }}:
- py37_macos:
- ENV_FILE: ci/deps/azure-macos-37.yaml
- CONDA_PY: "37"
- PATTERN: "not slow and not network"
+ py38_macos_1:
+ ENV_FILE: ci/deps/azure-macos-38.yaml
+ CONDA_PY: "38"
+ PATTERN: "not slow"
+ PYTEST_TARGET: "pandas/tests/[a-h]*"
+ py38_macos_2:
+ ENV_FILE: ci/deps/azure-macos-38.yaml
+ CONDA_PY: "38"
+ PATTERN: "not slow"
+ PYTEST_TARGET: "pandas/tests/[i-z]*"
+ py39_macos_1:
+ ENV_FILE: ci/deps/azure-macos-39.yaml
+ CONDA_PY: "39"
+ PATTERN: "not slow"
+ PYTEST_TARGET: "pandas/tests/[a-h]*"
+ py39_macos_2:
+ ENV_FILE: ci/deps/azure-macos-39.yaml
+ CONDA_PY: "39"
+ PATTERN: "not slow"
+ PYTEST_TARGET: "pandas/tests/[i-z]*"
+ py310_macos_1:
+ ENV_FILE: ci/deps/azure-macos-310.yaml
+ CONDA_PY: "310"
+ PATTERN: "not slow"
+ PYTEST_TARGET: "pandas/tests/[a-h]*"
+ py310_macos_2:
+ ENV_FILE: ci/deps/azure-macos-310.yaml
+ CONDA_PY: "310"
+ PATTERN: "not slow"
+ PYTEST_TARGET: "pandas/tests/[i-z]*"
steps:
- script: echo '##vso[task.prependpath]$(HOME)/miniconda3/bin'
diff --git a/ci/azure/windows.yml b/ci/azure/windows.yml
index 5644ad46714d5..7061a266f28c7 100644
--- a/ci/azure/windows.yml
+++ b/ci/azure/windows.yml
@@ -8,41 +8,70 @@ jobs:
vmImage: ${{ parameters.vmImage }}
strategy:
matrix:
- py37_np17:
- ENV_FILE: ci/deps/azure-windows-37.yaml
- CONDA_PY: "37"
- PATTERN: "not slow and not network"
+ py38_np18_1:
+ ENV_FILE: ci/deps/azure-windows-38.yaml
+ CONDA_PY: "38"
+ PATTERN: "not slow"
+ PYTEST_WORKERS: 2 # GH-42236
+ PYTEST_TARGET: "pandas/tests/[a-h]*"
- py38_np18:
+ py38_np18_2:
ENV_FILE: ci/deps/azure-windows-38.yaml
CONDA_PY: "38"
- PATTERN: "not slow and not network and not high_memory"
+ PATTERN: "not slow"
+ PYTEST_WORKERS: 2 # GH-42236
+ PYTEST_TARGET: "pandas/tests/[i-z]*"
+
+ py39_1:
+ ENV_FILE: ci/deps/azure-windows-39.yaml
+ CONDA_PY: "39"
+ PATTERN: "not slow and not high_memory"
+ PYTEST_WORKERS: 2 # GH-42236
+ PYTEST_TARGET: "pandas/tests/[a-h]*"
+
+ py39_2:
+ ENV_FILE: ci/deps/azure-windows-39.yaml
+ CONDA_PY: "39"
+ PATTERN: "not slow and not high_memory"
+ PYTEST_WORKERS: 2 # GH-42236
+ PYTEST_TARGET: "pandas/tests/[i-z]*"
+
+ py310_1:
+ ENV_FILE: ci/deps/azure-windows-310.yaml
+ CONDA_PY: "310"
+ PATTERN: "not slow and not high_memory"
+ PYTEST_WORKERS: 2 # GH-42236
+ PYTEST_TARGET: "pandas/tests/[a-h]*"
+
+ py310_2:
+ ENV_FILE: ci/deps/azure-windows-310.yaml
+ CONDA_PY: "310"
+ PATTERN: "not slow and not high_memory"
+ PYTEST_WORKERS: 2 # GH-42236
+ PYTEST_TARGET: "pandas/tests/[i-z]*"
steps:
- powershell: |
Write-Host "##vso[task.prependpath]$env:CONDA\Scripts"
Write-Host "##vso[task.prependpath]$HOME/miniconda3/bin"
displayName: 'Add conda to PATH'
-
- script: conda update -q -n base conda
displayName: 'Update conda'
- bash: |
conda env create -q --file ci\\deps\\azure-windows-$(CONDA_PY).yaml
displayName: 'Create anaconda environment'
-
- bash: |
source activate pandas-dev
conda list
python setup.py build_ext -q -j 4
python -m pip install --no-build-isolation -e .
displayName: 'Build'
-
- bash: |
source activate pandas-dev
+ wmic.exe cpu get caption, deviceid, name, numberofcores, maxclockspeed
ci/run_tests.sh
displayName: 'Test'
-
- task: PublishTestResults@2
condition: succeededOrFailed()
inputs:
diff --git a/ci/code_checks.sh b/ci/code_checks.sh
index 1844cb863c183..4498585e36ce5 100755
--- a/ci/code_checks.sh
+++ b/ci/code_checks.sh
@@ -3,22 +3,18 @@
# Run checks related to code quality.
#
# This script is intended for both the CI and to check locally that code standards are
-# respected. We are currently linting (PEP-8 and similar), looking for patterns of
-# common mistakes (sphinx directives with missing blank lines, old style classes,
-# unwanted imports...), we run doctests here (currently some files only), and we
+# respected. We run doctests here (currently some files only), and we
# validate formatting error in docstrings.
#
# Usage:
# $ ./ci/code_checks.sh # run all checks
-# $ ./ci/code_checks.sh lint # run linting only
-# $ ./ci/code_checks.sh patterns # check for patterns that should not exist
# $ ./ci/code_checks.sh code # checks on imported code
# $ ./ci/code_checks.sh doctests # run doctests
# $ ./ci/code_checks.sh docstrings # validate docstring errors
# $ ./ci/code_checks.sh typing # run static type analysis
-[[ -z "$1" || "$1" == "lint" || "$1" == "patterns" || "$1" == "code" || "$1" == "doctests" || "$1" == "docstrings" || "$1" == "typing" ]] || \
- { echo "Unknown command $1. Usage: $0 [lint|patterns|code|doctests|docstrings|typing]"; exit 9999; }
+[[ -z "$1" || "$1" == "code" || "$1" == "doctests" || "$1" == "docstrings" || "$1" == "typing" ]] || \
+ { echo "Unknown command $1. Usage: $0 [code|doctests|docstrings|typing]"; exit 9999; }
BASE_DIR="$(dirname $0)/.."
RET=0
@@ -38,49 +34,7 @@ function invgrep {
}
if [[ "$GITHUB_ACTIONS" == "true" ]]; then
- FLAKE8_FORMAT="##[error]%(path)s:%(row)s:%(col)s:%(code)s:%(text)s"
INVGREP_PREPEND="##[error]"
-else
- FLAKE8_FORMAT="default"
-fi
-
-### LINTING ###
-if [[ -z "$CHECK" || "$CHECK" == "lint" ]]; then
-
- # Check that cython casting is of the form `obj` as opposed to ` obj`;
- # it doesn't make a difference, but we want to be internally consistent.
- # Note: this grep pattern is (intended to be) equivalent to the python
- # regex r'(?])> '
- MSG='Linting .pyx code for spacing conventions in casting' ; echo $MSG
- invgrep -r -E --include '*.pyx' --include '*.pxi.in' '[a-zA-Z0-9*]> ' pandas/_libs
- RET=$(($RET + $?)) ; echo $MSG "DONE"
-
- # readability/casting: Warnings about C casting instead of C++ casting
- # runtime/int: Warnings about using C number types instead of C++ ones
- # build/include_subdir: Warnings about prefacing included header files with directory
-
-fi
-
-### PATTERNS ###
-if [[ -z "$CHECK" || "$CHECK" == "patterns" ]]; then
-
- # Check for the following code in the extension array base tests: `tm.assert_frame_equal` and `tm.assert_series_equal`
- MSG='Check for invalid EA testing' ; echo $MSG
- invgrep -r -E --include '*.py' --exclude base.py 'tm.assert_(series|frame)_equal' pandas/tests/extension/base
- RET=$(($RET + $?)) ; echo $MSG "DONE"
-
- MSG='Check for deprecated messages without sphinx directive' ; echo $MSG
- invgrep -R --include="*.py" --include="*.pyx" -E "(DEPRECATED|DEPRECATE|Deprecated)(:|,|\.)" pandas
- RET=$(($RET + $?)) ; echo $MSG "DONE"
-
- MSG='Check for backticks incorrectly rendering because of missing spaces' ; echo $MSG
- invgrep -R --include="*.rst" -E "[a-zA-Z0-9]\`\`?[a-zA-Z0-9]" doc/source/
- RET=$(($RET + $?)) ; echo $MSG "DONE"
-
- MSG='Check for unnecessary random seeds in asv benchmarks' ; echo $MSG
- invgrep -R --exclude pandas_vb_common.py -E 'np.random.seed' asv_bench/benchmarks/
- RET=$(($RET + $?)) ; echo $MSG "DONE"
-
fi
### CODE ###
@@ -110,45 +64,13 @@ fi
### DOCTESTS ###
if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then
- MSG='Doctests for individual files' ; echo $MSG
- pytest -q --doctest-modules \
- pandas/core/accessor.py \
- pandas/core/aggregation.py \
- pandas/core/algorithms.py \
- pandas/core/base.py \
- pandas/core/construction.py \
- pandas/core/frame.py \
- pandas/core/generic.py \
- pandas/core/indexers.py \
- pandas/core/nanops.py \
- pandas/core/series.py \
- pandas/io/sql.py
+ MSG='Doctests' ; echo $MSG
+ # Ignore test_*.py files or else the unit tests will run
+ python -m pytest --doctest-modules --ignore-glob="**/test_*.py" pandas
RET=$(($RET + $?)) ; echo $MSG "DONE"
- MSG='Doctests for directories' ; echo $MSG
- pytest -q --doctest-modules \
- pandas/_libs/ \
- pandas/api/ \
- pandas/arrays/ \
- pandas/compat/ \
- pandas/core/array_algos/ \
- pandas/core/arrays/ \
- pandas/core/computation/ \
- pandas/core/dtypes/ \
- pandas/core/groupby/ \
- pandas/core/indexes/ \
- pandas/core/ops/ \
- pandas/core/reshape/ \
- pandas/core/strings/ \
- pandas/core/tools/ \
- pandas/core/window/ \
- pandas/errors/ \
- pandas/io/clipboard/ \
- pandas/io/json/ \
- pandas/io/excel/ \
- pandas/io/parsers/ \
- pandas/io/sas/ \
- pandas/tseries/
+ MSG='Cython Doctests' ; echo $MSG
+ python -m pytest --doctest-cython pandas/_libs
RET=$(($RET + $?)) ; echo $MSG "DONE"
fi
@@ -156,8 +78,8 @@ fi
### DOCSTRINGS ###
if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then
- MSG='Validate docstrings (GL03, GL04, GL05, GL06, GL07, GL09, GL10, SS01, SS02, SS04, SS05, PR03, PR04, PR05, PR10, EX04, RT01, RT04, RT05, SA02, SA03)' ; echo $MSG
- $BASE_DIR/scripts/validate_docstrings.py --format=actions --errors=GL03,GL04,GL05,GL06,GL07,GL09,GL10,SS02,SS04,SS05,PR03,PR04,PR05,PR10,EX04,RT01,RT04,RT05,SA02,SA03
+ MSG='Validate docstrings (GL01, GL02, GL03, GL04, GL05, GL06, GL07, GL09, GL10, SS01, SS02, SS03, SS04, SS05, PR03, PR04, PR05, PR06, PR08, PR09, PR10, EX04, RT01, RT04, RT05, SA02, SA03)' ; echo $MSG
+ $BASE_DIR/scripts/validate_docstrings.py --format=actions --errors=GL01,GL02,GL03,GL04,GL05,GL06,GL07,GL09,GL10,SS02,SS03,SS04,SS05,PR03,PR04,PR05,PR06,PR08,PR09,PR10,EX04,RT01,RT04,RT05,SA02,SA03
RET=$(($RET + $?)) ; echo $MSG "DONE"
fi
@@ -169,8 +91,15 @@ if [[ -z "$CHECK" || "$CHECK" == "typing" ]]; then
mypy --version
MSG='Performing static analysis using mypy' ; echo $MSG
- mypy pandas
+ mypy
RET=$(($RET + $?)) ; echo $MSG "DONE"
+
+ # run pyright, if it is installed
+ if command -v pyright &> /dev/null ; then
+ MSG='Performing static analysis using pyright' ; echo $MSG
+ pyright
+ RET=$(($RET + $?)) ; echo $MSG "DONE"
+ fi
fi
exit $RET
diff --git a/ci/deps/actions-38-numpydev.yaml b/ci/deps/actions-310-numpydev.yaml
similarity index 64%
rename from ci/deps/actions-38-numpydev.yaml
rename to ci/deps/actions-310-numpydev.yaml
index 6eed2daac0c3b..3e32665d5433f 100644
--- a/ci/deps/actions-38-numpydev.yaml
+++ b/ci/deps/actions-310-numpydev.yaml
@@ -2,20 +2,20 @@ name: pandas-dev
channels:
- defaults
dependencies:
- - python=3.8.*
+ - python=3.10
# tools
- pytest>=6.0
- pytest-cov
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
# pandas dependencies
+ - python-dateutil
- pytz
- pip
- pip:
- - cython==0.29.21 # GH#34014
- - "git+git://github.com/dateutil/dateutil.git"
+ - cython==0.29.24 # GH#34014
- "--extra-index-url https://pypi.anaconda.org/scipy-wheels-nightly/simple"
- "--pre"
- "numpy"
diff --git a/ci/deps/actions-310.yaml b/ci/deps/actions-310.yaml
new file mode 100644
index 0000000000000..9829380620f86
--- /dev/null
+++ b/ci/deps/actions-310.yaml
@@ -0,0 +1,51 @@
+name: pandas-dev
+channels:
+ - conda-forge
+dependencies:
+ - python=3.9
+
+ # test dependencies
+ - cython=0.29.24
+ - pytest>=6.0
+ - pytest-cov
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
+ - psutil
+
+ # required dependencies
+ - python-dateutil
+ - numpy
+ - pytz
+
+ # optional dependencies
+ - beautifulsoup4
+ - blosc
+ - bottleneck
+ - fastparquet
+ - fsspec
+ - html5lib
+ - gcsfs
+ - jinja2
+ - lxml
+ - matplotlib
+ # TODO: uncomment after numba supports py310
+ #- numba
+ - numexpr
+ - openpyxl
+ - odfpy
+ - pandas-gbq
+ - psycopg2
+ - pymysql
+ - pytables
+ - pyarrow
+ - pyreadstat
+ - pyxlsb
+ - s3fs
+ - scipy
+ - sqlalchemy
+ - tabulate
+ - xarray
+ - xlrd
+ - xlsxwriter
+ - xlwt
+ - zstandard
diff --git a/ci/deps/actions-37-db-min.yaml b/ci/deps/actions-37-db-min.yaml
deleted file mode 100644
index cae4361ca37a7..0000000000000
--- a/ci/deps/actions-37-db-min.yaml
+++ /dev/null
@@ -1,48 +0,0 @@
-name: pandas-dev
-channels:
- - conda-forge
-dependencies:
- - python=3.7.*
-
- # tools
- - cython>=0.29.21
- - pytest>=6.0
- - pytest-cov
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
-
- # required
- - numpy<1.20 # GH#39541 compat for pyarrow<3
- - python-dateutil
- - pytz
-
- # optional
- - beautifulsoup4
- - blosc=1.17.0
- - python-blosc
- - fastparquet=0.4.0
- - html5lib
- - ipython
- - jinja2
- - lxml=4.3.0
- - matplotlib
- - nomkl
- - numexpr
- - openpyxl
- - pandas-gbq
- - google-cloud-bigquery>=1.27.2 # GH 36436
- - protobuf>=3.12.4
- - pyarrow=0.17.1 # GH 38803
- - pytables>=3.5.1
- - scipy
- - xarray=0.12.3
- - xlrd<2.0
- - xlsxwriter
- - xlwt
- - moto
- - flask
-
- # sql
- - psycopg2=2.7
- - pymysql=0.8.1
- - sqlalchemy=1.3.0
diff --git a/ci/deps/actions-37-locale_slow.yaml b/ci/deps/actions-37-locale_slow.yaml
deleted file mode 100644
index c6eb3b00a63ac..0000000000000
--- a/ci/deps/actions-37-locale_slow.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-name: pandas-dev
-channels:
- - defaults
- - conda-forge
-dependencies:
- - python=3.7.*
-
- # tools
- - cython>=0.29.21
- - pytest>=6.0
- - pytest-cov
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
-
- # pandas dependencies
- - beautifulsoup4=4.6.0
- - bottleneck=1.2.*
- - lxml
- - matplotlib=3.0.0
- - numpy=1.17.*
- - openpyxl=3.0.0
- - python-dateutil
- - python-blosc
- - pytz=2017.3
- - scipy
- - sqlalchemy=1.3.0
- - xlrd=1.2.0
- - xlsxwriter=1.0.2
- - xlwt=1.3.0
- - html5lib=1.0.1
diff --git a/ci/deps/actions-37-minimum_versions.yaml b/ci/deps/actions-37-minimum_versions.yaml
deleted file mode 100644
index b97601d18917c..0000000000000
--- a/ci/deps/actions-37-minimum_versions.yaml
+++ /dev/null
@@ -1,31 +0,0 @@
-name: pandas-dev
-channels:
- - conda-forge
-dependencies:
- - python=3.7.1
-
- # tools
- - cython=0.29.21
- - pytest>=6.0
- - pytest-cov
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
- - psutil
-
- # pandas dependencies
- - beautifulsoup4=4.6.0
- - bottleneck=1.2.1
- - jinja2=2.10
- - numba=0.46.0
- - numexpr=2.7.0
- - numpy=1.17.3
- - openpyxl=3.0.0
- - pytables=3.5.1
- - python-dateutil=2.7.3
- - pytz=2017.3
- - pyarrow=0.17.0
- - scipy=1.2
- - xlrd=1.2.0
- - xlsxwriter=1.0.2
- - xlwt=1.3.0
- - html5lib=1.0.1
diff --git a/ci/deps/actions-37.yaml b/ci/deps/actions-37.yaml
deleted file mode 100644
index 0effe6f80df86..0000000000000
--- a/ci/deps/actions-37.yaml
+++ /dev/null
@@ -1,28 +0,0 @@
-name: pandas-dev
-channels:
- - defaults
- - conda-forge
-dependencies:
- - python=3.7.*
-
- # tools
- - cython>=0.29.21
- - pytest>=6.0
- - pytest-cov
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
-
- # pandas dependencies
- - botocore>=1.11
- - fsspec>=0.7.4
- - numpy=1.19
- - python-dateutil
- - nomkl
- - pyarrow
- - pytz
- - s3fs>=0.4.0
- - moto>=1.3.14
- - flask
- - tabulate
- - pyreadstat
- - pip
diff --git a/ci/deps/actions-37-db.yaml b/ci/deps/actions-38-downstream_compat.yaml
similarity index 51%
rename from ci/deps/actions-37-db.yaml
rename to ci/deps/actions-38-downstream_compat.yaml
index e568f8615a8df..af4f7dee851d5 100644
--- a/ci/deps/actions-37-db.yaml
+++ b/ci/deps/actions-38-downstream_compat.yaml
@@ -1,54 +1,66 @@
+# Non-dependencies that pandas utilizes or has compatibility with pandas objects
name: pandas-dev
channels:
- conda-forge
dependencies:
- - python=3.7.*
+ - python=3.8
+ - pip
- # tools
- - cython>=0.29.21
+ # test dependencies
+ - cython>=0.29.24
- pytest>=6.0
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
- pytest-cov>=2.10.1 # this is only needed in the coverage build, ref: GH 35737
+ - nomkl
+
+ # required dependencies
+ - numpy
+ - python-dateutil
+ - pytz
- # pandas dependencies
+ # optional dependencies
- beautifulsoup4
- - botocore>=1.11
- - dask
+ - blosc
- fastparquet>=0.4.0
- fsspec>=0.7.4
- - gcsfs>=0.6.0
- - geopandas
+ - gcsfs
- html5lib
+ - jinja2
+ - lxml
- matplotlib
- - moto>=1.3.14
- - flask
- - nomkl
- numexpr
- - numpy=1.17.*
- odfpy
- openpyxl
- pandas-gbq
- - google-cloud-bigquery>=1.27.2 # GH 36436
- psycopg2
- - pyarrow>=0.17.0
+ - pyarrow>=1.0.1
- pymysql
- pytables
- - python-snappy
- - python-dateutil
- - pytz
+ - pyxlsb
- s3fs>=0.4.0
- - scikit-learn
- scipy
- sqlalchemy
- - statsmodels
- xarray
- - xlrd<2.0
+ - xlrd
- xlsxwriter
- xlwt
- - pip
+
+ # downstream packages
+ - aiobotocore<2.0.0 # GH#44311 pinned to fix docbuild
+ - boto3
+ - botocore>=1.11
+ - dask
+ - ipython
+ - geopandas
+ - python-snappy
+ - seaborn
+ - scikit-learn
+ - statsmodels
+ - brotlipy
+ - coverage
+ - pandas-datareader
+ - pyyaml
+ - py
- pip:
- - brotlipy
- - coverage
- - pandas-datareader
- - pyxlsb
+ - torch
diff --git a/ci/deps/actions-38-locale.yaml b/ci/deps/actions-38-locale.yaml
deleted file mode 100644
index 34a6860936550..0000000000000
--- a/ci/deps/actions-38-locale.yaml
+++ /dev/null
@@ -1,41 +0,0 @@
-name: pandas-dev
-channels:
- - conda-forge
-dependencies:
- - python=3.8.*
-
- # tools
- - cython>=0.29.21
- - pytest>=6.0
- - pytest-cov
- - pytest-xdist>=1.21
- - pytest-asyncio>=0.12.0
- - hypothesis>=3.58.0
-
- # pandas dependencies
- - beautifulsoup4
- - flask
- - html5lib
- - ipython
- - jinja2
- - jedi<0.18.0
- - lxml
- - matplotlib<3.3.0
- - moto
- - nomkl
- - numexpr
- - numpy<1.20 # GH#39541 compat with pyarrow<3
- - openpyxl
- - pytables
- - python-dateutil
- - pytz
- - scipy
- - xarray
- - xlrd<2.0
- - xlsxwriter
- - xlwt
- - moto
- - pyarrow=1.0.0
- - pip
- - pip:
- - pyxlsb
diff --git a/ci/deps/actions-38-minimum_versions.yaml b/ci/deps/actions-38-minimum_versions.yaml
new file mode 100644
index 0000000000000..467402bb6ef7f
--- /dev/null
+++ b/ci/deps/actions-38-minimum_versions.yaml
@@ -0,0 +1,52 @@
+# Minimum version of required + optional dependencies
+# Aligned with getting_started/install.rst and compat/_optional.py
+name: pandas-dev
+channels:
+ - conda-forge
+dependencies:
+ - python=3.8.0
+
+ # test dependencies
+ - cython=0.29.24
+ - pytest>=6.0
+ - pytest-cov
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
+ - psutil
+
+ # required dependencies
+ - python-dateutil=2.8.1
+ - numpy=1.18.5
+ - pytz=2020.1
+
+ # optional dependencies
+ - beautifulsoup4=4.8.2
+ - blosc=1.20.1
+ - bottleneck=1.3.1
+ - fastparquet=0.4.0
+ - fsspec=0.7.4
+ - html5lib=1.1
+ - gcsfs=0.6.0
+ - jinja2=2.11
+ - lxml=4.5.0
+ - matplotlib=3.3.2
+ - numba=0.50.1
+ - numexpr=2.7.1
+ - odfpy=1.4.1
+ - openpyxl=3.0.3
+ - pandas-gbq=0.14.0
+ - psycopg2=2.8.4
+ - pymysql=0.10.1
+ - pytables=3.6.1
+ - pyarrow=1.0.1
+ - pyreadstat=1.1.0
+ - pyxlsb=1.0.6
+ - s3fs=0.4.0
+ - scipy=1.4.1
+ - sqlalchemy=1.4.0
+ - tabulate=0.8.7
+ - xarray=0.15.1
+ - xlrd=2.0.1
+ - xlsxwriter=1.2.2
+ - xlwt=1.3.0
+ - zstandard=0.15.2
diff --git a/ci/deps/actions-38-slow.yaml b/ci/deps/actions-38-slow.yaml
deleted file mode 100644
index afba60e451b90..0000000000000
--- a/ci/deps/actions-38-slow.yaml
+++ /dev/null
@@ -1,38 +0,0 @@
-name: pandas-dev
-channels:
- - conda-forge
-dependencies:
- - python=3.8.*
-
- # tools
- - cython>=0.29.21
- - pytest>=6.0
- - pytest-cov
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
-
- # pandas dependencies
- - beautifulsoup4
- - fsspec>=0.7.4
- - html5lib
- - lxml
- - matplotlib
- - numexpr
- - numpy
- - openpyxl
- - patsy
- - psycopg2
- - pymysql
- - pytables
- - python-dateutil
- - pytz
- - s3fs>=0.4.0
- - moto>=1.3.14
- - scipy
- - sqlalchemy
- - xlrd>=2.0
- - xlsxwriter
- - xlwt
- - moto
- - flask
- - numba
diff --git a/ci/deps/actions-38.yaml b/ci/deps/actions-38.yaml
index 11daa92046eb4..b23f686d845e9 100644
--- a/ci/deps/actions-38.yaml
+++ b/ci/deps/actions-38.yaml
@@ -1,20 +1,50 @@
name: pandas-dev
channels:
- - defaults
- conda-forge
dependencies:
- - python=3.8.*
+ - python=3.8
- # tools
- - cython>=0.29.21
+ # test dependencies
+ - cython=0.29.24
- pytest>=6.0
- pytest-cov
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
+ - psutil
- # pandas dependencies
- - numpy
+ # required dependencies
- python-dateutil
- - nomkl
+ - numpy
- pytz
- - tabulate==0.8.7
+
+ # optional dependencies
+ - beautifulsoup4
+ - blosc
+ - bottleneck
+ - fastparquet
+ - fsspec
+ - html5lib
+ - gcsfs
+ - jinja2
+ - lxml
+ - matplotlib
+ - numba
+ - numexpr
+ - openpyxl
+ - odfpy
+ - pandas-gbq
+ - psycopg2
+ - pymysql
+ - pytables
+ - pyarrow=3
+ - pyreadstat
+ - pyxlsb
+ - s3fs
+ - scipy
+ - sqlalchemy
+ - tabulate
+ - xarray
+ - xlrd
+ - xlsxwriter
+ - xlwt
+ - zstandard
diff --git a/ci/deps/actions-39.yaml b/ci/deps/actions-39.yaml
index b74f1af8ee0f6..631ef40b02e33 100644
--- a/ci/deps/actions-39.yaml
+++ b/ci/deps/actions-39.yaml
@@ -2,21 +2,49 @@ name: pandas-dev
channels:
- conda-forge
dependencies:
- - python=3.9.*
+ - python=3.9
- # tools
- - cython>=0.29.21
+ # test dependencies
+ - cython=0.29.24
- pytest>=6.0
- pytest-cov
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
+ - psutil
- # pandas dependencies
- - numpy
+ # required dependencies
- python-dateutil
+ - numpy
- pytz
# optional dependencies
+ - beautifulsoup4
+ - blosc
+ - bottleneck
+ - fastparquet
+ - fsspec
+ - html5lib
+ - gcsfs
+ - jinja2
+ - lxml
+ - matplotlib
+ - numba
+ - numexpr
+ - openpyxl
+ - odfpy
+ - pandas-gbq
+ - psycopg2
+ - pymysql
- pytables
+ - pyarrow=5
+ - pyreadstat
+ - pyxlsb
+ - s3fs
- scipy
- - pyarrow=1.0
+ - sqlalchemy
+ - tabulate
+ - xarray
+ - xlrd
+ - xlsxwriter
+ - xlwt
+ - zstandard
diff --git a/ci/deps/actions-pypy-38.yaml b/ci/deps/actions-pypy-38.yaml
new file mode 100644
index 0000000000000..ad05d2ab2dacc
--- /dev/null
+++ b/ci/deps/actions-pypy-38.yaml
@@ -0,0 +1,20 @@
+name: pandas-dev
+channels:
+ - conda-forge
+dependencies:
+ # TODO: Add the rest of the dependencies in here
+ # once the other plentiful failures/segfaults
+ # with base pandas has been dealt with
+ - python=3.8[build=*_pypy] # TODO: use this once pypy3.8 is available
+
+ # tools
+ - cython>=0.29.24
+ - pytest>=6.0
+ - pytest-cov
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
+
+ # required
+ - numpy
+ - python-dateutil
+ - pytz
diff --git a/ci/deps/azure-macos-37.yaml b/ci/deps/azure-macos-310.yaml
similarity index 57%
rename from ci/deps/azure-macos-37.yaml
rename to ci/deps/azure-macos-310.yaml
index 43e1055347f17..312fac8091db6 100644
--- a/ci/deps/azure-macos-37.yaml
+++ b/ci/deps/azure-macos-310.yaml
@@ -3,12 +3,13 @@ channels:
- defaults
- conda-forge
dependencies:
- - python=3.7.*
+ - python=3.10
# tools
+ - cython>=0.29.24
- pytest>=6.0
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
- pytest-azurepipelines
# pandas dependencies
@@ -17,21 +18,19 @@ dependencies:
- html5lib
- jinja2
- lxml
- - matplotlib=2.2.3
+ - matplotlib
- nomkl
- numexpr
- - numpy=1.17.3
+ - numpy
- openpyxl
- - pyarrow=0.17
+ - pyarrow
+ - pyreadstat
- pytables
- - python-dateutil==2.7.3
+ - python-dateutil==2.8.1
- pytz
+ - pyxlsb
- xarray
- - xlrd<2.0
+ - xlrd
- xlsxwriter
- xlwt
- - pip
- - pip:
- - cython>=0.29.21
- - pyreadstat
- - pyxlsb
+ - zstandard
diff --git a/ci/deps/azure-macos-38.yaml b/ci/deps/azure-macos-38.yaml
new file mode 100644
index 0000000000000..422aa86c57fc7
--- /dev/null
+++ b/ci/deps/azure-macos-38.yaml
@@ -0,0 +1,36 @@
+name: pandas-dev
+channels:
+ - defaults
+ - conda-forge
+dependencies:
+ - python=3.8
+
+ # tools
+ - cython>=0.29.24
+ - pytest>=6.0
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
+ - pytest-azurepipelines
+
+ # pandas dependencies
+ - beautifulsoup4
+ - bottleneck
+ - html5lib
+ - jinja2
+ - lxml
+ - matplotlib=3.3.2
+ - nomkl
+ - numexpr
+ - numpy=1.18.5
+ - openpyxl
+ - pyarrow=1.0.1
+ - pyreadstat
+ - pytables
+ - python-dateutil==2.8.1
+ - pytz
+ - pyxlsb
+ - xarray
+ - xlrd
+ - xlsxwriter
+ - xlwt
+ - zstandard
diff --git a/ci/deps/azure-macos-39.yaml b/ci/deps/azure-macos-39.yaml
new file mode 100644
index 0000000000000..140d67796452c
--- /dev/null
+++ b/ci/deps/azure-macos-39.yaml
@@ -0,0 +1,36 @@
+name: pandas-dev
+channels:
+ - defaults
+ - conda-forge
+dependencies:
+ - python=3.9
+
+ # tools
+ - cython>=0.29.24
+ - pytest>=6.0
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
+ - pytest-azurepipelines
+
+ # pandas dependencies
+ - beautifulsoup4
+ - bottleneck
+ - html5lib
+ - jinja2
+ - lxml
+ - matplotlib=3.3.2
+ - nomkl
+ - numexpr
+ - numpy=1.21.3
+ - openpyxl
+ - pyarrow=4
+ - pyreadstat
+ - pytables
+ - python-dateutil==2.8.1
+ - pytz
+ - pyxlsb
+ - xarray
+ - xlrd
+ - xlsxwriter
+ - xlwt
+ - zstandard
diff --git a/ci/deps/azure-windows-37.yaml b/ci/deps/azure-windows-310.yaml
similarity index 62%
rename from ci/deps/azure-windows-37.yaml
rename to ci/deps/azure-windows-310.yaml
index 5cbc029f8c03d..8e6f4deef6057 100644
--- a/ci/deps/azure-windows-37.yaml
+++ b/ci/deps/azure-windows-310.yaml
@@ -1,42 +1,41 @@
name: pandas-dev
channels:
- - defaults
- conda-forge
+ - defaults
dependencies:
- - python=3.7.*
+ - python=3.10
# tools
- - cython>=0.29.21
+ - cython>=0.29.24
- pytest>=6.0
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
- pytest-azurepipelines
# pandas dependencies
- beautifulsoup4
- bottleneck
- fsspec>=0.8.0
- - gcsfs>=0.6.0
+ - gcsfs
- html5lib
- jinja2
- lxml
- - matplotlib=2.2.*
- - moto>=1.3.14
- - flask
+ - matplotlib
+ # TODO: uncomment after numba supports py310
+ #- numba
- numexpr
- - numpy=1.17.*
+ - numpy
- openpyxl
- - pyarrow=0.17.0
+ - pyarrow
- pytables
- python-dateutil
- pytz
- s3fs>=0.4.2
- scipy
- sqlalchemy
- - xlrd>=2.0
+ - xlrd
- xlsxwriter
- xlwt
- pyreadstat
- - pip
- - pip:
- - pyxlsb
+ - pyxlsb
+ - zstandard
diff --git a/ci/deps/azure-windows-38.yaml b/ci/deps/azure-windows-38.yaml
index 7fdecae626f9d..eb533524147d9 100644
--- a/ci/deps/azure-windows-38.yaml
+++ b/ci/deps/azure-windows-38.yaml
@@ -3,34 +3,33 @@ channels:
- conda-forge
- defaults
dependencies:
- - python=3.8.*
+ - python=3.8
# tools
- - cython>=0.29.21
+ - cython>=0.29.24
- pytest>=6.0
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
- pytest-azurepipelines
# pandas dependencies
- blosc
- bottleneck
- fastparquet>=0.4.0
- - flask
- fsspec>=0.8.0
- - matplotlib=3.1.3
- - moto>=1.3.14
+ - matplotlib=3.3.2
- numba
- numexpr
- - numpy=1.18.*
+ - numpy=1.18
- openpyxl
- jinja2
- - pyarrow>=0.17.0
+ - pyarrow=2
- pytables
- python-dateutil
- pytz
- s3fs>=0.4.0
- scipy
- - xlrd<2.0
+ - xlrd
- xlsxwriter
- xlwt
+ - zstandard
diff --git a/ci/deps/actions-37-slow.yaml b/ci/deps/azure-windows-39.yaml
similarity index 56%
rename from ci/deps/actions-37-slow.yaml
rename to ci/deps/azure-windows-39.yaml
index 166f2237dcad3..6f820b1c2aedb 100644
--- a/ci/deps/actions-37-slow.yaml
+++ b/ci/deps/azure-windows-39.yaml
@@ -1,39 +1,40 @@
name: pandas-dev
channels:
- - defaults
- conda-forge
+ - defaults
dependencies:
- - python=3.7.*
+ - python=3.9
# tools
- - cython>=0.29.21
+ - cython>=0.29.24
- pytest>=6.0
- - pytest-cov
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
+ - pytest-azurepipelines
# pandas dependencies
- beautifulsoup4
- - fsspec>=0.7.4
+ - bottleneck
+ - fsspec>=0.8.0
+ - gcsfs
- html5lib
+ - jinja2
- lxml
- matplotlib
+ - numba
- numexpr
- numpy
- openpyxl
- - patsy
- - psycopg2
- - pymysql
+ - pyarrow=6
- pytables
- python-dateutil
- pytz
- - s3fs>=0.4.0
- - moto>=1.3.14
+ - s3fs>=0.4.2
- scipy
- sqlalchemy
- - xlrd<2.0
+ - xlrd
- xlsxwriter
- xlwt
- - moto
- - flask
- - numba
+ - pyreadstat
+ - pyxlsb
+ - zstandard
diff --git a/ci/deps/circle-37-arm64.yaml b/ci/deps/circle-38-arm64.yaml
similarity index 64%
rename from ci/deps/circle-37-arm64.yaml
rename to ci/deps/circle-38-arm64.yaml
index 995ebda1f97e7..60608c3ee1a86 100644
--- a/ci/deps/circle-37-arm64.yaml
+++ b/ci/deps/circle-38-arm64.yaml
@@ -2,20 +2,20 @@ name: pandas-dev
channels:
- conda-forge
dependencies:
- - python=3.7.*
+ - python=3.8
# tools
- - cython>=0.29.21
+ - cython>=0.29.24
- pytest>=6.0
- - pytest-xdist>=1.21
- - hypothesis>=3.58.0
+ - pytest-xdist>=1.31
+ - hypothesis>=5.5.3
# pandas dependencies
- botocore>=1.11
+ - flask
+ - moto
- numpy
- python-dateutil
- pytz
+ - zstandard
- pip
- - flask
- - pip:
- - moto
diff --git a/ci/run_tests.sh b/ci/run_tests.sh
index 0d6f26d8c29f8..203f8fe293a06 100755
--- a/ci/run_tests.sh
+++ b/ci/run_tests.sh
@@ -5,12 +5,17 @@
# https://github.com/pytest-dev/pytest/issues/1075
export PYTHONHASHSEED=$(python -c 'import random; print(random.randint(1, 4294967295))')
+# May help reproduce flaky CI builds if set in subsequent runs
+echo PYTHONHASHSEED=$PYTHONHASHSEED
+
if [[ "not network" == *"$PATTERN"* ]]; then
export http_proxy=http://1.2.3.4 https_proxy=http://1.2.3.4;
fi
-if [ "$COVERAGE" ]; then
+if [[ "$COVERAGE" == "true" ]]; then
COVERAGE="-s --cov=pandas --cov-report=xml --cov-append"
+else
+ COVERAGE="" # We need to reset this for COVERAGE="false" case
fi
# If no X server is found, we use xvfb to emulate it
@@ -19,18 +24,19 @@ if [[ $(uname) == "Linux" && -z $DISPLAY ]]; then
XVFB="xvfb-run "
fi
-PYTEST_CMD="${XVFB}pytest -m \"$PATTERN\" -n $PYTEST_WORKERS --dist=loadfile $TEST_ARGS $COVERAGE pandas"
+PYTEST_CMD="${XVFB}pytest -m \"$PATTERN\" -n $PYTEST_WORKERS --dist=loadfile $TEST_ARGS $COVERAGE $PYTEST_TARGET"
if [[ $(uname) != "Linux" && $(uname) != "Darwin" ]]; then
- # GH#37455 windows py38 build appears to be running out of memory
- # skip collection of window tests
- PYTEST_CMD="$PYTEST_CMD --ignore=pandas/tests/window/moments --ignore=pandas/tests/plotting/"
+ PYTEST_CMD="$PYTEST_CMD --ignore=pandas/tests/plotting/"
fi
echo $PYTEST_CMD
sh -c "$PYTEST_CMD"
-PYTEST_AM_CMD="PANDAS_DATA_MANAGER=array pytest -m \"$PATTERN and arraymanager\" -n $PYTEST_WORKERS --dist=loadfile $TEST_ARGS $COVERAGE pandas"
+if [[ "$PANDAS_DATA_MANAGER" != "array" ]]; then
+ # The ArrayManager tests should have already been run by PYTEST_CMD if PANDAS_DATA_MANAGER was already set to array
+ PYTEST_AM_CMD="PANDAS_DATA_MANAGER=array pytest -m \"$PATTERN and arraymanager\" -n $PYTEST_WORKERS --dist=loadfile $TEST_ARGS $COVERAGE pandas"
-echo $PYTEST_AM_CMD
-sh -c "$PYTEST_AM_CMD"
+ echo $PYTEST_AM_CMD
+ sh -c "$PYTEST_AM_CMD"
+fi
diff --git a/ci/setup_env.sh b/ci/setup_env.sh
index 2e16bc6545161..d51ff98b241a6 100755
--- a/ci/setup_env.sh
+++ b/ci/setup_env.sh
@@ -48,6 +48,7 @@ conda config --set ssl_verify false
conda config --set quiet true --set always_yes true --set changeps1 false
conda install pip conda # create conda to create a historical artifact for pip & setuptools
conda update -n base conda
+conda install -y -c conda-forge mamba
echo "conda info -a"
conda info -a
@@ -62,8 +63,8 @@ conda list
conda remove --all -q -y -n pandas-dev
echo
-echo "conda env create -q --file=${ENV_FILE}"
-time conda env create -q --file="${ENV_FILE}"
+echo "mamba env create -q --file=${ENV_FILE}"
+time mamba env create -q --file="${ENV_FILE}"
if [[ "$BITS32" == "yes" ]]; then
@@ -86,11 +87,6 @@ echo "w/o removing anything else"
conda remove pandas -y --force || true
pip uninstall -y pandas || true
-echo
-echo "remove postgres if has been installed with conda"
-echo "we use the one from the CI"
-conda remove postgresql -y --force || true
-
echo
echo "remove qt"
echo "causes problems with the clipboard, we use xsel for that"
@@ -106,7 +102,8 @@ echo "[Build extensions]"
python setup.py build_ext -q -j2
echo "[Updating pip]"
-python -m pip install --no-deps -U pip wheel setuptools
+# TODO: GH#44980 https://github.com/pypa/setuptools/issues/2941
+python -m pip install --no-deps -U pip wheel "setuptools<60.0.0"
echo "[Install pandas]"
python -m pip install --no-build-isolation -e .
@@ -115,13 +112,4 @@ echo
echo "conda list"
conda list
-# Install DB for Linux
-
-if [[ -n ${SQL:0} ]]; then
- echo "installing dbs"
- mysql -e 'create database pandas_nosetest;'
- psql -c 'create database pandas_nosetest;' -U postgres
-else
- echo "not using dbs on non-linux Travis builds or Azure Pipelines"
-fi
echo "done"
diff --git a/codecov.yml b/codecov.yml
index 893e40db004a6..d893bdbdc9298 100644
--- a/codecov.yml
+++ b/codecov.yml
@@ -1,5 +1,5 @@
codecov:
- branch: master
+ branch: main
notify:
after_n_builds: 10
comment: false
@@ -12,6 +12,7 @@ coverage:
patch:
default:
target: '50'
+ informational: true
github_checks:
annotations: false
diff --git a/doc/source/_static/style/appmaphead1.png b/doc/source/_static/style/appmaphead1.png
new file mode 100644
index 0000000000000..905bcaa63e900
Binary files /dev/null and b/doc/source/_static/style/appmaphead1.png differ
diff --git a/doc/source/_static/style/appmaphead2.png b/doc/source/_static/style/appmaphead2.png
new file mode 100644
index 0000000000000..9adde61908378
Binary files /dev/null and b/doc/source/_static/style/appmaphead2.png differ
diff --git a/doc/source/_static/style/df_pipe.png b/doc/source/_static/style/df_pipe.png
new file mode 100644
index 0000000000000..071a481ad5acc
Binary files /dev/null and b/doc/source/_static/style/df_pipe.png differ
diff --git a/doc/source/_static/style/latex_stocks.png b/doc/source/_static/style/latex_stocks.png
new file mode 100644
index 0000000000000..c8906c33b810b
Binary files /dev/null and b/doc/source/_static/style/latex_stocks.png differ
diff --git a/doc/source/_static/style/latex_stocks_html.png b/doc/source/_static/style/latex_stocks_html.png
new file mode 100644
index 0000000000000..11b30faddf47c
Binary files /dev/null and b/doc/source/_static/style/latex_stocks_html.png differ
diff --git a/doc/source/conf.py b/doc/source/conf.py
index 8df048ce65582..e8cd85e3369f7 100644
--- a/doc/source/conf.py
+++ b/doc/source/conf.py
@@ -225,11 +225,24 @@
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
+
+switcher_version = version
+if ".dev" in version:
+ switcher_version = "dev"
+elif "rc" in version:
+ switcher_version = version.split("rc")[0] + " (rc)"
+
html_theme_options = {
"external_links": [],
"github_url": "https://github.com/pandas-dev/pandas",
"twitter_url": "https://twitter.com/pandas_dev",
"google_analytics_id": "UA-27880019-2",
+ "navbar_end": ["version-switcher", "navbar-icon-links"],
+ "switcher": {
+ "json_url": "https://pandas.pydata.org/versions.json",
+ "url_template": "https://pandas.pydata.org/{version}/",
+ "version_match": switcher_version,
+ },
}
# Add any paths that contain custom themes here, relative to this directory.
@@ -461,7 +474,6 @@
# eg pandas.Series.str and pandas.Series.dt (see GH9322)
import sphinx # isort:skip
-from sphinx.util import rpartition # isort:skip
from sphinx.ext.autodoc import ( # isort:skip
AttributeDocumenter,
Documenter,
@@ -521,8 +533,8 @@ def resolve_name(self, modname, parents, path, base):
# HACK: this is added in comparison to ClassLevelDocumenter
# mod_cls still exists of class.accessor, so an extra
# rpartition is needed
- modname, accessor = rpartition(mod_cls, ".")
- modname, cls = rpartition(modname, ".")
+ modname, _, accessor = mod_cls.rpartition(".")
+ modname, _, cls = modname.rpartition(".")
parents = [cls, accessor]
# if the module name is still missing, get it like above
if not modname:
@@ -652,7 +664,7 @@ def linkcode_resolve(domain, info):
fn = os.path.relpath(fn, start=os.path.dirname(pandas.__file__))
if "+" in pandas.__version__:
- return f"https://github.com/pandas-dev/pandas/blob/master/pandas/{fn}{linespec}"
+ return f"https://github.com/pandas-dev/pandas/blob/main/pandas/{fn}{linespec}"
else:
return (
f"https://github.com/pandas-dev/pandas/blob/"
diff --git a/doc/source/development/code_style.rst b/doc/source/development/code_style.rst
index 77c8d56765e5e..7bbfc010fbfb2 100644
--- a/doc/source/development/code_style.rst
+++ b/doc/source/development/code_style.rst
@@ -28,7 +28,7 @@ Testing
Failing tests
--------------
-See https://docs.pytest.org/en/latest/skipping.html for background.
+See https://docs.pytest.org/en/latest/how-to/skipping.html for background.
Do not use ``pytest.xfail``
---------------------------
diff --git a/doc/source/development/contributing.rst b/doc/source/development/contributing.rst
index f4a09e0daa750..1d745d21dacae 100644
--- a/doc/source/development/contributing.rst
+++ b/doc/source/development/contributing.rst
@@ -59,7 +59,7 @@ will allow others to reproduce the bug and provide insight into fixing. See
`this blogpost `_
for tips on writing a good bug report.
-Trying the bug-producing code out on the *master* branch is often a worthwhile exercise
+Trying the bug-producing code out on the *main* branch is often a worthwhile exercise
to confirm the bug still exists. It is also worth searching existing bug reports and pull requests
to see if the issue has already been reported and/or fixed.
@@ -143,7 +143,7 @@ as the version number cannot be computed anymore.
Creating a branch
-----------------
-You want your master branch to reflect only production-ready code, so create a
+You want your main branch to reflect only production-ready code, so create a
feature branch for making your changes. For example::
git branch shiny-new-feature
@@ -158,14 +158,14 @@ changes in this branch specific to one bug or feature so it is clear
what the branch brings to pandas. You can have many shiny-new-features
and switch in between them using the git checkout command.
-When creating this branch, make sure your master branch is up to date with
-the latest upstream master version. To update your local master branch, you
+When creating this branch, make sure your main branch is up to date with
+the latest upstream main version. To update your local main branch, you
can do::
- git checkout master
- git pull upstream master --ff-only
+ git checkout main
+ git pull upstream main --ff-only
-When you want to update the feature branch with changes in master after
+When you want to update the feature branch with changes in main after
you created the branch, check the section on
:ref:`updating a PR `.
@@ -256,7 +256,7 @@ double check your branch changes against the branch it was based on:
#. Navigate to your repository on GitHub -- https://github.com/your-user-name/pandas
#. Click on ``Branches``
#. Click on the ``Compare`` button for your feature branch
-#. Select the ``base`` and ``compare`` branches, if necessary. This will be ``master`` and
+#. Select the ``base`` and ``compare`` branches, if necessary. This will be ``main`` and
``shiny-new-feature``, respectively.
Finally, make the pull request
@@ -264,8 +264,8 @@ Finally, make the pull request
If everything looks good, you are ready to make a pull request. A pull request is how
code from a local repository becomes available to the GitHub community and can be looked
-at and eventually merged into the master version. This pull request and its associated
-changes will eventually be committed to the master branch and available in the next
+at and eventually merged into the main version. This pull request and its associated
+changes will eventually be committed to the main branch and available in the next
release. To submit a pull request:
#. Navigate to your repository on GitHub
@@ -294,14 +294,14 @@ This will automatically update your pull request with the latest code and restar
:any:`Continuous Integration ` tests.
Another reason you might need to update your pull request is to solve conflicts
-with changes that have been merged into the master branch since you opened your
+with changes that have been merged into the main branch since you opened your
pull request.
-To do this, you need to "merge upstream master" in your branch::
+To do this, you need to "merge upstream main" in your branch::
git checkout shiny-new-feature
git fetch upstream
- git merge upstream/master
+ git merge upstream/main
If there are no conflicts (or they could be fixed automatically), a file with a
default commit message will open, and you can simply save and quit this file.
@@ -313,7 +313,7 @@ Once the conflicts are merged and the files where the conflicts were solved are
added, you can run ``git commit`` to save those fixes.
If you have uncommitted changes at the moment you want to update the branch with
-master, you will need to ``stash`` them prior to updating (see the
+main, you will need to ``stash`` them prior to updating (see the
`stash docs `__).
This will effectively store your changes and they can be reapplied after updating.
@@ -331,18 +331,23 @@ can comment::
@github-actions pre-commit
-on that pull request. This will trigger a workflow which will autofix formatting errors.
+on that pull request. This will trigger a workflow which will autofix formatting
+errors.
+
+To automatically fix formatting errors on each commit you make, you can
+set up pre-commit yourself. First, create a Python :ref:`environment
+` and then set up :ref:`pre-commit `.
Delete your merged branch (optional)
------------------------------------
Once your feature branch is accepted into upstream, you'll probably want to get rid of
-the branch. First, merge upstream master into your branch so git knows it is safe to
+the branch. First, merge upstream main into your branch so git knows it is safe to
delete your branch::
git fetch upstream
- git checkout master
- git merge upstream/master
+ git checkout main
+ git merge upstream/main
Then you can do::
diff --git a/doc/source/development/contributing_codebase.rst b/doc/source/development/contributing_codebase.rst
index e812aaa760a8f..4826921d4866b 100644
--- a/doc/source/development/contributing_codebase.rst
+++ b/doc/source/development/contributing_codebase.rst
@@ -23,11 +23,10 @@ contributing them to the project::
./ci/code_checks.sh
-The script verifies the linting of code files, it looks for common mistake patterns
-(like missing spaces around sphinx directives that make the documentation not
-being rendered properly) and it also validates the doctests. It is possible to
-run the checks independently by using the parameters ``lint``, ``patterns`` and
-``doctests`` (e.g. ``./ci/code_checks.sh lint``).
+The script validates the doctests, formatting in docstrings, static typing, and
+imported modules. It is possible to run the checks independently by using the
+parameters ``docstring``, ``code``, ``typing``, and ``doctests``
+(e.g. ``./ci/code_checks.sh doctests``).
In addition, because a lot of people use our library, it is important that we
do not make sudden changes to the code that could have the potential to break
@@ -70,9 +69,9 @@ to run its checks with::
without needing to have done ``pre-commit install`` beforehand.
-If you want to run checks on all recently committed files on upstream/master you can use::
+If you want to run checks on all recently committed files on upstream/main you can use::
- pre-commit run --from-ref=upstream/master --to-ref=HEAD --all-files
+ pre-commit run --from-ref=upstream/main --to-ref=HEAD --all-files
without needing to have done ``pre-commit install`` beforehand.
@@ -156,7 +155,7 @@ Python (PEP8 / black)
pandas follows the `PEP8 `_ standard
and uses `Black `_ and
-`Flake8 `_ to ensure a consistent code
+`Flake8 `_ to ensure a consistent code
format throughout the project. We encourage you to use :ref:`pre-commit `.
:ref:`Continuous Integration ` will run those tools and
@@ -164,7 +163,7 @@ report any stylistic errors in your code. Therefore, it is helpful before
submitting code to run the check yourself::
black pandas
- git diff upstream/master -u -- "*.py" | flake8 --diff
+ git diff upstream/main -u -- "*.py" | flake8 --diff
to auto-format your code. Additionally, many editors have plugins that will
apply ``black`` as you edit files.
@@ -172,7 +171,7 @@ apply ``black`` as you edit files.
You should use a ``black`` version 21.5b2 as previous versions are not compatible
with the pandas codebase.
-One caveat about ``git diff upstream/master -u -- "*.py" | flake8 --diff``: this
+One caveat about ``git diff upstream/main -u -- "*.py" | flake8 --diff``: this
command will catch any stylistic errors in your changes specifically, but
be beware it may not catch all of them. For example, if you delete the only
usage of an imported function, it is stylistically incorrect to import an
@@ -180,18 +179,18 @@ unused function. However, style-checking the diff will not catch this because
the actual import is not part of the diff. Thus, for completeness, you should
run this command, though it may take longer::
- git diff upstream/master --name-only -- "*.py" | xargs -r flake8
+ git diff upstream/main --name-only -- "*.py" | xargs -r flake8
-Note that on OSX, the ``-r`` flag is not available, so you have to omit it and
+Note that on macOS, the ``-r`` flag is not available, so you have to omit it and
run this slightly modified command::
- git diff upstream/master --name-only -- "*.py" | xargs flake8
+ git diff upstream/main --name-only -- "*.py" | xargs flake8
Windows does not support the ``xargs`` command (unless installed for example
via the `MinGW `__ toolchain), but one can imitate the
behaviour as follows::
- for /f %i in ('git diff upstream/master --name-only -- "*.py"') do flake8 %i
+ for /f %i in ('git diff upstream/main --name-only -- "*.py"') do flake8 %i
This will get all the files being changed by the PR (and ending with ``.py``),
and run ``flake8`` on them, one after the other.
@@ -205,7 +204,7 @@ Import formatting
pandas uses `isort `__ to standardise import
formatting across the codebase.
-A guide to import layout as per pep8 can be found `here `__.
+A guide to import layout as per pep8 can be found `here `__.
A summary of our current import sections ( in order ):
@@ -243,9 +242,9 @@ to automatically format imports correctly. This will modify your local copy of t
Alternatively, you can run a command similar to what was suggested for ``black`` and ``flake8`` :ref:`right above `::
- git diff upstream/master --name-only -- "*.py" | xargs -r isort
+ git diff upstream/main --name-only -- "*.py" | xargs -r isort
-Where similar caveats apply if you are on OSX or Windows.
+Where similar caveats apply if you are on macOS or Windows.
You can then verify the changes look ok, then git :any:`commit ` and :any:`push `.
@@ -304,7 +303,7 @@ pandas strongly encourages the use of :pep:`484` style type hints. New developme
Style guidelines
~~~~~~~~~~~~~~~~
-Types imports should follow the ``from typing import ...`` convention. So rather than
+Type imports should follow the ``from typing import ...`` convention. Some types do not need to be imported since :pep:`585` some builtin constructs, such as ``list`` and ``tuple``, can directly be used for type annotations. So rather than
.. code-block:: python
@@ -316,21 +315,31 @@ You should write
.. code-block:: python
- from typing import List, Optional, Union
+ primes: list[int] = []
- primes: List[int] = []
+``Optional`` should be avoided in favor of the shorter ``| None``, so instead of
-``Optional`` should be used where applicable, so instead of
+.. code-block:: python
+
+ from typing import Union
+
+ maybe_primes: list[Union[int, None]] = []
+
+or
.. code-block:: python
- maybe_primes: List[Union[int, None]] = []
+ from typing import Optional
+
+ maybe_primes: list[Optional[int]] = []
You should write
.. code-block:: python
- maybe_primes: List[Optional[int]] = []
+ from __future__ import annotations # noqa: F404
+
+ maybe_primes: list[int | None] = []
In some cases in the code base classes may define class variables that shadow builtins. This causes an issue as described in `Mypy 1775 `_. The defensive solution here is to create an unambiguous alias of the builtin and use that without your annotation. For example, if you come across a definition like
@@ -380,7 +389,7 @@ With custom types and inference this is not always possible so exceptions are ma
pandas-specific types
~~~~~~~~~~~~~~~~~~~~~
-Commonly used types specific to pandas will appear in `pandas._typing `_ and you should use these where applicable. This module is private for now but ultimately this should be exposed to third party libraries who want to implement type checking against pandas.
+Commonly used types specific to pandas will appear in `pandas._typing `_ and you should use these where applicable. This module is private for now but ultimately this should be exposed to third party libraries who want to implement type checking against pandas.
For example, quite a few functions in pandas accept a ``dtype`` argument. This can be expressed as a string like ``"object"``, a ``numpy.dtype`` like ``np.int64`` or even a pandas ``ExtensionDtype`` like ``pd.CategoricalDtype``. Rather than burden the user with having to constantly annotate all of those options, this can simply be imported and reused from the pandas._typing module
@@ -396,14 +405,41 @@ This module will ultimately house types for repeatedly used concepts like "path-
Validating type hints
~~~~~~~~~~~~~~~~~~~~~
-pandas uses `mypy `_ to statically analyze the code base and type hints. After making any change you can ensure your type hints are correct by running
+pandas uses `mypy `_ and `pyright `_ to statically analyze the code base and type hints. After making any change you can ensure your type hints are correct by running
.. code-block:: shell
- mypy pandas
+ mypy
+
+ # let pre-commit setup and run pyright
+ pre-commit run --hook-stage manual --all-files pyright
+ # or if pyright is installed (requires node.js)
+ pyright
+
+A recent version of ``numpy`` (>=1.21.0) is required for type validation.
.. _contributing.ci:
+Testing type hints in code using pandas
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. warning::
+
+ * Pandas is not yet a py.typed library (:pep:`561`)!
+ The primary purpose of locally declaring pandas as a py.typed library is to test and
+ improve the pandas-builtin type annotations.
+
+Until pandas becomes a py.typed library, it is possible to easily experiment with the type
+annotations shipped with pandas by creating an empty file named "py.typed" in the pandas
+installation folder:
+
+.. code-block:: none
+
+ python -c "import pandas; import pathlib; (pathlib.Path(pandas.__path__[0]) / 'py.typed').touch()"
+
+The existence of the py.typed file signals to type checkers that pandas is already a py.typed
+library. This makes type checkers aware of the type annotations shipped with pandas.
+
Testing with continuous integration
-----------------------------------
@@ -413,7 +449,7 @@ continuous integration services, once your pull request is submitted.
However, if you wish to run the test suite on a branch prior to submitting the pull request,
then the continuous integration services need to be hooked to your GitHub repository. Instructions are here
for `GitHub Actions `__ and
-`Azure Pipelines `__.
+`Azure Pipelines `__.
A pull-request will be considered for merging when you have an all 'green' build. If any tests are failing,
then you will get a red 'X', where you can click through to see the individual failed tests.
@@ -454,8 +490,7 @@ Writing tests
All tests should go into the ``tests`` subdirectory of the specific package.
This folder contains many current examples of tests, and we suggest looking to these for
inspiration. If your test requires working with files or
-network connectivity, there is more information on the `testing page
-`_ of the wiki.
+network connectivity, there is more information on the :wiki:`Testing` of the wiki.
The ``pandas._testing`` module has many special ``assert`` functions that
make it easier to make statements about whether Series or DataFrame objects are
@@ -741,10 +776,10 @@ Running the performance test suite
Performance matters and it is worth considering whether your code has introduced
performance regressions. pandas is in the process of migrating to
-`asv benchmarks `__
+`asv benchmarks `__
to enable easy monitoring of the performance of critical pandas operations.
These benchmarks are all found in the ``pandas/asv_bench`` directory, and the
-test results can be found `here `__.
+test results can be found `here `__.
To use all features of asv, you will need either ``conda`` or
``virtualenv``. For more details please check the `asv installation
@@ -752,18 +787,18 @@ webpage `_.
To install asv::
- pip install git+https://github.com/spacetelescope/asv
+ pip install git+https://github.com/airspeed-velocity/asv
If you need to run a benchmark, change your directory to ``asv_bench/`` and run::
- asv continuous -f 1.1 upstream/master HEAD
+ asv continuous -f 1.1 upstream/main HEAD
You can replace ``HEAD`` with the name of the branch you are working on,
and report benchmarks that changed by more than 10%.
The command uses ``conda`` by default for creating the benchmark
environments. If you want to use virtualenv instead, write::
- asv continuous -f 1.1 -E virtualenv upstream/master HEAD
+ asv continuous -f 1.1 -E virtualenv upstream/main HEAD
The ``-E virtualenv`` option should be added to all ``asv`` commands
that run benchmarks. The default value is defined in ``asv.conf.json``.
@@ -775,12 +810,12 @@ do not cause unexpected performance regressions. You can run specific benchmark
using the ``-b`` flag, which takes a regular expression. For example, this will
only run benchmarks from a ``pandas/asv_bench/benchmarks/groupby.py`` file::
- asv continuous -f 1.1 upstream/master HEAD -b ^groupby
+ asv continuous -f 1.1 upstream/main HEAD -b ^groupby
If you want to only run a specific group of benchmarks from a file, you can do it
using ``.`` as a separator. For example::
- asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods
+ asv continuous -f 1.1 upstream/main HEAD -b groupby.GroupByMethods
will only run the ``GroupByMethods`` benchmark defined in ``groupby.py``.
@@ -812,7 +847,21 @@ Changes should be reflected in the release notes located in ``doc/source/whatsne
This file contains an ongoing change log for each release. Add an entry to this file to
document your fix, enhancement or (unavoidable) breaking change. Make sure to include the
GitHub issue number when adding your entry (using ``:issue:`1234``` where ``1234`` is the
-issue/pull request number).
+issue/pull request number). Your entry should be written using full sentences and proper
+grammar.
+
+When mentioning parts of the API, use a Sphinx ``:func:``, ``:meth:``, or ``:class:``
+directive as appropriate. Not all public API functions and methods have a
+documentation page; ideally links would only be added if they resolve. You can
+usually find similar examples by checking the release notes for one of the previous
+versions.
+
+If your code is a bugfix, add your entry to the relevant bugfix section. Avoid
+adding to the ``Other`` section; only in rare cases should entries go there.
+Being as concise as possible, the description of the bug should include how the
+user may encounter it and an indication of the bug itself, e.g.
+"produces incorrect results" or "incorrectly raises". It may be necessary to also
+indicate the new behavior.
If your code is an enhancement, it is most likely necessary to add usage
examples to the existing documentation. This can be done following the section
diff --git a/doc/source/development/contributing_docstring.rst b/doc/source/development/contributing_docstring.rst
index 623d1e8d45565..a87d8d5ad44bf 100644
--- a/doc/source/development/contributing_docstring.rst
+++ b/doc/source/development/contributing_docstring.rst
@@ -68,7 +68,7 @@ explained in this document:
* `numpydoc docstring guide `_
(which is based in the original `Guide to NumPy/SciPy documentation
- `_)
+ `_)
numpydoc is a Sphinx extension to support the NumPy docstring convention.
diff --git a/doc/source/development/contributing_documentation.rst b/doc/source/development/contributing_documentation.rst
index a4a4f781d9dad..39bc582511148 100644
--- a/doc/source/development/contributing_documentation.rst
+++ b/doc/source/development/contributing_documentation.rst
@@ -202,10 +202,10 @@ And you'll have the satisfaction of seeing your new and improved documentation!
.. _contributing.dev_docs:
-Building master branch documentation
+Building main branch documentation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-When pull requests are merged into the pandas ``master`` branch, the main parts of
+When pull requests are merged into the pandas ``main`` branch, the main parts of
the documentation are also built by Travis-CI. These docs are then hosted `here
`__, see also
the :any:`Continuous Integration ` section.
diff --git a/doc/source/development/contributing_environment.rst b/doc/source/development/contributing_environment.rst
index bc0a3556b9ac1..5f36a2a609c9f 100644
--- a/doc/source/development/contributing_environment.rst
+++ b/doc/source/development/contributing_environment.rst
@@ -47,7 +47,7 @@ Enable Docker support and use the Services tool window to build and manage image
run and interact with containers.
See https://www.jetbrains.com/help/pycharm/docker.html for details.
-Note that you might need to rebuild the C extensions if/when you merge with upstream/master using::
+Note that you might need to rebuild the C extensions if/when you merge with upstream/main using::
python setup.py build_ext -j 4
@@ -72,7 +72,7 @@ These packages will automatically be installed by using the ``pandas``
**Windows**
-You will need `Build Tools for Visual Studio 2017
+You will need `Build Tools for Visual Studio 2019
`_.
.. warning::
@@ -82,7 +82,7 @@ You will need `Build Tools for Visual Studio 2017
In the installer, select the "C++ build tools" workload.
You can install the necessary components on the commandline using
-`vs_buildtools.exe `_:
+`vs_buildtools.exe `_:
.. code::
@@ -133,14 +133,13 @@ compiler installation instructions.
Let us know if you have any difficulties by opening an issue or reaching out on `Gitter `_.
-
Creating a Python environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Now create an isolated pandas development environment:
-* Install either `Anaconda `_, `miniconda
- `_, or `miniforge `_
+* Install either `Anaconda `_, `miniconda
+ `_, or `miniforge `_
* Make sure your conda is up to date (``conda update conda``)
* Make sure that you have :any:`cloned the repository `
* ``cd`` to the pandas source directory
@@ -166,7 +165,7 @@ We'll now kick off a three-step process:
At this point you should be able to import pandas from your locally built version::
- $ python # start an interpreter
+ $ python
>>> import pandas
>>> print(pandas.__version__)
0.22.0.dev0+29.g4ad6d4d74
@@ -182,18 +181,15 @@ To return to your root environment::
conda deactivate
-See the full conda docs `here `__.
+See the full conda docs `here `__.
Creating a Python environment (pip)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you aren't using conda for your development environment, follow these instructions.
-You'll need to have at least the :ref:`minimum Python version ` that pandas supports. If your Python version
-is 3.8.0 (or later), you might need to update your ``setuptools`` to version 42.0.0 (or later)
-in your development environment before installing the build dependencies::
-
- pip install --upgrade setuptools
+You'll need to have at least the :ref:`minimum Python version ` that pandas supports.
+You also need to have ``setuptools`` 51.0.0 or later to build pandas.
**Unix**/**macOS with virtualenv**
@@ -242,7 +238,7 @@ Consult the docs for setting up pyenv `here `__.
Below is a brief overview on how to set-up a virtual environment with Powershell
under Windows. For details please refer to the
-`official virtualenv user guide `__
+`official virtualenv user guide `__
Use an ENV_DIR of your choice. We'll use ~\\virtualenvs\\pandas-dev where
'~' is the folder pointed to by either $env:USERPROFILE (Powershell) or
diff --git a/doc/source/development/debugging_extensions.rst b/doc/source/development/debugging_extensions.rst
index 894277d304020..7ba2091e18853 100644
--- a/doc/source/development/debugging_extensions.rst
+++ b/doc/source/development/debugging_extensions.rst
@@ -80,7 +80,7 @@ Once the process launches, simply type ``run`` and the test suite will begin, st
Checking memory leaks with valgrind
===================================
-You can use `Valgrind `_ to check for and log memory leaks in extensions. For instance, to check for a memory leak in a test from the suite you can run:
+You can use `Valgrind `_ to check for and log memory leaks in extensions. For instance, to check for a memory leak in a test from the suite you can run:
.. code-block:: sh
diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst
index d701208792a4c..6de237b70f08d 100644
--- a/doc/source/development/developer.rst
+++ b/doc/source/development/developer.rst
@@ -180,7 +180,7 @@ As an example of fully-formed metadata:
'numpy_type': 'int64',
'metadata': None}
],
- 'pandas_version': '0.20.0',
+ 'pandas_version': '1.4.0',
'creator': {
'library': 'pyarrow',
'version': '0.13.0'
diff --git a/doc/source/development/extending.rst b/doc/source/development/extending.rst
index d5b45f5953453..5347aab2c731a 100644
--- a/doc/source/development/extending.rst
+++ b/doc/source/development/extending.rst
@@ -50,7 +50,7 @@ decorate a class, providing the name of attribute to add. The class's
Now users can access your methods using the ``geo`` namespace:
- >>> ds = pd.Dataframe(
+ >>> ds = pd.DataFrame(
... {"longitude": np.linspace(0, 10), "latitude": np.linspace(0, 20)}
... )
>>> ds.geo.center
@@ -106,7 +106,7 @@ extension array for IP Address data, this might be ``ipaddress.IPv4Address``.
See the `extension dtype source`_ for interface definition.
-:class:`pandas.api.extension.ExtensionDtype` can be registered to pandas to allow creation via a string dtype name.
+:class:`pandas.api.extensions.ExtensionDtype` can be registered to pandas to allow creation via a string dtype name.
This allows one to instantiate ``Series`` and ``.astype()`` with a registered string name, for
example ``'category'`` is a registered string accessor for the ``CategoricalDtype``.
@@ -125,7 +125,7 @@ data. We do require that your array be convertible to a NumPy array, even if
this is relatively expensive (as it is for ``Categorical``).
They may be backed by none, one, or many NumPy arrays. For example,
-``pandas.Categorical`` is an extension array backed by two arrays,
+:class:`pandas.Categorical` is an extension array backed by two arrays,
one for codes and one for categories. An array of IPv6 addresses may
be backed by a NumPy structured array with two fields, one for the
lower 64 bits and one for the upper 64 bits. Or they may be backed
@@ -231,7 +231,7 @@ Testing extension arrays
We provide a test suite for ensuring that your extension arrays satisfy the expected
behavior. To use the test suite, you must provide several pytest fixtures and inherit
from the base test class. The required fixtures are found in
-https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/conftest.py.
+https://github.com/pandas-dev/pandas/blob/main/pandas/tests/extension/conftest.py.
To use a test, subclass it:
@@ -244,7 +244,7 @@ To use a test, subclass it:
pass
-See https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/base/__init__.py
+See https://github.com/pandas-dev/pandas/blob/main/pandas/tests/extension/base/__init__.py
for a list of all the tests available.
.. _extending.extension.arrow:
@@ -290,9 +290,9 @@ See more in the `Arrow documentation `__
+Libraries implementing the plotting backend should use `entry points `__
to make their backend discoverable to pandas. The key is ``"pandas_plotting_backends"``. For example, pandas
registers the default "matplotlib" backend as follows.
@@ -486,4 +486,4 @@ registers the default "matplotlib" backend as follows.
More information on how to implement a third-party plotting backend can be found at
-https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/__init__.py#L1.
+https://github.com/pandas-dev/pandas/blob/main/pandas/plotting/__init__.py#L1.
diff --git a/doc/source/development/maintaining.rst b/doc/source/development/maintaining.rst
index a0e9ba53acd00..a8521039c5427 100644
--- a/doc/source/development/maintaining.rst
+++ b/doc/source/development/maintaining.rst
@@ -237,4 +237,4 @@ a milestone before tagging, you can request the bot to backport it with:
.. _governance documents: https://github.com/pandas-dev/pandas-governance
-.. _list of permissions: https://help.github.com/en/github/setting-up-and-managing-organizations-and-teams/repository-permission-levels-for-an-organization
+.. _list of permissions: https://docs.github.com/en/organizations/managing-access-to-your-organizations-repositories/repository-roles-for-an-organization
diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst
index 37e45bf5a42b5..ccdb4f1fafae4 100644
--- a/doc/source/development/roadmap.rst
+++ b/doc/source/development/roadmap.rst
@@ -74,8 +74,7 @@ types. This includes consistent behavior in all operations (indexing, arithmetic
operations, comparisons, etc.). There has been discussion of eventually making
the new semantics the default.
-This has been discussed at
-`github #28095 `__ (and
+This has been discussed at :issue:`28095` (and
linked issues), and described in more detail in this
`design doc `__.
@@ -129,8 +128,7 @@ We propose that it should only work with positional indexing, and the translatio
to positions should be entirely done at a higher level.
Indexing is a complicated API with many subtleties. This refactor will require care
-and attention. More details are discussed at
-https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code
+and attention. More details are discussed at :wiki:`(Tentative)-rules-for-restructuring-indexing-code`
Numba-accelerated operations
----------------------------
@@ -205,4 +203,4 @@ We improved the pandas documentation
* :ref:`getting_started` contains a number of resources intended for new
pandas users coming from a variety of backgrounds (:issue:`26831`).
-.. _pydata-sphinx-theme: https://github.com/pandas-dev/pydata-sphinx-theme
+.. _pydata-sphinx-theme: https://github.com/pydata/pydata-sphinx-theme
diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst
index ee061e7b7d3e6..16cae9bbfbf46 100644
--- a/doc/source/ecosystem.rst
+++ b/doc/source/ecosystem.rst
@@ -19,7 +19,7 @@ development to remain focused around it's original requirements.
This is an inexhaustive list of projects that build on pandas in order to provide
tools in the PyData space. For a list of projects that depend on pandas,
see the
-`libraries.io usage page for pandas `_
+`Github network dependents for pandas `_
or `search pypi for pandas `_.
We'd like to make it easier for users to find these projects, if you know of other
@@ -30,16 +30,18 @@ substantial projects that you feel should be on this list, please let us know.
Data cleaning and validation
----------------------------
-`Pyjanitor `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+`Pyjanitor `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pyjanitor provides a clean API for cleaning data, using method chaining.
-`Engarde `__
+`Pandera `__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Engarde is a lightweight library used to explicitly state assumptions about your datasets
-and check that they're *actually* true.
+Pandera provides a flexible and expressive API for performing data validation on dataframes
+to make data processing pipelines more readable and robust.
+Dataframes contain information that pandera explicitly validates at runtime. This is useful in
+production-critical data pipelines or reproducible research settings.
`pandas-path `__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -69,19 +71,19 @@ a long-standing special relationship with pandas. Statsmodels provides powerful
econometrics, analysis and modeling functionality that is out of pandas' scope.
Statsmodels leverages pandas objects as the underlying data container for computation.
-`sklearn-pandas `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+`sklearn-pandas `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use pandas DataFrames in your `scikit-learn `__
ML pipeline.
`Featuretools `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Featuretools is a Python library for automated feature engineering built on top of pandas. It excels at transforming temporal and relational datasets into feature matrices for machine learning using reusable feature engineering "primitives". Users can contribute their own primitives in Python and share them with the rest of the community.
`Compose `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compose is a machine learning tool for labeling data and prediction engineering. It allows you to structure the labeling process by parameterizing prediction problems and transforming time-driven relational data into target values with cutoff times that can be used for supervised learning.
@@ -113,8 +115,8 @@ simplicity produces beautiful and effective visualizations with a
minimal amount of code. Altair works with pandas DataFrames.
-`Bokeh `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+`Bokeh `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Bokeh is a Python interactive visualization library for large datasets that natively uses
the latest web technologies. Its goal is to provide elegant, concise construction of novel
@@ -145,7 +147,7 @@ estimation while plotting, aggregating across observations and visualizing the
fit of statistical models to emphasize patterns in a dataset.
`plotnine `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Hadley Wickham's `ggplot2 `__ is a foundational exploratory visualization package for the R language.
Based on `"The Grammar of Graphics" `__ it
@@ -159,10 +161,10 @@ A good implementation for Python users is `has2k1/plotnine `__ leverages `Vega
`__ to create plots within Jupyter Notebook.
-`Plotly `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+`Plotly `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-`Plotly’s `__ `Python API `__ enables interactive figures and web shareability. Maps, 2D, 3D, and live-streaming graphs are rendered with WebGL and `D3.js `__. The library supports plotting directly from a pandas DataFrame and cloud-based collaboration. Users of `matplotlib, ggplot for Python, and Seaborn `__ can convert figures into interactive web-based plots. Plots can be drawn in `IPython Notebooks `__ , edited with R or MATLAB, modified in a GUI, or embedded in apps and dashboards. Plotly is free for unlimited sharing, and has `cloud `__, `offline `__, or `on-premise `__ accounts for private use.
+`Plotly’s `__ `Python API `__ enables interactive figures and web shareability. Maps, 2D, 3D, and live-streaming graphs are rendered with WebGL and `D3.js `__. The library supports plotting directly from a pandas DataFrame and cloud-based collaboration. Users of `matplotlib, ggplot for Python, and Seaborn `__ can convert figures into interactive web-based plots. Plots can be drawn in `IPython Notebooks `__ , edited with R or MATLAB, modified in a GUI, or embedded in apps and dashboards. Plotly is free for unlimited sharing, and has `offline `__, or `on-premise `__ accounts for private use.
`Lux `__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -177,7 +179,7 @@ A good implementation for Python users is `has2k1/plotnine `__ that highlights interesting trends and patterns in the dataframe. Users can leverage any existing pandas commands without modifying their code, while being able to visualize their pandas data structures (e.g., DataFrame, Series, Index) at the same time. Lux also offers a `powerful, intuitive language `__ that allow users to create `Altair `__, `matplotlib `__, or `Vega-Lite `__ visualizations without having to think at the level of code.
+By printing out a dataframe, Lux automatically `recommends a set of visualizations `__ that highlights interesting trends and patterns in the dataframe. Users can leverage any existing pandas commands without modifying their code, while being able to visualize their pandas data structures (e.g., DataFrame, Series, Index) at the same time. Lux also offers a `powerful, intuitive language `__ that allow users to create `Altair `__, `matplotlib `__, or `Vega-Lite `__ visualizations without having to think at the level of code.
`Qtpandas `__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -202,8 +204,7 @@ invoked with the following command
dtale.show(df)
D-Tale integrates seamlessly with Jupyter notebooks, Python terminals, Kaggle
-& Google Colab. Here are some demos of the `grid `__
-and `chart-builder `__.
+& Google Colab. Here are some demos of the `grid `__.
`hvplot `__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -218,7 +219,7 @@ It can be loaded as a native pandas plotting backend via
.. _ecosystem.ide:
IDE
-------
+---
`IPython `__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -262,7 +263,7 @@ debugging and profiling functionality of a software development tool with the
data exploration, interactive execution, deep inspection and rich visualization
capabilities of a scientific environment like MATLAB or Rstudio.
-Its `Variable Explorer `__
+Its `Variable Explorer `__
allows users to view, manipulate and edit pandas ``Index``, ``Series``,
and ``DataFrame`` objects like a "spreadsheet", including copying and modifying
values, sorting, displaying a "heatmap", converting data types and more.
@@ -272,9 +273,9 @@ Spyder can also import data from a variety of plain text and binary files
or the clipboard into a new pandas DataFrame via a sophisticated import wizard.
Most pandas classes, methods and data attributes can be autocompleted in
-Spyder's `Editor `__ and
-`IPython Console `__,
-and Spyder's `Help pane `__ can retrieve
+Spyder's `Editor `__ and
+`IPython Console `__,
+and Spyder's `Help pane `__ can retrieve
and render Numpydoc documentation on pandas objects in rich text with Sphinx
both automatically and on-demand.
@@ -310,8 +311,8 @@ The following data feeds are available:
* Stooq Index Data
* MOEX Data
-`Quandl/Python `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+`Quandl/Python `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Quandl API for Python wraps the Quandl REST API to return
pandas DataFrames with timeseries indexes.
@@ -322,8 +323,8 @@ PyDatastream is a Python interface to the
REST API to return indexed pandas DataFrames with financial data.
This package requires valid credentials for this API (non free).
-`pandaSDMX `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+`pandaSDMX `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pandaSDMX is a library to retrieve and acquire statistical data
and metadata disseminated in
`SDMX `_ 2.1, an ISO-standard
@@ -355,8 +356,8 @@ with pandas.
Domain specific
---------------
-`Geopandas `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+`Geopandas `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Geopandas extends pandas data objects to include geographic information which support
geometric operations. If your work entails maps and geographical coordinates, and
@@ -396,7 +397,7 @@ any Delta table into Pandas dataframe.
.. _ecosystem.out-of-core:
Out-of-core
--------------
+-----------
`Blaze `__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -434,8 +435,8 @@ can selectively scale parts of their pandas DataFrame applications.
print(df3)
-`Dask `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+`Dask `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dask is a flexible parallel computing library for analytics. Dask
provides a familiar ``DataFrame`` interface for out-of-core, parallel and distributed computing.
@@ -445,6 +446,12 @@ provides a familiar ``DataFrame`` interface for out-of-core, parallel and distri
Dask-ML enables parallel and distributed machine learning using Dask alongside existing machine learning libraries like Scikit-Learn, XGBoost, and TensorFlow.
+`Ibis `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ibis offers a standard way to write analytics code, that can be run in multiple engines. It helps in bridging the gap between local Python environments (like pandas) and remote storage and execution systems like Hadoop components (like HDFS, Impala, Hive, Spark) and SQL databases (Postgres, etc.).
+
+
`Koalas `__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -467,8 +474,8 @@ time-consuming tasks like ingesting data (``read_csv``, ``read_excel``,
df = pd.read_csv("big.csv") # use all your cores!
-`Odo `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+`Odo `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Odo provides a uniform API for moving data between different formats. It uses
pandas own ``read_csv`` for CSV IO and leverages many existing packages such as
@@ -492,8 +499,8 @@ If also displays progress bars.
df.parallel_apply(func)
-`Vaex `__
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+`Vaex `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis and visualization. Vaex is a Python library for Out-of-Core DataFrames (similar to pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (10\ :sup:`9`) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).
@@ -567,5 +574,18 @@ Library Accessor Classes Description
.. _pathlib.Path: https://docs.python.org/3/library/pathlib.html
.. _pint-pandas: https://github.com/hgrecco/pint-pandas
.. _composeml: https://github.com/alteryx/compose
-.. _datatest: https://datatest.readthedocs.io/
+.. _datatest: https://datatest.readthedocs.io/en/stable/
.. _woodwork: https://github.com/alteryx/woodwork
+
+Development tools
+-----------------
+
+`pandas-stubs `__
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+While pandas repository is partially typed, the package itself doesn't expose this information for external use.
+Install pandas-stubs to enable basic type coverage of pandas API.
+
+Learn more by reading through :issue:`14468`, :issue:`26766`, :issue:`28142`.
+
+See installation and usage instructions on the `github page `__.
diff --git a/doc/source/getting_started/comparison/comparison_with_r.rst b/doc/source/getting_started/comparison/comparison_with_r.rst
index 864081002086b..f91f4218c3429 100644
--- a/doc/source/getting_started/comparison/comparison_with_r.rst
+++ b/doc/source/getting_started/comparison/comparison_with_r.rst
@@ -31,7 +31,7 @@ Quick reference
We'll start off with a quick reference guide pairing some common R
operations using `dplyr
-`__ with
+`__ with
pandas equivalents.
@@ -326,8 +326,8 @@ table below shows how these data structures could be mapped in Python.
| data.frame | dataframe |
+------------+-------------------------------+
-|ddply|_
-~~~~~~~~
+ddply
+~~~~~
An expression using a data.frame called ``df`` in R where you want to
summarize ``x`` by ``month``:
@@ -372,8 +372,8 @@ For more details and examples see :ref:`the groupby documentation
reshape / reshape2
------------------
-|meltarray|_
-~~~~~~~~~~~~~
+meltarray
+~~~~~~~~~
An expression using a 3 dimensional array called ``a`` in R where you want to
melt it into a data.frame:
@@ -390,8 +390,8 @@ In Python, since ``a`` is a list, you can simply use list comprehension.
a = np.array(list(range(1, 24)) + [np.NAN]).reshape(2, 3, 4)
pd.DataFrame([tuple(list(x) + [val]) for x, val in np.ndenumerate(a)])
-|meltlist|_
-~~~~~~~~~~~~
+meltlist
+~~~~~~~~
An expression using a list called ``a`` in R where you want to melt it
into a data.frame:
@@ -412,8 +412,8 @@ In Python, this list would be a list of tuples, so
For more details and examples see :ref:`the Into to Data Structures
documentation `.
-|meltdf|_
-~~~~~~~~~~~~~~~~
+meltdf
+~~~~~~
An expression using a data.frame called ``cheese`` in R where you want to
reshape the data.frame:
@@ -447,8 +447,8 @@ In Python, the :meth:`~pandas.melt` method is the R equivalent:
For more details and examples see :ref:`the reshaping documentation
`.
-|cast|_
-~~~~~~~
+cast
+~~~~
In R ``acast`` is an expression using a data.frame called ``df`` in R to cast
into a higher dimensional array:
@@ -577,20 +577,5 @@ For more details and examples see :ref:`categorical introduction `
.. |subset| replace:: ``subset``
.. _subset: https://stat.ethz.ch/R-manual/R-patched/library/base/html/subset.html
-.. |ddply| replace:: ``ddply``
-.. _ddply: https://cran.r-project.org/web/packages/plyr/plyr.pdf#Rfn.ddply.1
-
-.. |meltarray| replace:: ``melt.array``
-.. _meltarray: https://cran.r-project.org/web/packages/reshape2/reshape2.pdf#Rfn.melt.array.1
-
-.. |meltlist| replace:: ``melt.list``
-.. meltlist: https://cran.r-project.org/web/packages/reshape2/reshape2.pdf#Rfn.melt.list.1
-
-.. |meltdf| replace:: ``melt.data.frame``
-.. meltdf: https://cran.r-project.org/web/packages/reshape2/reshape2.pdf#Rfn.melt.data.frame.1
-
-.. |cast| replace:: ``cast``
-.. cast: https://cran.r-project.org/web/packages/reshape2/reshape2.pdf#Rfn.cast.1
-
.. |factor| replace:: ``factor``
.. _factor: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
diff --git a/doc/source/getting_started/comparison/comparison_with_sas.rst b/doc/source/getting_started/comparison/comparison_with_sas.rst
index 54b45dc20db20..5a624c9c55782 100644
--- a/doc/source/getting_started/comparison/comparison_with_sas.rst
+++ b/doc/source/getting_started/comparison/comparison_with_sas.rst
@@ -96,7 +96,7 @@ Reading external data
Like SAS, pandas provides utilities for reading in data from
many formats. The ``tips`` dataset, found within the pandas
-tests (`csv `_)
+tests (`csv `_)
will be used in many of the following examples.
SAS provides ``PROC IMPORT`` to read csv data into a data set.
@@ -113,7 +113,7 @@ The pandas method is :func:`read_csv`, which works similarly.
url = (
"https://raw.github.com/pandas-dev/"
- "pandas/master/pandas/tests/io/data/csv/tips.csv"
+ "pandas/main/pandas/tests/io/data/csv/tips.csv"
)
tips = pd.read_csv(url)
tips
@@ -335,7 +335,7 @@ Extracting substring by position
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SAS extracts a substring from a string based on its position with the
-`SUBSTR `__ function.
+`SUBSTR `__ function.
.. code-block:: sas
@@ -538,7 +538,7 @@ This means that the size of data able to be loaded in pandas is limited by your
machine's memory, but also that the operations on that data may be faster.
If out of core processing is needed, one possibility is the
-`dask.dataframe `_
+`dask.dataframe `_
library (currently in development) which
provides a subset of pandas functionality for an on-disk ``DataFrame``
diff --git a/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst b/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst
index bdd0f7d8cfddf..a7148405ba8a0 100644
--- a/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst
+++ b/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst
@@ -11,7 +11,7 @@ of how various spreadsheet operations would be performed using pandas. This page
terminology and link to documentation for Excel, but much will be the same/similar in
`Google Sheets `_,
`LibreOffice Calc `_,
-`Apple Numbers `_, and other
+`Apple Numbers `_, and other
Excel-compatible spreadsheet software.
.. include:: includes/introduction.rst
@@ -85,14 +85,14 @@ In a spreadsheet, `values can be typed directly into cells `__
+Both `Excel `__
and :ref:`pandas <10min_tut_02_read_write>` can import data from various sources in various
formats.
CSV
'''
-Let's load and display the `tips `_
+Let's load and display the `tips `_
dataset from the pandas tests, which is a CSV file. In Excel, you would download and then
`open the CSV `_.
In pandas, you pass the URL or local path of the CSV file to :func:`~pandas.read_csv`:
@@ -101,7 +101,7 @@ In pandas, you pass the URL or local path of the CSV file to :func:`~pandas.read
url = (
"https://raw.github.com/pandas-dev"
- "/pandas/master/pandas/tests/io/data/csv/tips.csv"
+ "/pandas/main/pandas/tests/io/data/csv/tips.csv"
)
tips = pd.read_csv(url)
tips
@@ -435,13 +435,14 @@ The equivalent in pandas:
Adding a row
~~~~~~~~~~~~
-Assuming we are using a :class:`~pandas.RangeIndex` (numbered ``0``, ``1``, etc.), we can use :meth:`DataFrame.append` to add a row to the bottom of a ``DataFrame``.
+Assuming we are using a :class:`~pandas.RangeIndex` (numbered ``0``, ``1``, etc.), we can use :func:`concat` to add a row to the bottom of a ``DataFrame``.
.. ipython:: python
df
- new_row = {"class": "E", "student_count": 51, "all_pass": True}
- df.append(new_row, ignore_index=True)
+ new_row = pd.DataFrame([["E", 51, True]],
+ columns=["class", "student_count", "all_pass"])
+ pd.concat([df, new_row])
Find and Replace
diff --git a/doc/source/getting_started/comparison/comparison_with_sql.rst b/doc/source/getting_started/comparison/comparison_with_sql.rst
index 49a21f87382b3..0a891a4c6d2d7 100644
--- a/doc/source/getting_started/comparison/comparison_with_sql.rst
+++ b/doc/source/getting_started/comparison/comparison_with_sql.rst
@@ -18,7 +18,7 @@ structure.
url = (
"https://raw.github.com/pandas-dev"
- "/pandas/master/pandas/tests/io/data/csv/tips.csv"
+ "/pandas/main/pandas/tests/io/data/csv/tips.csv"
)
tips = pd.read_csv(url)
tips
@@ -233,6 +233,12 @@ default, :meth:`~pandas.DataFrame.join` will join the DataFrames on their indice
parameters allowing you to specify the type of join to perform (``LEFT``, ``RIGHT``, ``INNER``,
``FULL``) or the columns to join on (column names or indices).
+.. warning::
+
+ If both key columns contain rows where the key is a null value, those
+ rows will be matched against each other. This is different from usual SQL
+ join behaviour and can lead to unexpected results.
+
.. ipython:: python
df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})
diff --git a/doc/source/getting_started/comparison/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst
index 94c45adcccc82..636778a2ca32e 100644
--- a/doc/source/getting_started/comparison/comparison_with_stata.rst
+++ b/doc/source/getting_started/comparison/comparison_with_stata.rst
@@ -92,7 +92,7 @@ Reading external data
Like Stata, pandas provides utilities for reading in data from
many formats. The ``tips`` data set, found within the pandas
-tests (`csv `_)
+tests (`csv `_)
will be used in many of the following examples.
Stata provides ``import delimited`` to read csv data into a data set in memory.
@@ -109,7 +109,7 @@ the data set if presented with a url.
url = (
"https://raw.github.com/pandas-dev"
- "/pandas/master/pandas/tests/io/data/csv/tips.csv"
+ "/pandas/main/pandas/tests/io/data/csv/tips.csv"
)
tips = pd.read_csv(url)
tips
@@ -496,6 +496,6 @@ Disk vs memory
pandas and Stata both operate exclusively in memory. This means that the size of
data able to be loaded in pandas is limited by your machine's memory.
If out of core processing is needed, one possibility is the
-`dask.dataframe `_
+`dask.dataframe `_
library, which provides a subset of pandas functionality for an
on-disk ``DataFrame``.
diff --git a/doc/source/getting_started/comparison/includes/nth_word.rst b/doc/source/getting_started/comparison/includes/nth_word.rst
index 7af0285005d5b..20e2ec47a8c9d 100644
--- a/doc/source/getting_started/comparison/includes/nth_word.rst
+++ b/doc/source/getting_started/comparison/includes/nth_word.rst
@@ -5,5 +5,5 @@ word by index. Note there are more powerful approaches should you need them.
firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})
firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0]
- firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[0]
+ firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[1]
firstlast
diff --git a/doc/source/getting_started/install.rst b/doc/source/getting_started/install.rst
index 88e54421daa11..df9c258f4aa6d 100644
--- a/doc/source/getting_started/install.rst
+++ b/doc/source/getting_started/install.rst
@@ -12,7 +12,7 @@ cross platform distribution for data analysis and scientific computing.
This is the recommended installation method for most users.
Instructions for installing from source,
-`PyPI `__, `ActivePython `__, various Linux distributions, or a
+`PyPI `__, `ActivePython `__, various Linux distributions, or a
`development version `__ are also provided.
.. _install.version:
@@ -20,7 +20,7 @@ Instructions for installing from source,
Python version support
----------------------
-Officially Python 3.7.1 and above, 3.8, and 3.9.
+Officially Python 3.8, and 3.9.
Installing pandas
-----------------
@@ -47,7 +47,7 @@ rest of the `SciPy `__ stack without needing to install
anything else, and without needing to wait for any software to be compiled.
Installation instructions for `Anaconda `__
-`can be found here `__.
+`can be found here `__.
A full list of the packages available as part of the
`Anaconda `__ distribution
@@ -70,18 +70,18 @@ and involves downloading the installer which is a few hundred megabytes in size.
If you want to have more control on which packages, or have a limited internet
bandwidth, then installing pandas with
-`Miniconda `__ may be a better solution.
+`Miniconda `__ may be a better solution.
-`Conda `__ is the package manager that the
+`Conda `__ is the package manager that the
`Anaconda `__ distribution is built upon.
It is a package manager that is both cross-platform and language agnostic
(it can play a similar role to a pip and virtualenv combination).
`Miniconda `__ allows you to create a
minimal self contained Python installation, and then use the
-`Conda `__ command to install additional packages.
+`Conda `__ command to install additional packages.
-First you will need `Conda `__ to be installed and
+First you will need `Conda `__ to be installed and
downloading and running the `Miniconda
`__
will do this for you. The installer
@@ -132,6 +132,9 @@ Installing from PyPI
pandas can be installed via pip from
`PyPI `__.
+.. note::
+ You must have ``pip>=19.3`` to install from PyPI.
+
::
pip install pandas
@@ -140,8 +143,8 @@ Installing with ActivePython
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Installation instructions for
-`ActivePython `__ can be found
-`here `__. Versions
+`ActivePython `__ can be found
+`here `__. Versions
2.7, 3.5 and 3.6 include pandas.
Installing using your Linux distribution's package manager.
@@ -155,10 +158,10 @@ The commands in this table will install pandas for Python 3 from your distributi
Debian, stable, `official Debian repository `__ , ``sudo apt-get install python3-pandas``
- Debian & Ubuntu, unstable (latest packages), `NeuroDebian `__ , ``sudo apt-get install python3-pandas``
+ Debian & Ubuntu, unstable (latest packages), `NeuroDebian `__ , ``sudo apt-get install python3-pandas``
Ubuntu, stable, `official Ubuntu repository `__ , ``sudo apt-get install python3-pandas``
OpenSuse, stable, `OpenSuse Repository `__ , ``zypper in python3-pandas``
- Fedora, stable, `official Fedora repository `__ , ``dnf install python3-pandas``
+ Fedora, stable, `official Fedora repository `__ , ``dnf install python3-pandas``
Centos/RHEL, stable, `EPEL repository `__ , ``yum install python3-pandas``
**However**, the packages in the linux package managers are often a few versions behind, so
@@ -196,7 +199,7 @@ the code base as of this writing. To run it on your machine to verify that
everything is working (and that you have all of the dependencies, soft and hard,
installed), make sure you have `pytest
`__ >= 6.0 and `Hypothesis
-`__ >= 3.58, then run:
+`__ >= 3.58, then run:
::
@@ -221,9 +224,9 @@ Dependencies
================================================================ ==========================
Package Minimum supported version
================================================================ ==========================
-`NumPy `__ 1.17.3
-`python-dateutil `__ 2.7.3
-`pytz `__ 2017.3
+`NumPy `__ 1.18.5
+`python-dateutil `__ 2.8.1
+`pytz `__ 2020.1
================================================================ ==========================
.. _install.recommended_dependencies:
@@ -233,11 +236,11 @@ Recommended dependencies
* `numexpr `__: for accelerating certain numerical operations.
``numexpr`` uses multiple cores as well as smart chunking and caching to achieve large speedups.
- If installed, must be Version 2.7.0 or higher.
+ If installed, must be Version 2.7.1 or higher.
* `bottleneck `__: for accelerating certain types of ``nan``
evaluations. ``bottleneck`` uses specialized cython routines to achieve large speedups. If installed,
- must be Version 1.2.1 or higher.
+ must be Version 1.3.1 or higher.
.. note::
@@ -262,9 +265,8 @@ Visualization
========================= ================== =============================================================
Dependency Minimum Version Notes
========================= ================== =============================================================
-setuptools 38.6.0 Utils for entry points of plotting backend
-matplotlib 2.2.3 Plotting library
-Jinja2 2.10 Conditional formatting with DataFrame.style
+matplotlib 3.3.2 Plotting library
+Jinja2 2.11 Conditional formatting with DataFrame.style
tabulate 0.8.7 Printing in Markdown-friendly format (see `tabulate`_)
========================= ================== =============================================================
@@ -274,10 +276,10 @@ Computation
========================= ================== =============================================================
Dependency Minimum Version Notes
========================= ================== =============================================================
-SciPy 1.12.0 Miscellaneous statistical functions
-numba 0.46.0 Alternative execution engine for rolling operations
+SciPy 1.14.1 Miscellaneous statistical functions
+numba 0.50.1 Alternative execution engine for rolling operations
(see :ref:`Enhancing Performance `)
-xarray 0.12.3 pandas-like API for N-dimensional data
+xarray 0.15.1 pandas-like API for N-dimensional data
========================= ================== =============================================================
Excel files
@@ -286,10 +288,10 @@ Excel files
========================= ================== =============================================================
Dependency Minimum Version Notes
========================= ================== =============================================================
-xlrd 1.2.0 Reading Excel
+xlrd 2.0.1 Reading Excel
xlwt 1.3.0 Writing Excel
-xlsxwriter 1.0.2 Writing Excel
-openpyxl 3.0.0 Reading / writing for xlsx files
+xlsxwriter 1.2.2 Writing Excel
+openpyxl 3.0.3 Reading / writing for xlsx files
pyxlsb 1.0.6 Reading for xlsb files
========================= ================== =============================================================
@@ -299,9 +301,9 @@ HTML
========================= ================== =============================================================
Dependency Minimum Version Notes
========================= ================== =============================================================
-BeautifulSoup4 4.6.0 HTML parser for read_html
-html5lib 1.0.1 HTML parser for read_html
-lxml 4.3.0 HTML parser for read_html
+BeautifulSoup4 4.8.2 HTML parser for read_html
+html5lib 1.1 HTML parser for read_html
+lxml 4.5.0 HTML parser for read_html
========================= ================== =============================================================
One of the following combinations of libraries is needed to use the
@@ -334,7 +336,7 @@ XML
========================= ================== =============================================================
Dependency Minimum Version Notes
========================= ================== =============================================================
-lxml 4.3.0 XML parser for read_xml and tree builder for to_xml
+lxml 4.5.0 XML parser for read_xml and tree builder for to_xml
========================= ================== =============================================================
SQL databases
@@ -343,9 +345,9 @@ SQL databases
========================= ================== =============================================================
Dependency Minimum Version Notes
========================= ================== =============================================================
-SQLAlchemy 1.3.0 SQL support for databases other than sqlite
-psycopg2 2.7 PostgreSQL engine for sqlalchemy
-pymysql 0.8.1 MySQL engine for sqlalchemy
+SQLAlchemy 1.4.0 SQL support for databases other than sqlite
+psycopg2 2.8.4 PostgreSQL engine for sqlalchemy
+pymysql 0.10.1 MySQL engine for sqlalchemy
========================= ================== =============================================================
Other data sources
@@ -354,12 +356,12 @@ Other data sources
========================= ================== =============================================================
Dependency Minimum Version Notes
========================= ================== =============================================================
-PyTables 3.5.1 HDF5-based reading / writing
-blosc 1.17.0 Compression for HDF5
+PyTables 3.6.1 HDF5-based reading / writing
+blosc 1.20.1 Compression for HDF5
zlib Compression for HDF5
fastparquet 0.4.0 Parquet reading / writing
-pyarrow 0.17.0 Parquet, ORC, and feather reading / writing
-pyreadstat SPSS files (.sav) reading
+pyarrow 1.0.1 Parquet, ORC, and feather reading / writing
+pyreadstat 1.1.0 SPSS files (.sav) reading
========================= ================== =============================================================
.. _install.warn_orc:
@@ -385,7 +387,7 @@ Dependency Minimum Version Notes
========================= ================== =============================================================
fsspec 0.7.4 Handling files aside from simple local and HTTP
gcsfs 0.6.0 Google Cloud Storage access
-pandas-gbq 0.12.0 Google Big Query access
+pandas-gbq 0.14.0 Google Big Query access
s3fs 0.4.0 Amazon S3 access
========================= ================== =============================================================
@@ -400,3 +402,13 @@ qtpy Clipboard I/O
xclip Clipboard I/O on linux
xsel Clipboard I/O on linux
========================= ================== =============================================================
+
+
+Compression
+^^^^^^^^^^^
+
+========================= ================== =============================================================
+Dependency Minimum Version Notes
+========================= ================== =============================================================
+Zstandard Zstandard compression
+========================= ================== =============================================================
diff --git a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst
index fcf754e340ab2..caa37d69f2945 100644
--- a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst
+++ b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst
@@ -82,7 +82,7 @@ return a ``DataFrame``, see the :ref:`subset data tutorial <10min_tut_03_subset>
The aggregating statistic can be calculated for multiple columns at the
-same time. Remember the ``describe`` function from :ref:`first tutorial <10min_tut_01_tableoriented>` tutorial?
+same time. Remember the ``describe`` function from :ref:`first tutorial <10min_tut_01_tableoriented>`?
.. ipython:: python
diff --git a/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst b/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst
index bd4a617fe753b..d09511143787a 100644
--- a/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst
+++ b/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst
@@ -67,7 +67,7 @@ measurement.
.. raw:: html
- To raw data
+ To raw data
diff --git a/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst b/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst
index be4c284912db4..0b165c4aaa94e 100644
--- a/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst
+++ b/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst
@@ -34,7 +34,7 @@ Westminster* in respectively Paris, Antwerp and London.
.. raw:: html
- To raw data
+ To raw data
@@ -69,7 +69,7 @@ Westminster* in respectively Paris, Antwerp and London.
.. raw:: html
- To raw data
+ To raw data
diff --git a/doc/source/getting_started/intro_tutorials/09_timeseries.rst b/doc/source/getting_started/intro_tutorials/09_timeseries.rst
index b9cab0747196e..1b3c3f2a601e8 100644
--- a/doc/source/getting_started/intro_tutorials/09_timeseries.rst
+++ b/doc/source/getting_started/intro_tutorials/09_timeseries.rst
@@ -35,7 +35,7 @@ Westminster* in respectively Paris, Antwerp and London.
.. raw:: html
- To raw data
+ To raw data
diff --git a/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst b/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst
index a5a5442330e43..410062cf46344 100644
--- a/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst
+++ b/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst
@@ -17,6 +17,6 @@ in respectively Paris, Antwerp and London.
.. raw:: html
- To raw data
+ To raw data
diff --git a/doc/source/getting_started/intro_tutorials/includes/titanic.rst b/doc/source/getting_started/intro_tutorials/includes/titanic.rst
index 7032b70b3f1cf..1267a33d605ed 100644
--- a/doc/source/getting_started/intro_tutorials/includes/titanic.rst
+++ b/doc/source/getting_started/intro_tutorials/includes/titanic.rst
@@ -27,6 +27,6 @@ consists of the following data columns:
.. raw:: html
- To raw data
+ To raw data
diff --git a/doc/source/getting_started/overview.rst b/doc/source/getting_started/overview.rst
index 7084b67cf9424..320d2da01418c 100644
--- a/doc/source/getting_started/overview.rst
+++ b/doc/source/getting_started/overview.rst
@@ -29,7 +29,7 @@ and :class:`DataFrame` (2-dimensional), handle the vast majority of typical use
cases in finance, statistics, social science, and many areas of
engineering. For R users, :class:`DataFrame` provides everything that R's
``data.frame`` provides and much more. pandas is built on top of `NumPy
-`__ and is intended to integrate well within a scientific
+`__ and is intended to integrate well within a scientific
computing environment with many other 3rd party libraries.
Here are just a few of the things that pandas does well:
@@ -75,7 +75,7 @@ Some other notes
specialized tool.
- pandas is a dependency of `statsmodels
- `__, making it an important part of the
+ `__, making it an important part of the
statistical computing ecosystem in Python.
- pandas has been used extensively in production in financial applications.
@@ -168,7 +168,7 @@ The list of the Core Team members and more detailed information can be found on
Institutional partners
----------------------
-The information about current institutional partners can be found on `pandas website page `__.
+The information about current institutional partners can be found on `pandas website page `__.
License
-------
diff --git a/doc/source/getting_started/tutorials.rst b/doc/source/getting_started/tutorials.rst
index b8940d2efed2f..a4c555ac227e6 100644
--- a/doc/source/getting_started/tutorials.rst
+++ b/doc/source/getting_started/tutorials.rst
@@ -18,6 +18,19 @@ entails.
For the table of contents, see the `pandas-cookbook GitHub
repository `_.
+pandas workshop by Stefanie Molin
+---------------------------------
+
+An introductory workshop by `Stefanie Molin `_
+designed to quickly get you up to speed with pandas using real-world datasets.
+It covers getting started with pandas, data wrangling, and data visualization
+(with some exposure to matplotlib and seaborn). The
+`pandas-workshop GitHub repository `_
+features detailed environment setup instructions (including a Binder environment),
+slides and notebooks for following along, and exercises to practice the concepts.
+There is also a lab with new exercises on a dataset not covered in the workshop for
+additional practice.
+
Learn pandas by Hernan Rojas
----------------------------
@@ -77,11 +90,11 @@ Video tutorials
* `Data analysis in Python with pandas `_
(2016-2018)
`GitHub repo `__ and
- `Jupyter Notebook `__
+ `Jupyter Notebook `__
* `Best practices with pandas `_
(2018)
`GitHub repo `__ and
- `Jupyter Notebook `__
+ `Jupyter Notebook `__
Various tutorials
diff --git a/doc/source/index.rst.template b/doc/source/index.rst.template
index 51a6807b30e2a..3b440122c2b97 100644
--- a/doc/source/index.rst.template
+++ b/doc/source/index.rst.template
@@ -12,6 +12,9 @@ pandas documentation
**Download documentation**: `PDF Version `__ | `Zipped HTML `__
+**Previous versions**: Documentation of previous pandas versions is available at
+`pandas.pydata.org `__.
+
**Useful links**:
`Binary Installers `__ |
`Source Repository `__ |
diff --git a/doc/source/reference/arrays.rst b/doc/source/reference/arrays.rst
index c6fda85b0486d..38792c46e5f54 100644
--- a/doc/source/reference/arrays.rst
+++ b/doc/source/reference/arrays.rst
@@ -2,9 +2,9 @@
.. _api.arrays:
-=============
-pandas arrays
-=============
+======================================
+pandas arrays, scalars, and data types
+======================================
.. currentmodule:: pandas
@@ -141,11 +141,11 @@ Methods
Timestamp.weekday
A collection of timestamps may be stored in a :class:`arrays.DatetimeArray`.
-For timezone-aware data, the ``.dtype`` of a ``DatetimeArray`` is a
+For timezone-aware data, the ``.dtype`` of a :class:`arrays.DatetimeArray` is a
:class:`DatetimeTZDtype`. For timezone-naive data, ``np.dtype("datetime64[ns]")``
is used.
-If the data are tz-aware, then every value in the array must have the same timezone.
+If the data are timezone-aware, then every value in the array must have the same timezone.
.. autosummary::
:toctree: api/
@@ -206,7 +206,7 @@ Methods
Timedelta.to_numpy
Timedelta.total_seconds
-A collection of timedeltas may be stored in a :class:`TimedeltaArray`.
+A collection of :class:`Timedelta` may be stored in a :class:`TimedeltaArray`.
.. autosummary::
:toctree: api/
@@ -267,8 +267,8 @@ Methods
Period.strftime
Period.to_timestamp
-A collection of timedeltas may be stored in a :class:`arrays.PeriodArray`.
-Every period in a ``PeriodArray`` must have the same ``freq``.
+A collection of :class:`Period` may be stored in a :class:`arrays.PeriodArray`.
+Every period in a :class:`arrays.PeriodArray` must have the same ``freq``.
.. autosummary::
:toctree: api/
@@ -383,8 +383,8 @@ Categorical data
----------------
pandas defines a custom data type for representing data that can take only a
-limited, fixed set of values. The dtype of a ``Categorical`` can be described by
-a :class:`pandas.api.types.CategoricalDtype`.
+limited, fixed set of values. The dtype of a :class:`Categorical` can be described by
+a :class:`CategoricalDtype`.
.. autosummary::
:toctree: api/
@@ -414,7 +414,7 @@ have the categories and integer codes already:
Categorical.from_codes
-The dtype information is available on the ``Categorical``
+The dtype information is available on the :class:`Categorical`
.. autosummary::
:toctree: api/
@@ -425,21 +425,21 @@ The dtype information is available on the ``Categorical``
Categorical.codes
``np.asarray(categorical)`` works by implementing the array interface. Be aware, that this converts
-the Categorical back to a NumPy array, so categories and order information is not preserved!
+the :class:`Categorical` back to a NumPy array, so categories and order information is not preserved!
.. autosummary::
:toctree: api/
Categorical.__array__
-A ``Categorical`` can be stored in a ``Series`` or ``DataFrame``.
+A :class:`Categorical` can be stored in a :class:`Series` or :class:`DataFrame`.
To create a Series of dtype ``category``, use ``cat = s.astype(dtype)`` or
``Series(..., dtype=dtype)`` where ``dtype`` is either
* the string ``'category'``
-* an instance of :class:`~pandas.api.types.CategoricalDtype`.
+* an instance of :class:`CategoricalDtype`.
-If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical
+If the :class:`Series` is of dtype :class:`CategoricalDtype`, ``Series.cat`` can be used to change the categorical
data. See :ref:`api.series.cat` for more.
.. _api.arrays.sparse:
@@ -488,7 +488,7 @@ we recommend using :class:`StringDtype` (with the alias ``"string"``).
StringDtype
-The ``Series.str`` accessor is available for ``Series`` backed by a :class:`arrays.StringArray`.
+The ``Series.str`` accessor is available for :class:`Series` backed by a :class:`arrays.StringArray`.
See :ref:`api.series.str` for more.
@@ -498,7 +498,7 @@ Boolean data with missing values
--------------------------------
The boolean dtype (with the alias ``"boolean"``) provides support for storing
-boolean data (True, False values) with missing values, which is not possible
+boolean data (``True``, ``False``) with missing values, which is not possible
with a bool :class:`numpy.ndarray`.
.. autosummary::
diff --git a/doc/source/reference/extensions.rst b/doc/source/reference/extensions.rst
index 7b451ed3bf296..ce8d8d5c2ca10 100644
--- a/doc/source/reference/extensions.rst
+++ b/doc/source/reference/extensions.rst
@@ -48,6 +48,7 @@ objects.
api.extensions.ExtensionArray.equals
api.extensions.ExtensionArray.factorize
api.extensions.ExtensionArray.fillna
+ api.extensions.ExtensionArray.insert
api.extensions.ExtensionArray.isin
api.extensions.ExtensionArray.isna
api.extensions.ExtensionArray.ravel
@@ -60,6 +61,7 @@ objects.
api.extensions.ExtensionArray.nbytes
api.extensions.ExtensionArray.ndim
api.extensions.ExtensionArray.shape
+ api.extensions.ExtensionArray.tolist
Additionally, we have some utility methods for ensuring your object
behaves correctly.
diff --git a/doc/source/reference/general_functions.rst b/doc/source/reference/general_functions.rst
index b5832cb8aa591..dde16fb7fac71 100644
--- a/doc/source/reference/general_functions.rst
+++ b/doc/source/reference/general_functions.rst
@@ -37,15 +37,15 @@ Top-level missing data
notna
notnull
-Top-level conversions
-~~~~~~~~~~~~~~~~~~~~~
+Top-level dealing with numeric data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/
to_numeric
-Top-level dealing with datetimelike
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Top-level dealing with datetimelike data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/
@@ -57,8 +57,8 @@ Top-level dealing with datetimelike
timedelta_range
infer_freq
-Top-level dealing with intervals
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Top-level dealing with Interval data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/
diff --git a/doc/source/reference/general_utility_functions.rst b/doc/source/reference/general_utility_functions.rst
index 37fe980dbf68c..ee17ef3831164 100644
--- a/doc/source/reference/general_utility_functions.rst
+++ b/doc/source/reference/general_utility_functions.rst
@@ -35,14 +35,17 @@ Exceptions and warnings
.. autosummary::
:toctree: api/
+ errors.AbstractMethodError
errors.AccessorRegistrationWarning
errors.DtypeWarning
errors.DuplicateLabelError
errors.EmptyDataError
errors.InvalidIndexError
+ errors.IntCastingNaNError
errors.MergeError
errors.NullFrequencyError
errors.NumbaUtilError
+ errors.OptionError
errors.OutOfBoundsDatetime
errors.OutOfBoundsTimedelta
errors.ParserError
diff --git a/doc/source/reference/groupby.rst b/doc/source/reference/groupby.rst
index ccf130d03418c..2bb0659264eb0 100644
--- a/doc/source/reference/groupby.rst
+++ b/doc/source/reference/groupby.rst
@@ -122,6 +122,7 @@ application to columns of a specific data type.
DataFrameGroupBy.skew
DataFrameGroupBy.take
DataFrameGroupBy.tshift
+ DataFrameGroupBy.value_counts
The following methods are available only for ``SeriesGroupBy`` objects.
diff --git a/doc/source/reference/indexing.rst b/doc/source/reference/indexing.rst
index 1a8c21a2c1a74..ddfef14036ef3 100644
--- a/doc/source/reference/indexing.rst
+++ b/doc/source/reference/indexing.rst
@@ -406,6 +406,7 @@ Methods
:toctree: api/
DatetimeIndex.mean
+ DatetimeIndex.std
TimedeltaIndex
--------------
diff --git a/doc/source/reference/io.rst b/doc/source/reference/io.rst
index 442631de50c7a..44ee09f2a5e6b 100644
--- a/doc/source/reference/io.rst
+++ b/doc/source/reference/io.rst
@@ -13,6 +13,7 @@ Pickling
:toctree: api/
read_pickle
+ DataFrame.to_pickle
Flat file
~~~~~~~~~
@@ -21,6 +22,7 @@ Flat file
read_table
read_csv
+ DataFrame.to_csv
read_fwf
Clipboard
@@ -29,6 +31,7 @@ Clipboard
:toctree: api/
read_clipboard
+ DataFrame.to_clipboard
Excel
~~~~~
@@ -36,14 +39,26 @@ Excel
:toctree: api/
read_excel
+ DataFrame.to_excel
ExcelFile.parse
+.. currentmodule:: pandas.io.formats.style
+
+.. autosummary::
+ :toctree: api/
+
+ Styler.to_excel
+
+.. currentmodule:: pandas
+
.. autosummary::
:toctree: api/
:template: autosummary/class_without_autosummary.rst
ExcelWriter
+.. currentmodule:: pandas
+
JSON
~~~~
.. autosummary::
@@ -51,6 +66,7 @@ JSON
read_json
json_normalize
+ DataFrame.to_json
.. currentmodule:: pandas.io.json
@@ -67,6 +83,16 @@ HTML
:toctree: api/
read_html
+ DataFrame.to_html
+
+.. currentmodule:: pandas.io.formats.style
+
+.. autosummary::
+ :toctree: api/
+
+ Styler.to_html
+
+.. currentmodule:: pandas
XML
~~~~
@@ -74,6 +100,23 @@ XML
:toctree: api/
read_xml
+ DataFrame.to_xml
+
+Latex
+~~~~~
+.. autosummary::
+ :toctree: api/
+
+ DataFrame.to_latex
+
+.. currentmodule:: pandas.io.formats.style
+
+.. autosummary::
+ :toctree: api/
+
+ Styler.to_latex
+
+.. currentmodule:: pandas
HDFStore: PyTables (HDF5)
~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -92,7 +135,7 @@ HDFStore: PyTables (HDF5)
.. warning::
- One can store a subclass of ``DataFrame`` or ``Series`` to HDF5,
+ One can store a subclass of :class:`DataFrame` or :class:`Series` to HDF5,
but the type of the subclass is lost upon storing.
Feather
@@ -101,6 +144,7 @@ Feather
:toctree: api/
read_feather
+ DataFrame.to_feather
Parquet
~~~~~~~
@@ -108,6 +152,7 @@ Parquet
:toctree: api/
read_parquet
+ DataFrame.to_parquet
ORC
~~~
@@ -138,6 +183,7 @@ SQL
read_sql_table
read_sql_query
read_sql
+ DataFrame.to_sql
Google BigQuery
~~~~~~~~~~~~~~~
@@ -152,6 +198,7 @@ STATA
:toctree: api/
read_stata
+ DataFrame.to_stata
.. currentmodule:: pandas.io.stata
diff --git a/doc/source/reference/series.rst b/doc/source/reference/series.rst
index 3ff3b2bb53fda..a60dab549e66d 100644
--- a/doc/source/reference/series.rst
+++ b/doc/source/reference/series.rst
@@ -427,6 +427,8 @@ strings and apply several methods to it. These can be accessed like
Series.str.normalize
Series.str.pad
Series.str.partition
+ Series.str.removeprefix
+ Series.str.removesuffix
Series.str.repeat
Series.str.replace
Series.str.rfind
diff --git a/doc/source/reference/style.rst b/doc/source/reference/style.rst
index 5a2ff803f0323..a739993e4d376 100644
--- a/doc/source/reference/style.rst
+++ b/doc/source/reference/style.rst
@@ -24,6 +24,8 @@ Styler properties
Styler.env
Styler.template_html
+ Styler.template_html_style
+ Styler.template_html_table
Styler.template_latex
Styler.loader
@@ -34,13 +36,17 @@ Style application
Styler.apply
Styler.applymap
- Styler.where
+ Styler.apply_index
+ Styler.applymap_index
Styler.format
+ Styler.format_index
+ Styler.hide
Styler.set_td_classes
Styler.set_table_styles
Styler.set_table_attributes
Styler.set_tooltips
Styler.set_caption
+ Styler.set_sticky
Styler.set_properties
Styler.set_uuid
Styler.clear
@@ -65,9 +71,8 @@ Style export and import
.. autosummary::
:toctree: api/
- Styler.render
- Styler.export
- Styler.use
Styler.to_html
- Styler.to_excel
Styler.to_latex
+ Styler.to_excel
+ Styler.export
+ Styler.use
diff --git a/doc/source/reference/window.rst b/doc/source/reference/window.rst
index a255b3ae8081e..0be3184a9356c 100644
--- a/doc/source/reference/window.rst
+++ b/doc/source/reference/window.rst
@@ -35,6 +35,7 @@ Rolling window functions
Rolling.aggregate
Rolling.quantile
Rolling.sem
+ Rolling.rank
.. _api.functions_window:
@@ -75,6 +76,7 @@ Expanding window functions
Expanding.aggregate
Expanding.quantile
Expanding.sem
+ Expanding.rank
.. _api.functions_ewm:
@@ -86,6 +88,7 @@ Exponentially-weighted window functions
:toctree: api/
ExponentialMovingWindow.mean
+ ExponentialMovingWindow.sum
ExponentialMovingWindow.std
ExponentialMovingWindow.var
ExponentialMovingWindow.corr
diff --git a/doc/source/user_guide/10min.rst b/doc/source/user_guide/10min.rst
index 2b329ef362354..08488a33936f0 100644
--- a/doc/source/user_guide/10min.rst
+++ b/doc/source/user_guide/10min.rst
@@ -19,7 +19,7 @@ Customarily, we import as follows:
Object creation
---------------
-See the :ref:`Data Structure Intro section `.
+See the :ref:`Intro to data structures section `.
Creating a :class:`Series` by passing a list of values, letting pandas create
a default integer index:
@@ -39,7 +39,8 @@ and labeled columns:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
-Creating a :class:`DataFrame` by passing a dict of objects that can be converted to series-like.
+Creating a :class:`DataFrame` by passing a dictionary of objects that can be
+converted into a series-like structure:
.. ipython:: python
@@ -56,7 +57,7 @@ Creating a :class:`DataFrame` by passing a dict of objects that can be converted
df2
The columns of the resulting :class:`DataFrame` have different
-:ref:`dtypes `.
+:ref:`dtypes `:
.. ipython:: python
@@ -116,14 +117,14 @@ of the dtypes in the DataFrame. This may end up being ``object``, which requires
casting every value to a Python object.
For ``df``, our :class:`DataFrame` of all floating-point values,
-:meth:`DataFrame.to_numpy` is fast and doesn't require copying data.
+:meth:`DataFrame.to_numpy` is fast and doesn't require copying data:
.. ipython:: python
df.to_numpy()
For ``df2``, the :class:`DataFrame` with multiple dtypes,
-:meth:`DataFrame.to_numpy` is relatively expensive.
+:meth:`DataFrame.to_numpy` is relatively expensive:
.. ipython:: python
@@ -180,7 +181,7 @@ equivalent to ``df.A``:
df["A"]
-Selecting via ``[]``, which slices the rows.
+Selecting via ``[]``, which slices the rows:
.. ipython:: python
@@ -278,13 +279,13 @@ For getting fast access to a scalar (equivalent to the prior method):
Boolean indexing
~~~~~~~~~~~~~~~~
-Using a single column's values to select data.
+Using a single column's values to select data:
.. ipython:: python
df[df["A"] > 0]
-Selecting values from a DataFrame where a boolean condition is met.
+Selecting values from a DataFrame where a boolean condition is met:
.. ipython:: python
@@ -303,7 +304,7 @@ Setting
~~~~~~~
Setting a new column automatically aligns the data
-by the indexes.
+by the indexes:
.. ipython:: python
@@ -329,13 +330,13 @@ Setting by assigning with a NumPy array:
df.loc[:, "D"] = np.array([5] * len(df))
-The result of the prior setting operations.
+The result of the prior setting operations:
.. ipython:: python
df
-A ``where`` operation with setting.
+A ``where`` operation with setting:
.. ipython:: python
@@ -352,7 +353,7 @@ default not included in computations. See the :ref:`Missing Data section
`.
Reindexing allows you to change/add/delete the index on a specified axis. This
-returns a copy of the data.
+returns a copy of the data:
.. ipython:: python
@@ -360,19 +361,19 @@ returns a copy of the data.
df1.loc[dates[0] : dates[1], "E"] = 1
df1
-To drop any rows that have missing data.
+To drop any rows that have missing data:
.. ipython:: python
df1.dropna(how="any")
-Filling missing data.
+Filling missing data:
.. ipython:: python
df1.fillna(value=5)
-To get the boolean mask where values are ``nan``.
+To get the boolean mask where values are ``nan``:
.. ipython:: python
@@ -402,7 +403,7 @@ Same operation on the other axis:
df.mean(1)
Operating with objects that have different dimensionality and need alignment.
-In addition, pandas automatically broadcasts along the specified dimension.
+In addition, pandas automatically broadcasts along the specified dimension:
.. ipython:: python
@@ -477,7 +478,6 @@ Concatenating pandas objects together with :func:`concat`:
a row requires a copy, and may be expensive. We recommend passing a
pre-built list of records to the :class:`DataFrame` constructor instead
of building a :class:`DataFrame` by iteratively appending records to it.
- See :ref:`Appending to dataframe ` for more.
Join
~~~~
@@ -527,14 +527,14 @@ See the :ref:`Grouping section `.
df
Grouping and then applying the :meth:`~pandas.core.groupby.GroupBy.sum` function to the resulting
-groups.
+groups:
.. ipython:: python
df.groupby("A").sum()
Grouping by multiple columns forms a hierarchical index, and again we can
-apply the :meth:`~pandas.core.groupby.GroupBy.sum` function.
+apply the :meth:`~pandas.core.groupby.GroupBy.sum` function:
.. ipython:: python
@@ -565,7 +565,7 @@ Stack
df2
The :meth:`~DataFrame.stack` method "compresses" a level in the DataFrame's
-columns.
+columns:
.. ipython:: python
@@ -673,7 +673,7 @@ pandas can include categorical data in a :class:`DataFrame`. For full docs, see
-Convert the raw grades to a categorical data type.
+Converting the raw grades to a categorical data type:
.. ipython:: python
@@ -681,13 +681,13 @@ Convert the raw grades to a categorical data type.
df["grade"]
Rename the categories to more meaningful names (assigning to
-:meth:`Series.cat.categories` is in place!).
+:meth:`Series.cat.categories` is in place!):
.. ipython:: python
df["grade"].cat.categories = ["very good", "good", "very bad"]
-Reorder the categories and simultaneously add the missing categories (methods under :meth:`Series.cat` return a new :class:`Series` by default).
+Reorder the categories and simultaneously add the missing categories (methods under :meth:`Series.cat` return a new :class:`Series` by default):
.. ipython:: python
@@ -696,13 +696,13 @@ Reorder the categories and simultaneously add the missing categories (methods un
)
df["grade"]
-Sorting is per order in the categories, not lexical order.
+Sorting is per order in the categories, not lexical order:
.. ipython:: python
df.sort_values(by="grade")
-Grouping by a categorical column also shows empty categories.
+Grouping by a categorical column also shows empty categories:
.. ipython:: python
@@ -722,7 +722,7 @@ We use the standard convention for referencing the matplotlib API:
plt.close("all")
-The :meth:`~plt.close` method is used to `close `__ a figure window.
+The :meth:`~plt.close` method is used to `close `__ a figure window:
.. ipython:: python
@@ -732,6 +732,14 @@ The :meth:`~plt.close` method is used to `close `__ to show it or
+`matplotlib.pyplot.savefig `__ to write it to a file.
+
+.. ipython:: python
+
+ plt.show();
+
On a DataFrame, the :meth:`~DataFrame.plot` method is a convenience to plot all
of the columns with labels:
@@ -754,13 +762,13 @@ Getting data in/out
CSV
~~~
-:ref:`Writing to a csv file. `
+:ref:`Writing to a csv file: `
.. ipython:: python
df.to_csv("foo.csv")
-:ref:`Reading from a csv file. `
+:ref:`Reading from a csv file: `
.. ipython:: python
@@ -778,13 +786,13 @@ HDF5
Reading and writing to :ref:`HDFStores `.
-Writing to a HDF5 Store.
+Writing to a HDF5 Store:
.. ipython:: python
df.to_hdf("foo.h5", "df")
-Reading from a HDF5 Store.
+Reading from a HDF5 Store:
.. ipython:: python
@@ -800,13 +808,13 @@ Excel
Reading and writing to :ref:`MS Excel `.
-Writing to an excel file.
+Writing to an excel file:
.. ipython:: python
df.to_excel("foo.xlsx", sheet_name="Sheet1")
-Reading from an excel file.
+Reading from an excel file:
.. ipython:: python
diff --git a/doc/source/user_guide/advanced.rst b/doc/source/user_guide/advanced.rst
index 3b33ebe701037..b8df21ab5a5b4 100644
--- a/doc/source/user_guide/advanced.rst
+++ b/doc/source/user_guide/advanced.rst
@@ -7,7 +7,7 @@ MultiIndex / advanced indexing
******************************
This section covers :ref:`indexing with a MultiIndex `
-and :ref:`other advanced indexing features `.
+and :ref:`other advanced indexing features `.
See the :ref:`Indexing and Selecting Data ` for general indexing documentation.
@@ -738,7 +738,7 @@ faster than fancy indexing.
%timeit ser.iloc[indexer]
%timeit ser.take(indexer)
-.. _indexing.index_types:
+.. _advanced.index_types:
Index types
-----------
@@ -749,7 +749,7 @@ and documentation about ``TimedeltaIndex`` is found :ref:`here `__.
-.. _indexing.float64index:
+.. _advanced.float64index:
Float64Index
~~~~~~~~~~~~
+.. deprecated:: 1.4.0
+ :class:`Index` will become the default index type for numeric types in the future
+ instead of ``Int64Index``, ``Float64Index`` and ``UInt64Index`` and those index types
+ are therefore deprecated and will be removed in a future version of Pandas.
+ ``RangeIndex`` will not be removed as it represents an optimized version of an integer index.
+
By default a :class:`Float64Index` will be automatically created when passing floating, or mixed-integer-floating values in index creation.
This enables a pure label-based slicing paradigm that makes ``[],ix,loc`` for scalar indexing and slicing work exactly the
same.
@@ -956,6 +968,7 @@ If you need integer based selection, you should use ``iloc``:
dfir.iloc[0:5]
+
.. _advanced.intervalindex:
IntervalIndex
@@ -1233,5 +1246,5 @@ This is because the (re)indexing operations above silently inserts ``NaNs`` and
changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs``
such as ``numpy.logical_and``.
-See the `this old issue `__ for a more
+See the :issue:`2388` for a more
detailed discussion.
diff --git a/doc/source/user_guide/basics.rst b/doc/source/user_guide/basics.rst
index 82c8a27bec3a5..a34d4891b9d77 100644
--- a/doc/source/user_guide/basics.rst
+++ b/doc/source/user_guide/basics.rst
@@ -848,8 +848,8 @@ have introduced the popular ``(%>%)`` (read pipe) operator for R_.
The implementation of ``pipe`` here is quite clean and feels right at home in Python.
We encourage you to view the source code of :meth:`~DataFrame.pipe`.
-.. _dplyr: https://github.com/hadley/dplyr
-.. _magrittr: https://github.com/smbache/magrittr
+.. _dplyr: https://github.com/tidyverse/dplyr
+.. _magrittr: https://github.com/tidyverse/magrittr
.. _R: https://www.r-project.org
@@ -1045,6 +1045,9 @@ not noted for a particular column will be ``NaN``:
Mixed dtypes
++++++++++++
+.. deprecated:: 1.4.0
+ Attempting to determine which columns cannot be aggregated and silently dropping them from the results is deprecated and will be removed in a future version. If any porition of the columns or operations provided fail, the call to ``.agg`` will raise.
+
When presented with mixed dtypes that cannot aggregate, ``.agg`` will only take the valid
aggregations. This is similar to how ``.groupby.agg`` works.
@@ -1061,6 +1064,7 @@ aggregations. This is similar to how ``.groupby.agg`` works.
mdf.dtypes
.. ipython:: python
+ :okwarning:
mdf.agg(["min", "sum"])
@@ -2047,32 +2051,33 @@ The following table lists all of pandas extension types. For methods requiring `
arguments, strings can be specified as indicated. See the respective
documentation sections for more on each type.
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| Kind of Data | Data Type | Scalar | Array | String Aliases | Documentation |
-+===================+===========================+====================+===============================+=========================================+===============================+
-| tz-aware datetime | :class:`DatetimeTZDtype` | :class:`Timestamp` | :class:`arrays.DatetimeArray` | ``'datetime64[ns, ]'`` | :ref:`timeseries.timezone` |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| Categorical | :class:`CategoricalDtype` | (none) | :class:`Categorical` | ``'category'`` | :ref:`categorical` |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| period | :class:`PeriodDtype` | :class:`Period` | :class:`arrays.PeriodArray` | ``'period[]'``, | :ref:`timeseries.periods` |
-| (time spans) | | | | ``'Period[]'`` | |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| sparse | :class:`SparseDtype` | (none) | :class:`arrays.SparseArray` | ``'Sparse'``, ``'Sparse[int]'``, | :ref:`sparse` |
-| | | | | ``'Sparse[float]'`` | |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| intervals | :class:`IntervalDtype` | :class:`Interval` | :class:`arrays.IntervalArray` | ``'interval'``, ``'Interval'``, | :ref:`advanced.intervalindex` |
-| | | | | ``'Interval[]'``, | |
-| | | | | ``'Interval[datetime64[ns, ]]'``, | |
-| | | | | ``'Interval[timedelta64[]]'`` | |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| nullable integer + :class:`Int64Dtype`, ... | (none) | :class:`arrays.IntegerArray` | ``'Int8'``, ``'Int16'``, ``'Int32'``, | :ref:`integer_na` |
-| | | | | ``'Int64'``, ``'UInt8'``, ``'UInt16'``, | |
-| | | | | ``'UInt32'``, ``'UInt64'`` | |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| Strings | :class:`StringDtype` | :class:`str` | :class:`arrays.StringArray` | ``'string'`` | :ref:`text` |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
-| Boolean (with NA) | :class:`BooleanDtype` | :class:`bool` | :class:`arrays.BooleanArray` | ``'boolean'`` | :ref:`api.arrays.bool` |
-+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| Kind of Data | Data Type | Scalar | Array | String Aliases |
++=================================================+===============+===========+========+===========+===============================+========================================+
+| :ref:`tz-aware datetime ` | :class:`DatetimeTZDtype` | :class:`Timestamp` | :class:`arrays.DatetimeArray` | ``'datetime64[ns, ]'`` |
+| | | | | |
++-------------------------------------------------+---------------+-----------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`Categorical ` | :class:`CategoricalDtype` | (none) | :class:`Categorical` | ``'category'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`period (time spans) ` | :class:`PeriodDtype` | :class:`Period` | :class:`arrays.PeriodArray` | ``'period[]'``, |
+| | | | ``'Period[]'`` | |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`sparse ` | :class:`SparseDtype` | (none) | :class:`arrays.SparseArray` | ``'Sparse'``, ``'Sparse[int]'``, |
+| | | | | ``'Sparse[float]'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`intervals ` | :class:`IntervalDtype` | :class:`Interval` | :class:`arrays.IntervalArray` | ``'interval'``, ``'Interval'``, |
+| | | | | ``'Interval[]'``, |
+| | | | | ``'Interval[datetime64[ns, ]]'``, |
+| | | | | ``'Interval[timedelta64[]]'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`nullable integer ` | :class:`Int64Dtype`, ... | (none) | :class:`arrays.IntegerArray` | ``'Int8'``, ``'Int16'``, ``'Int32'``, |
+| | | | | ``'Int64'``, ``'UInt8'``, ``'UInt16'``,|
+| | | | | ``'UInt32'``, ``'UInt64'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`Strings ` | :class:`StringDtype` | :class:`str` | :class:`arrays.StringArray` | ``'string'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
+| :ref:`Boolean (with NA) ` | :class:`BooleanDtype` | :class:`bool` | :class:`arrays.BooleanArray` | ``'boolean'`` |
++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+
pandas has two ways to store strings.
diff --git a/doc/source/user_guide/boolean.rst b/doc/source/user_guide/boolean.rst
index 76c922fcef638..54c67674b890c 100644
--- a/doc/source/user_guide/boolean.rst
+++ b/doc/source/user_guide/boolean.rst
@@ -12,6 +12,11 @@
Nullable Boolean data type
**************************
+.. note::
+
+ BooleanArray is currently experimental. Its API or implementation may
+ change without warning.
+
.. versionadded:: 1.0.0
diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst
index f65638cd78a2b..0105cf99193dd 100644
--- a/doc/source/user_guide/categorical.rst
+++ b/doc/source/user_guide/categorical.rst
@@ -777,8 +777,8 @@ value is included in the ``categories``:
df
try:
df.iloc[2:4, :] = [["c", 3], ["c", 3]]
- except ValueError as e:
- print("ValueError:", str(e))
+ except TypeError as e:
+ print("TypeError:", str(e))
Setting values by assigning categorical data will also check that the ``categories`` match:
@@ -788,8 +788,8 @@ Setting values by assigning categorical data will also check that the ``categori
df
try:
df.loc["j":"k", "cats"] = pd.Categorical(["b", "b"], categories=["a", "b", "c"])
- except ValueError as e:
- print("ValueError:", str(e))
+ except TypeError as e:
+ print("TypeError:", str(e))
Assigning a ``Categorical`` to parts of a column of other types will use the values:
@@ -1141,7 +1141,7 @@ Categorical index
``CategoricalIndex`` is a type of index that is useful for supporting
indexing with duplicates. This is a container around a ``Categorical``
and allows efficient indexing and storage of an index with a large number of duplicated elements.
-See the :ref:`advanced indexing docs ` for a more detailed
+See the :ref:`advanced indexing docs ` for a more detailed
explanation.
Setting the index will create a ``CategoricalIndex``:
diff --git a/doc/source/user_guide/cookbook.rst b/doc/source/user_guide/cookbook.rst
index e1aae0fd481b1..f88f4a9708c45 100644
--- a/doc/source/user_guide/cookbook.rst
+++ b/doc/source/user_guide/cookbook.rst
@@ -193,8 +193,7 @@ The :ref:`indexing ` docs.
df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))]
-`Use loc for label-oriented slicing and iloc positional slicing
-`__
+Use loc for label-oriented slicing and iloc positional slicing :issue:`2904`
.. ipython:: python
@@ -229,7 +228,7 @@ Ambiguity arises when an index consists of integers with a non-zero start or non
df2.loc[1:3] # Label-oriented
`Using inverse operator (~) to take the complement of a mask
-`__
+`__
.. ipython:: python
@@ -259,7 +258,7 @@ New columns
df
`Keep other columns when using min() with groupby
-`__
+`__
.. ipython:: python
@@ -389,14 +388,13 @@ Sorting
*******
`Sort by specific column or an ordered list of columns, with a MultiIndex
-`__
+`__
.. ipython:: python
df.sort_values(by=("Labs", "II"), ascending=False)
-`Partial selection, the need for sortedness;
-`__
+Partial selection, the need for sortedness :issue:`2995`
Levels
******
@@ -405,7 +403,7 @@ Levels
`__
`Flatten Hierarchical columns
-`__
+`__
.. _cookbook.missing_data:
@@ -556,7 +554,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to
ts
`Create a value counts column and reassign back to the DataFrame
-`__
+`__
.. ipython:: python
@@ -663,7 +661,7 @@ Pivot
The :ref:`Pivot ` docs.
`Partial sums and subtotals
-`__
+`__
.. ipython:: python
@@ -870,7 +868,7 @@ Timeseries
`__
`Constructing a datetime range that excludes weekends and includes only certain times
-`__
+`__
`Vectorized Lookup
`__
@@ -910,8 +908,7 @@ Valid frequency arguments to Grouper :ref:`Timeseries `__
-`Using TimeGrouper and another grouping to create subgroups, then apply a custom function
-`__
+Using TimeGrouper and another grouping to create subgroups, then apply a custom function :issue:`3791`
`Resampling with custom periods
`__
@@ -929,9 +926,9 @@ Valid frequency arguments to Grouper :ref:`Timeseries ` docs. The :ref:`Join ` docs.
+The :ref:`Join ` docs.
-`Append two dataframes with overlapping index (emulate R rbind)
+`Concatenate two dataframes with overlapping index (emulate R rbind)
`__
.. ipython:: python
@@ -944,11 +941,10 @@ Depending on df construction, ``ignore_index`` may be needed
.. ipython:: python
- df = df1.append(df2, ignore_index=True)
+ df = pd.concat([df1, df2], ignore_index=True)
df
-`Self Join of a DataFrame
-`__
+Self Join of a DataFrame :issue:`2996`
.. ipython:: python
@@ -1038,7 +1034,7 @@ Data in/out
-----------
`Performance comparison of SQL vs HDF5
-`__
+`__
.. _cookbook.csv:
@@ -1070,14 +1066,7 @@ using that handle to read.
`Inferring dtypes from a file
`__
-`Dealing with bad lines
-`__
-
-`Dealing with bad lines II
-`__
-
-`Reading CSV with Unix timestamps and converting to local timezone
-`__
+Dealing with bad lines :issue:`2886`
`Write a multi-row index CSV without writing duplicates
`__
@@ -1211,6 +1200,8 @@ The :ref:`Excel ` docs
`Modifying formatting in XlsxWriter output
`__
+Loading only visible sheets :issue:`19842#issuecomment-892150745`
+
.. _cookbook.html:
HTML
@@ -1229,8 +1220,7 @@ The :ref:`HDFStores ` docs
`Simple queries with a Timestamp Index
`__
-`Managing heterogeneous data using a linked multiple table hierarchy
-`__
+Managing heterogeneous data using a linked multiple table hierarchy :issue:`3032`
`Merging on-disk tables with millions of rows
`__
@@ -1250,7 +1240,7 @@ csv file and creating a store by chunks, with date parsing as well.
`__
`Large Data work flows
-`__
+`__
`Reading in a sequence of files, then providing a global unique index to a store while appending
`__
@@ -1300,7 +1290,7 @@ is closed.
.. ipython:: python
- store = pd.HDFStore("test.h5", "w", diver="H5FD_CORE")
+ store = pd.HDFStore("test.h5", "w", driver="H5FD_CORE")
df = pd.DataFrame(np.random.randn(8, 3))
store["test"] = df
@@ -1381,7 +1371,7 @@ Computation
-----------
`Numerical integration (sample-based) of a time series
-`__
+`__
Correlation
***********
diff --git a/doc/source/user_guide/duplicates.rst b/doc/source/user_guide/duplicates.rst
index 7cda067fb24ad..36c2ec53d58b4 100644
--- a/doc/source/user_guide/duplicates.rst
+++ b/doc/source/user_guide/duplicates.rst
@@ -28,6 +28,7 @@ duplicates present. The output can't be determined, and so pandas raises.
.. ipython:: python
:okexcept:
+ :okwarning:
s1 = pd.Series([0, 1, 2], index=["a", "b", "b"])
s1.reindex(["a", "b", "c"])
diff --git a/doc/source/user_guide/enhancingperf.rst b/doc/source/user_guide/enhancingperf.rst
index aa9a1ba6d6bf0..eef41eb4be80e 100644
--- a/doc/source/user_guide/enhancingperf.rst
+++ b/doc/source/user_guide/enhancingperf.rst
@@ -35,7 +35,7 @@ by trying to remove for-loops and making use of NumPy vectorization. It's always
optimising in Python first.
This tutorial walks through a "typical" process of cythonizing a slow computation.
-We use an `example from the Cython documentation `__
+We use an `example from the Cython documentation `__
but in the context of pandas. Our final cythonized solution is around 100 times
faster than the pure Python solution.
@@ -302,28 +302,63 @@ For more about ``boundscheck`` and ``wraparound``, see the Cython docs on
.. _enhancingperf.numba:
-Using Numba
------------
+Numba (JIT compilation)
+-----------------------
-A recent alternative to statically compiling Cython code, is to use a *dynamic jit-compiler*, Numba.
+An alternative to statically compiling Cython code is to use a dynamic just-in-time (JIT) compiler with `Numba `__.
-Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
+Numba allows you to write a pure Python function which can be JIT compiled to native machine instructions, similar in performance to C, C++ and Fortran,
+by decorating your function with ``@jit``.
-Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack.
+Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool).
+Numba supports compilation of Python to run on either CPU or GPU hardware and is designed to integrate with the Python scientific software stack.
.. note::
- You will need to install Numba. This is easy with ``conda``, by using: ``conda install numba``, see :ref:`installing using miniconda`.
+ The ``@jit`` compilation will add overhead to the runtime of the function, so performance benefits may not be realized especially when using small data sets.
+ Consider `caching `__ your function to avoid compilation overhead each time your function is run.
-.. note::
+Numba can be used in 2 ways with pandas:
+
+#. Specify the ``engine="numba"`` keyword in select pandas methods
+#. Define your own Python function decorated with ``@jit`` and pass the underlying NumPy array of :class:`Series` or :class:`Dataframe` (using ``to_numpy()``) into the function
+
+pandas Numba Engine
+~~~~~~~~~~~~~~~~~~~
+
+If Numba is installed, one can specify ``engine="numba"`` in select pandas methods to execute the method using Numba.
+Methods that support ``engine="numba"`` will also have an ``engine_kwargs`` keyword that accepts a dictionary that allows one to specify
+``"nogil"``, ``"nopython"`` and ``"parallel"`` keys with boolean values to pass into the ``@jit`` decorator.
+If ``engine_kwargs`` is not specified, it defaults to ``{"nogil": False, "nopython": True, "parallel": False}`` unless otherwise specified.
+
+In terms of performance, **the first time a function is run using the Numba engine will be slow**
+as Numba will have some function compilation overhead. However, the JIT compiled functions are cached,
+and subsequent calls will be fast. In general, the Numba engine is performant with
+a larger amount of data points (e.g. 1+ million).
- As of Numba version 0.20, pandas objects cannot be passed directly to Numba-compiled functions. Instead, one must pass the NumPy array underlying the pandas object to the Numba-compiled function as demonstrated below.
+.. code-block:: ipython
+
+ In [1]: data = pd.Series(range(1_000_000)) # noqa: E225
+
+ In [2]: roll = data.rolling(10)
-Jit
-~~~
+ In [3]: def f(x):
+ ...: return np.sum(x) + 5
+ # Run the first time, compilation time will affect performance
+ In [4]: %timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True)
+ 1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
+ # Function is cached and performance will improve
+ In [5]: %timeit roll.apply(f, engine='numba', raw=True)
+ 188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
-We demonstrate how to use Numba to just-in-time compile our code. We simply
-take the plain Python code from above and annotate with the ``@jit`` decorator.
+ In [6]: %timeit roll.apply(f, engine='cython', raw=True)
+ 3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+
+Custom Function Examples
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+A custom Python function decorated with ``@jit`` can be used with pandas objects by passing their NumPy array
+representations with ``to_numpy()``.
.. code-block:: python
@@ -360,8 +395,6 @@ take the plain Python code from above and annotate with the ``@jit`` decorator.
)
return pd.Series(result, index=df.index, name="result")
-Note that we directly pass NumPy arrays to the Numba function. ``compute_numba`` is just a wrapper that provides a
-nicer interface by passing/returning pandas objects.
.. code-block:: ipython
@@ -370,19 +403,9 @@ nicer interface by passing/returning pandas objects.
In this example, using Numba was faster than Cython.
-Numba as an argument
-~~~~~~~~~~~~~~~~~~~~
-
-Additionally, we can leverage the power of `Numba `__
-by calling it as an argument in :meth:`~Rolling.apply`. See :ref:`Computation tools
-` for an extensive example.
-
-Vectorize
-~~~~~~~~~
-
Numba can also be used to write vectorized functions that do not require the user to explicitly
loop over the observations of a vector; a vectorized function will be applied to each row automatically.
-Consider the following toy example of doubling each observation:
+Consider the following example of doubling each observation:
.. code-block:: python
@@ -414,25 +437,23 @@ Consider the following toy example of doubling each observation:
Caveats
~~~~~~~
-.. note::
-
- Numba will execute on any function, but can only accelerate certain classes of functions.
-
Numba is best at accelerating functions that apply numerical functions to NumPy
-arrays. When passed a function that only uses operations it knows how to
-accelerate, it will execute in ``nopython`` mode.
-
-If Numba is passed a function that includes something it doesn't know how to
-work with -- a category that currently includes sets, lists, dictionaries, or
-string functions -- it will revert to ``object mode``. In ``object mode``,
-Numba will execute but your code will not speed up significantly. If you would
+arrays. If you try to ``@jit`` a function that contains unsupported `Python `__
+or `NumPy `__
+code, compilation will revert `object mode `__ which
+will mostly likely not speed up your function. If you would
prefer that Numba throw an error if it cannot compile a function in a way that
speeds up your code, pass Numba the argument
-``nopython=True`` (e.g. ``@numba.jit(nopython=True)``). For more on
+``nopython=True`` (e.g. ``@jit(nopython=True)``). For more on
troubleshooting Numba modes, see the `Numba troubleshooting page
`__.
-Read more in the `Numba docs `__.
+Using ``parallel=True`` (e.g. ``@jit(parallel=True)``) may result in a ``SIGABRT`` if the threading layer leads to unsafe
+behavior. You can first `specify a safe threading layer `__
+before running a JIT function with ``parallel=True``.
+
+Generally if the you encounter a segfault (``SIGSEGV``) while using Numba, please report the issue
+to the `Numba issue tracker. `__
.. _enhancingperf.eval:
diff --git a/doc/source/user_guide/gotchas.rst b/doc/source/user_guide/gotchas.rst
index 1de978b195382..bf764316df373 100644
--- a/doc/source/user_guide/gotchas.rst
+++ b/doc/source/user_guide/gotchas.rst
@@ -341,7 +341,7 @@ Why not make NumPy like R?
Many people have suggested that NumPy should simply emulate the ``NA`` support
present in the more domain-specific statistical programming language `R
-`__. Part of the reason is the NumPy type hierarchy:
+`__. Part of the reason is the NumPy type hierarchy:
.. csv-table::
:header: "Typeclass","Dtypes"
diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst
index 870ec6763c72f..0fb59c50efa74 100644
--- a/doc/source/user_guide/groupby.rst
+++ b/doc/source/user_guide/groupby.rst
@@ -391,7 +391,6 @@ something different for each of the columns. Thus, using ``[]`` similar to
getting a column from a DataFrame, you can do:
.. ipython:: python
- :suppress:
df = pd.DataFrame(
{
@@ -402,7 +401,7 @@ getting a column from a DataFrame, you can do:
}
)
-.. ipython:: python
+ df
grouped = df.groupby(["A"])
grouped_C = grouped["C"]
@@ -579,7 +578,7 @@ column, which produces an aggregated result with a hierarchical index:
.. ipython:: python
- grouped.agg([np.sum, np.mean, np.std])
+ grouped[["C", "D"]].agg([np.sum, np.mean, np.std])
The resulting aggregations are named for the functions themselves. If you
@@ -598,7 +597,7 @@ For a grouped ``DataFrame``, you can rename in a similar manner:
.. ipython:: python
(
- grouped.agg([np.sum, np.mean, np.std]).rename(
+ grouped[["C", "D"]].agg([np.sum, np.mean, np.std]).rename(
columns={"sum": "foo", "mean": "bar", "std": "baz"}
)
)
@@ -1106,11 +1105,9 @@ Numba Accelerated Routines
.. versionadded:: 1.1
If `Numba `__ is installed as an optional dependency, the ``transform`` and
-``aggregate`` methods support ``engine='numba'`` and ``engine_kwargs`` arguments. The ``engine_kwargs``
-argument is a dictionary of keyword arguments that will be passed into the
-`numba.jit decorator `__.
-These keyword arguments will be applied to the passed function. Currently only ``nogil``, ``nopython``,
-and ``parallel`` are supported, and their default values are set to ``False``, ``True`` and ``False`` respectively.
+``aggregate`` methods support ``engine='numba'`` and ``engine_kwargs`` arguments.
+See :ref:`enhancing performance with Numba ` for general usage of the arguments
+and performance considerations.
The function signature must start with ``values, index`` **exactly** as the data belonging to each group
will be passed into ``values``, and the group index will be passed into ``index``.
@@ -1121,52 +1118,6 @@ will be passed into ``values``, and the group index will be passed into ``index`
data and group index will be passed as NumPy arrays to the JITed user defined function, and no
alternative execution attempts will be tried.
-.. note::
-
- In terms of performance, **the first time a function is run using the Numba engine will be slow**
- as Numba will have some function compilation overhead. However, the compiled functions are cached,
- and subsequent calls will be fast. In general, the Numba engine is performant with
- a larger amount of data points (e.g. 1+ million).
-
-.. code-block:: ipython
-
- In [1]: N = 10 ** 3
-
- In [2]: data = {0: [str(i) for i in range(100)] * N, 1: list(range(100)) * N}
-
- In [3]: df = pd.DataFrame(data, columns=[0, 1])
-
- In [4]: def f_numba(values, index):
- ...: total = 0
- ...: for i, value in enumerate(values):
- ...: if i % 2:
- ...: total += value + 5
- ...: else:
- ...: total += value * 2
- ...: return total
- ...:
-
- In [5]: def f_cython(values):
- ...: total = 0
- ...: for i, value in enumerate(values):
- ...: if i % 2:
- ...: total += value + 5
- ...: else:
- ...: total += value * 2
- ...: return total
- ...:
-
- In [6]: groupby = df.groupby(0)
- # Run the first time, compilation time will affect performance
- In [7]: %timeit -r 1 -n 1 groupby.aggregate(f_numba, engine='numba') # noqa: E225
- 2.14 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
- # Function is cached and performance will improve
- In [8]: %timeit groupby.aggregate(f_numba, engine='numba')
- 4.93 ms ± 32.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
-
- In [9]: %timeit groupby.aggregate(f_cython, engine='cython')
- 18.6 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
-
Other useful features
---------------------
diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst
index dc66303a44f53..e41f938170417 100644
--- a/doc/source/user_guide/indexing.rst
+++ b/doc/source/user_guide/indexing.rst
@@ -701,7 +701,7 @@ Having a duplicated index will raise for a ``.reindex()``:
.. code-block:: ipython
In [17]: s.reindex(labels)
- ValueError: cannot reindex from a duplicate axis
+ ValueError: cannot reindex on an axis with duplicate labels
Generally, you can intersect the desired labels with the current
axis, and then reindex.
@@ -717,7 +717,7 @@ However, this would *still* raise if your resulting index is duplicated.
In [41]: labels = ['a', 'd']
In [42]: s.loc[s.index.intersection(labels)].reindex(labels)
- ValueError: cannot reindex from a duplicate axis
+ ValueError: cannot reindex on an axis with duplicate labels
.. _indexing.basics.partial_setting:
@@ -997,6 +997,15 @@ a list of items you want to check for.
df.isin(values)
+To return the DataFrame of booleans where the values are *not* in the original DataFrame,
+use the ``~`` operator:
+
+.. ipython:: python
+
+ values = {'ids': ['a', 'b'], 'vals': [1, 3]}
+
+ ~df.isin(values)
+
Combine DataFrame's ``isin`` with the ``any()`` and ``all()`` methods to
quickly select subsets of your data that meet a given criteria.
To select a row where each column meets its own criterion:
@@ -1523,8 +1532,8 @@ Looking up values by index/column labels
----------------------------------------
Sometimes you want to extract a set of values given a sequence of row labels
-and column labels, this can be achieved by ``DataFrame.melt`` combined by filtering the corresponding
-rows with ``DataFrame.loc``. For instance:
+and column labels, this can be achieved by ``pandas.factorize`` and NumPy indexing.
+For instance:
.. ipython:: python
@@ -1532,9 +1541,8 @@ rows with ``DataFrame.loc``. For instance:
'A': [80, 23, np.nan, 22],
'B': [80, 55, 76, 67]})
df
- melt = df.melt('col')
- melt = melt.loc[melt['col'] == melt['variable'], 'value']
- melt.reset_index(drop=True)
+ idx, cols = pd.factorize(df['col'])
+ df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Formerly this could be achieved with the dedicated ``DataFrame.lookup`` method
which was deprecated in version 1.2.0.
diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst
index c2b030d732ba9..be761bb97f320 100644
--- a/doc/source/user_guide/io.rst
+++ b/doc/source/user_guide/io.rst
@@ -26,7 +26,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
text;`XML `__;:ref:`read_xml`;:ref:`to_xml`
text; Local clipboard;:ref:`read_clipboard`;:ref:`to_clipboard`
binary;`MS Excel `__;:ref:`read_excel`;:ref:`to_excel`
- binary;`OpenDocument `__;:ref:`read_excel`;
+ binary;`OpenDocument `__;:ref:`read_excel`;
binary;`HDF5 Format `__;:ref:`read_hdf`;:ref:`to_hdf`
binary;`Feather Format `__;:ref:`read_feather`;:ref:`to_feather`
binary;`Parquet Format `__;:ref:`read_parquet`;:ref:`to_parquet`
@@ -102,7 +102,7 @@ header : int or list of ints, default ``'infer'``
names : array-like, default ``None``
List of column names to use. If file contains no header row, then you should
explicitly pass ``header=None``. Duplicates in this list are not allowed.
-index_col : int, str, sequence of int / str, or False, default ``None``
+index_col : int, str, sequence of int / str, or False, optional, default ``None``
Column(s) to use as the row labels of the ``DataFrame``, either given as
string name or column index. If a sequence of int / str is given, a
MultiIndex is used.
@@ -116,11 +116,19 @@ index_col : int, str, sequence of int / str, or False, default ``None``
of the data file, then a default index is used. If it is larger, then
the first columns are used as index so that the remaining number of fields in
the body are equal to the number of fields in the header.
+
+ The first row after the header is used to determine the number of columns,
+ which will go into the index. If the subsequent rows contain less columns
+ than the first row, they are filled with ``NaN``.
+
+ This can be avoided through ``usecols``. This ensures that the columns are
+ taken as is and the trailing data are ignored.
usecols : list-like or callable, default ``None``
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or strings
that correspond to column names provided either by the user in ``names`` or
- inferred from the document header row(s). For example, a valid list-like
+ inferred from the document header row(s). If ``names`` are given, the document
+ header row(s) are not taken into account. For example, a valid list-like
``usecols`` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``.
Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``. To
@@ -142,11 +150,29 @@ usecols : list-like or callable, default ``None``
pd.read_csv(StringIO(data))
pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["COL1", "COL3"])
- Using this parameter results in much faster parsing time and lower memory usage.
+ Using this parameter results in much faster parsing time and lower memory usage
+ when using the c engine. The Python engine loads the data first before deciding
+ which columns to drop.
squeeze : boolean, default ``False``
If the parsed data only contains one column then return a ``Series``.
+
+ .. deprecated:: 1.4.0
+ Append ``.squeeze("columns")`` to the call to ``{func_name}`` to squeeze
+ the data.
prefix : str, default ``None``
Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
+
+ .. deprecated:: 1.4.0
+ Use a list comprehension on the DataFrame's columns after calling ``read_csv``.
+
+ .. ipython:: python
+
+ data = "col1,col2,col3\na,b,1"
+
+ df = pd.read_csv(StringIO(data))
+ df.columns = [f"pre_{col}" for col in df.columns]
+ df
+
mangle_dupe_cols : boolean, default ``True``
Duplicate columns will be specified as 'X', 'X.1'...'X.N', rather than 'X'...'X'.
Passing in ``False`` will cause data to be overwritten if there are duplicate
@@ -160,9 +186,15 @@ dtype : Type name or dict of column -> type, default ``None``
(unsupported with ``engine='python'``). Use ``str`` or ``object`` together
with suitable ``na_values`` settings to preserve and
not interpret dtype.
-engine : {``'c'``, ``'python'``}
- Parser engine to use. The C engine is faster while the Python engine is
- currently more feature-complete.
+engine : {``'c'``, ``'python'``, ``'pyarrow'``}
+ Parser engine to use. The C and pyarrow engines are faster, while the python engine
+ is currently more feature-complete. Multithreading is currently only supported by
+ the pyarrow engine.
+
+ .. versionadded:: 1.4.0
+
+ The "pyarrow" engine was added as an *experimental* engine, and some features
+ are unsupported, or may not work correctly, with this engine.
converters : dict, default ``None``
Dict of functions for converting values in certain columns. Keys can either be
integers or column labels.
@@ -284,14 +316,14 @@ chunksize : int, default ``None``
Quoting, compression, and file format
+++++++++++++++++++++++++++++++++++++
-compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``None``, ``dict``}, default ``'infer'``
+compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``'zstd'``, ``None``, ``dict``}, default ``'infer'``
For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
- bz2, zip, or xz if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2',
- '.zip', or '.xz', respectively, and no decompression otherwise. If using 'zip',
+ bz2, zip, xz, or zstandard if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2',
+ '.zip', '.xz', '.zst', respectively, and no decompression otherwise. If using 'zip',
the ZIP file must contain only one data file to be read in.
Set to ``None`` for no decompression. Can also be a dict with key ``'method'``
- set to one of {``'zip'``, ``'gzip'``, ``'bz2'``} and other key-value pairs are
- forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, or ``bz2.BZ2File``.
+ set to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``} and other key-value pairs are
+ forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, ``bz2.BZ2File``, or ``zstandard.ZstdDecompressor``.
As an example, the following could be passed for faster compression and to
create a reproducible gzip archive:
``compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}``.
@@ -342,7 +374,7 @@ dialect : str or :class:`python:csv.Dialect` instance, default ``None``
Error handling
++++++++++++++
-error_bad_lines : boolean, default ``None``
+error_bad_lines : boolean, optional, default ``None``
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no ``DataFrame`` will be
returned. If ``False``, then these "bad lines" will dropped from the
@@ -352,7 +384,7 @@ error_bad_lines : boolean, default ``None``
.. deprecated:: 1.3.0
The ``on_bad_lines`` parameter should be used instead to specify behavior upon
encountering a bad line instead.
-warn_bad_lines : boolean, default ``None``
+warn_bad_lines : boolean, optional, default ``None``
If error_bad_lines is ``False``, and warn_bad_lines is ``True``, a warning for
each "bad line" will be output.
@@ -1202,6 +1234,10 @@ Returning Series
Using the ``squeeze`` keyword, the parser will return output with a single column
as a ``Series``:
+.. deprecated:: 1.4.0
+ Users should append ``.squeeze("columns")`` to the DataFrame returned by
+ ``read_csv`` instead.
+
.. ipython:: python
:suppress:
@@ -1211,6 +1247,7 @@ as a ``Series``:
fh.write(data)
.. ipython:: python
+ :okwarning:
print(open("tmp.csv").read())
@@ -1268,19 +1305,57 @@ You can elect to skip bad lines:
0 1 2 3
1 8 9 10
+Or pass a callable function to handle the bad line if ``engine="python"``.
+The bad line will be a list of strings that was split by the ``sep``:
+
+.. code-block:: ipython
+
+ In [29]: external_list = []
+
+ In [30]: def bad_lines_func(line):
+ ...: external_list.append(line)
+ ...: return line[-3:]
+
+ In [31]: pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python")
+ Out[31]:
+ a b c
+ 0 1 2 3
+ 1 5 6 7
+ 2 8 9 10
+
+ In [32]: external_list
+ Out[32]: [4, 5, 6, 7]
+
+ .. versionadded:: 1.4.0
+
+
You can also use the ``usecols`` parameter to eliminate extraneous column
data that appear in some lines but not others:
.. code-block:: ipython
- In [30]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])
+ In [33]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])
- Out[30]:
+ Out[33]:
a b c
0 1 2 3
1 4 5 6
2 8 9 10
+In case you want to keep all data including the lines with too many fields, you can
+specify a sufficient number of ``names``. This ensures that lines with not enough
+fields are filled with ``NaN``.
+
+.. code-block:: ipython
+
+ In [34]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd'])
+
+ Out[34]:
+ a b c d
+ 0 1 2 3 NaN
+ 1 4 5 6 7
+ 2 8 9 10 NaN
+
.. _io.dialect:
Dialect
@@ -1622,11 +1697,17 @@ Specifying ``iterator=True`` will also return the ``TextFileReader`` object:
Specifying the parser engine
''''''''''''''''''''''''''''
-Under the hood pandas uses a fast and efficient parser implemented in C as well
-as a Python implementation which is currently more feature-complete. Where
-possible pandas uses the C parser (specified as ``engine='c'``), but may fall
-back to Python if C-unsupported options are specified. Currently, C-unsupported
-options include:
+Pandas currently supports three engines, the C engine, the python engine, and an experimental
+pyarrow engine (requires the ``pyarrow`` package). In general, the pyarrow engine is fastest
+on larger workloads and is equivalent in speed to the C engine on most other workloads.
+The python engine tends to be slower than the pyarrow and C engines on most workloads. However,
+the pyarrow engine is much less robust than the C engine, which lacks a few features compared to the
+Python engine.
+
+Where possible, pandas uses the C parser (specified as ``engine='c'``), but it may fall
+back to Python if C-unsupported options are specified.
+
+Currently, options unsupported by the C and pyarrow engines include:
* ``sep`` other than a single character (e.g. regex separators)
* ``skipfooter``
@@ -1635,6 +1716,32 @@ options include:
Specifying any of the above options will produce a ``ParserWarning`` unless the
python engine is selected explicitly using ``engine='python'``.
+Options that are unsupported by the pyarrow engine which are not covered by the list above include:
+
+* ``float_precision``
+* ``chunksize``
+* ``comment``
+* ``nrows``
+* ``thousands``
+* ``memory_map``
+* ``dialect``
+* ``warn_bad_lines``
+* ``error_bad_lines``
+* ``on_bad_lines``
+* ``delim_whitespace``
+* ``quoting``
+* ``lineterminator``
+* ``converters``
+* ``decimal``
+* ``iterator``
+* ``dayfirst``
+* ``infer_datetime_format``
+* ``verbose``
+* ``skipinitialspace``
+* ``low_memory``
+
+Specifying these options with ``engine='pyarrow'`` will raise a ``ValueError``.
+
.. _io.remote:
Reading/writing remote files
@@ -1820,6 +1927,7 @@ with optional parameters:
``index``; dict like {index -> {column -> value}}
``columns``; dict like {column -> {index -> value}}
``values``; just the values array
+ ``table``; adhering to the JSON `Table Schema`_
* ``date_format`` : string, type of date conversion, 'epoch' for timestamp, 'iso' for ISO8601.
* ``double_precision`` : The number of decimal places to use when encoding floating point values, default 10.
@@ -2394,7 +2502,6 @@ A few notes on the generated table schema:
* For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
then ``level_`` is used.
-
``read_json`` also accepts ``orient='table'`` as an argument. This allows for
the preservation of metadata such as dtypes and index names in a
round-trippable manner.
@@ -2436,8 +2543,18 @@ indicate missing values and the subsequent read cannot distinguish the intent.
os.remove("test.json")
+When using ``orient='table'`` along with user-defined ``ExtensionArray``,
+the generated schema will contain an additional ``extDtype`` key in the respective
+``fields`` element. This extra key is not standard but does enable JSON roundtrips
+for extension types (e.g. ``read_json(df.to_json(orient="table"), orient="table")``).
+
+The ``extDtype`` key carries the name of the extension, if you have properly registered
+the ``ExtensionDtype``, pandas will use said name to perform a lookup into the registry
+and re-convert the serialized data into your custom dtype.
+
.. _Table Schema: https://specs.frictionlessdata.io/table-schema/
+
HTML
----
@@ -2464,14 +2581,16 @@ Read a URL with no options:
.. ipython:: python
- url = (
- "https://raw.githubusercontent.com/pandas-dev/pandas/master/"
- "pandas/tests/io/data/html/spam.html"
- )
+ url = "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list"
dfs = pd.read_html(url)
dfs
-Read in the content of the "banklist.html" file and pass it to ``read_html``
+.. note::
+
+ The data from the above URL changes every Monday so the resulting data above
+ and the data below may be slightly different.
+
+Read in the content of the file from the above URL and pass it to ``read_html``
as a string:
.. ipython:: python
@@ -2503,7 +2622,7 @@ You can even pass in an instance of ``StringIO`` if you so desire:
that having so many network-accessing functions slows down the documentation
build. If you spot an error or an example that doesn't run, please do not
hesitate to report it over on `pandas GitHub issues page
- `__.
+ `__.
Read a URL and match a table that contains specific text:
@@ -2977,6 +3096,7 @@ Read in the content of the "books.xml" as instance of ``StringIO`` or
Even read XML from AWS S3 buckets such as Python Software Foundation's IRS 990 Form:
.. ipython:: python
+ :okwarning:
df = pd.read_xml(
"s3://irs-form-990/201923199349319487_public.xml",
@@ -3460,9 +3580,9 @@ with ``on_demand=True``.
Specifying sheets
+++++++++++++++++
-.. note :: The second argument is ``sheet_name``, not to be confused with ``ExcelFile.sheet_names``.
+.. note:: The second argument is ``sheet_name``, not to be confused with ``ExcelFile.sheet_names``.
-.. note :: An ExcelFile's attribute ``sheet_names`` provides access to a list of sheets.
+.. note:: An ExcelFile's attribute ``sheet_names`` provides access to a list of sheets.
* The arguments ``sheet_name`` allows specifying the sheet or sheets to read.
* The default value for ``sheet_name`` is 0, indicating to read the first sheet
@@ -3936,18 +4056,18 @@ Compressed pickle files
'''''''''''''''''''''''
:func:`read_pickle`, :meth:`DataFrame.to_pickle` and :meth:`Series.to_pickle` can read
-and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz`` are supported for reading and writing.
+and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz``, ``zstd`` are supported for reading and writing.
The ``zip`` file format only supports reading and must contain only one data file
to be read.
The compression type can be an explicit parameter or be inferred from the file extension.
-If 'infer', then use ``gzip``, ``bz2``, ``zip``, or ``xz`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``, or
-``'.xz'``, respectively.
+If 'infer', then use ``gzip``, ``bz2``, ``zip``, ``xz``, ``zstd`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``,
+``'.xz'``, or ``'.zst'``, respectively.
The compression parameter can also be a ``dict`` in order to pass options to the
compression protocol. It must have a ``'method'`` key set to the name
of the compression protocol, which must be one of
-{``'zip'``, ``'gzip'``, ``'bz2'``}. All other key-value pairs are passed to
+{``'zip'``, ``'gzip'``, ``'bz2'``, ``'xz'``, ``'zstd'``}. All other key-value pairs are passed to
the underlying compression library.
.. ipython:: python
@@ -4872,7 +4992,7 @@ control compression: ``complevel`` and ``complib``.
rates but is somewhat slow.
- `lzo `_: Fast
compression and decompression.
- - `bzip2 `_: Good compression rates.
+ - `bzip2 `_: Good compression rates.
- `blosc `_: Fast compression and
decompression.
@@ -4881,10 +5001,10 @@ control compression: ``complevel`` and ``complib``.
- `blosc:blosclz `_ This is the
default compressor for ``blosc``
- `blosc:lz4
- `_:
+ `_:
A compact, very popular and fast compressor.
- `blosc:lz4hc
- `_:
+ `_:
A tweaked version of LZ4, produces better
compression ratios at the expense of speed.
- `blosc:snappy `_:
@@ -5226,15 +5346,6 @@ Several caveats:
See the `Full Documentation `__.
-.. ipython:: python
- :suppress:
-
- import warnings
-
- # This can be removed once building with pyarrow >=0.15.0
- warnings.filterwarnings("ignore", "The Sparse", FutureWarning)
-
-
.. ipython:: python
df = pd.DataFrame(
@@ -5477,7 +5588,7 @@ SQL queries
The :mod:`pandas.io.sql` module provides a collection of query wrappers to both
facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction
is provided by SQLAlchemy if installed. In addition you will need a driver library for
-your database. Examples of such drivers are `psycopg2 `__
+your database. Examples of such drivers are `psycopg2 `__
for PostgreSQL or `pymysql `__ for MySQL.
For `SQLite `__ this is
included in Python's standard library by default.
@@ -5509,7 +5620,7 @@ The key functions are:
the provided input (database table name or sql query).
Table names do not need to be quoted if they have special characters.
-In the following example, we use the `SQlite `__ SQL database
+In the following example, we use the `SQlite `__ SQL database
engine. You can use a temporary SQLite database where data are stored in
"memory".
@@ -5526,13 +5637,23 @@ below and the SQLAlchemy `documentation `__
+for an explanation of how the database connection is handled.
.. code-block:: python
with engine.connect() as conn, conn.begin():
data = pd.read_sql_table("data", conn)
+.. warning::
+
+ When you open a connection to a database you are also responsible for closing it.
+ Side effects of leaving a connection open may include locking the database or
+ other breaking behaviour.
+
Writing DataFrames
''''''''''''''''''
@@ -5663,7 +5784,7 @@ Possible values are:
specific backend dialect features.
Example of a callable using PostgreSQL `COPY clause
-`__::
+`__::
# Alternative to_sql() *method* for DBs that support COPY FROM
import csv
@@ -5689,7 +5810,7 @@ Example of a callable using PostgreSQL `COPY clause
writer.writerows(data_iter)
s_buf.seek(0)
- columns = ', '.join('"{}"'.format(k) for k in keys)
+ columns = ', '.join(['"{}"'.format(k) for k in keys])
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name)
else:
@@ -5925,7 +6046,7 @@ pandas integrates with this external package. if ``pandas-gbq`` is installed, yo
use the pandas methods ``pd.read_gbq`` and ``DataFrame.to_gbq``, which will call the
respective functions from ``pandas-gbq``.
-Full documentation can be found `here `__.
+Full documentation can be found `here `__.
.. _io.stata:
@@ -6133,7 +6254,7 @@ Obtain an iterator and read an XPORT file 100,000 lines at a time:
The specification_ for the xport file format is available from the SAS
web site.
-.. _specification: https://support.sas.com/techsup/technote/ts140.pdf
+.. _specification: https://support.sas.com/content/dam/SAS/support/en/technical-papers/record-layout-of-a-sas-version-5-or-6-data-set-in-sas-transport-xport-format.pdf
No official documentation is available for the SAS7BDAT format.
@@ -6175,7 +6296,7 @@ avoid converting categorical columns into ``pd.Categorical``:
More information about the SAV and ZSAV file formats is available here_.
-.. _here: https://www.ibm.com/support/knowledgecenter/en/SSLVMB_22.0.0/com.ibm.spss.statistics.help/spss/base/savedatatypes.htm
+.. _here: https://www.ibm.com/docs/en/spss-statistics/22.0.0
.. _io.other:
@@ -6193,7 +6314,7 @@ xarray_ provides data structures inspired by the pandas ``DataFrame`` for workin
with multi-dimensional datasets, with a focus on the netCDF file format and
easy conversion to and from pandas.
-.. _xarray: https://xarray.pydata.org/
+.. _xarray: https://xarray.pydata.org/en/stable/
.. _io.perf:
diff --git a/doc/source/user_guide/merging.rst b/doc/source/user_guide/merging.rst
index 09b3d3a8c96df..bbca5773afdfe 100644
--- a/doc/source/user_guide/merging.rst
+++ b/doc/source/user_guide/merging.rst
@@ -237,59 +237,6 @@ Similarly, we could index before the concatenation:
p.plot([df1, df4], result, labels=["df1", "df4"], vertical=False);
plt.close("all");
-.. _merging.concatenation:
-
-Concatenating using ``append``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-A useful shortcut to :func:`~pandas.concat` are the :meth:`~DataFrame.append`
-instance methods on ``Series`` and ``DataFrame``. These methods actually predated
-``concat``. They concatenate along ``axis=0``, namely the index:
-
-.. ipython:: python
-
- result = df1.append(df2)
-
-.. ipython:: python
- :suppress:
-
- @savefig merging_append1.png
- p.plot([df1, df2], result, labels=["df1", "df2"], vertical=True);
- plt.close("all");
-
-In the case of ``DataFrame``, the indexes must be disjoint but the columns do not
-need to be:
-
-.. ipython:: python
-
- result = df1.append(df4, sort=False)
-
-.. ipython:: python
- :suppress:
-
- @savefig merging_append2.png
- p.plot([df1, df4], result, labels=["df1", "df4"], vertical=True);
- plt.close("all");
-
-``append`` may take multiple objects to concatenate:
-
-.. ipython:: python
-
- result = df1.append([df2, df3])
-
-.. ipython:: python
- :suppress:
-
- @savefig merging_append3.png
- p.plot([df1, df2, df3], result, labels=["df1", "df2", "df3"], vertical=True);
- plt.close("all");
-
-.. note::
-
- Unlike the :py:meth:`~list.append` method, which appends to the original list
- and returns ``None``, :meth:`~DataFrame.append` here **does not** modify
- ``df1`` and returns its copy with ``df2`` appended.
-
.. _merging.ignore_index:
Ignoring indexes on the concatenation axis
@@ -309,19 +256,6 @@ do this, use the ``ignore_index`` argument:
p.plot([df1, df4], result, labels=["df1", "df4"], vertical=True);
plt.close("all");
-This is also a valid argument to :meth:`DataFrame.append`:
-
-.. ipython:: python
-
- result = df1.append(df4, ignore_index=True, sort=False)
-
-.. ipython:: python
- :suppress:
-
- @savefig merging_append_ignore_index.png
- p.plot([df1, df4], result, labels=["df1", "df4"], vertical=True);
- plt.close("all");
-
.. _merging.mixed_ndims:
Concatenating with mixed ndims
@@ -473,14 +407,13 @@ like GroupBy where the order of a categorical variable is meaningful.
Appending rows to a DataFrame
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-While not especially efficient (since a new object must be created), you can
-append a single row to a ``DataFrame`` by passing a ``Series`` or dict to
-``append``, which returns a new ``DataFrame`` as above.
+If you have a series that you want to append as a single row to a ``DataFrame``, you can convert the row into a
+``DataFrame`` and use ``concat``
.. ipython:: python
s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])
- result = df1.append(s2, ignore_index=True)
+ result = pd.concat([df1, s2.to_frame().T], ignore_index=True)
.. ipython:: python
:suppress:
@@ -493,20 +426,6 @@ You should use ``ignore_index`` with this method to instruct DataFrame to
discard its index. If you wish to preserve the index, you should construct an
appropriately-indexed DataFrame and append or concatenate those objects.
-You can also pass a list of dicts or Series:
-
-.. ipython:: python
-
- dicts = [{"A": 1, "B": 2, "C": 3, "X": 4}, {"A": 5, "B": 6, "C": 7, "Y": 8}]
- result = df1.append(dicts, ignore_index=True, sort=False)
-
-.. ipython:: python
- :suppress:
-
- @savefig merging_append_dits.png
- p.plot([df1, pd.DataFrame(dicts)], result, labels=["df1", "dicts"], vertical=True);
- plt.close("all");
-
.. _merging.join:
Database-style DataFrame or named Series joining/merging
@@ -562,7 +481,7 @@ all standard database join operations between ``DataFrame`` or named ``Series``
(hierarchical), the number of levels must match the number of join keys
from the right DataFrame or Series.
* ``right_index``: Same usage as ``left_index`` for the right DataFrame or Series
-* ``how``: One of ``'left'``, ``'right'``, ``'outer'``, ``'inner'``. Defaults
+* ``how``: One of ``'left'``, ``'right'``, ``'outer'``, ``'inner'``, ``'cross'``. Defaults
to ``inner``. See below for more detailed description of each method.
* ``sort``: Sort the result DataFrame by the join keys in lexicographical
order. Defaults to ``True``, setting to ``False`` will improve performance
@@ -707,6 +626,7 @@ either the left or right tables, the values in the joined table will be
``right``, ``RIGHT OUTER JOIN``, Use keys from right frame only
``outer``, ``FULL OUTER JOIN``, Use union of keys from both frames
``inner``, ``INNER JOIN``, Use intersection of keys from both frames
+ ``cross``, ``CROSS JOIN``, Create the cartesian product of rows of both frames
.. ipython:: python
@@ -751,6 +671,17 @@ either the left or right tables, the values in the joined table will be
p.plot([left, right], result, labels=["left", "right"], vertical=False);
plt.close("all");
+.. ipython:: python
+
+ result = pd.merge(left, right, how="cross")
+
+.. ipython:: python
+ :suppress:
+
+ @savefig merging_merge_cross.png
+ p.plot([left, right], result, labels=["left", "right"], vertical=False);
+ plt.close("all");
+
You can merge a mult-indexed Series and a DataFrame, if the names of
the MultiIndex correspond to the columns from the DataFrame. Transform
the Series to a DataFrame using :meth:`Series.reset_index` before merging,
diff --git a/doc/source/user_guide/missing_data.rst b/doc/source/user_guide/missing_data.rst
index 1621b37f31b23..3052ee3001681 100644
--- a/doc/source/user_guide/missing_data.rst
+++ b/doc/source/user_guide/missing_data.rst
@@ -470,7 +470,7 @@ at the new values.
interp_s = ser.reindex(new_index).interpolate(method="pchip")
interp_s[49:51]
-.. _scipy: https://www.scipy.org
+.. _scipy: https://scipy.org/
.. _documentation: https://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation
.. _guide: https://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
@@ -580,7 +580,7 @@ String/regular expression replacement
backslashes than strings without this prefix. Backslashes in raw strings
will be interpreted as an escaped backslash, e.g., ``r'\' == '\\'``. You
should `read about them
- `__
+ `__
if this is unclear.
Replace the '.' with ``NaN`` (str -> str):
diff --git a/doc/source/user_guide/options.rst b/doc/source/user_guide/options.rst
index 62a347acdaa34..f6e98b68afdc9 100644
--- a/doc/source/user_guide/options.rst
+++ b/doc/source/user_guide/options.rst
@@ -31,18 +31,18 @@ namespace:
* :func:`~pandas.option_context` - execute a codeblock with a set of options
that revert to prior settings after execution.
-**Note:** Developers can check out `pandas/core/config_init.py `_ for more information.
+**Note:** Developers can check out `pandas/core/config_init.py `_ for more information.
All of the functions above accept a regexp pattern (``re.search`` style) as an argument,
and so passing in a substring will work - as long as it is unambiguous:
.. ipython:: python
- pd.get_option("display.max_rows")
- pd.set_option("display.max_rows", 101)
- pd.get_option("display.max_rows")
- pd.set_option("max_r", 102)
- pd.get_option("display.max_rows")
+ pd.get_option("display.chop_threshold")
+ pd.set_option("display.chop_threshold", 2)
+ pd.get_option("display.chop_threshold")
+ pd.set_option("chop", 4)
+ pd.get_option("display.chop_threshold")
The following will **not work** because it matches multiple option names, e.g.
@@ -52,7 +52,7 @@ The following will **not work** because it matches multiple option names, e.g.
:okexcept:
try:
- pd.get_option("column")
+ pd.get_option("max")
except KeyError as e:
print(e)
@@ -138,7 +138,7 @@ More information can be found in the `IPython documentation
import pandas as pd
pd.set_option("display.max_rows", 999)
- pd.set_option("precision", 5)
+ pd.set_option("display.precision", 5)
.. _options.frequently_used:
@@ -153,27 +153,27 @@ lines are replaced by an ellipsis.
.. ipython:: python
df = pd.DataFrame(np.random.randn(7, 2))
- pd.set_option("max_rows", 7)
+ pd.set_option("display.max_rows", 7)
df
- pd.set_option("max_rows", 5)
+ pd.set_option("display.max_rows", 5)
df
- pd.reset_option("max_rows")
+ pd.reset_option("display.max_rows")
Once the ``display.max_rows`` is exceeded, the ``display.min_rows`` options
determines how many rows are shown in the truncated repr.
.. ipython:: python
- pd.set_option("max_rows", 8)
- pd.set_option("min_rows", 4)
+ pd.set_option("display.max_rows", 8)
+ pd.set_option("display.min_rows", 4)
# below max_rows -> all rows shown
df = pd.DataFrame(np.random.randn(7, 2))
df
# above max_rows -> only min_rows (4) rows shown
df = pd.DataFrame(np.random.randn(9, 2))
df
- pd.reset_option("max_rows")
- pd.reset_option("min_rows")
+ pd.reset_option("display.max_rows")
+ pd.reset_option("display.min_rows")
``display.expand_frame_repr`` allows for the representation of
dataframes to stretch across pages, wrapped over the full column vs row-wise.
@@ -193,13 +193,13 @@ dataframes to stretch across pages, wrapped over the full column vs row-wise.
.. ipython:: python
df = pd.DataFrame(np.random.randn(10, 10))
- pd.set_option("max_rows", 5)
+ pd.set_option("display.max_rows", 5)
pd.set_option("large_repr", "truncate")
df
pd.set_option("large_repr", "info")
df
pd.reset_option("large_repr")
- pd.reset_option("max_rows")
+ pd.reset_option("display.max_rows")
``display.max_colwidth`` sets the maximum width of columns. Cells
of this length or longer will be truncated with an ellipsis.
@@ -253,9 +253,9 @@ This is only a suggestion.
.. ipython:: python
df = pd.DataFrame(np.random.randn(5, 5))
- pd.set_option("precision", 7)
+ pd.set_option("display.precision", 7)
df
- pd.set_option("precision", 4)
+ pd.set_option("display.precision", 4)
df
``display.chop_threshold`` sets at what level pandas rounds to zero when
@@ -430,6 +430,10 @@ display.html.use_mathjax True When True, Jupyter notebook
table contents using MathJax, rendering
mathematical expressions enclosed by the
dollar symbol.
+display.max_dir_items 100 The number of columns from a dataframe that
+ are added to dir. These columns can then be
+ suggested by tab completion. 'None' value means
+ unlimited.
io.excel.xls.writer xlwt The default Excel writer engine for
'xls' files.
@@ -487,8 +491,32 @@ styler.sparse.index True "Sparsify" MultiIndex displ
elements in outer levels within groups).
styler.sparse.columns True "Sparsify" MultiIndex display for columns
in Styler output.
+styler.render.repr html Standard output format for Styler rendered in Jupyter Notebook.
+ Should be one of "html" or "latex".
styler.render.max_elements 262144 Maximum number of datapoints that Styler will render
trimming either rows, columns or both to fit.
+styler.render.max_rows None Maximum number of rows that Styler will render. By default
+ this is dynamic based on ``max_elements``.
+styler.render.max_columns None Maximum number of columns that Styler will render. By default
+ this is dynamic based on ``max_elements``.
+styler.render.encoding utf-8 Default encoding for output HTML or LaTeX files.
+styler.format.formatter None Object to specify formatting functions to ``Styler.format``.
+styler.format.na_rep None String representation for missing data.
+styler.format.precision 6 Precision to display floating point and complex numbers.
+styler.format.decimal . String representation for decimal point separator for floating
+ point and complex numbers.
+styler.format.thousands None String representation for thousands separator for
+ integers, and floating point and complex numbers.
+styler.format.escape None Whether to escape "html" or "latex" special
+ characters in the display representation.
+styler.html.mathjax True If set to False will render specific CSS classes to
+ table attributes that will prevent Mathjax from rendering
+ in Jupyter Notebook.
+styler.latex.multicol_align r Alignment of headers in a merged column due to sparsification. Can be in {"r", "c", "l"}.
+styler.latex.multirow_align c Alignment of index labels in a merged row due to sparsification. Can be in {"c", "t", "b"}.
+styler.latex.environment None If given will replace the default ``\\begin{table}`` environment. If "longtable" is specified
+ this will render with a specific "longtable" template with longtable features.
+styler.latex.hrules False If set to True will render ``\\toprule``, ``\\midrule``, and ``\bottomrule`` by default.
======================================= ============ ==================================
diff --git a/doc/source/user_guide/reshaping.rst b/doc/source/user_guide/reshaping.rst
index 7d1d03fe020a6..e74272c825e46 100644
--- a/doc/source/user_guide/reshaping.rst
+++ b/doc/source/user_guide/reshaping.rst
@@ -474,7 +474,15 @@ rows and columns:
.. ipython:: python
- df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std)
+ table = df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std)
+ table
+
+Additionally, you can call :meth:`DataFrame.stack` to display a pivoted DataFrame
+as having a multi-level index:
+
+.. ipython:: python
+
+ table.stack()
.. _reshaping.crosstabulations:
diff --git a/doc/source/user_guide/sparse.rst b/doc/source/user_guide/sparse.rst
index 52d99533c1f60..b2b3678e48534 100644
--- a/doc/source/user_guide/sparse.rst
+++ b/doc/source/user_guide/sparse.rst
@@ -294,7 +294,7 @@ To convert back to sparse SciPy matrix in COO format, you can use the :meth:`Dat
sdf.sparse.to_coo()
-meth:`Series.sparse.to_coo` is implemented for transforming a ``Series`` with sparse values indexed by a :class:`MultiIndex` to a :class:`scipy.sparse.coo_matrix`.
+:meth:`Series.sparse.to_coo` is implemented for transforming a ``Series`` with sparse values indexed by a :class:`MultiIndex` to a :class:`scipy.sparse.coo_matrix`.
The method requires a ``MultiIndex`` with two or more levels.
diff --git a/doc/source/user_guide/style.ipynb b/doc/source/user_guide/style.ipynb
index 7d8d8e90dfbda..2dc40e67338b4 100644
--- a/doc/source/user_guide/style.ipynb
+++ b/doc/source/user_guide/style.ipynb
@@ -11,7 +11,7 @@
"\n",
"[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n",
"[viz]: visualization.rst\n",
- "[download]: https://nbviewer.ipython.org/github/pandas-dev/pandas/blob/master/doc/source/user_guide/style.ipynb"
+ "[download]: https://nbviewer.ipython.org/github/pandas-dev/pandas/blob/main/doc/source/user_guide/style.ipynb"
]
},
{
@@ -49,6 +49,7 @@
"source": [
"import pandas as pd\n",
"import numpy as np\n",
+ "import matplotlib as mpl\n",
"\n",
"df = pd.DataFrame([[38.0, 2.0, 18.0, 22.0, 21, np.nan],[19, 439, 6, 452, 226,232]], \n",
" index=pd.Index(['Tumour (Positive)', 'Non-Tumour (Negative)'], name='Actual Label:'), \n",
@@ -60,9 +61,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "The above output looks very similar to the standard DataFrame HTML representation. But the HTML here has already attached some CSS classes to each cell, even if we haven't yet created any styles. We can view these by calling the [.render()][render] method, which returns the raw HTML as string, which is useful for further processing or adding to a file - read on in [More about CSS and HTML](#More-About-CSS-and-HTML). Below we will show how we can use these to format the DataFrame to be more communicative. For example how we can build `s`:\n",
+ "The above output looks very similar to the standard DataFrame HTML representation. But the HTML here has already attached some CSS classes to each cell, even if we haven't yet created any styles. We can view these by calling the [.to_html()][tohtml] method, which returns the raw HTML as string, which is useful for further processing or adding to a file - read on in [More about CSS and HTML](#More-About-CSS-and-HTML). Below we will show how we can use these to format the DataFrame to be more communicative. For example how we can build `s`:\n",
"\n",
- "[render]: ../reference/api/pandas.io.formats.style.Styler.render.rst"
+ "[tohtml]: ../reference/api/pandas.io.formats.style.Styler.to_html.rst"
]
},
{
@@ -150,15 +151,14 @@
"\n",
"### Formatting Values\n",
"\n",
- "Before adding styles it is useful to show that the [Styler][styler] can distinguish the *display* value from the *actual* value. To control the display value, the text is printed in each cell, and we can use the [.format()][formatfunc] method to manipulate this according to a [format spec string][format] or a callable that takes a single value and returns a string. It is possible to define this for the whole table or for individual columns. \n",
+ "Before adding styles it is useful to show that the [Styler][styler] can distinguish the *display* value from the *actual* value, in both datavlaues and index or columns headers. To control the display value, the text is printed in each cell as string, and we can use the [.format()][formatfunc] and [.format_index()][formatfuncindex] methods to manipulate this according to a [format spec string][format] or a callable that takes a single value and returns a string. It is possible to define this for the whole table, or index, or for individual columns, or MultiIndex levels. \n",
"\n",
- "Additionally, the format function has a **precision** argument to specifically help formatting floats, an **na_rep** argument to display missing data, and an **escape** argument to help displaying safe-HTML. The default formatter is configured to adopt pandas' regular `display.precision` option, controllable using `with pd.option_context('display.precision', 2):`\n",
- "\n",
- "Here is an example of using the multiple options to control the formatting generally and with specific column formatters.\n",
+ "Additionally, the format function has a **precision** argument to specifically help formatting floats, as well as **decimal** and **thousands** separators to support other locales, an **na_rep** argument to display missing data, and an **escape** argument to help displaying safe-HTML or safe-LaTeX. The default formatter is configured to adopt pandas' `styler.format.precision` option, controllable using `with pd.option_context('format.precision', 2):` \n",
"\n",
"[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n",
"[format]: https://docs.python.org/3/library/string.html#format-specification-mini-language\n",
- "[formatfunc]: ../reference/api/pandas.io.formats.style.Styler.format.rst"
+ "[formatfunc]: ../reference/api/pandas.io.formats.style.Styler.format.rst\n",
+ "[formatfuncindex]: ../reference/api/pandas.io.formats.style.Styler.format_index.rst"
]
},
{
@@ -167,28 +167,72 @@
"metadata": {},
"outputs": [],
"source": [
- "df.style.format(precision=0, na_rep='MISSING', \n",
+ "df.style.format(precision=0, na_rep='MISSING', thousands=\" \",\n",
" formatter={('Decision Tree', 'Tumour'): \"{:.2f}\",\n",
- " ('Regression', 'Non-Tumour'): lambda x: \"$ {:,.1f}\".format(x*-1e3)\n",
+ " ('Regression', 'Non-Tumour'): lambda x: \"$ {:,.1f}\".format(x*-1e6)\n",
" })"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Using Styler to manipulate the display is a useful feature because maintaining the indexing and datavalues for other purposes gives greater control. You do not have to overwrite your DataFrame to display it how you like. Here is an example of using the formatting functions whilst still relying on the underlying data for indexing and calculations."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "weather_df = pd.DataFrame(np.random.rand(10,2)*5, \n",
+ " index=pd.date_range(start=\"2021-01-01\", periods=10),\n",
+ " columns=[\"Tokyo\", \"Beijing\"])\n",
+ "\n",
+ "def rain_condition(v): \n",
+ " if v < 1.75:\n",
+ " return \"Dry\"\n",
+ " elif v < 2.75:\n",
+ " return \"Rain\"\n",
+ " return \"Heavy Rain\"\n",
+ "\n",
+ "def make_pretty(styler):\n",
+ " styler.set_caption(\"Weather Conditions\")\n",
+ " styler.format(rain_condition)\n",
+ " styler.format_index(lambda v: v.strftime(\"%A\"))\n",
+ " styler.background_gradient(axis=None, vmin=1, vmax=5, cmap=\"YlGnBu\")\n",
+ " return styler\n",
+ "\n",
+ "weather_df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "weather_df.loc[\"2021-01-04\":\"2021-01-08\"].style.pipe(make_pretty)"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Hiding Data\n",
"\n",
- "The index can be hidden from rendering by calling [.hide_index()][hideidx], which might be useful if your index is integer based.\n",
+ "The index and column headers can be completely hidden, as well subselecting rows or columns that one wishes to exclude. Both these options are performed using the same methods.\n",
"\n",
- "Columns can be hidden from rendering by calling [.hide_columns()][hidecols] and passing in the name of a column, or a slice of columns.\n",
+ "The index can be hidden from rendering by calling [.hide()][hideidx] without any arguments, which might be useful if your index is integer based. Similarly column headers can be hidden by calling [.hide(axis=\"columns\")][hideidx] without any further arguments.\n",
"\n",
- "Hiding does not change the integer arrangement of CSS classes, e.g. hiding the first two columns of a DataFrame means the column class indexing will start at `col2`, since `col0` and `col1` are simply ignored.\n",
+ "Specific rows or columns can be hidden from rendering by calling the same [.hide()][hideidx] method and passing in a row/column label, a list-like or a slice of row/column labels to for the ``subset`` argument.\n",
"\n",
- "We can update our `Styler` object to hide some data and format the values.\n",
+ "Hiding does not change the integer arrangement of CSS classes, e.g. hiding the first two columns of a DataFrame means the column class indexing will still start at `col2`, since `col0` and `col1` are simply ignored.\n",
"\n",
- "[hideidx]: ../reference/api/pandas.io.formats.style.Styler.hide_index.rst\n",
- "[hidecols]: ../reference/api/pandas.io.formats.style.Styler.hide_columns.rst"
+ "We can update our `Styler` object from before to hide some data and format the values.\n",
+ "\n",
+ "[hideidx]: ../reference/api/pandas.io.formats.style.Styler.hide.rst"
]
},
{
@@ -197,7 +241,7 @@
"metadata": {},
"outputs": [],
"source": [
- "s = df.style.format('{:.0f}').hide_columns([('Random', 'Tumour'), ('Random', 'Non-Tumour')])\n",
+ "s = df.style.format('{:.0f}').hide([('Random', 'Tumour'), ('Random', 'Non-Tumour')], axis=\"columns\")\n",
"s"
]
},
@@ -223,13 +267,15 @@
"\n",
"- Using [.set_table_styles()][table] to control broader areas of the table with specified internal CSS. Although table styles allow the flexibility to add CSS selectors and properties controlling all individual parts of the table, they are unwieldy for individual cell specifications. Also, note that table styles cannot be exported to Excel. \n",
"- Using [.set_td_classes()][td_class] to directly link either external CSS classes to your data cells or link the internal CSS classes created by [.set_table_styles()][table]. See [here](#Setting-Classes-and-Linking-to-External-CSS). These cannot be used on column header rows or indexes, and also won't export to Excel. \n",
- "- Using the [.apply()][apply] and [.applymap()][applymap] functions to add direct internal CSS to specific data cells. See [here](#Styler-Functions). These cannot be used on column header rows or indexes, but only these methods add styles that will export to Excel. These methods work in a similar way to [DataFrame.apply()][dfapply] and [DataFrame.applymap()][dfapplymap].\n",
+ "- Using the [.apply()][apply] and [.applymap()][applymap] functions to add direct internal CSS to specific data cells. See [here](#Styler-Functions). As of v1.4.0 there are also methods that work directly on column header rows or indexes; [.apply_index()][applyindex] and [.applymap_index()][applymapindex]. Note that only these methods add styles that will export to Excel. These methods work in a similar way to [DataFrame.apply()][dfapply] and [DataFrame.applymap()][dfapplymap].\n",
"\n",
"[table]: ../reference/api/pandas.io.formats.style.Styler.set_table_styles.rst\n",
"[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n",
"[td_class]: ../reference/api/pandas.io.formats.style.Styler.set_td_classes.rst\n",
"[apply]: ../reference/api/pandas.io.formats.style.Styler.apply.rst\n",
"[applymap]: ../reference/api/pandas.io.formats.style.Styler.applymap.rst\n",
+ "[applyindex]: ../reference/api/pandas.io.formats.style.Styler.apply_index.rst\n",
+ "[applymapindex]: ../reference/api/pandas.io.formats.style.Styler.applymap_index.rst\n",
"[dfapply]: ../reference/api/pandas.DataFrame.apply.rst\n",
"[dfapplymap]: ../reference/api/pandas.DataFrame.applymap.rst"
]
@@ -377,7 +423,7 @@
"metadata": {},
"outputs": [],
"source": [
- "out = s.set_table_attributes('class=\"my-table-cls\"').render()\n",
+ "out = s.set_table_attributes('class=\"my-table-cls\"').to_html()\n",
"print(out[out.find('
"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Acting on the Index and Column Headers\n",
+ "\n",
+ "Similar application is acheived for headers by using:\n",
+ " \n",
+ "- [.applymap_index()][applymapindex] (elementwise): accepts a function that takes a single value and returns a string with the CSS attribute-value pair.\n",
+ "- [.apply_index()][applyindex] (level-wise): accepts a function that takes a Series and returns a Series, or numpy array with an identical shape where each element is a string with a CSS attribute-value pair. This method passes each level of your Index one-at-a-time. To style the index use `axis=0` and to style the column headers use `axis=1`.\n",
+ "\n",
+ "You can select a `level` of a `MultiIndex` but currently no similar `subset` application is available for these methods.\n",
+ "\n",
+ "[applyindex]: ../reference/api/pandas.io.formats.style.Styler.apply_index.rst\n",
+ "[applymapindex]: ../reference/api/pandas.io.formats.style.Styler.applymap_index.rst"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "s2.applymap_index(lambda v: \"color:pink;\" if v>4 else \"color:darkblue;\", axis=0)\n",
+ "s2.apply_index(lambda s: np.where(s.isin([\"A\", \"B\"]), \"color:pink;\", \"color:darkblue;\"), axis=1)"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -959,7 +1046,7 @@
"source": [
"### 5. If every byte counts use string replacement\n",
"\n",
- "You can remove unnecessary HTML, or shorten the default class names with string replace functions."
+ "You can remove unnecessary HTML, or shorten the default class names by replacing the default css dict. You can read a little more about CSS [below](#More-About-CSS-and-HTML)."
]
},
{
@@ -968,21 +1055,24 @@
"metadata": {},
"outputs": [],
"source": [
- "html = Styler(df4, uuid_len=0, cell_ids=False)\\\n",
- " .set_table_styles([{'selector': 'td', 'props': props},\n",
- " {'selector': '.col1', 'props': 'color:green;'},\n",
- " {'selector': '.level0', 'props': 'color:blue;'}])\\\n",
- " .render()\\\n",
- " .replace('blank', '')\\\n",
- " .replace('data', '')\\\n",
- " .replace('level0', 'l0')\\\n",
- " .replace('col_heading', '')\\\n",
- " .replace('row_heading', '')\n",
- "\n",
- "import re\n",
- "html = re.sub(r'col[0-9]+', lambda x: x.group().replace('col', 'c'), html)\n",
- "html = re.sub(r'row[0-9]+', lambda x: x.group().replace('row', 'r'), html)\n",
- "print(html)"
+ "my_css = {\n",
+ " \"row_heading\": \"\",\n",
+ " \"col_heading\": \"\",\n",
+ " \"index_name\": \"\",\n",
+ " \"col\": \"c\",\n",
+ " \"row\": \"r\",\n",
+ " \"col_trim\": \"\",\n",
+ " \"row_trim\": \"\",\n",
+ " \"level\": \"l\",\n",
+ " \"data\": \"\",\n",
+ " \"blank\": \"\",\n",
+ "}\n",
+ "html = Styler(df4, uuid_len=0, cell_ids=False)\n",
+ "html.set_table_styles([{'selector': 'td', 'props': props},\n",
+ " {'selector': '.c1', 'props': 'color:green;'},\n",
+ " {'selector': '.l0', 'props': 'color:blue;'}],\n",
+ " css_class_names=my_css)\n",
+ "print(html.to_html())"
]
},
{
@@ -991,8 +1081,7 @@
"metadata": {},
"outputs": [],
"source": [
- "from IPython.display import HTML\n",
- "HTML(html)"
+ "html"
]
},
{
@@ -1107,7 +1196,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "You can create \"heatmaps\" with the `background_gradient` and `text_gradient` methods. These require matplotlib, and we'll use [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) to get a nice colormap."
+ "You can create \"heatmaps\" with the `background_gradient` and `text_gradient` methods. These require matplotlib, and we'll use [Seaborn](http://seaborn.pydata.org/) to get a nice colormap."
]
},
{
@@ -1188,9 +1277,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "In version 0.20.0 the ability to customize the bar chart further was given. You can now have the `df.style.bar` be centered on zero or midpoint value (in addition to the already existing way of having the min value at the left side of the cell), and you can pass a list of `[color_negative, color_positive]`.\n",
+ "Additional keyword arguments give more control on centering and positioning, and you can pass a list of `[color_negative, color_positive]` to highlight lower and higher values or a matplotlib colormap.\n",
"\n",
- "Here's how you can change the above with the new `align='mid'` option:"
+ "To showcase an example here's how you can change the above with the new `align` option, combined with setting `vmin` and `vmax` limits, the `width` of the figure, and underlying css `props` of cells, leaving space to display the text and the bars. We also use `text_gradient` to color the text the same as the bars using a matplotlib colormap (although in this case the visualization is probably better without this additional effect)."
]
},
{
@@ -1199,7 +1288,10 @@
"metadata": {},
"outputs": [],
"source": [
- "df2.style.bar(subset=['A', 'B'], align='mid', color=['#d65f5f', '#5fba7d'])"
+ "df2.style.format('{:.3f}', na_rep=\"\")\\\n",
+ " .bar(align=0, vmin=-2.5, vmax=2.5, cmap=\"bwr\", height=50,\n",
+ " width=60, props=\"width: 120px; border-right: 1px solid black;\")\\\n",
+ " .text_gradient(cmap=\"bwr\", vmin=-2.5, vmax=2.5)"
]
},
{
@@ -1223,30 +1315,33 @@
"\n",
"# Test series\n",
"test1 = pd.Series([-100,-60,-30,-20], name='All Negative')\n",
- "test2 = pd.Series([10,20,50,100], name='All Positive')\n",
- "test3 = pd.Series([-10,-5,0,90], name='Both Pos and Neg')\n",
+ "test2 = pd.Series([-10,-5,0,90], name='Both Pos and Neg')\n",
+ "test3 = pd.Series([10,20,50,100], name='All Positive')\n",
+ "test4 = pd.Series([100, 103, 101, 102], name='Large Positive')\n",
+ "\n",
"\n",
"head = \"\"\"\n",
"
\".format(align)\n",
- " for series in [test1,test2,test3]:\n",
+ " for series in [test1,test2,test3, test4]:\n",
" s = series.copy()\n",
" s.name=''\n",
- " row += \"
'\n",
" head += row\n",
" \n",
@@ -1284,8 +1379,12 @@
"metadata": {},
"outputs": [],
"source": [
- "style1 = df2.style.applymap(style_negative, props='color:red;')\\\n",
- " .applymap(lambda v: 'opacity: 20%;' if (v < 0.3) and (v > -0.3) else None)"
+ "style1 = df2.style\\\n",
+ " .applymap(style_negative, props='color:red;')\\\n",
+ " .applymap(lambda v: 'opacity: 20%;' if (v < 0.3) and (v > -0.3) else None)\\\n",
+ " .set_table_styles([{\"selector\": \"th\", \"props\": \"color: blue;\"}])\\\n",
+ " .hide(axis=\"index\")\n",
+ "style1"
]
},
{
@@ -1312,13 +1411,10 @@
"source": [
"## Limitations\n",
"\n",
- "- DataFrame only `(use Series.to_frame().style)`\n",
- "- The index and columns must be unique\n",
+ "- DataFrame only (use `Series.to_frame().style`)\n",
+ "- The index and columns do not need to be unique, but certain styling functions can only work with unique indexes.\n",
"- No large repr, and construction performance isn't great; although we have some [HTML optimizations](#Optimization)\n",
- "- You can only style the *values*, not the index or columns (except with `table_styles` above)\n",
- "- You can only apply styles, you can't insert new HTML entities\n",
- "\n",
- "Some of these might be addressed in the future. "
+ "- You can only apply styles, you can't insert new HTML entities, except via subclassing."
]
},
{
@@ -1403,7 +1499,9 @@
"source": [
"### Sticky Headers\n",
"\n",
- "If you display a large matrix or DataFrame in a notebook, but you want to always see the column and row headers you can use the following CSS to make them stick. We might make this into an API function later."
+ "If you display a large matrix or DataFrame in a notebook, but you want to always see the column and row headers you can use the [.set_sticky][sticky] method which manipulates the table styles CSS.\n",
+ "\n",
+ "[sticky]: ../reference/api/pandas.io.formats.style.Styler.set_sticky.rst"
]
},
{
@@ -1412,20 +1510,15 @@
"metadata": {},
"outputs": [],
"source": [
- "bigdf = pd.DataFrame(np.random.randn(15, 100))\n",
- "bigdf.style.set_table_styles([\n",
- " {'selector': 'thead th', 'props': 'position: sticky; top:0; background-color:salmon;'},\n",
- " {'selector': 'tbody th', 'props': 'position: sticky; left:0; background-color:lightgreen;'} \n",
- "])"
+ "bigdf = pd.DataFrame(np.random.randn(16, 100))\n",
+ "bigdf.style.set_sticky(axis=\"index\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Hiding Headers\n",
- "\n",
- "We don't yet have any API to hide headers so a quick fix is:"
+ "It is also possible to stick MultiIndexes and even only specific levels."
]
},
{
@@ -1434,7 +1527,8 @@
"metadata": {},
"outputs": [],
"source": [
- "df3.style.set_table_styles([{'selector': 'thead tr', 'props': 'display: none;'}]) # or 'thead th'"
+ "bigdf.index = pd.MultiIndex.from_product([[\"A\",\"B\"],[0,1],[0,1,2,3]])\n",
+ "bigdf.style.set_sticky(axis=\"index\", pixel_size=18, levels=[1,2])"
]
},
{
@@ -1524,6 +1618,17 @@
"\n"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Export to LaTeX\n",
+ "\n",
+ "There is support (*since version 1.3.0*) to export `Styler` to LaTeX. The documentation for the [.to_latex][latex] method gives further detail and numerous examples.\n",
+ "\n",
+ "[latex]: ../reference/api/pandas.io.formats.style.Styler.to_latex.rst"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
@@ -1555,12 +1660,13 @@
" + `row`, where `m` is the numeric position of the cell.\n",
" + `col`, where `n` is the numeric position of the cell.\n",
"- Blank cells include `blank`\n",
+ "- Trimmed cells include `col_trim` or `row_trim`\n",
"\n",
"The structure of the `id` is `T_uuid_level_row_col` where `level` is used only on headings, and headings will only have either `row` or `col` whichever is needed. By default we've also prepended each row/column identifier with a UUID unique to each DataFrame so that the style from one doesn't collide with the styling from another within the same notebook or page. You can read more about the use of UUIDs in [Optimization](#Optimization).\n",
"\n",
- "We can see example of the HTML by calling the [.render()][render] method.\n",
+ "We can see example of the HTML by calling the [.to_html()][tohtml] method.\n",
"\n",
- "[render]: ../reference/api/pandas.io.formats.style.Styler.render.rst"
+ "[tohtml]: ../reference/api/pandas.io.formats.style.Styler.to_html.rst"
]
},
{
@@ -1569,7 +1675,7 @@
"metadata": {},
"outputs": [],
"source": [
- "print(pd.DataFrame([[1,2],[3,4]], index=['i1', 'i2'], columns=['c1', 'c2']).style.render())"
+ "print(pd.DataFrame([[1,2],[3,4]], index=['i1', 'i2'], columns=['c1', 'c2']).style.to_html())"
]
},
{
@@ -1769,7 +1875,7 @@
" Styler.loader, # the default\n",
" ])\n",
" )\n",
- " template_html = env.get_template(\"myhtml.tpl\")"
+ " template_html_table = env.get_template(\"myhtml.tpl\")"
]
},
{
@@ -1796,7 +1902,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Our custom template accepts a `table_title` keyword. We can provide the value in the `.render` method."
+ "Our custom template accepts a `table_title` keyword. We can provide the value in the `.to_html` method."
]
},
{
@@ -1805,7 +1911,7 @@
"metadata": {},
"outputs": [],
"source": [
- "HTML(MyStyler(df3).render(table_title=\"Extending Example\"))"
+ "HTML(MyStyler(df3).to_html(table_title=\"Extending Example\"))"
]
},
{
@@ -1822,14 +1928,63 @@
"outputs": [],
"source": [
"EasyStyler = Styler.from_custom_template(\"templates\", \"myhtml.tpl\")\n",
- "EasyStyler(df3)"
+ "HTML(EasyStyler(df3).to_html(table_title=\"Another Title\"))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Template Structure\n",
+ "\n",
+ "Here's the template structure for the both the style generation template and the table generation template:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Style template:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "nbsphinx": "hidden"
+ },
+ "outputs": [],
+ "source": [
+ "with open(\"templates/html_style_structure.html\") as f:\n",
+ " style_structure = f.read()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "HTML(style_structure)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "Here's the template structure:"
+ "Table template:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "nbsphinx": "hidden"
+ },
+ "outputs": [],
+ "source": [
+ "with open(\"templates/html_table_structure.html\") as f:\n",
+ " table_structure = f.read()"
]
},
{
@@ -1838,10 +1993,7 @@
"metadata": {},
"outputs": [],
"source": [
- "with open(\"templates/template_structure.html\") as f:\n",
- " structure = f.read()\n",
- " \n",
- "HTML(structure)"
+ "HTML(table_structure)"
]
},
{
@@ -1871,7 +2023,7 @@
],
"metadata": {
"kernelspec": {
- "display_name": "Python 3",
+ "display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@@ -1885,7 +2037,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.8.6"
+ "version": "3.9.5"
}
},
"nbformat": 4,
diff --git a/doc/source/user_guide/templates/html_style_structure.html b/doc/source/user_guide/templates/html_style_structure.html
new file mode 100644
index 0000000000000..dc0c03ac363a9
--- /dev/null
+++ b/doc/source/user_guide/templates/html_style_structure.html
@@ -0,0 +1,35 @@
+
+
+
+
before_style
+
style
+
<style type="text/css">
+
table_styles
+
before_cellstyle
+
cellstyle
+
</style>
+
diff --git a/doc/source/user_guide/templates/template_structure.html b/doc/source/user_guide/templates/html_table_structure.html
similarity index 80%
rename from doc/source/user_guide/templates/template_structure.html
rename to doc/source/user_guide/templates/html_table_structure.html
index 0778d8e2e6f18..e03f9591d2a35 100644
--- a/doc/source/user_guide/templates/template_structure.html
+++ b/doc/source/user_guide/templates/html_table_structure.html
@@ -25,15 +25,6 @@
}
-
{{ super() }}
diff --git a/doc/source/user_guide/text.rst b/doc/source/user_guide/text.rst
index db9485f3f2348..d350351075cb6 100644
--- a/doc/source/user_guide/text.rst
+++ b/doc/source/user_guide/text.rst
@@ -335,6 +335,19 @@ regular expression object will raise a ``ValueError``.
---------------------------------------------------------------------------
ValueError: case and flags cannot be set when pat is a compiled regex
+``removeprefix`` and ``removesuffix`` have the same effect as ``str.removeprefix`` and ``str.removesuffix`` added in Python 3.9
+`__:
+
+.. versionadded:: 1.4.0
+
+.. ipython:: python
+
+ s = pd.Series(["str_foo", "str_bar", "no_prefix"])
+ s.str.removeprefix("str_")
+
+ s = pd.Series(["foo_str", "bar_str", "no_suffix"])
+ s.str.removesuffix("_str")
+
.. _text.concatenate:
Concatenation
@@ -742,6 +755,8 @@ Method summary
:meth:`~Series.str.get_dummies`;Split strings on the delimiter returning DataFrame of dummy variables
:meth:`~Series.str.contains`;Return boolean array if each string contains pattern/regex
:meth:`~Series.str.replace`;Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence
+ :meth:`~Series.str.removeprefix`;Remove prefix from string, i.e. only remove if string starts with prefix.
+ :meth:`~Series.str.removesuffix`;Remove suffix from string, i.e. only remove if string ends with suffix.
:meth:`~Series.str.repeat`;Duplicate values (``s.str.repeat(3)`` equivalent to ``x * 3``)
:meth:`~Series.str.pad`;"Add whitespace to left, right, or both sides of strings"
:meth:`~Series.str.center`;Equivalent to ``str.center``
diff --git a/doc/source/user_guide/timedeltas.rst b/doc/source/user_guide/timedeltas.rst
index 0b4ddaaa8a42a..180de1df53f9e 100644
--- a/doc/source/user_guide/timedeltas.rst
+++ b/doc/source/user_guide/timedeltas.rst
@@ -88,13 +88,19 @@ or a list/array of strings:
pd.to_timedelta(["1 days 06:05:01.00003", "15.5us", "nan"])
-The ``unit`` keyword argument specifies the unit of the Timedelta:
+The ``unit`` keyword argument specifies the unit of the Timedelta if the input
+is numeric:
.. ipython:: python
pd.to_timedelta(np.arange(5), unit="s")
pd.to_timedelta(np.arange(5), unit="d")
+.. warning::
+ If a string or array of strings is passed as an input then the ``unit`` keyword
+ argument will be ignored. If a string without units is passed then the default
+ unit of nanoseconds is assumed.
+
.. _timedeltas.limitations:
Timedelta limitations
diff --git a/doc/source/user_guide/timeseries.rst b/doc/source/user_guide/timeseries.rst
index 6f005f912fe37..6df234a027ee9 100644
--- a/doc/source/user_guide/timeseries.rst
+++ b/doc/source/user_guide/timeseries.rst
@@ -204,6 +204,7 @@ If you use dates which start with the day first (i.e. European style),
you can pass the ``dayfirst`` flag:
.. ipython:: python
+ :okwarning:
pd.to_datetime(["04-01-2012 10:00"], dayfirst=True)
@@ -211,9 +212,10 @@ you can pass the ``dayfirst`` flag:
.. warning::
- You see in the above example that ``dayfirst`` isn't strict, so if a date
+ You see in the above example that ``dayfirst`` isn't strict. If a date
can't be parsed with the day being first it will be parsed as if
- ``dayfirst`` were False.
+ ``dayfirst`` were False, and in the case of parsing delimited date strings
+ (e.g. ``31-12-2012``) then a warning will also be raised.
If you pass a single string to ``to_datetime``, it returns a single ``Timestamp``.
``Timestamp`` can also accept string input, but it doesn't accept string parsing
@@ -850,7 +852,7 @@ savings time. However, all :class:`DateOffset` subclasses that are an hour or sm
The basic :class:`DateOffset` acts similar to ``dateutil.relativedelta`` (`relativedelta documentation`_)
that shifts a date time by the corresponding calendar duration specified. The
-arithmetic operator (``+``) or the ``apply`` method can be used to perform the shift.
+arithmetic operator (``+``) can be used to perform the shift.
.. ipython:: python
@@ -864,7 +866,6 @@ arithmetic operator (``+``) or the ``apply`` method can be used to perform the s
friday.day_name()
# Add 2 business days (Friday --> Tuesday)
two_business_days = 2 * pd.offsets.BDay()
- two_business_days.apply(friday)
friday + two_business_days
(friday + two_business_days).day_name()
@@ -936,14 +937,14 @@ in the operation).
ts = pd.Timestamp("2014-01-01 09:00")
day = pd.offsets.Day()
- day.apply(ts)
- day.apply(ts).normalize()
+ day + ts
+ (day + ts).normalize()
ts = pd.Timestamp("2014-01-01 22:00")
hour = pd.offsets.Hour()
- hour.apply(ts)
- hour.apply(ts).normalize()
- hour.apply(pd.Timestamp("2014-01-01 23:30")).normalize()
+ hour + ts
+ (hour + ts).normalize()
+ (hour + pd.Timestamp("2014-01-01 23:30")).normalize()
.. _relativedelta documentation: https://dateutil.readthedocs.io/en/stable/relativedelta.html
@@ -1183,16 +1184,16 @@ under the default business hours (9:00 - 17:00), there is no gap (0 minutes) bet
pd.offsets.BusinessHour().rollback(pd.Timestamp("2014-08-02 15:00"))
pd.offsets.BusinessHour().rollforward(pd.Timestamp("2014-08-02 15:00"))
- # It is the same as BusinessHour().apply(pd.Timestamp('2014-08-01 17:00')).
- # And it is the same as BusinessHour().apply(pd.Timestamp('2014-08-04 09:00'))
- pd.offsets.BusinessHour().apply(pd.Timestamp("2014-08-02 15:00"))
+ # It is the same as BusinessHour() + pd.Timestamp('2014-08-01 17:00').
+ # And it is the same as BusinessHour() + pd.Timestamp('2014-08-04 09:00')
+ pd.offsets.BusinessHour() + pd.Timestamp("2014-08-02 15:00")
# BusinessDay results (for reference)
pd.offsets.BusinessHour().rollforward(pd.Timestamp("2014-08-02"))
- # It is the same as BusinessDay().apply(pd.Timestamp('2014-08-01'))
+ # It is the same as BusinessDay() + pd.Timestamp('2014-08-01')
# The result is the same as rollworward because BusinessDay never overlap.
- pd.offsets.BusinessHour().apply(pd.Timestamp("2014-08-02"))
+ pd.offsets.BusinessHour() + pd.Timestamp("2014-08-02")
``BusinessHour`` regards Saturday and Sunday as holidays. To use arbitrary
holidays, you can use ``CustomBusinessHour`` offset, as explained in the
@@ -1269,6 +1270,36 @@ frequencies. We will refer to these aliases as *offset aliases*.
"U, us", "microseconds"
"N", "nanoseconds"
+.. note::
+
+ When using the offset aliases above, it should be noted that functions
+ such as :func:`date_range`, :func:`bdate_range`, will only return
+ timestamps that are in the interval defined by ``start_date`` and
+ ``end_date``. If the ``start_date`` does not correspond to the frequency,
+ the returned timestamps will start at the next valid timestamp, same for
+ ``end_date``, the returned timestamps will stop at the previous valid
+ timestamp.
+
+ For example, for the offset ``MS``, if the ``start_date`` is not the first
+ of the month, the returned timestamps will start with the first day of the
+ next month. If ``end_date`` is not the first day of a month, the last
+ returned timestamp will be the first day of the corresponding month.
+
+ .. ipython:: python
+
+ dates_lst_1 = pd.date_range("2020-01-06", "2020-04-03", freq="MS")
+ dates_lst_1
+
+ dates_lst_2 = pd.date_range("2020-01-01", "2020-04-01", freq="MS")
+ dates_lst_2
+
+ We can see in the above example :func:`date_range` and
+ :func:`bdate_range` will only return the valid timestamps between the
+ ``start_date`` and ``end_date``. If these are not valid timestamps for the
+ given frequency it will roll to the next value for ``start_date``
+ (respectively previous for the ``end_date``)
+
+
Combining aliases
~~~~~~~~~~~~~~~~~
@@ -2079,7 +2110,6 @@ The ``period`` dtype can be used in ``.astype(...)``. It allows one to change th
dti
dti.astype("period[M]")
-
PeriodIndex partial string indexing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -2389,7 +2419,7 @@ you can use the ``tz_convert`` method.
For ``pytz`` time zones, it is incorrect to pass a time zone object directly into
the ``datetime.datetime`` constructor
- (e.g., ``datetime.datetime(2011, 1, 1, tz=pytz.timezone('US/Eastern'))``.
+ (e.g., ``datetime.datetime(2011, 1, 1, tzinfo=pytz.timezone('US/Eastern'))``.
Instead, the datetime needs to be localized using the ``localize`` method
on the ``pytz`` time zone object.
diff --git a/doc/source/user_guide/visualization.rst b/doc/source/user_guide/visualization.rst
index 1c02be989eeeb..404914dbc7a69 100644
--- a/doc/source/user_guide/visualization.rst
+++ b/doc/source/user_guide/visualization.rst
@@ -272,7 +272,7 @@ horizontal and cumulative histograms can be drawn by
plt.close("all")
See the :meth:`hist ` method and the
-`matplotlib hist documentation `__ for more.
+`matplotlib hist documentation `__ for more.
The existing interface ``DataFrame.hist`` to plot histogram still can be used.
@@ -316,6 +316,34 @@ The ``by`` keyword can be specified to plot grouped histograms:
@savefig grouped_hist.png
data.hist(by=np.random.randint(0, 4, 1000), figsize=(6, 4));
+.. ipython:: python
+ :suppress:
+
+ plt.close("all")
+ np.random.seed(123456)
+
+In addition, the ``by`` keyword can also be specified in :meth:`DataFrame.plot.hist`.
+
+.. versionchanged:: 1.4.0
+
+.. ipython:: python
+
+ data = pd.DataFrame(
+ {
+ "a": np.random.choice(["x", "y", "z"], 1000),
+ "b": np.random.choice(["e", "f", "g"], 1000),
+ "c": np.random.randn(1000),
+ "d": np.random.randn(1000) - 1,
+ },
+ )
+
+ @savefig grouped_hist_by.png
+ data.plot.hist(by=["a", "b"], figsize=(10, 5));
+
+.. ipython:: python
+ :suppress:
+
+ plt.close("all")
.. _visualization.box:
@@ -382,7 +410,7 @@ For example, horizontal and custom-positioned boxplot can be drawn by
See the :meth:`boxplot ` method and the
-`matplotlib boxplot documentation `__ for more.
+`matplotlib boxplot documentation `__ for more.
The existing interface ``DataFrame.boxplot`` to plot boxplot still can be used.
@@ -448,6 +476,32 @@ columns:
plt.close("all")
+You could also create groupings with :meth:`DataFrame.plot.box`, for instance:
+
+.. versionchanged:: 1.4.0
+
+.. ipython:: python
+ :suppress:
+
+ plt.close("all")
+ np.random.seed(123456)
+
+.. ipython:: python
+ :okwarning:
+
+ df = pd.DataFrame(np.random.rand(10, 3), columns=["Col1", "Col2", "Col3"])
+ df["X"] = pd.Series(["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"])
+
+ plt.figure();
+
+ @savefig box_plot_ex4.png
+ bp = df.plot.box(column=["Col1", "Col2"], by="X")
+
+.. ipython:: python
+ :suppress:
+
+ plt.close("all")
+
.. _visualization.box.return:
In ``boxplot``, the return type can be controlled by the ``return_type``, keyword. The valid choices are ``{"axes", "dict", "both", None}``.
@@ -620,7 +674,7 @@ bubble chart using a column of the ``DataFrame`` as the bubble size.
plt.close("all")
See the :meth:`scatter ` method and the
-`matplotlib scatter documentation `__ for more.
+`matplotlib scatter documentation `__ for more.
.. _visualization.hexbin:
@@ -680,7 +734,7 @@ given by column ``z``. The bins are aggregated with NumPy's ``max`` function.
plt.close("all")
See the :meth:`hexbin ` method and the
-`matplotlib hexbin documentation `__ for more.
+`matplotlib hexbin documentation `__ for more.
.. _visualization.pie:
@@ -785,7 +839,7 @@ If you pass values whose sum total is less than 1.0, matplotlib draws a semicirc
@savefig series_pie_plot_semi.png
series.plot.pie(figsize=(6, 6));
-See the `matplotlib pie documentation `__ for more.
+See the `matplotlib pie documentation `__ for more.
.. ipython:: python
:suppress:
@@ -902,7 +956,7 @@ for more information. By coloring these curves differently for each class
it is possible to visualize data clustering. Curves belonging to samples
of the same class will usually be closer together and form larger structures.
-**Note**: The "Iris" dataset is available `here `__.
+**Note**: The "Iris" dataset is available `here `__.
.. ipython:: python
@@ -1059,10 +1113,10 @@ unit interval). The point in the plane, where our sample settles to (where the
forces acting on our sample are at an equilibrium) is where a dot representing
our sample will be drawn. Depending on which class that sample belongs it will
be colored differently.
-See the R package `Radviz `__
+See the R package `Radviz `__
for more information.
-**Note**: The "Iris" dataset is available `here `__.
+**Note**: The "Iris" dataset is available `here `__.
.. ipython:: python
@@ -1330,7 +1384,7 @@ tick locator methods, it is useful to call the automatic
date tick adjustment from matplotlib for figures whose ticklabels overlap.
See the :meth:`autofmt_xdate ` method and the
-`matplotlib documentation `__ for more.
+`matplotlib documentation `__ for more.
Subplots
~~~~~~~~
@@ -1566,7 +1620,7 @@ as seen in the example below.
There also exists a helper function ``pandas.plotting.table``, which creates a
table from :class:`DataFrame` or :class:`Series`, and adds it to an
``matplotlib.Axes`` instance. This function can accept keywords which the
-matplotlib `table `__ has.
+matplotlib `table `__ has.
.. ipython:: python
@@ -1597,7 +1651,7 @@ remedy this, ``DataFrame`` plotting supports the use of the ``colormap`` argumen
which accepts either a Matplotlib `colormap `__
or a string that is a name of a colormap registered with Matplotlib. A
visualization of the default matplotlib colormaps is available `here
-`__.
+`__.
As matplotlib does not directly support colormaps for line-based plots, the
colors are selected based on an even spacing determined by the number of columns
@@ -1740,7 +1794,7 @@ Starting in version 0.25, pandas can be extended with third-party plotting backe
main idea is letting users select a plotting backend different than the provided
one based on Matplotlib.
-This can be done by passsing 'backend.module' as the argument ``backend`` in ``plot``
+This can be done by passing 'backend.module' as the argument ``backend`` in ``plot``
function. For example:
.. code-block:: python
diff --git a/doc/source/user_guide/window.rst b/doc/source/user_guide/window.rst
index 0d6dcaa3726e6..d1244f62cc1e4 100644
--- a/doc/source/user_guide/window.rst
+++ b/doc/source/user_guide/window.rst
@@ -262,26 +262,24 @@ and we want to use an expanding window where ``use_expanding`` is ``True`` other
.. code-block:: ipython
In [2]: from pandas.api.indexers import BaseIndexer
- ...:
- ...: class CustomIndexer(BaseIndexer):
- ...:
- ...: def get_window_bounds(self, num_values, min_periods, center, closed):
- ...: start = np.empty(num_values, dtype=np.int64)
- ...: end = np.empty(num_values, dtype=np.int64)
- ...: for i in range(num_values):
- ...: if self.use_expanding[i]:
- ...: start[i] = 0
- ...: end[i] = i + 1
- ...: else:
- ...: start[i] = i
- ...: end[i] = i + self.window_size
- ...: return start, end
- ...:
-
- In [3]: indexer = CustomIndexer(window_size=1, use_expanding=use_expanding)
-
- In [4]: df.rolling(indexer).sum()
- Out[4]:
+
+ In [3]: class CustomIndexer(BaseIndexer):
+ ...: def get_window_bounds(self, num_values, min_periods, center, closed):
+ ...: start = np.empty(num_values, dtype=np.int64)
+ ...: end = np.empty(num_values, dtype=np.int64)
+ ...: for i in range(num_values):
+ ...: if self.use_expanding[i]:
+ ...: start[i] = 0
+ ...: end[i] = i + 1
+ ...: else:
+ ...: start[i] = i
+ ...: end[i] = i + self.window_size
+ ...: return start, end
+
+ In [4]: indexer = CustomIndexer(window_size=1, use_expanding=use_expanding)
+
+ In [5]: df.rolling(indexer).sum()
+ Out[5]:
values
0 0.0
1 1.0
@@ -289,7 +287,7 @@ and we want to use an expanding window where ``use_expanding`` is ``True`` other
3 3.0
4 10.0
-You can view other examples of ``BaseIndexer`` subclasses `here `__
+You can view other examples of ``BaseIndexer`` subclasses `here `__
.. versionadded:: 1.1
@@ -365,45 +363,21 @@ Numba engine
Additionally, :meth:`~Rolling.apply` can leverage `Numba `__
if installed as an optional dependency. The apply aggregation can be executed using Numba by specifying
``engine='numba'`` and ``engine_kwargs`` arguments (``raw`` must also be set to ``True``).
+See :ref:`enhancing performance with Numba ` for general usage of the arguments and performance considerations.
+
Numba will be applied in potentially two routines:
#. If ``func`` is a standard Python function, the engine will `JIT `__ the passed function. ``func`` can also be a JITed function in which case the engine will not JIT the function again.
#. The engine will JIT the for loop where the apply function is applied to each window.
-.. versionadded:: 1.3.0
-
-``mean``, ``median``, ``max``, ``min``, and ``sum`` also support the ``engine`` and ``engine_kwargs`` arguments.
-
The ``engine_kwargs`` argument is a dictionary of keyword arguments that will be passed into the
`numba.jit decorator `__.
These keyword arguments will be applied to *both* the passed function (if a standard Python function)
-and the apply for loop over each window. Currently only ``nogil``, ``nopython``, and ``parallel`` are supported,
-and their default values are set to ``False``, ``True`` and ``False`` respectively.
-
-.. note::
+and the apply for loop over each window.
- In terms of performance, **the first time a function is run using the Numba engine will be slow**
- as Numba will have some function compilation overhead. However, the compiled functions are cached,
- and subsequent calls will be fast. In general, the Numba engine is performant with
- a larger amount of data points (e.g. 1+ million).
-
-.. code-block:: ipython
-
- In [1]: data = pd.Series(range(1_000_000))
-
- In [2]: roll = data.rolling(10)
-
- In [3]: def f(x):
- ...: return np.sum(x) + 5
- # Run the first time, compilation time will affect performance
- In [4]: %timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True) # noqa: E225, E999
- 1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
- # Function is cached and performance will improve
- In [5]: %timeit roll.apply(f, engine='numba', raw=True)
- 188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
+.. versionadded:: 1.3.0
- In [6]: %timeit roll.apply(f, engine='cython', raw=True)
- 3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
+``mean``, ``median``, ``max``, ``min``, and ``sum`` also support the ``engine`` and ``engine_kwargs`` arguments.
.. _window.cov_corr:
diff --git a/doc/source/whatsnew/index.rst b/doc/source/whatsnew/index.rst
index 986cf43b80494..df33174804a33 100644
--- a/doc/source/whatsnew/index.rst
+++ b/doc/source/whatsnew/index.rst
@@ -10,12 +10,25 @@ This is the list of changes to pandas between each release. For full details,
see the `commit logs `_. For install and
upgrade instructions, see :ref:`install`.
+Version 1.4
+-----------
+
+.. toctree::
+ :maxdepth: 2
+
+ v1.4.0
+
Version 1.3
-----------
.. toctree::
:maxdepth: 2
+ v1.3.5
+ v1.3.4
+ v1.3.3
+ v1.3.2
+ v1.3.1
v1.3.0
Version 1.2
diff --git a/doc/source/whatsnew/v0.10.0.rst b/doc/source/whatsnew/v0.10.0.rst
index aa2749c85a232..bd47e6e4bc025 100644
--- a/doc/source/whatsnew/v0.10.0.rst
+++ b/doc/source/whatsnew/v0.10.0.rst
@@ -181,6 +181,7 @@ labeled the aggregated group with the end of the interval: the next day).
``X0``, ``X1``, ...) can be reproduced by specifying ``prefix='X'``:
.. ipython:: python
+ :okwarning:
import io
@@ -197,11 +198,25 @@ labeled the aggregated group with the end of the interval: the next day).
though this can be controlled by new ``true_values`` and ``false_values``
arguments:
-.. ipython:: python
+.. code-block:: ipython
- print(data)
- pd.read_csv(io.StringIO(data))
- pd.read_csv(io.StringIO(data), true_values=["Yes"], false_values=["No"])
+ In [4]: print(data)
+
+ a,b,c
+ 1,Yes,2
+ 3,No,4
+
+ In [5]: pd.read_csv(io.StringIO(data))
+ Out[5]:
+ a b c
+ 0 1 Yes 2
+ 1 3 No 4
+
+ In [6]: pd.read_csv(io.StringIO(data), true_values=["Yes"], false_values=["No"])
+ Out[6]:
+ a b c
+ 0 1 True 2
+ 1 3 False 4
- The file parsers will not recognize non-string values arising from a
converter function as NA if passed in the ``na_values`` argument. It's better
diff --git a/doc/source/whatsnew/v0.13.0.rst b/doc/source/whatsnew/v0.13.0.rst
index 3c6b70fb21383..b2596358d0c9d 100644
--- a/doc/source/whatsnew/v0.13.0.rst
+++ b/doc/source/whatsnew/v0.13.0.rst
@@ -310,7 +310,7 @@ Float64Index API change
- Added a new index type, ``Float64Index``. This will be automatically created when passing floating values in index creation.
This enables a pure label-based slicing paradigm that makes ``[],ix,loc`` for scalar indexing and slicing work exactly the
- same. See :ref:`the docs`, (:issue:`263`)
+ same. See :ref:`the docs`, (:issue:`263`)
Construction is by default for floating type values.
diff --git a/doc/source/whatsnew/v0.16.1.rst b/doc/source/whatsnew/v0.16.1.rst
index 269854111373f..cbf5b7703bd79 100644
--- a/doc/source/whatsnew/v0.16.1.rst
+++ b/doc/source/whatsnew/v0.16.1.rst
@@ -168,7 +168,7 @@ values NOT in the categories, similarly to how you can reindex ANY pandas index.
ordered=False, name='B',
dtype='category')
-See the :ref:`documentation ` for more. (:issue:`7629`, :issue:`10038`, :issue:`10039`)
+See the :ref:`documentation ` for more. (:issue:`7629`, :issue:`10038`, :issue:`10039`)
.. _whatsnew_0161.enhancements.sample:
diff --git a/doc/source/whatsnew/v0.16.2.rst b/doc/source/whatsnew/v0.16.2.rst
index 37e8c64ea9ced..c6c134a383e11 100644
--- a/doc/source/whatsnew/v0.16.2.rst
+++ b/doc/source/whatsnew/v0.16.2.rst
@@ -62,6 +62,7 @@ When the function you wish to apply takes its data anywhere other than the first
of ``(function, keyword)`` indicating where the DataFrame should flow. For example:
.. ipython:: python
+ :okwarning:
import statsmodels.formula.api as sm
@@ -82,7 +83,7 @@ popular ``(%>%)`` pipe operator for R_.
See the :ref:`documentation ` for more. (:issue:`10129`)
-.. _dplyr: https://github.com/hadley/dplyr
+.. _dplyr: https://github.com/tidyverse/dplyr
.. _magrittr: https://github.com/smbache/magrittr
.. _R: http://www.r-project.org
diff --git a/doc/source/whatsnew/v0.17.1.rst b/doc/source/whatsnew/v0.17.1.rst
index 6b0a28ec47568..774d17e6ff6b0 100644
--- a/doc/source/whatsnew/v0.17.1.rst
+++ b/doc/source/whatsnew/v0.17.1.rst
@@ -37,9 +37,7 @@ Conditional HTML formatting
.. warning::
This is a new feature and is under active development.
We'll be adding features an possibly making breaking changes in future
- releases. Feedback is welcome_.
-
-.. _welcome: https://github.com/pandas-dev/pandas/issues/11610
+ releases. Feedback is welcome in :issue:`11610`
We've added *experimental* support for conditional HTML formatting:
the visual styling of a DataFrame based on the data.
diff --git a/doc/source/whatsnew/v0.18.0.rst b/doc/source/whatsnew/v0.18.0.rst
index 829c04dac9f2d..a05b9bb1a88ef 100644
--- a/doc/source/whatsnew/v0.18.0.rst
+++ b/doc/source/whatsnew/v0.18.0.rst
@@ -669,9 +669,9 @@ New signature
.. ipython:: python
- pd.Series([0,1]).rank(axis=0, method='average', numeric_only=None,
+ pd.Series([0,1]).rank(axis=0, method='average', numeric_only=False,
na_option='keep', ascending=True, pct=False)
- pd.DataFrame([0,1]).rank(axis=0, method='average', numeric_only=None,
+ pd.DataFrame([0,1]).rank(axis=0, method='average', numeric_only=False,
na_option='keep', ascending=True, pct=False)
diff --git a/doc/source/whatsnew/v0.19.2.rst b/doc/source/whatsnew/v0.19.2.rst
index bba89d78be869..db9d9e65c923d 100644
--- a/doc/source/whatsnew/v0.19.2.rst
+++ b/doc/source/whatsnew/v0.19.2.rst
@@ -18,7 +18,7 @@ We recommend that all users upgrade to this version.
Highlights include:
- Compatibility with Python 3.6
-- Added a `Pandas Cheat Sheet `__. (:issue:`13202`).
+- Added a `Pandas Cheat Sheet `__. (:issue:`13202`).
.. contents:: What's new in v0.19.2
diff --git a/doc/source/whatsnew/v0.20.0.rst b/doc/source/whatsnew/v0.20.0.rst
index 733995cc718dd..faf4b1ac44d5b 100644
--- a/doc/source/whatsnew/v0.20.0.rst
+++ b/doc/source/whatsnew/v0.20.0.rst
@@ -105,6 +105,7 @@ aggregations. This is similar to how groupby ``.agg()`` works. (:issue:`15015`)
df.dtypes
.. ipython:: python
+ :okwarning:
df.agg(['min', 'sum'])
@@ -187,7 +188,7 @@ support for bz2 compression in the python 2 C-engine improved (:issue:`14874`).
url = ('https://github.com/{repo}/raw/{branch}/{path}'
.format(repo='pandas-dev/pandas',
- branch='master',
+ branch='main',
path='pandas/tests/io/parser/data/salaries.csv.bz2'))
# default, infer compression
df = pd.read_csv(url, sep='\t', compression='infer')
@@ -248,11 +249,12 @@ or purely non-negative, integers. Previously, handling these integers would
result in improper rounding or data-type casting, leading to incorrect results.
Notably, a new numerical index, ``UInt64Index``, has been created (:issue:`14937`)
-.. ipython:: python
+.. code-block:: ipython
- idx = pd.UInt64Index([1, 2, 3])
- df = pd.DataFrame({'A': ['a', 'b', 'c']}, index=idx)
- df.index
+ In [1]: idx = pd.UInt64Index([1, 2, 3])
+ In [2]: df = pd.DataFrame({'A': ['a', 'b', 'c']}, index=idx)
+ In [3]: df.index
+ Out[3]: UInt64Index([1, 2, 3], dtype='uint64')
- Bug in converting object elements of array-like objects to unsigned 64-bit integers (:issue:`4471`, :issue:`14982`)
- Bug in ``Series.unique()`` in which unsigned 64-bit integers were causing overflow (:issue:`14721`)
@@ -326,7 +328,7 @@ more information about the data.
You must enable this by setting the ``display.html.table_schema`` option to ``True``.
.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
-.. _nteract: http://nteract.io/
+.. _nteract: https://nteract.io/
.. _whatsnew_0200.enhancements.scipy_sparse:
diff --git a/doc/source/whatsnew/v0.23.0.rst b/doc/source/whatsnew/v0.23.0.rst
index f4caea9d363eb..be84c562b3c32 100644
--- a/doc/source/whatsnew/v0.23.0.rst
+++ b/doc/source/whatsnew/v0.23.0.rst
@@ -861,21 +861,21 @@ Previous behavior:
Current behavior:
-.. ipython:: python
+.. code-block:: ipython
- index = pd.Int64Index([-1, 0, 1])
+ In [12]: index = pd.Int64Index([-1, 0, 1])
# division by zero gives -infinity where negative,
# +infinity where positive, and NaN for 0 / 0
- index / 0
+ In [13]: index / 0
# The result of division by zero should not depend on
# whether the zero is int or float
- index / 0.0
+ In [14]: index / 0.0
- index = pd.UInt64Index([0, 1])
- index / np.array([0, 0], dtype=np.uint64)
+ In [15]: index = pd.UInt64Index([0, 1])
+ In [16]: index / np.array([0, 0], dtype=np.uint64)
- pd.RangeIndex(1, 5) / 0
+ In [17]: pd.RangeIndex(1, 5) / 0
.. _whatsnew_0230.api_breaking.extract:
diff --git a/doc/source/whatsnew/v0.25.0.rst b/doc/source/whatsnew/v0.25.0.rst
index 89c003f34f0cc..9cbfa49cc8c5c 100644
--- a/doc/source/whatsnew/v0.25.0.rst
+++ b/doc/source/whatsnew/v0.25.0.rst
@@ -473,10 +473,12 @@ considered commutative, such that ``A.union(B) == B.union(A)`` (:issue:`23525`).
*New behavior*:
-.. ipython:: python
+.. code-block:: python
- pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
- pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
+ In [3]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
+ Out[3]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')
+ In [4]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
+ Out[4]: Index([1, 2, 3], dtype='object')
Note that integer- and floating-dtype indexes are considered "compatible". The integer
values are coerced to floating point, which may result in loss of precision. See
diff --git a/doc/source/whatsnew/v0.5.0.rst b/doc/source/whatsnew/v0.5.0.rst
index 8757d9c887785..129b86dc1ce5b 100644
--- a/doc/source/whatsnew/v0.5.0.rst
+++ b/doc/source/whatsnew/v0.5.0.rst
@@ -28,7 +28,6 @@ New features
- :ref:`Added ` convenience ``set_index`` function for creating a DataFrame index from its existing columns
- :ref:`Implemented ` ``groupby`` hierarchical index level name (:issue:`223`)
- :ref:`Added ` support for different delimiters in ``DataFrame.to_csv`` (:issue:`244`)
-- TODO: DOCS ABOUT TAKE METHODS
Performance enhancements
~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/doc/source/whatsnew/v0.6.1.rst b/doc/source/whatsnew/v0.6.1.rst
index 139c6e2d1cb0c..4e72a630ad9f1 100644
--- a/doc/source/whatsnew/v0.6.1.rst
+++ b/doc/source/whatsnew/v0.6.1.rst
@@ -6,7 +6,7 @@ Version 0.6.1 (December 13, 2011)
New features
~~~~~~~~~~~~
-- Can :ref:`append single rows ` (as Series) to a DataFrame
+- Can append single rows (as Series) to a DataFrame
- Add Spearman and Kendall rank :ref:`correlation `
options to Series.corr and DataFrame.corr (:issue:`428`)
- :ref:`Added ` ``get_value`` and ``set_value`` methods to
diff --git a/doc/source/whatsnew/v0.7.0.rst b/doc/source/whatsnew/v0.7.0.rst
index 52747f2992dc4..1b947030ab8ab 100644
--- a/doc/source/whatsnew/v0.7.0.rst
+++ b/doc/source/whatsnew/v0.7.0.rst
@@ -19,7 +19,7 @@ New features
intersection of the other axes. Improves performance of ``Series.append`` and
``DataFrame.append`` (:issue:`468`, :issue:`479`, :issue:`273`)
-- :ref:`Can ` pass multiple DataFrames to
+- Can pass multiple DataFrames to
``DataFrame.append`` to concatenate (stack) and multiple Series to
``Series.append`` too
diff --git a/doc/source/whatsnew/v1.0.0.rst b/doc/source/whatsnew/v1.0.0.rst
index b87274307431b..03dfe475475a1 100755
--- a/doc/source/whatsnew/v1.0.0.rst
+++ b/doc/source/whatsnew/v1.0.0.rst
@@ -338,19 +338,20 @@ maps labels to their new names along the default axis, is allowed to be passed b
*pandas 0.25.x*
-.. code-block:: python
+.. code-block:: ipython
- >>> df = pd.DataFrame([[1]])
- >>> df.rename({0: 1}, {0: 2})
+ In [1]: df = pd.DataFrame([[1]])
+ In [2]: df.rename({0: 1}, {0: 2})
+ Out[2]:
FutureWarning: ...Use named arguments to resolve ambiguity...
2
1 1
*pandas 1.0.0*
-.. code-block:: python
+.. code-block:: ipython
- >>> df.rename({0: 1}, {0: 2})
+ In [3]: df.rename({0: 1}, {0: 2})
Traceback (most recent call last):
...
TypeError: rename() takes from 1 to 2 positional arguments but 3 were given
@@ -359,26 +360,28 @@ Note that errors will now be raised when conflicting or potentially ambiguous ar
*pandas 0.25.x*
-.. code-block:: python
+.. code-block:: ipython
- >>> df.rename({0: 1}, index={0: 2})
+ In [4]: df.rename({0: 1}, index={0: 2})
+ Out[4]:
0
1 1
- >>> df.rename(mapper={0: 1}, index={0: 2})
+ In [5]: df.rename(mapper={0: 1}, index={0: 2})
+ Out[5]:
0
2 1
*pandas 1.0.0*
-.. code-block:: python
+.. code-block:: ipython
- >>> df.rename({0: 1}, index={0: 2})
+ In [6]: df.rename({0: 1}, index={0: 2})
Traceback (most recent call last):
...
TypeError: Cannot specify both 'mapper' and any of 'index' or 'columns'
- >>> df.rename(mapper={0: 1}, index={0: 2})
+ In [7]: df.rename(mapper={0: 1}, index={0: 2})
Traceback (most recent call last):
...
TypeError: Cannot specify both 'mapper' and any of 'index' or 'columns'
@@ -405,12 +408,12 @@ Extended verbose info output for :class:`~pandas.DataFrame`
*pandas 0.25.x*
-.. code-block:: python
+.. code-block:: ipython
- >>> df = pd.DataFrame({"int_col": [1, 2, 3],
+ In [1]: df = pd.DataFrame({"int_col": [1, 2, 3],
... "text_col": ["a", "b", "c"],
... "float_col": [0.0, 0.1, 0.2]})
- >>> df.info(verbose=True)
+ In [2]: df.info(verbose=True)
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
@@ -440,14 +443,16 @@ Extended verbose info output for :class:`~pandas.DataFrame`
*pandas 0.25.x*
-.. code-block:: python
+.. code-block:: ipython
- >>> pd.array(["a", None])
+ In [1]: pd.array(["a", None])
+ Out[1]:
['a', None]
Length: 2, dtype: object
- >>> pd.array([1, None])
+ In [2]: pd.array([1, None])
+ Out[2]:
[1, None]
Length: 2, dtype: object
@@ -470,15 +475,17 @@ As a reminder, you can specify the ``dtype`` to disable all inference.
*pandas 0.25.x*
-.. code-block:: python
+.. code-block:: ipython
- >>> a = pd.array([1, 2, None], dtype="Int64")
- >>> a
+ In [1]: a = pd.array([1, 2, None], dtype="Int64")
+ In [2]: a
+ Out[2]:
[1, 2, NaN]
Length: 3, dtype: Int64
- >>> a[2]
+ In [3]: a[2]
+ Out[3]:
nan
*pandas 1.0.0*
@@ -499,9 +506,10 @@ will now raise.
*pandas 0.25.x*
-.. code-block:: python
+.. code-block:: ipython
- >>> np.asarray(a, dtype="float")
+ In [1]: np.asarray(a, dtype="float")
+ Out[1]:
array([ 1., 2., nan])
*pandas 1.0.0*
@@ -525,9 +533,10 @@ will now be ``pd.NA`` instead of ``np.nan`` in presence of missing values
*pandas 0.25.x*
-.. code-block:: python
+.. code-block:: ipython
- >>> pd.Series(a).sum(skipna=False)
+ In [1]: pd.Series(a).sum(skipna=False)
+ Out[1]:
nan
*pandas 1.0.0*
@@ -543,9 +552,10 @@ integer dtype for the values.
*pandas 0.25.x*
-.. code-block:: python
+.. code-block:: ipython
- >>> pd.Series([2, 1, 1, None], dtype="Int64").value_counts().dtype
+ In [1]: pd.Series([2, 1, 1, None], dtype="Int64").value_counts().dtype
+ Out[1]:
dtype('int64')
*pandas 1.0.0*
@@ -565,15 +575,17 @@ Comparison operations on a :class:`arrays.IntegerArray` now returns a
*pandas 0.25.x*
-.. code-block:: python
+.. code-block:: ipython
- >>> a = pd.array([1, 2, None], dtype="Int64")
- >>> a
+ In [1]: a = pd.array([1, 2, None], dtype="Int64")
+ In [2]: a
+ Out[2]:
[1, 2, NaN]
Length: 3, dtype: Int64
- >>> a > 1
+ In [3]: a > 1
+ Out[3]:
array([False, True, False])
*pandas 1.0.0*
@@ -640,9 +652,10 @@ scalar values in the result are instances of the extension dtype's scalar type.
*pandas 0.25.x*
-.. code-block:: python
+.. code-block:: ipython
- >>> df.resample("2D").agg(lambda x: 'a').A.dtype
+ In [1]> df.resample("2D").agg(lambda x: 'a').A.dtype
+ Out[1]:
CategoricalDtype(categories=['a', 'b'], ordered=False)
*pandas 1.0.0*
@@ -657,9 +670,10 @@ depending on how the results are cast back to the original dtype.
*pandas 0.25.x*
-.. code-block:: python
+.. code-block:: ipython
- >>> df.resample("2D").agg(lambda x: 'c')
+ In [1] df.resample("2D").agg(lambda x: 'c')
+ Out[1]:
A
0 NaN
@@ -871,10 +885,10 @@ matplotlib directly rather than :meth:`~DataFrame.plot`.
To use pandas formatters with a matplotlib plot, specify
-.. code-block:: python
+.. code-block:: ipython
- >>> import pandas as pd
- >>> pd.options.plotting.matplotlib.register_converters = True
+ In [1]: import pandas as pd
+ In [2]: pd.options.plotting.matplotlib.register_converters = True
Note that plots created by :meth:`DataFrame.plot` and :meth:`Series.plot` *do* register the converters
automatically. The only behavior change is when plotting a date-like object via ``matplotlib.pyplot.plot``
diff --git a/doc/source/whatsnew/v1.1.0.rst b/doc/source/whatsnew/v1.1.0.rst
index 9f3ccb3e14116..ebd76d97e78b3 100644
--- a/doc/source/whatsnew/v1.1.0.rst
+++ b/doc/source/whatsnew/v1.1.0.rst
@@ -265,7 +265,7 @@ SSH, FTP, dropbox and github. For docs and capabilities, see the `fsspec docs`_.
The existing capability to interface with S3 and GCS will be unaffected by this
change, as ``fsspec`` will still bring in the same packages as before.
-.. _Azure Datalake and Blob: https://github.com/dask/adlfs
+.. _Azure Datalake and Blob: https://github.com/fsspec/adlfs
.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/
diff --git a/doc/source/whatsnew/v1.2.0.rst b/doc/source/whatsnew/v1.2.0.rst
index 36b591c3c3142..3d3ec53948a01 100644
--- a/doc/source/whatsnew/v1.2.0.rst
+++ b/doc/source/whatsnew/v1.2.0.rst
@@ -150,6 +150,7 @@ and a short caption (:issue:`36267`).
The keyword ``position`` has been added to set the position.
.. ipython:: python
+ :okwarning:
data = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
table = data.to_latex(position='ht')
@@ -161,6 +162,7 @@ one can optionally provide a tuple ``(full_caption, short_caption)``
to add a short caption macro.
.. ipython:: python
+ :okwarning:
data = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
table = data.to_latex(caption=('the full long caption', 'short caption'))
diff --git a/doc/source/whatsnew/v1.2.1.rst b/doc/source/whatsnew/v1.2.1.rst
index bfe30d52e2aff..34e28eab6d4bf 100644
--- a/doc/source/whatsnew/v1.2.1.rst
+++ b/doc/source/whatsnew/v1.2.1.rst
@@ -52,20 +52,23 @@ DataFrame / Series combination) would ignore the indices, only match
the inputs by shape, and use the index/columns of the first DataFrame for
the result:
-.. code-block:: python
+.. code-block:: ipython
- >>> df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[0, 1])
- ... df2 = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[1, 2])
- >>> df1
+ In [1]: df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[0, 1])
+ In [2]: df2 = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[1, 2])
+ In [3]: df1
+ Out[3]:
a b
0 1 3
1 2 4
- >>> df2
+ In [4]: df2
+ Out[4]:
a b
1 1 3
2 2 4
- >>> np.add(df1, df2)
+ In [5]: np.add(df1, df2)
+ Out[5]:
a b
0 2 6
1 4 8
@@ -73,9 +76,10 @@ the result:
This contrasts with how other pandas operations work, which first align
the inputs:
-.. code-block:: python
+.. code-block:: ipython
- >>> df1 + df2
+ In [6]: df1 + df2
+ Out[6]:
a b
0 NaN NaN
1 3.0 7.0
@@ -94,9 +98,10 @@ objects (eg ``np.add(s1, s2)``) already aligns and continues to do so.
To avoid the warning and keep the current behaviour of ignoring the indices,
convert one of the arguments to a NumPy array:
-.. code-block:: python
+.. code-block:: ipython
- >>> np.add(df1, np.asarray(df2))
+ In [7]: np.add(df1, np.asarray(df2))
+ Out[7]:
a b
0 2 6
1 4 8
@@ -104,10 +109,11 @@ convert one of the arguments to a NumPy array:
To obtain the future behaviour and silence the warning, you can align manually
before passing the arguments to the ufunc:
-.. code-block:: python
+.. code-block:: ipython
- >>> df1, df2 = df1.align(df2)
- >>> np.add(df1, df2)
+ In [8]: df1, df2 = df1.align(df2)
+ In [9]: np.add(df1, df2)
+ Out[9]:
a b
0 NaN NaN
1 3.0 7.0
diff --git a/doc/source/whatsnew/v1.2.5.rst b/doc/source/whatsnew/v1.2.5.rst
index d0af23b48b1f7..d3ceb2b919b5d 100644
--- a/doc/source/whatsnew/v1.2.5.rst
+++ b/doc/source/whatsnew/v1.2.5.rst
@@ -1,7 +1,7 @@
.. _whatsnew_125:
-What's new in 1.2.5 (May ??, 2021)
-----------------------------------
+What's new in 1.2.5 (June 22, 2021)
+-----------------------------------
These are the changes in pandas 1.2.5. See :ref:`release` for a full changelog
including other versions of pandas.
@@ -14,32 +14,12 @@ including other versions of pandas.
Fixed regressions
~~~~~~~~~~~~~~~~~
-- Regression in :func:`concat` between two :class:`DataFrames` where one has an :class:`Index` that is all-None and the other is :class:`DatetimeIndex` incorrectly raising (:issue:`40841`)
+- Fixed regression in :func:`concat` between two :class:`DataFrame` where one has an :class:`Index` that is all-None and the other is :class:`DatetimeIndex` incorrectly raising (:issue:`40841`)
- Fixed regression in :meth:`DataFrame.sum` and :meth:`DataFrame.prod` when ``min_count`` and ``numeric_only`` are both given (:issue:`41074`)
-- Regression in :func:`read_csv` when using ``memory_map=True`` with an non-UTF8 encoding (:issue:`40986`)
-- Regression in :meth:`DataFrame.replace` and :meth:`Series.replace` when the values to replace is a NumPy float array (:issue:`40371`)
-- Regression in :func:`ExcelFile` when a corrupt file is opened but not closed (:issue:`41778`)
-
-.. ---------------------------------------------------------------------------
-
-
-.. _whatsnew_125.bug_fixes:
-
-Bug fixes
-~~~~~~~~~
-
--
--
-
-.. ---------------------------------------------------------------------------
-
-.. _whatsnew_125.other:
-
-Other
-~~~~~
-
--
--
+- Fixed regression in :func:`read_csv` when using ``memory_map=True`` with an non-UTF8 encoding (:issue:`40986`)
+- Fixed regression in :meth:`DataFrame.replace` and :meth:`Series.replace` when the values to replace is a NumPy float array (:issue:`40371`)
+- Fixed regression in :func:`ExcelFile` when a corrupt file is opened but not closed (:issue:`41778`)
+- Fixed regression in :meth:`DataFrame.astype` with ``dtype=str`` failing to convert ``NaN`` in categorical columns (:issue:`41797`)
.. ---------------------------------------------------------------------------
diff --git a/doc/source/whatsnew/v1.3.0.rst b/doc/source/whatsnew/v1.3.0.rst
index dd95f9088e3da..a392aeb5274c2 100644
--- a/doc/source/whatsnew/v1.3.0.rst
+++ b/doc/source/whatsnew/v1.3.0.rst
@@ -1,7 +1,7 @@
.. _whatsnew_130:
-What's new in 1.3.0 (??)
-------------------------
+What's new in 1.3.0 (July 2, 2021)
+----------------------------------
These are the changes in pandas 1.3.0. See :ref:`release` for a full changelog
including other versions of pandas.
@@ -124,7 +124,7 @@ which has been revised and improved (:issue:`39720`, :issue:`39317`, :issue:`404
- The methods :meth:`.Styler.highlight_null`, :meth:`.Styler.highlight_min`, and :meth:`.Styler.highlight_max` now allow custom CSS highlighting instead of the default background coloring (:issue:`40242`)
- :meth:`.Styler.apply` now accepts functions that return an ``ndarray`` when ``axis=None``, making it now consistent with the ``axis=0`` and ``axis=1`` behavior (:issue:`39359`)
- When incorrectly formatted CSS is given via :meth:`.Styler.apply` or :meth:`.Styler.applymap`, an error is now raised upon rendering (:issue:`39660`)
- - :meth:`.Styler.format` now accepts the keyword argument ``escape`` for optional HTML and LaTex escaping (:issue:`40388`, :issue:`41619`)
+ - :meth:`.Styler.format` now accepts the keyword argument ``escape`` for optional HTML and LaTeX escaping (:issue:`40388`, :issue:`41619`)
- :meth:`.Styler.background_gradient` has gained the argument ``gmap`` to supply a specific gradient map for shading (:issue:`22727`)
- :meth:`.Styler.clear` now clears :attr:`Styler.hidden_index` and :attr:`Styler.hidden_columns` as well (:issue:`40484`)
- Added the method :meth:`.Styler.highlight_between` (:issue:`39821`)
@@ -136,8 +136,9 @@ which has been revised and improved (:issue:`39720`, :issue:`39317`, :issue:`404
- Many features of the :class:`.Styler` class are now either partially or fully usable on a DataFrame with a non-unique indexes or columns (:issue:`41143`)
- One has greater control of the display through separate sparsification of the index or columns using the :ref:`new styler options `, which are also usable via :func:`option_context` (:issue:`41142`)
- Added the option ``styler.render.max_elements`` to avoid browser overload when styling large DataFrames (:issue:`40712`)
- - Added the method :meth:`.Styler.to_latex` (:issue:`21673`)
+ - Added the method :meth:`.Styler.to_latex` (:issue:`21673`, :issue:`42320`), which also allows some limited CSS conversion (:issue:`40731`)
- Added the method :meth:`.Styler.to_html` (:issue:`13379`)
+ - Added the method :meth:`.Styler.set_sticky` to make index and column headers permanently visible in scrolling HTML frames (:issue:`29072`)
.. _whatsnew_130.enhancements.dataframe_honors_copy_with_dict:
@@ -246,11 +247,12 @@ Other enhancements
- Improved error message when ``usecols`` and ``names`` do not match for :func:`read_csv` and ``engine="c"`` (:issue:`29042`)
- Improved consistency of error messages when passing an invalid ``win_type`` argument in :ref:`Window methods ` (:issue:`15969`)
- :func:`read_sql_query` now accepts a ``dtype`` argument to cast the columnar data from the SQL database based on user input (:issue:`10285`)
+- :func:`read_csv` now raising ``ParserWarning`` if length of header or given names does not match length of data when ``usecols`` is not specified (:issue:`21768`)
- Improved integer type mapping from pandas to SQLAlchemy when using :meth:`DataFrame.to_sql` (:issue:`35076`)
- :func:`to_numeric` now supports downcasting of nullable ``ExtensionDtype`` objects (:issue:`33013`)
- Added support for dict-like names in :class:`MultiIndex.set_names` and :class:`MultiIndex.rename` (:issue:`20421`)
- :func:`read_excel` can now auto-detect .xlsb files and older .xls files (:issue:`35416`, :issue:`41225`)
-- :class:`ExcelWriter` now accepts an ``if_sheet_exists`` parameter to control the behaviour of append mode when writing to existing sheets (:issue:`40230`)
+- :class:`ExcelWriter` now accepts an ``if_sheet_exists`` parameter to control the behavior of append mode when writing to existing sheets (:issue:`40230`)
- :meth:`.Rolling.sum`, :meth:`.Expanding.sum`, :meth:`.Rolling.mean`, :meth:`.Expanding.mean`, :meth:`.ExponentialMovingWindow.mean`, :meth:`.Rolling.median`, :meth:`.Expanding.median`, :meth:`.Rolling.max`, :meth:`.Expanding.max`, :meth:`.Rolling.min`, and :meth:`.Expanding.min` now support `Numba `_ execution with the ``engine`` keyword (:issue:`38895`, :issue:`41267`)
- :meth:`DataFrame.apply` can now accept NumPy unary operators as strings, e.g. ``df.apply("sqrt")``, which was already the case for :meth:`Series.apply` (:issue:`39116`)
- :meth:`DataFrame.apply` can now accept non-callable DataFrame properties as strings, e.g. ``df.apply("size")``, which was already the case for :meth:`Series.apply` (:issue:`39116`)
@@ -267,12 +269,16 @@ Other enhancements
- :meth:`read_csv` and :meth:`read_json` expose the argument ``encoding_errors`` to control how encoding errors are handled (:issue:`39450`)
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` use Kleene logic with nullable data types (:issue:`37506`)
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` return a ``BooleanDtype`` for columns with nullable data types (:issue:`33449`)
+- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` raising with ``object`` data containing ``pd.NA`` even when ``skipna=True`` (:issue:`37501`)
- :meth:`.GroupBy.rank` now supports object-dtype data (:issue:`38278`)
- Constructing a :class:`DataFrame` or :class:`Series` with the ``data`` argument being a Python iterable that is *not* a NumPy ``ndarray`` consisting of NumPy scalars will now result in a dtype with a precision the maximum of the NumPy scalars; this was already the case when ``data`` is a NumPy ``ndarray`` (:issue:`40908`)
- Add keyword ``sort`` to :func:`pivot_table` to allow non-sorting of the result (:issue:`39143`)
- Add keyword ``dropna`` to :meth:`DataFrame.value_counts` to allow counting rows that include ``NA`` values (:issue:`41325`)
- :meth:`Series.replace` will now cast results to ``PeriodDtype`` where possible instead of ``object`` dtype (:issue:`41526`)
- Improved error message in ``corr`` and ``cov`` methods on :class:`.Rolling`, :class:`.Expanding`, and :class:`.ExponentialMovingWindow` when ``other`` is not a :class:`DataFrame` or :class:`Series` (:issue:`41741`)
+- :meth:`Series.between` can now accept ``left`` or ``right`` as arguments to ``inclusive`` to include only the left or right boundary (:issue:`40245`)
+- :meth:`DataFrame.explode` now supports exploding multiple columns. Its ``column`` argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (:issue:`39240`)
+- :meth:`DataFrame.sample` now accepts the ``ignore_index`` argument to reset the index after sampling, similar to :meth:`DataFrame.drop_duplicates` and :meth:`DataFrame.sort_values` (:issue:`38581`)
.. ---------------------------------------------------------------------------
@@ -301,7 +307,7 @@ As an example of this, given:
original = pd.Series(cat)
unique = original.unique()
-*pandas < 1.3.0*:
+*Previous behavior*:
.. code-block:: ipython
@@ -311,7 +317,7 @@ As an example of this, given:
In [2]: original.dtype == unique.dtype
False
-*pandas >= 1.3.0*
+*New behavior*:
.. ipython:: python
@@ -333,7 +339,7 @@ Preserve dtypes in :meth:`DataFrame.combine_first`
df2
combined = df1.combine_first(df2)
-*pandas 1.2.x*
+*Previous behavior*:
.. code-block:: ipython
@@ -344,7 +350,7 @@ Preserve dtypes in :meth:`DataFrame.combine_first`
C float64
dtype: object
-*pandas 1.3.0*
+*New behavior*:
.. ipython:: python
@@ -367,7 +373,7 @@ values as measured by ``np.allclose``. Now no such casting occurs.
df = pd.DataFrame({'key': [1, 1], 'a': [True, False], 'b': [True, True]})
df
-*pandas 1.2.x*
+*Previous behavior*:
.. code-block:: ipython
@@ -377,7 +383,7 @@ values as measured by ``np.allclose``. Now no such casting occurs.
key
1 True 2
-*pandas 1.3.0*
+*New behavior*:
.. ipython:: python
@@ -395,7 +401,7 @@ Now, these methods will always return a float dtype. (:issue:`41137`)
df = pd.DataFrame({'a': [True], 'b': [1], 'c': [1.0]})
-*pandas 1.2.x*
+*Previous behavior*:
.. code-block:: ipython
@@ -404,7 +410,7 @@ Now, these methods will always return a float dtype. (:issue:`41137`)
a b c
0 True 1 1.0
-*pandas 1.3.0*
+*New behavior*:
.. ipython:: python
@@ -428,7 +434,7 @@ insert the values into the existing data rather than create an entirely new arra
In both the new and old behavior, the data in ``values`` is overwritten, but in
the old behavior the dtype of ``df["A"]`` changed to ``int64``.
-*pandas 1.2.x*
+*Previous behavior*:
.. code-block:: ipython
@@ -443,7 +449,7 @@ the old behavior the dtype of ``df["A"]`` changed to ``int64``.
In pandas 1.3.0, ``df`` continues to share data with ``values``
-*pandas 1.3.0*
+*New behavior*:
.. ipython:: python
@@ -470,7 +476,7 @@ never casting to the dtypes of the existing arrays.
In the old behavior, ``5`` was cast to ``float64`` and inserted into the existing
array backing ``df``:
-*pandas 1.2.x*
+*Previous behavior*:
.. code-block:: ipython
@@ -480,7 +486,7 @@ array backing ``df``:
In the new behavior, we get a new array, and retain an integer-dtyped ``5``:
-*pandas 1.3.0*
+*New behavior*:
.. ipython:: python
@@ -503,7 +509,7 @@ casts to ``dtype=object`` (:issue:`38709`)
ser2 = orig.copy()
ser2.iloc[1] = 2.0
-*pandas 1.2.x*
+*Previous behavior*:
.. code-block:: ipython
@@ -519,7 +525,7 @@ casts to ``dtype=object`` (:issue:`38709`)
1 2.0
dtype: object
-*pandas 1.3.0*
+*New behavior*:
.. ipython:: python
@@ -637,7 +643,7 @@ If installed, we now require:
+-----------------+-----------------+----------+---------+
| pytest (dev) | 6.0 | | X |
+-----------------+-----------------+----------+---------+
-| mypy (dev) | 0.800 | | X |
+| mypy (dev) | 0.812 | | X |
+-----------------+-----------------+----------+---------+
| setuptools | 38.6.0 | | X |
+-----------------+-----------------+----------+---------+
@@ -701,6 +707,8 @@ Other API changes
- Added new ``engine`` and ``**engine_kwargs`` parameters to :meth:`DataFrame.to_sql` to support other future "SQL engines". Currently we still only use ``SQLAlchemy`` under the hood, but more engines are planned to be supported such as `turbodbc `_ (:issue:`36893`)
- Removed redundant ``freq`` from :class:`PeriodIndex` string representation (:issue:`41653`)
- :meth:`ExtensionDtype.construct_array_type` is now a required method instead of an optional one for :class:`ExtensionDtype` subclasses (:issue:`24860`)
+- Calling ``hash`` on non-hashable pandas objects will now raise ``TypeError`` with the built-in error message (e.g. ``unhashable type: 'Series'``). Previously it would raise a custom message such as ``'Series' objects are mutable, thus they cannot be hashed``. Furthermore, ``isinstance(, abc.collections.Hashable)`` will now return ``False`` (:issue:`40013`)
+- :meth:`.Styler.from_custom_template` now has two new arguments for template names, and removed the old ``name``, due to template inheritance having been introducing for better parsing (:issue:`42053`). Subclassing modifications to Styler attributes are also needed.
.. _whatsnew_130.api_breaking.build:
@@ -714,64 +722,6 @@ Build
Deprecations
~~~~~~~~~~~~
-- Deprecated allowing scalars to be passed to the :class:`Categorical` constructor (:issue:`38433`)
-- Deprecated constructing :class:`CategoricalIndex` without passing list-like data (:issue:`38944`)
-- Deprecated allowing subclass-specific keyword arguments in the :class:`Index` constructor, use the specific subclass directly instead (:issue:`14093`, :issue:`21311`, :issue:`22315`, :issue:`26974`)
-- Deprecated the :meth:`astype` method of datetimelike (``timedelta64[ns]``, ``datetime64[ns]``, ``Datetime64TZDtype``, ``PeriodDtype``) to convert to integer dtypes, use ``values.view(...)`` instead (:issue:`38544`)
-- Deprecated :meth:`MultiIndex.is_lexsorted` and :meth:`MultiIndex.lexsort_depth`, use :meth:`MultiIndex.is_monotonic_increasing` instead (:issue:`32259`)
-- Deprecated keyword ``try_cast`` in :meth:`Series.where`, :meth:`Series.mask`, :meth:`DataFrame.where`, :meth:`DataFrame.mask`; cast results manually if desired (:issue:`38836`)
-- Deprecated comparison of :class:`Timestamp` objects with ``datetime.date`` objects. Instead of e.g. ``ts <= mydate`` use ``ts <= pd.Timestamp(mydate)`` or ``ts.date() <= mydate`` (:issue:`36131`)
-- Deprecated :attr:`Rolling.win_type` returning ``"freq"`` (:issue:`38963`)
-- Deprecated :attr:`Rolling.is_datetimelike` (:issue:`38963`)
-- Deprecated :class:`DataFrame` indexer for :meth:`Series.__setitem__` and :meth:`DataFrame.__setitem__` (:issue:`39004`)
-- Deprecated :meth:`ExponentialMovingWindow.vol` (:issue:`39220`)
-- Using ``.astype`` to convert between ``datetime64[ns]`` dtype and :class:`DatetimeTZDtype` is deprecated and will raise in a future version, use ``obj.tz_localize`` or ``obj.dt.tz_localize`` instead (:issue:`38622`)
-- Deprecated casting ``datetime.date`` objects to ``datetime64`` when used as ``fill_value`` in :meth:`DataFrame.unstack`, :meth:`DataFrame.shift`, :meth:`Series.shift`, and :meth:`DataFrame.reindex`, pass ``pd.Timestamp(dateobj)`` instead (:issue:`39767`)
-- Deprecated :meth:`.Styler.set_na_rep` and :meth:`.Styler.set_precision` in favour of :meth:`.Styler.format` with ``na_rep`` and ``precision`` as existing and new input arguments respectively (:issue:`40134`, :issue:`40425`)
-- Deprecated allowing partial failure in :meth:`Series.transform` and :meth:`DataFrame.transform` when ``func`` is list-like or dict-like and raises anything but ``TypeError``; ``func`` raising anything but a ``TypeError`` will raise in a future version (:issue:`40211`)
-- Deprecated arguments ``error_bad_lines`` and ``warn_bad_lines`` in :meth:`read_csv` and :meth:`read_table` in favor of argument ``on_bad_lines`` (:issue:`15122`)
-- Deprecated support for ``np.ma.mrecords.MaskedRecords`` in the :class:`DataFrame` constructor, pass ``{name: data[name] for name in data.dtype.names}`` instead (:issue:`40363`)
-- Deprecated using :func:`merge`, :meth:`DataFrame.merge`, and :meth:`DataFrame.join` on a different number of levels (:issue:`34862`)
-- Deprecated the use of ``**kwargs`` in :class:`.ExcelWriter`; use the keyword argument ``engine_kwargs`` instead (:issue:`40430`)
-- Deprecated the ``level`` keyword for :class:`DataFrame` and :class:`Series` aggregations; use groupby instead (:issue:`39983`)
-- Deprecated the ``inplace`` parameter of :meth:`Categorical.remove_categories`, :meth:`Categorical.add_categories`, :meth:`Categorical.reorder_categories`, :meth:`Categorical.rename_categories`, :meth:`Categorical.set_categories` and will be removed in a future version (:issue:`37643`)
-- Deprecated :func:`merge` producing duplicated columns through the ``suffixes`` keyword and already existing columns (:issue:`22818`)
-- Deprecated setting :attr:`Categorical._codes`, create a new :class:`Categorical` with the desired codes instead (:issue:`40606`)
-- Deprecated the ``convert_float`` optional argument in :func:`read_excel` and :meth:`ExcelFile.parse` (:issue:`41127`)
-- Deprecated behavior of :meth:`DatetimeIndex.union` with mixed timezones; in a future version both will be cast to UTC instead of object dtype (:issue:`39328`)
-- Deprecated using ``usecols`` with out of bounds indices for :func:`read_csv` with ``engine="c"`` (:issue:`25623`)
-- Deprecated special treatment of lists with first element a Categorical in the :class:`DataFrame` constructor; pass as ``pd.DataFrame({col: categorical, ...})`` instead (:issue:`38845`)
-- Deprecated behavior of :class:`DataFrame` constructor when a ``dtype`` is passed and the data cannot be cast to that dtype. In a future version, this will raise instead of being silently ignored (:issue:`24435`)
-- Deprecated the :attr:`Timestamp.freq` attribute. For the properties that use it (``is_month_start``, ``is_month_end``, ``is_quarter_start``, ``is_quarter_end``, ``is_year_start``, ``is_year_end``), when you have a ``freq``, use e.g. ``freq.is_month_start(ts)`` (:issue:`15146`)
-- Deprecated construction of :class:`Series` or :class:`DataFrame` with ``DatetimeTZDtype`` data and ``datetime64[ns]`` dtype. Use ``Series(data).dt.tz_localize(None)`` instead (:issue:`41555`, :issue:`33401`)
-- Deprecated behavior of :class:`Series` construction with large-integer values and small-integer dtype silently overflowing; use ``Series(data).astype(dtype)`` instead (:issue:`41734`)
-- Deprecated behavior of :class:`DataFrame` construction with floating data and integer dtype casting even when lossy; in a future version this will remain floating, matching :class:`Series` behavior (:issue:`41770`)
-- Deprecated inference of ``timedelta64[ns]``, ``datetime64[ns]``, or ``DatetimeTZDtype`` dtypes in :class:`Series` construction when data containing strings is passed and no ``dtype`` is passed (:issue:`33558`)
-- In a future version, constructing :class:`Series` or :class:`DataFrame` with ``datetime64[ns]`` data and ``DatetimeTZDtype`` will treat the data as wall-times instead of as UTC times (matching DatetimeIndex behavior). To treat the data as UTC times, use ``pd.Series(data).dt.tz_localize("UTC").dt.tz_convert(dtype.tz)`` or ``pd.Series(data.view("int64"), dtype=dtype)`` (:issue:`33401`)
-- Deprecated passing lists as ``key`` to :meth:`DataFrame.xs` and :meth:`Series.xs` (:issue:`41760`)
-- Deprecated passing arguments as positional for all of the following, with exceptions noted (:issue:`41485`):
- - :func:`concat` (other than ``objs``)
- - :func:`read_csv` (other than ``filepath_or_buffer``)
- - :func:`read_table` (other than ``filepath_or_buffer``)
- - :meth:`DataFrame.clip` and :meth:`Series.clip` (other than ``upper`` and ``lower``)
- - :meth:`DataFrame.drop_duplicates` (except for ``subset``), :meth:`Series.drop_duplicates`, :meth:`Index.drop_duplicates` and :meth:`MultiIndex.drop_duplicates`
- - :meth:`DataFrame.drop` (other than ``labels``) and :meth:`Series.drop`
- - :meth:`DataFrame.dropna` and :meth:`Series.dropna`
- - :meth:`DataFrame.ffill`, :meth:`Series.ffill`, :meth:`DataFrame.bfill`, and :meth:`Series.bfill`
- - :meth:`DataFrame.fillna` and :meth:`Series.fillna` (apart from ``value``)
- - :meth:`DataFrame.interpolate` and :meth:`Series.interpolate` (other than ``method``)
- - :meth:`DataFrame.mask` and :meth:`Series.mask` (other than ``cond`` and ``other``)
- - :meth:`DataFrame.reset_index` (other than ``level``) and :meth:`Series.reset_index`
- - :meth:`DataFrame.set_axis` and :meth:`Series.set_axis` (other than ``labels``)
- - :meth:`DataFrame.set_index` (other than ``keys``)
- - :meth:`DataFrame.sort_index` and :meth:`Series.sort_index`
- - :meth:`DataFrame.sort_values` (other than ``by``) and :meth:`Series.sort_values`
- - :meth:`DataFrame.where` and :meth:`Series.where` (other than ``cond`` and ``other``)
- - :meth:`Index.set_names` and :meth:`MultiIndex.set_names` (except for ``names``)
- - :meth:`MultiIndex.codes` (except for ``codes``)
- - :meth:`MultiIndex.set_levels` (except for ``levels``)
- - :meth:`Resampler.interpolate` (other than ``method``)
-
.. _whatsnew_130.deprecations.nuisance_columns:
@@ -840,6 +790,8 @@ For example:
1 2
2 12
+*Future behavior*:
+
.. code-block:: ipython
In [5]: gb.prod(numeric_only=False)
@@ -852,6 +804,72 @@ For example:
1 2
2 12
+.. _whatsnew_130.deprecations.other:
+
+Other Deprecations
+^^^^^^^^^^^^^^^^^^
+- Deprecated allowing scalars to be passed to the :class:`Categorical` constructor (:issue:`38433`)
+- Deprecated constructing :class:`CategoricalIndex` without passing list-like data (:issue:`38944`)
+- Deprecated allowing subclass-specific keyword arguments in the :class:`Index` constructor, use the specific subclass directly instead (:issue:`14093`, :issue:`21311`, :issue:`22315`, :issue:`26974`)
+- Deprecated the :meth:`astype` method of datetimelike (``timedelta64[ns]``, ``datetime64[ns]``, ``Datetime64TZDtype``, ``PeriodDtype``) to convert to integer dtypes, use ``values.view(...)`` instead (:issue:`38544`). This deprecation was later reverted in pandas 1.4.0.
+- Deprecated :meth:`MultiIndex.is_lexsorted` and :meth:`MultiIndex.lexsort_depth`, use :meth:`MultiIndex.is_monotonic_increasing` instead (:issue:`32259`)
+- Deprecated keyword ``try_cast`` in :meth:`Series.where`, :meth:`Series.mask`, :meth:`DataFrame.where`, :meth:`DataFrame.mask`; cast results manually if desired (:issue:`38836`)
+- Deprecated comparison of :class:`Timestamp` objects with ``datetime.date`` objects. Instead of e.g. ``ts <= mydate`` use ``ts <= pd.Timestamp(mydate)`` or ``ts.date() <= mydate`` (:issue:`36131`)
+- Deprecated :attr:`Rolling.win_type` returning ``"freq"`` (:issue:`38963`)
+- Deprecated :attr:`Rolling.is_datetimelike` (:issue:`38963`)
+- Deprecated :class:`DataFrame` indexer for :meth:`Series.__setitem__` and :meth:`DataFrame.__setitem__` (:issue:`39004`)
+- Deprecated :meth:`ExponentialMovingWindow.vol` (:issue:`39220`)
+- Using ``.astype`` to convert between ``datetime64[ns]`` dtype and :class:`DatetimeTZDtype` is deprecated and will raise in a future version, use ``obj.tz_localize`` or ``obj.dt.tz_localize`` instead (:issue:`38622`)
+- Deprecated casting ``datetime.date`` objects to ``datetime64`` when used as ``fill_value`` in :meth:`DataFrame.unstack`, :meth:`DataFrame.shift`, :meth:`Series.shift`, and :meth:`DataFrame.reindex`, pass ``pd.Timestamp(dateobj)`` instead (:issue:`39767`)
+- Deprecated :meth:`.Styler.set_na_rep` and :meth:`.Styler.set_precision` in favor of :meth:`.Styler.format` with ``na_rep`` and ``precision`` as existing and new input arguments respectively (:issue:`40134`, :issue:`40425`)
+- Deprecated :meth:`.Styler.where` in favor of using an alternative formulation with :meth:`Styler.applymap` (:issue:`40821`)
+- Deprecated allowing partial failure in :meth:`Series.transform` and :meth:`DataFrame.transform` when ``func`` is list-like or dict-like and raises anything but ``TypeError``; ``func`` raising anything but a ``TypeError`` will raise in a future version (:issue:`40211`)
+- Deprecated arguments ``error_bad_lines`` and ``warn_bad_lines`` in :meth:`read_csv` and :meth:`read_table` in favor of argument ``on_bad_lines`` (:issue:`15122`)
+- Deprecated support for ``np.ma.mrecords.MaskedRecords`` in the :class:`DataFrame` constructor, pass ``{name: data[name] for name in data.dtype.names}`` instead (:issue:`40363`)
+- Deprecated using :func:`merge`, :meth:`DataFrame.merge`, and :meth:`DataFrame.join` on a different number of levels (:issue:`34862`)
+- Deprecated the use of ``**kwargs`` in :class:`.ExcelWriter`; use the keyword argument ``engine_kwargs`` instead (:issue:`40430`)
+- Deprecated the ``level`` keyword for :class:`DataFrame` and :class:`Series` aggregations; use groupby instead (:issue:`39983`)
+- Deprecated the ``inplace`` parameter of :meth:`Categorical.remove_categories`, :meth:`Categorical.add_categories`, :meth:`Categorical.reorder_categories`, :meth:`Categorical.rename_categories`, :meth:`Categorical.set_categories` and will be removed in a future version (:issue:`37643`)
+- Deprecated :func:`merge` producing duplicated columns through the ``suffixes`` keyword and already existing columns (:issue:`22818`)
+- Deprecated setting :attr:`Categorical._codes`, create a new :class:`Categorical` with the desired codes instead (:issue:`40606`)
+- Deprecated the ``convert_float`` optional argument in :func:`read_excel` and :meth:`ExcelFile.parse` (:issue:`41127`)
+- Deprecated behavior of :meth:`DatetimeIndex.union` with mixed timezones; in a future version both will be cast to UTC instead of object dtype (:issue:`39328`)
+- Deprecated using ``usecols`` with out of bounds indices for :func:`read_csv` with ``engine="c"`` (:issue:`25623`)
+- Deprecated special treatment of lists with first element a Categorical in the :class:`DataFrame` constructor; pass as ``pd.DataFrame({col: categorical, ...})`` instead (:issue:`38845`)
+- Deprecated behavior of :class:`DataFrame` constructor when a ``dtype`` is passed and the data cannot be cast to that dtype. In a future version, this will raise instead of being silently ignored (:issue:`24435`)
+- Deprecated the :attr:`Timestamp.freq` attribute. For the properties that use it (``is_month_start``, ``is_month_end``, ``is_quarter_start``, ``is_quarter_end``, ``is_year_start``, ``is_year_end``), when you have a ``freq``, use e.g. ``freq.is_month_start(ts)`` (:issue:`15146`)
+- Deprecated construction of :class:`Series` or :class:`DataFrame` with ``DatetimeTZDtype`` data and ``datetime64[ns]`` dtype. Use ``Series(data).dt.tz_localize(None)`` instead (:issue:`41555`, :issue:`33401`)
+- Deprecated behavior of :class:`Series` construction with large-integer values and small-integer dtype silently overflowing; use ``Series(data).astype(dtype)`` instead (:issue:`41734`)
+- Deprecated behavior of :class:`DataFrame` construction with floating data and integer dtype casting even when lossy; in a future version this will remain floating, matching :class:`Series` behavior (:issue:`41770`)
+- Deprecated inference of ``timedelta64[ns]``, ``datetime64[ns]``, or ``DatetimeTZDtype`` dtypes in :class:`Series` construction when data containing strings is passed and no ``dtype`` is passed (:issue:`33558`)
+- In a future version, constructing :class:`Series` or :class:`DataFrame` with ``datetime64[ns]`` data and ``DatetimeTZDtype`` will treat the data as wall-times instead of as UTC times (matching DatetimeIndex behavior). To treat the data as UTC times, use ``pd.Series(data).dt.tz_localize("UTC").dt.tz_convert(dtype.tz)`` or ``pd.Series(data.view("int64"), dtype=dtype)`` (:issue:`33401`)
+- Deprecated passing lists as ``key`` to :meth:`DataFrame.xs` and :meth:`Series.xs` (:issue:`41760`)
+- Deprecated boolean arguments of ``inclusive`` in :meth:`Series.between` to have ``{"left", "right", "neither", "both"}`` as standard argument values (:issue:`40628`)
+- Deprecated passing arguments as positional for all of the following, with exceptions noted (:issue:`41485`):
+
+ - :func:`concat` (other than ``objs``)
+ - :func:`read_csv` (other than ``filepath_or_buffer``)
+ - :func:`read_table` (other than ``filepath_or_buffer``)
+ - :meth:`DataFrame.clip` and :meth:`Series.clip` (other than ``upper`` and ``lower``)
+ - :meth:`DataFrame.drop_duplicates` (except for ``subset``), :meth:`Series.drop_duplicates`, :meth:`Index.drop_duplicates` and :meth:`MultiIndex.drop_duplicates`
+ - :meth:`DataFrame.drop` (other than ``labels``) and :meth:`Series.drop`
+ - :meth:`DataFrame.dropna` and :meth:`Series.dropna`
+ - :meth:`DataFrame.ffill`, :meth:`Series.ffill`, :meth:`DataFrame.bfill`, and :meth:`Series.bfill`
+ - :meth:`DataFrame.fillna` and :meth:`Series.fillna` (apart from ``value``)
+ - :meth:`DataFrame.interpolate` and :meth:`Series.interpolate` (other than ``method``)
+ - :meth:`DataFrame.mask` and :meth:`Series.mask` (other than ``cond`` and ``other``)
+ - :meth:`DataFrame.reset_index` (other than ``level``) and :meth:`Series.reset_index`
+ - :meth:`DataFrame.set_axis` and :meth:`Series.set_axis` (other than ``labels``)
+ - :meth:`DataFrame.set_index` (other than ``keys``)
+ - :meth:`DataFrame.sort_index` and :meth:`Series.sort_index`
+ - :meth:`DataFrame.sort_values` (other than ``by``) and :meth:`Series.sort_values`
+ - :meth:`DataFrame.where` and :meth:`Series.where` (other than ``cond`` and ``other``)
+ - :meth:`Index.set_names` and :meth:`MultiIndex.set_names` (except for ``names``)
+ - :meth:`MultiIndex.codes` (except for ``codes``)
+ - :meth:`MultiIndex.set_levels` (except for ``levels``)
+ - :meth:`Resampler.interpolate` (other than ``method``)
+
+
.. ---------------------------------------------------------------------------
@@ -873,7 +891,7 @@ Performance improvements
- Performance improvement in :class:`.Styler` where render times are more than 50% reduced and now matches :meth:`DataFrame.to_html` (:issue:`39972` :issue:`39952`, :issue:`40425`)
- The method :meth:`.Styler.set_td_classes` is now as performant as :meth:`.Styler.apply` and :meth:`.Styler.applymap`, and even more so in some cases (:issue:`40453`)
- Performance improvement in :meth:`.ExponentialMovingWindow.mean` with ``times`` (:issue:`39784`)
-- Performance improvement in :meth:`.GroupBy.apply` when requiring the python fallback implementation (:issue:`40176`)
+- Performance improvement in :meth:`.GroupBy.apply` when requiring the Python fallback implementation (:issue:`40176`)
- Performance improvement in the conversion of a PyArrow Boolean array to a pandas nullable Boolean array (:issue:`41051`)
- Performance improvement for concatenation of data with type :class:`CategoricalDtype` (:issue:`40193`)
- Performance improvement in :meth:`.GroupBy.cummin` and :meth:`.GroupBy.cummax` with nullable data types (:issue:`37493`)
@@ -905,6 +923,7 @@ Datetimelike
- Bug in constructing a :class:`DataFrame` or :class:`Series` with mismatched ``datetime64`` data and ``timedelta64`` dtype, or vice-versa, failing to raise a ``TypeError`` (:issue:`38575`, :issue:`38764`, :issue:`38792`)
- Bug in constructing a :class:`Series` or :class:`DataFrame` with a ``datetime`` object out of bounds for ``datetime64[ns]`` dtype or a ``timedelta`` object out of bounds for ``timedelta64[ns]`` dtype (:issue:`38792`, :issue:`38965`)
- Bug in :meth:`DatetimeIndex.intersection`, :meth:`DatetimeIndex.symmetric_difference`, :meth:`PeriodIndex.intersection`, :meth:`PeriodIndex.symmetric_difference` always returning object-dtype when operating with :class:`CategoricalIndex` (:issue:`38741`)
+- Bug in :meth:`DatetimeIndex.intersection` giving incorrect results with non-Tick frequencies with ``n != 1`` (:issue:`42104`)
- Bug in :meth:`Series.where` incorrectly casting ``datetime64`` values to ``int64`` (:issue:`37682`)
- Bug in :class:`Categorical` incorrectly typecasting ``datetime`` object to ``Timestamp`` (:issue:`38878`)
- Bug in comparisons between :class:`Timestamp` object and ``datetime64`` objects just outside the implementation bounds for nanosecond ``datetime64`` (:issue:`39221`)
@@ -912,6 +931,7 @@ Datetimelike
- Bug in :meth:`Timedelta.round`, :meth:`Timedelta.floor`, :meth:`Timedelta.ceil` for values near the implementation bounds of :class:`Timedelta` (:issue:`38964`)
- Bug in :func:`date_range` incorrectly creating :class:`DatetimeIndex` containing ``NaT`` instead of raising ``OutOfBoundsDatetime`` in corner cases (:issue:`24124`)
- Bug in :func:`infer_freq` incorrectly fails to infer 'H' frequency of :class:`DatetimeIndex` if the latter has a timezone and crosses DST boundaries (:issue:`39556`)
+- Bug in :class:`Series` backed by :class:`DatetimeArray` or :class:`TimedeltaArray` sometimes failing to set the array's ``freq`` to ``None`` (:issue:`41425`)
Timedelta
^^^^^^^^^
@@ -941,6 +961,9 @@ Numeric
- Bug in :meth:`Series.count` would result in an ``int32`` result on 32-bit platforms when argument ``level=None`` (:issue:`40908`)
- Bug in :class:`Series` and :class:`DataFrame` reductions with methods ``any`` and ``all`` not returning Boolean results for object data (:issue:`12863`, :issue:`35450`, :issue:`27709`)
- Bug in :meth:`Series.clip` would fail if the Series contains NA values and has nullable int or float as a data type (:issue:`40851`)
+- Bug in :meth:`UInt64Index.where` and :meth:`UInt64Index.putmask` with an ``np.int64`` dtype ``other`` incorrectly raising ``TypeError`` (:issue:`41974`)
+- Bug in :meth:`DataFrame.agg()` not sorting the aggregated axis in the order of the provided aggregation functions when one or more aggregation function fails to produce results (:issue:`33634`)
+- Bug in :meth:`DataFrame.clip` not interpreting missing values as no threshold (:issue:`40420`)
Conversion
^^^^^^^^^^
@@ -956,6 +979,12 @@ Conversion
- Bug in :class:`DataFrame` and :class:`Series` construction with ``datetime64[ns]`` data and ``dtype=object`` resulting in ``datetime`` objects instead of :class:`Timestamp` objects (:issue:`41599`)
- Bug in :class:`DataFrame` and :class:`Series` construction with ``timedelta64[ns]`` data and ``dtype=object`` resulting in ``np.timedelta64`` objects instead of :class:`Timedelta` objects (:issue:`41599`)
- Bug in :class:`DataFrame` construction when given a two-dimensional object-dtype ``np.ndarray`` of :class:`Period` or :class:`Interval` objects failing to cast to :class:`PeriodDtype` or :class:`IntervalDtype`, respectively (:issue:`41812`)
+- Bug in constructing a :class:`Series` from a list and a :class:`PandasDtype` (:issue:`39357`)
+- Bug in creating a :class:`Series` from a ``range`` object that does not fit in the bounds of ``int64`` dtype (:issue:`30173`)
+- Bug in creating a :class:`Series` from a ``dict`` with all-tuple keys and an :class:`Index` that requires reindexing (:issue:`41707`)
+- Bug in :func:`.infer_dtype` not recognizing Series, Index, or array with a Period dtype (:issue:`23553`)
+- Bug in :func:`.infer_dtype` raising an error for general :class:`.ExtensionArray` objects. It will now return ``"unknown-array"`` instead of raising (:issue:`37367`)
+- Bug in :meth:`DataFrame.convert_dtypes` incorrectly raised a ``ValueError`` when called on an empty DataFrame (:issue:`40393`)
Strings
^^^^^^^
@@ -976,6 +1005,7 @@ Indexing
^^^^^^^^
- Bug in :meth:`Index.union` and :meth:`MultiIndex.union` dropping duplicate ``Index`` values when ``Index`` was not monotonic or ``sort`` was set to ``False`` (:issue:`36289`, :issue:`31326`, :issue:`40862`)
- Bug in :meth:`CategoricalIndex.get_indexer` failing to raise ``InvalidIndexError`` when non-unique (:issue:`38372`)
+- Bug in :meth:`IntervalIndex.get_indexer` when ``target`` has ``CategoricalDtype`` and both the index and the target contain NA values (:issue:`41934`)
- Bug in :meth:`Series.loc` raising a ``ValueError`` when input was filtered with a Boolean list and values to set were a list with lower dimension (:issue:`20438`)
- Bug in inserting many new columns into a :class:`DataFrame` causing incorrect subsequent indexing behavior (:issue:`38380`)
- Bug in :meth:`DataFrame.__setitem__` raising a ``ValueError`` when setting multiple values to duplicate columns (:issue:`15695`)
@@ -1007,12 +1037,17 @@ Indexing
- Bug in :meth:`DataFrame.loc.__setitem__` when setting-with-expansion incorrectly raising when the index in the expanding axis contained duplicates (:issue:`40096`)
- Bug in :meth:`DataFrame.loc.__getitem__` with :class:`MultiIndex` casting to float when at least one index column has float dtype and we retrieve a scalar (:issue:`41369`)
- Bug in :meth:`DataFrame.loc` incorrectly matching non-Boolean index elements (:issue:`20432`)
+- Bug in indexing with ``np.nan`` on a :class:`Series` or :class:`DataFrame` with a :class:`CategoricalIndex` incorrectly raising ``KeyError`` when ``np.nan`` keys are present (:issue:`41933`)
- Bug in :meth:`Series.__delitem__` with ``ExtensionDtype`` incorrectly casting to ``ndarray`` (:issue:`40386`)
+- Bug in :meth:`DataFrame.at` with a :class:`CategoricalIndex` returning incorrect results when passed integer keys (:issue:`41846`)
- Bug in :meth:`DataFrame.loc` returning a :class:`MultiIndex` in the wrong order if an indexer has duplicates (:issue:`40978`)
- Bug in :meth:`DataFrame.__setitem__` raising a ``TypeError`` when using a ``str`` subclass as the column name with a :class:`DatetimeIndex` (:issue:`37366`)
- Bug in :meth:`PeriodIndex.get_loc` failing to raise a ``KeyError`` when given a :class:`Period` with a mismatched ``freq`` (:issue:`41670`)
- Bug ``.loc.__getitem__`` with a :class:`UInt64Index` and negative-integer keys raising ``OverflowError`` instead of ``KeyError`` in some cases, wrapping around to positive integers in others (:issue:`41777`)
- Bug in :meth:`Index.get_indexer` failing to raise ``ValueError`` in some cases with invalid ``method``, ``limit``, or ``tolerance`` arguments (:issue:`41918`)
+- Bug when slicing a :class:`Series` or :class:`DataFrame` with a :class:`TimedeltaIndex` when passing an invalid string raising ``ValueError`` instead of a ``TypeError`` (:issue:`41821`)
+- Bug in :class:`Index` constructor sometimes silently ignoring a specified ``dtype`` (:issue:`38879`)
+- :meth:`Index.where` behavior now mirrors :meth:`Index.putmask` behavior, i.e. ``index.where(mask, other)`` matches ``index.putmask(~mask, other)`` (:issue:`39412`)
Missing
^^^^^^^
@@ -1021,6 +1056,7 @@ Missing
- Bug in :meth:`DataFrame.fillna` not accepting a dictionary for the ``downcast`` keyword (:issue:`40809`)
- Bug in :func:`isna` not returning a copy of the mask for nullable types, causing any subsequent mask modification to change the original array (:issue:`40935`)
- Bug in :class:`DataFrame` construction with float data containing ``NaN`` and an integer ``dtype`` casting instead of retaining the ``NaN`` (:issue:`26919`)
+- Bug in :meth:`Series.isin` and :meth:`MultiIndex.isin` didn't treat all nans as equivalent if they were in tuples (:issue:`41836`)
MultiIndex
^^^^^^^^^^
@@ -1028,6 +1064,7 @@ MultiIndex
- Bug in :meth:`MultiIndex.intersection` duplicating ``NaN`` in the result (:issue:`38623`)
- Bug in :meth:`MultiIndex.equals` incorrectly returning ``True`` when the :class:`MultiIndex` contained ``NaN`` even when they are differently ordered (:issue:`38439`)
- Bug in :meth:`MultiIndex.intersection` always returning an empty result when intersecting with :class:`CategoricalIndex` (:issue:`38653`)
+- Bug in :meth:`MultiIndex.difference` incorrectly raising ``TypeError`` when indexes contain non-sortable entries (:issue:`41915`)
- Bug in :meth:`MultiIndex.reindex` raising a ``ValueError`` when used on an empty :class:`MultiIndex` and indexing only a specific level (:issue:`41170`)
- Bug in :meth:`MultiIndex.reindex` raising ``TypeError`` when reindexing against a flat :class:`Index` (:issue:`41707`)
@@ -1067,6 +1104,7 @@ I/O
- Bug in the conversion from PyArrow to pandas (e.g. for reading Parquet) with nullable dtypes and a PyArrow array whose data buffer size is not a multiple of the dtype size (:issue:`40896`)
- Bug in :func:`read_excel` would raise an error when pandas could not determine the file type even though the user specified the ``engine`` argument (:issue:`41225`)
- Bug in :func:`read_clipboard` copying from an excel file shifts values into the wrong column if there are null values in first column (:issue:`41108`)
+- Bug in :meth:`DataFrame.to_hdf` and :meth:`Series.to_hdf` raising a ``TypeError`` when trying to append a string column to an incompatible column (:issue:`41897`)
Period
^^^^^^
@@ -1126,6 +1164,8 @@ Groupby/resample/rolling
- Bug in :class:`DataFrameGroupBy` aggregations incorrectly failing to drop columns with invalid dtypes for that aggregation when there are no valid columns (:issue:`41291`)
- Bug in :meth:`DataFrame.rolling.__iter__` where ``on`` was not assigned to the index of the resulting objects (:issue:`40373`)
- Bug in :meth:`.DataFrameGroupBy.transform` and :meth:`.DataFrameGroupBy.agg` with ``engine="numba"`` where ``*args`` were being cached with the user passed function (:issue:`41647`)
+- Bug in :class:`DataFrameGroupBy` methods ``agg``, ``transform``, ``sum``, ``bfill``, ``ffill``, ``pad``, ``pct_change``, ``shift``, ``ohlc`` dropping ``.columns.names`` (:issue:`41497`)
+
Reshaping
^^^^^^^^^
@@ -1148,6 +1188,8 @@ Reshaping
- Bug in :func:`to_datetime` raising an error when the input sequence contained unhashable items (:issue:`39756`)
- Bug in :meth:`Series.explode` preserving the index when ``ignore_index`` was ``True`` and values were scalars (:issue:`40487`)
- Bug in :func:`to_datetime` raising a ``ValueError`` when :class:`Series` contains ``None`` and ``NaT`` and has more than 50 elements (:issue:`39882`)
+- Bug in :meth:`Series.unstack` and :meth:`DataFrame.unstack` with object-dtype values containing timezone-aware datetime objects incorrectly raising ``TypeError`` (:issue:`41875`)
+- Bug in :meth:`DataFrame.melt` raising ``InvalidIndexError`` when :class:`DataFrame` has duplicate columns used as ``value_vars`` (:issue:`41951`)
Sparse
^^^^^^
@@ -1175,24 +1217,14 @@ Styler
Other
^^^^^
-- Bug in :class:`Index` constructor sometimes silently ignoring a specified ``dtype`` (:issue:`38879`)
-- Bug in :func:`.infer_dtype` not recognizing Series, Index, or array with a Period dtype (:issue:`23553`)
-- Bug in :func:`.infer_dtype` raising an error for general :class:`.ExtensionArray` objects. It will now return ``"unknown-array"`` instead of raising (:issue:`37367`)
-- Bug in constructing a :class:`Series` from a list and a :class:`PandasDtype` (:issue:`39357`)
- ``inspect.getmembers(Series)`` no longer raises an ``AbstractMethodError`` (:issue:`38782`)
- Bug in :meth:`Series.where` with numeric dtype and ``other=None`` not casting to ``nan`` (:issue:`39761`)
-- :meth:`Index.where` behavior now mirrors :meth:`Index.putmask` behavior, i.e. ``index.where(mask, other)`` matches ``index.putmask(~mask, other)`` (:issue:`39412`)
- Bug in :func:`.assert_series_equal`, :func:`.assert_frame_equal`, :func:`.assert_index_equal` and :func:`.assert_extension_array_equal` incorrectly raising when an attribute has an unrecognized NA type (:issue:`39461`)
- Bug in :func:`.assert_index_equal` with ``exact=True`` not raising when comparing :class:`CategoricalIndex` instances with ``Int64Index`` and ``RangeIndex`` categories (:issue:`41263`)
- Bug in :meth:`DataFrame.equals`, :meth:`Series.equals`, and :meth:`Index.equals` with object-dtype containing ``np.datetime64("NaT")`` or ``np.timedelta64("NaT")`` (:issue:`39650`)
- Bug in :func:`show_versions` where console JSON output was not proper JSON (:issue:`39701`)
- pandas can now compile on z/OS when using `xlc `_ (:issue:`35826`)
-- Bug in :meth:`DataFrame.convert_dtypes` incorrectly raised a ``ValueError`` when called on an empty DataFrame (:issue:`40393`)
-- Bug in :meth:`DataFrame.agg()` not sorting the aggregated axis in the order of the provided aggragation functions when one or more aggregation function fails to produce results (:issue:`33634`)
-- Bug in :meth:`DataFrame.clip` not interpreting missing values as no threshold (:issue:`40420`)
-- Bug in :class:`Series` backed by :class:`DatetimeArray` or :class:`TimedeltaArray` sometimes failing to set the array's ``freq`` to ``None`` (:issue:`41425`)
-- Bug in creating a :class:`Series` from a ``range`` object that does not fit in the bounds of ``int64`` dtype (:issue:`30173`)
-- Bug in creating a :class:`Series` from a ``dict`` with all-tuple keys and an :class:`Index` that requires reindexing (:issue:`41707`)
+- Bug in :func:`pandas.util.hash_pandas_object` not recognizing ``hash_key``, ``encoding`` and ``categorize`` when the input object type is a :class:`DataFrame` (:issue:`41404`)
.. ---------------------------------------------------------------------------
@@ -1201,4 +1233,4 @@ Other
Contributors
~~~~~~~~~~~~
-.. contributors:: v1.2.4..v1.3.0|HEAD
+.. contributors:: v1.2.5..v1.3.0
diff --git a/doc/source/whatsnew/v1.3.1.rst b/doc/source/whatsnew/v1.3.1.rst
new file mode 100644
index 0000000000000..a57995eb0db9a
--- /dev/null
+++ b/doc/source/whatsnew/v1.3.1.rst
@@ -0,0 +1,51 @@
+.. _whatsnew_131:
+
+What's new in 1.3.1 (July 25, 2021)
+-----------------------------------
+
+These are the changes in pandas 1.3.1. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_131.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Pandas could not be built on PyPy (:issue:`42355`)
+- :class:`DataFrame` constructed with an older version of pandas could not be unpickled (:issue:`42345`)
+- Performance regression in constructing a :class:`DataFrame` from a dictionary of dictionaries (:issue:`42248`)
+- Fixed regression in :meth:`DataFrame.agg` dropping values when the DataFrame had an Extension Array dtype, a duplicate index, and ``axis=1`` (:issue:`42380`)
+- Fixed regression in :meth:`DataFrame.astype` changing the order of noncontiguous data (:issue:`42396`)
+- Performance regression in :class:`DataFrame` in reduction operations requiring casting such as :meth:`DataFrame.mean` on integer data (:issue:`38592`)
+- Performance regression in :meth:`DataFrame.to_dict` and :meth:`Series.to_dict` when ``orient`` argument one of "records", "dict", or "split" (:issue:`42352`)
+- Fixed regression in indexing with a ``list`` subclass incorrectly raising ``TypeError`` (:issue:`42433`, :issue:`42461`)
+- Fixed regression in :meth:`DataFrame.isin` and :meth:`Series.isin` raising ``TypeError`` with nullable data containing at least one missing value (:issue:`42405`)
+- Regression in :func:`concat` between objects with bool dtype and integer dtype casting to object instead of to integer (:issue:`42092`)
+- Bug in :class:`Series` constructor not accepting a ``dask.Array`` (:issue:`38645`)
+- Fixed regression for ``SettingWithCopyWarning`` displaying incorrect stacklevel (:issue:`42570`)
+- Fixed regression for :func:`merge_asof` raising ``KeyError`` when one of the ``by`` columns is in the index (:issue:`34488`)
+- Fixed regression in :func:`to_datetime` returning pd.NaT for inputs that produce duplicated values, when ``cache=True`` (:issue:`42259`)
+- Fixed regression in :meth:`SeriesGroupBy.value_counts` that resulted in an ``IndexError`` when called on a Series with one row (:issue:`42618`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_131.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+- Fixed bug in :meth:`DataFrame.transpose` dropping values when the DataFrame had an Extension Array dtype and a duplicate index (:issue:`42380`)
+- Fixed bug in :meth:`DataFrame.to_xml` raising ``KeyError`` when called with ``index=False`` and an offset index (:issue:`42458`)
+- Fixed bug in :meth:`.Styler.set_sticky` not handling index names correctly for single index columns case (:issue:`42537`)
+- Fixed bug in :meth:`DataFrame.copy` failing to consolidate blocks in the result (:issue:`42579`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_131.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.3.0..v1.3.1
diff --git a/doc/source/whatsnew/v1.3.2.rst b/doc/source/whatsnew/v1.3.2.rst
new file mode 100644
index 0000000000000..e3c6268547dd2
--- /dev/null
+++ b/doc/source/whatsnew/v1.3.2.rst
@@ -0,0 +1,51 @@
+.. _whatsnew_132:
+
+What's new in 1.3.2 (August 15, 2021)
+-------------------------------------
+
+These are the changes in pandas 1.3.2. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_132.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Performance regression in :meth:`DataFrame.isin` and :meth:`Series.isin` for nullable data types (:issue:`42714`)
+- Regression in updating values of :class:`Series` using boolean index, created by using :meth:`DataFrame.pop` (:issue:`42530`)
+- Regression in :meth:`DataFrame.from_records` with empty records (:issue:`42456`)
+- Fixed regression in :meth:`DataFrame.shift` where ``TypeError`` occurred when shifting DataFrame created by concatenation of slices and fills with values (:issue:`42719`)
+- Regression in :meth:`DataFrame.agg` when the ``func`` argument returned lists and ``axis=1`` (:issue:`42727`)
+- Regression in :meth:`DataFrame.drop` does nothing if :class:`MultiIndex` has duplicates and indexer is a tuple or list of tuples (:issue:`42771`)
+- Fixed regression where :func:`read_csv` raised a ``ValueError`` when parameters ``names`` and ``prefix`` were both set to ``None`` (:issue:`42387`)
+- Fixed regression in comparisons between :class:`Timestamp` object and ``datetime64`` objects outside the implementation bounds for nanosecond ``datetime64`` (:issue:`42794`)
+- Fixed regression in :meth:`.Styler.highlight_min` and :meth:`.Styler.highlight_max` where ``pandas.NA`` was not successfully ignored (:issue:`42650`)
+- Fixed regression in :func:`concat` where ``copy=False`` was not honored in ``axis=1`` Series concatenation (:issue:`42501`)
+- Regression in :meth:`Series.nlargest` and :meth:`Series.nsmallest` with nullable integer or float dtype (:issue:`42816`)
+- Fixed regression in :meth:`Series.quantile` with :class:`Int64Dtype` (:issue:`42626`)
+- Fixed regression in :meth:`Series.groupby` and :meth:`DataFrame.groupby` where supplying the ``by`` argument with a Series named with a tuple would incorrectly raise (:issue:`42731`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_132.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+- Bug in :func:`read_excel` modifies the dtypes dictionary when reading a file with duplicate columns (:issue:`42462`)
+- 1D slices over extension types turn into N-dimensional slices over ExtensionArrays (:issue:`42430`)
+- Fixed bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not calculating window bounds correctly for the first row when ``center=True`` and ``window`` is an offset that covers all the rows (:issue:`42753`)
+- :meth:`.Styler.hide_columns` now hides the index name header row as well as column headers (:issue:`42101`)
+- :meth:`.Styler.set_sticky` has amended CSS to control the column/index names and ensure the correct sticky positions (:issue:`42537`)
+- Bug in de-serializing datetime indexes in PYTHONOPTIMIZED mode (:issue:`42866`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_132.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.3.1..v1.3.2
diff --git a/doc/source/whatsnew/v1.3.3.rst b/doc/source/whatsnew/v1.3.3.rst
new file mode 100644
index 0000000000000..ecec6d975ccb7
--- /dev/null
+++ b/doc/source/whatsnew/v1.3.3.rst
@@ -0,0 +1,57 @@
+.. _whatsnew_133:
+
+What's new in 1.3.3 (September 12, 2021)
+----------------------------------------
+
+These are the changes in pandas 1.3.3. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_133.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Fixed regression in :class:`DataFrame` constructor failing to broadcast for defined :class:`Index` and len one list of :class:`Timestamp` (:issue:`42810`)
+- Fixed regression in :meth:`.GroupBy.agg` incorrectly raising in some cases (:issue:`42390`)
+- Fixed regression in :meth:`.GroupBy.apply` where ``nan`` values were dropped even with ``dropna=False`` (:issue:`43205`)
+- Fixed regression in :meth:`.GroupBy.quantile` which was failing with ``pandas.NA`` (:issue:`42849`)
+- Fixed regression in :meth:`merge` where ``on`` columns with ``ExtensionDtype`` or ``bool`` data types were cast to ``object`` in ``right`` and ``outer`` merge (:issue:`40073`)
+- Fixed regression in :meth:`RangeIndex.where` and :meth:`RangeIndex.putmask` raising ``AssertionError`` when result did not represent a :class:`RangeIndex` (:issue:`43240`)
+- Fixed regression in :meth:`read_parquet` where the ``fastparquet`` engine would not work properly with fastparquet 0.7.0 (:issue:`43075`)
+- Fixed regression in :meth:`DataFrame.loc.__setitem__` raising ``ValueError`` when setting array as cell value (:issue:`43422`)
+- Fixed regression in :func:`is_list_like` where objects with ``__iter__`` set to ``None`` would be identified as iterable (:issue:`43373`)
+- Fixed regression in :meth:`DataFrame.__getitem__` raising error for slice of :class:`DatetimeIndex` when index is non monotonic (:issue:`43223`)
+- Fixed regression in :meth:`.Resampler.aggregate` when used after column selection would raise if ``func`` is a list of aggregation functions (:issue:`42905`)
+- Fixed regression in :meth:`DataFrame.corr` where Kendall correlation would produce incorrect results for columns with repeated values (:issue:`43401`)
+- Fixed regression in :meth:`DataFrame.groupby` where aggregation on columns with object types dropped results on those columns (:issue:`42395`, :issue:`43108`)
+- Fixed regression in :meth:`Series.fillna` raising ``TypeError`` when filling ``float`` ``Series`` with list-like fill value having a dtype which couldn't cast lostlessly (like ``float32`` filled with ``float64``) (:issue:`43424`)
+- Fixed regression in :func:`read_csv` raising ``AttributeError`` when the file handle is an ``tempfile.SpooledTemporaryFile`` object (:issue:`43439`)
+- Fixed performance regression in :meth:`core.window.ewm.ExponentialMovingWindow.mean` (:issue:`42333`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_133.performance:
+
+Performance improvements
+~~~~~~~~~~~~~~~~~~~~~~~~
+- Performance improvement for :meth:`DataFrame.__setitem__` when the key or value is not a :class:`DataFrame`, or key is not list-like (:issue:`43274`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_133.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+- Fixed bug in :meth:`.DataFrameGroupBy.agg` and :meth:`.DataFrameGroupBy.transform` with ``engine="numba"`` where ``index`` data was not being correctly passed into ``func`` (:issue:`43133`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_133.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.3.2..v1.3.3
diff --git a/doc/source/whatsnew/v1.3.4.rst b/doc/source/whatsnew/v1.3.4.rst
new file mode 100644
index 0000000000000..b46744d51d74d
--- /dev/null
+++ b/doc/source/whatsnew/v1.3.4.rst
@@ -0,0 +1,57 @@
+.. _whatsnew_134:
+
+What's new in 1.3.4 (October 17, 2021)
+--------------------------------------
+
+These are the changes in pandas 1.3.4. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_134.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Fixed regression in :meth:`DataFrame.convert_dtypes` incorrectly converts byte strings to strings (:issue:`43183`)
+- Fixed regression in :meth:`.GroupBy.agg` where it was failing silently with mixed data types along ``axis=1`` and :class:`MultiIndex` (:issue:`43209`)
+- Fixed regression in :func:`merge` with integer and ``NaN`` keys failing with ``outer`` merge (:issue:`43550`)
+- Fixed regression in :meth:`DataFrame.corr` raising ``ValueError`` with ``method="spearman"`` on 32-bit platforms (:issue:`43588`)
+- Fixed performance regression in :meth:`MultiIndex.equals` (:issue:`43549`)
+- Fixed performance regression in :meth:`.GroupBy.first` and :meth:`.GroupBy.last` with :class:`StringDtype` (:issue:`41596`)
+- Fixed regression in :meth:`Series.cat.reorder_categories` failing to update the categories on the ``Series`` (:issue:`43232`)
+- Fixed regression in :meth:`Series.cat.categories` setter failing to update the categories on the ``Series`` (:issue:`43334`)
+- Fixed regression in :func:`read_csv` raising ``UnicodeDecodeError`` exception when ``memory_map=True`` (:issue:`43540`)
+- Fixed regression in :meth:`DataFrame.explode` raising ``AssertionError`` when ``column`` is any scalar which is not a string (:issue:`43314`)
+- Fixed regression in :meth:`Series.aggregate` attempting to pass ``args`` and ``kwargs`` multiple times to the user supplied ``func`` in certain cases (:issue:`43357`)
+- Fixed regression when iterating over a :class:`DataFrame.groupby.rolling` object causing the resulting DataFrames to have an incorrect index if the input groupings were not sorted (:issue:`43386`)
+- Fixed regression in :meth:`DataFrame.groupby.rolling.cov` and :meth:`DataFrame.groupby.rolling.corr` computing incorrect results if the input groupings were not sorted (:issue:`43386`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_134.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+- Fixed bug in :meth:`pandas.DataFrame.groupby.rolling` and :class:`pandas.api.indexers.FixedForwardWindowIndexer` leading to segfaults and window endpoints being mixed across groups (:issue:`43267`)
+- Fixed bug in :meth:`.GroupBy.mean` with datetimelike values including ``NaT`` values returning incorrect results (:issue:`43132`)
+- Fixed bug in :meth:`Series.aggregate` not passing the first ``args`` to the user supplied ``func`` in certain cases (:issue:`43357`)
+- Fixed memory leaks in :meth:`Series.rolling.quantile` and :meth:`Series.rolling.median` (:issue:`43339`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_134.other:
+
+Other
+~~~~~
+- The minimum version of Cython needed to compile pandas is now ``0.29.24`` (:issue:`43729`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_134.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.3.3..v1.3.4
diff --git a/doc/source/whatsnew/v1.3.5.rst b/doc/source/whatsnew/v1.3.5.rst
new file mode 100644
index 0000000000000..339bd7debf945
--- /dev/null
+++ b/doc/source/whatsnew/v1.3.5.rst
@@ -0,0 +1,34 @@
+.. _whatsnew_135:
+
+What's new in 1.3.5 (December 12, 2021)
+---------------------------------------
+
+These are the changes in pandas 1.3.5. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_135.regressions:
+
+Fixed regressions
+~~~~~~~~~~~~~~~~~
+- Fixed regression in :meth:`Series.equals` when comparing floats with dtype object to None (:issue:`44190`)
+- Fixed regression in :func:`merge_asof` raising error when array was supplied as join key (:issue:`42844`)
+- Fixed regression when resampling :class:`DataFrame` with :class:`DateTimeIndex` with empty groups and ``uint8``, ``uint16`` or ``uint32`` columns incorrectly raising ``RuntimeError`` (:issue:`43329`)
+- Fixed regression in creating a :class:`DataFrame` from a timezone-aware :class:`Timestamp` scalar near a Daylight Savings Time transition (:issue:`42505`)
+- Fixed performance regression in :func:`read_csv` (:issue:`44106`)
+- Fixed regression in :meth:`Series.duplicated` and :meth:`Series.drop_duplicates` when Series has :class:`Categorical` dtype with boolean categories (:issue:`44351`)
+- Fixed regression in :meth:`.GroupBy.sum` with ``timedelta64[ns]`` dtype containing ``NaT`` failing to treat that value as NA (:issue:`42659`)
+- Fixed regression in :meth:`.RollingGroupby.cov` and :meth:`.RollingGroupby.corr` when ``other`` had the same shape as each group would incorrectly return superfluous groups in the result (:issue:`42915`)
+
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_135.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.3.4..v1.3.5|HEAD
diff --git a/doc/source/whatsnew/v1.4.0.rst b/doc/source/whatsnew/v1.4.0.rst
new file mode 100644
index 0000000000000..363d4b57544a9
--- /dev/null
+++ b/doc/source/whatsnew/v1.4.0.rst
@@ -0,0 +1,1093 @@
+.. _whatsnew_140:
+
+What's new in 1.4.0 (January 22, 2022)
+--------------------------------------
+
+These are the changes in pandas 1.4.0. See :ref:`release` for a full changelog
+including other versions of pandas.
+
+{{ header }}
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_140.enhancements:
+
+Enhancements
+~~~~~~~~~~~~
+
+.. _whatsnew_140.enhancements.warning_lineno:
+
+Improved warning messages
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Previously, warning messages may have pointed to lines within the pandas
+library. Running the script ``setting_with_copy_warning.py``
+
+.. code-block:: python
+
+ import pandas as pd
+
+ df = pd.DataFrame({'a': [1, 2, 3]})
+ df[:2].loc[:, 'a'] = 5
+
+with pandas 1.3 resulted in::
+
+ .../site-packages/pandas/core/indexing.py:1951: SettingWithCopyWarning:
+ A value is trying to be set on a copy of a slice from a DataFrame.
+
+This made it difficult to determine where the warning was being generated from.
+Now pandas will inspect the call stack, reporting the first line outside of the
+pandas library that gave rise to the warning. The output of the above script is
+now::
+
+ setting_with_copy_warning.py:4: SettingWithCopyWarning:
+ A value is trying to be set on a copy of a slice from a DataFrame.
+
+
+
+
+.. _whatsnew_140.enhancements.ExtensionIndex:
+
+Index can hold arbitrary ExtensionArrays
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Until now, passing a custom :class:`ExtensionArray` to ``pd.Index`` would cast
+the array to ``object`` dtype. Now :class:`Index` can directly hold arbitrary
+ExtensionArrays (:issue:`43930`).
+
+*Previous behavior*:
+
+.. ipython:: python
+
+ arr = pd.array([1, 2, pd.NA])
+ idx = pd.Index(arr)
+
+In the old behavior, ``idx`` would be object-dtype:
+
+*Previous behavior*:
+
+.. code-block:: ipython
+
+ In [1]: idx
+ Out[1]: Index([1, 2, ], dtype='object')
+
+With the new behavior, we keep the original dtype:
+
+*New behavior*:
+
+.. ipython:: python
+
+ idx
+
+One exception to this is ``SparseArray``, which will continue to cast to numpy
+dtype until pandas 2.0. At that point it will retain its dtype like other
+ExtensionArrays.
+
+.. _whatsnew_140.enhancements.styler:
+
+Styler
+^^^^^^
+
+:class:`.Styler` has been further developed in 1.4.0. The following general enhancements have been made:
+
+ - Styling and formatting of indexes has been added, with :meth:`.Styler.apply_index`, :meth:`.Styler.applymap_index` and :meth:`.Styler.format_index`. These mirror the signature of the methods already used to style and format data values, and work with both HTML, LaTeX and Excel format (:issue:`41893`, :issue:`43101`, :issue:`41993`, :issue:`41995`)
+ - The new method :meth:`.Styler.hide` deprecates :meth:`.Styler.hide_index` and :meth:`.Styler.hide_columns` (:issue:`43758`)
+ - The keyword arguments ``level`` and ``names`` have been added to :meth:`.Styler.hide` (and implicitly to the deprecated methods :meth:`.Styler.hide_index` and :meth:`.Styler.hide_columns`) for additional control of visibility of MultiIndexes and of Index names (:issue:`25475`, :issue:`43404`, :issue:`43346`)
+ - The :meth:`.Styler.export` and :meth:`.Styler.use` have been updated to address all of the added functionality from v1.2.0 and v1.3.0 (:issue:`40675`)
+ - Global options under the category ``pd.options.styler`` have been extended to configure default ``Styler`` properties which address formatting, encoding, and HTML and LaTeX rendering. Note that formerly ``Styler`` relied on ``display.html.use_mathjax``, which has now been replaced by ``styler.html.mathjax`` (:issue:`41395`)
+ - Validation of certain keyword arguments, e.g. ``caption`` (:issue:`43368`)
+ - Various bug fixes as recorded below
+
+Additionally there are specific enhancements to the HTML specific rendering:
+
+ - :meth:`.Styler.bar` introduces additional arguments to control alignment and display (:issue:`26070`, :issue:`36419`), and it also validates the input arguments ``width`` and ``height`` (:issue:`42511`)
+ - :meth:`.Styler.to_html` introduces keyword arguments ``sparse_index``, ``sparse_columns``, ``bold_headers``, ``caption``, ``max_rows`` and ``max_columns`` (:issue:`41946`, :issue:`43149`, :issue:`42972`)
+ - :meth:`.Styler.to_html` omits CSSStyle rules for hidden table elements as a performance enhancement (:issue:`43619`)
+ - Custom CSS classes can now be directly specified without string replacement (:issue:`43686`)
+ - Ability to render hyperlinks automatically via a new ``hyperlinks`` formatting keyword argument (:issue:`45058`)
+
+There are also some LaTeX specific enhancements:
+
+ - :meth:`.Styler.to_latex` introduces keyword argument ``environment``, which also allows a specific "longtable" entry through a separate jinja2 template (:issue:`41866`)
+ - Naive sparsification is now possible for LaTeX without the necessity of including the multirow package (:issue:`43369`)
+ - *cline* support has been added for :class:`MultiIndex` row sparsification through a keyword argument (:issue:`45138`)
+
+.. _whatsnew_140.enhancements.pyarrow_csv_engine:
+
+Multi-threaded CSV reading with a new CSV Engine based on pyarrow
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:func:`pandas.read_csv` now accepts ``engine="pyarrow"`` (requires at least
+``pyarrow`` 1.0.1) as an argument, allowing for faster csv parsing on multicore
+machines with pyarrow installed. See the :doc:`I/O docs ` for
+more info. (:issue:`23697`, :issue:`43706`)
+
+.. _whatsnew_140.enhancements.window_rank:
+
+Rank function for rolling and expanding windows
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Added ``rank`` function to :class:`Rolling` and :class:`Expanding`. The new
+function supports the ``method``, ``ascending``, and ``pct`` flags of
+:meth:`DataFrame.rank`. The ``method`` argument supports ``min``, ``max``, and
+``average`` ranking methods.
+Example:
+
+.. ipython:: python
+
+ s = pd.Series([1, 4, 2, 3, 5, 3])
+ s.rolling(3).rank()
+
+ s.rolling(3).rank(method="max")
+
+.. _whatsnew_140.enhancements.groupby_indexing:
+
+Groupby positional indexing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It is now possible to specify positional ranges relative to the ends of each
+group.
+
+Negative arguments for :meth:`.GroupBy.head` and :meth:`.GroupBy.tail` now work
+correctly and result in ranges relative to the end and start of each group,
+respectively. Previously, negative arguments returned empty frames.
+
+.. ipython:: python
+
+ df = pd.DataFrame([["g", "g0"], ["g", "g1"], ["g", "g2"], ["g", "g3"],
+ ["h", "h0"], ["h", "h1"]], columns=["A", "B"])
+ df.groupby("A").head(-1)
+
+
+:meth:`.GroupBy.nth` now accepts a slice or list of integers and slices.
+
+.. ipython:: python
+
+ df.groupby("A").nth(slice(1, -1))
+ df.groupby("A").nth([slice(None, 1), slice(-1, None)])
+
+:meth:`.GroupBy.nth` now accepts index notation.
+
+.. ipython:: python
+
+ df.groupby("A").nth[1, -1]
+ df.groupby("A").nth[1:-1]
+ df.groupby("A").nth[:1, -1:]
+
+.. _whatsnew_140.dict_tight:
+
+DataFrame.from_dict and DataFrame.to_dict have new ``'tight'`` option
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A new ``'tight'`` dictionary format that preserves :class:`MultiIndex` entries
+and names is now available with the :meth:`DataFrame.from_dict` and
+:meth:`DataFrame.to_dict` methods and can be used with the standard ``json``
+library to produce a tight representation of :class:`DataFrame` objects
+(:issue:`4889`).
+
+.. ipython:: python
+
+ df = pd.DataFrame.from_records(
+ [[1, 3], [2, 4]],
+ index=pd.MultiIndex.from_tuples([("a", "b"), ("a", "c")],
+ names=["n1", "n2"]),
+ columns=pd.MultiIndex.from_tuples([("x", 1), ("y", 2)],
+ names=["z1", "z2"]),
+ )
+ df
+ df.to_dict(orient='tight')
+
+.. _whatsnew_140.enhancements.other:
+
+Other enhancements
+^^^^^^^^^^^^^^^^^^
+- :meth:`concat` will preserve the ``attrs`` when it is the same for all objects and discard the ``attrs`` when they are different (:issue:`41828`)
+- :class:`DataFrameGroupBy` operations with ``as_index=False`` now correctly retain ``ExtensionDtype`` dtypes for columns being grouped on (:issue:`41373`)
+- Add support for assigning values to ``by`` argument in :meth:`DataFrame.plot.hist` and :meth:`DataFrame.plot.box` (:issue:`15079`)
+- :meth:`Series.sample`, :meth:`DataFrame.sample`, and :meth:`.GroupBy.sample` now accept a ``np.random.Generator`` as input to ``random_state``. A generator will be more performant, especially with ``replace=False`` (:issue:`38100`)
+- :meth:`Series.ewm` and :meth:`DataFrame.ewm` now support a ``method`` argument with a ``'table'`` option that performs the windowing operation over an entire :class:`DataFrame`. See :ref:`Window Overview ` for performance and functional benefits (:issue:`42273`)
+- :meth:`.GroupBy.cummin` and :meth:`.GroupBy.cummax` now support the argument ``skipna`` (:issue:`34047`)
+- :meth:`read_table` now supports the argument ``storage_options`` (:issue:`39167`)
+- :meth:`DataFrame.to_stata` and :meth:`StataWriter` now accept the keyword only argument ``value_labels`` to save labels for non-categorical columns (:issue:`38454`)
+- Methods that relied on hashmap based algos such as :meth:`DataFrameGroupBy.value_counts`, :meth:`DataFrameGroupBy.count` and :func:`factorize` ignored imaginary component for complex numbers (:issue:`17927`)
+- Add :meth:`Series.str.removeprefix` and :meth:`Series.str.removesuffix` introduced in Python 3.9 to remove pre-/suffixes from string-type :class:`Series` (:issue:`36944`)
+- Attempting to write into a file in missing parent directory with :meth:`DataFrame.to_csv`, :meth:`DataFrame.to_html`, :meth:`DataFrame.to_excel`, :meth:`DataFrame.to_feather`, :meth:`DataFrame.to_parquet`, :meth:`DataFrame.to_stata`, :meth:`DataFrame.to_json`, :meth:`DataFrame.to_pickle`, and :meth:`DataFrame.to_xml` now explicitly mentions missing parent directory, the same is true for :class:`Series` counterparts (:issue:`24306`)
+- Indexing with ``.loc`` and ``.iloc`` now supports ``Ellipsis`` (:issue:`37750`)
+- :meth:`IntegerArray.all` , :meth:`IntegerArray.any`, :meth:`FloatingArray.any`, and :meth:`FloatingArray.all` use Kleene logic (:issue:`41967`)
+- Added support for nullable boolean and integer types in :meth:`DataFrame.to_stata`, :class:`~pandas.io.stata.StataWriter`, :class:`~pandas.io.stata.StataWriter117`, and :class:`~pandas.io.stata.StataWriterUTF8` (:issue:`40855`)
+- :meth:`DataFrame.__pos__` and :meth:`DataFrame.__neg__` now retain ``ExtensionDtype`` dtypes (:issue:`43883`)
+- The error raised when an optional dependency can't be imported now includes the original exception, for easier investigation (:issue:`43882`)
+- Added :meth:`.ExponentialMovingWindow.sum` (:issue:`13297`)
+- :meth:`Series.str.split` now supports a ``regex`` argument that explicitly specifies whether the pattern is a regular expression. Default is ``None`` (:issue:`43563`, :issue:`32835`, :issue:`25549`)
+- :meth:`DataFrame.dropna` now accepts a single label as ``subset`` along with array-like (:issue:`41021`)
+- Added :meth:`DataFrameGroupBy.value_counts` (:issue:`43564`)
+- :func:`read_csv` now accepts a ``callable`` function in ``on_bad_lines`` when ``engine="python"`` for custom handling of bad lines (:issue:`5686`)
+- :class:`ExcelWriter` argument ``if_sheet_exists="overlay"`` option added (:issue:`40231`)
+- :meth:`read_excel` now accepts a ``decimal`` argument that allow the user to specify the decimal point when parsing string columns to numeric (:issue:`14403`)
+- :meth:`.GroupBy.mean`, :meth:`.GroupBy.std`, :meth:`.GroupBy.var`, and :meth:`.GroupBy.sum` now support `Numba `_ execution with the ``engine`` keyword (:issue:`43731`, :issue:`44862`, :issue:`44939`)
+- :meth:`Timestamp.isoformat` now handles the ``timespec`` argument from the base ``datetime`` class (:issue:`26131`)
+- :meth:`NaT.to_numpy` ``dtype`` argument is now respected, so ``np.timedelta64`` can be returned (:issue:`44460`)
+- New option ``display.max_dir_items`` customizes the number of columns added to :meth:`Dataframe.__dir__` and suggested for tab completion (:issue:`37996`)
+- Added "Juneteenth National Independence Day" to ``USFederalHolidayCalendar`` (:issue:`44574`)
+- :meth:`.Rolling.var`, :meth:`.Expanding.var`, :meth:`.Rolling.std`, and :meth:`.Expanding.std` now support `Numba `_ execution with the ``engine`` keyword (:issue:`44461`)
+- :meth:`Series.info` has been added, for compatibility with :meth:`DataFrame.info` (:issue:`5167`)
+- Implemented :meth:`IntervalArray.min` and :meth:`IntervalArray.max`, as a result of which ``min`` and ``max`` now work for :class:`IntervalIndex`, :class:`Series` and :class:`DataFrame` with ``IntervalDtype`` (:issue:`44746`)
+- :meth:`UInt64Index.map` now retains ``dtype`` where possible (:issue:`44609`)
+- :meth:`read_json` can now parse unsigned long long integers (:issue:`26068`)
+- :meth:`DataFrame.take` now raises a ``TypeError`` when passed a scalar for the indexer (:issue:`42875`)
+- :meth:`is_list_like` now identifies duck-arrays as list-like unless ``.ndim == 0`` (:issue:`35131`)
+- :class:`ExtensionDtype` and :class:`ExtensionArray` are now (de)serialized when exporting a :class:`DataFrame` with :meth:`DataFrame.to_json` using ``orient='table'`` (:issue:`20612`, :issue:`44705`)
+- Add support for `Zstandard `_ compression to :meth:`DataFrame.to_pickle`/:meth:`read_pickle` and friends (:issue:`43925`)
+- :meth:`DataFrame.to_sql` now returns an ``int`` of the number of written rows (:issue:`23998`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_140.notable_bug_fixes:
+
+Notable bug fixes
+~~~~~~~~~~~~~~~~~
+
+These are bug fixes that might have notable behavior changes.
+
+.. _whatsnew_140.notable_bug_fixes.inconsistent_date_string_parsing:
+
+Inconsistent date string parsing
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``dayfirst`` option of :func:`to_datetime` isn't strict, and this can lead
+to surprising behavior:
+
+.. ipython:: python
+ :okwarning:
+
+ pd.to_datetime(["31-12-2021"], dayfirst=False)
+
+Now, a warning will be raised if a date string cannot be parsed accordance to
+the given ``dayfirst`` value when the value is a delimited date string (e.g.
+``31-12-2012``).
+
+.. _whatsnew_140.notable_bug_fixes.concat_with_empty_or_all_na:
+
+Ignoring dtypes in concat with empty or all-NA columns
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When using :func:`concat` to concatenate two or more :class:`DataFrame` objects,
+if one of the DataFrames was empty or had all-NA values, its dtype was
+*sometimes* ignored when finding the concatenated dtype. These are now
+consistently *not* ignored (:issue:`43507`).
+
+.. ipython:: python
+
+ df1 = pd.DataFrame({"bar": [pd.Timestamp("2013-01-01")]}, index=range(1))
+ df2 = pd.DataFrame({"bar": np.nan}, index=range(1, 2))
+ res = pd.concat([df1, df2])
+
+Previously, the float-dtype in ``df2`` would be ignored so the result dtype
+would be ``datetime64[ns]``. As a result, the ``np.nan`` would be cast to
+``NaT``.
+
+*Previous behavior*:
+
+.. code-block:: ipython
+
+ In [4]: res
+ Out[4]:
+ bar
+ 0 2013-01-01
+ 1 NaT
+
+Now the float-dtype is respected. Since the common dtype for these DataFrames is
+object, the ``np.nan`` is retained.
+
+*New behavior*:
+
+.. ipython:: python
+
+ res
+
+.. _whatsnew_140.notable_bug_fixes.value_counts_and_mode_do_not_coerce_to_nan:
+
+Null-values are no longer coerced to NaN-value in value_counts and mode
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:meth:`Series.value_counts` and :meth:`Series.mode` no longer coerce ``None``,
+``NaT`` and other null-values to a NaN-value for ``np.object``-dtype. This
+behavior is now consistent with ``unique``, ``isin`` and others
+(:issue:`42688`).
+
+.. ipython:: python
+
+ s = pd.Series([True, None, pd.NaT, None, pd.NaT, None])
+ res = s.value_counts(dropna=False)
+
+Previously, all null-values were replaced by a NaN-value.
+
+*Previous behavior*:
+
+.. code-block:: ipython
+
+ In [3]: res
+ Out[3]:
+ NaN 5
+ True 1
+ dtype: int64
+
+Now null-values are no longer mangled.
+
+*New behavior*:
+
+.. ipython:: python
+
+ res
+
+.. _whatsnew_140.notable_bug_fixes.read_csv_mangle_dup_cols:
+
+mangle_dupe_cols in read_csv no longer renames unique columns conflicting with target names
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:func:`read_csv` no longer renames unique column labels which conflict with the target
+names of duplicated columns. Already existing columns are skipped, i.e. the next
+available index is used for the target column name (:issue:`14704`).
+
+.. ipython:: python
+
+ import io
+
+ data = "a,a,a.1\n1,2,3"
+ res = pd.read_csv(io.StringIO(data))
+
+Previously, the second column was called ``a.1``, while the third column was
+also renamed to ``a.1.1``.
+
+*Previous behavior*:
+
+.. code-block:: ipython
+
+ In [3]: res
+ Out[3]:
+ a a.1 a.1.1
+ 0 1 2 3
+
+Now the renaming checks if ``a.1`` already exists when changing the name of the
+second column and jumps this index. The second column is instead renamed to
+``a.2``.
+
+*New behavior*:
+
+.. ipython:: python
+
+ res
+
+.. _whatsnew_140.notable_bug_fixes.unstack_pivot_int32_limit:
+
+unstack and pivot_table no longer raises ValueError for result that would exceed int32 limit
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Previously :meth:`DataFrame.pivot_table` and :meth:`DataFrame.unstack` would
+raise a ``ValueError`` if the operation could produce a result with more than
+``2**31 - 1`` elements. This operation now raises a
+:class:`errors.PerformanceWarning` instead (:issue:`26314`).
+
+*Previous behavior*:
+
+.. code-block:: ipython
+
+ In [3]: df = DataFrame({"ind1": np.arange(2 ** 16), "ind2": np.arange(2 ** 16), "count": 0})
+ In [4]: df.pivot_table(index="ind1", columns="ind2", values="count", aggfunc="count")
+ ValueError: Unstacked DataFrame is too big, causing int32 overflow
+
+*New behavior*:
+
+.. code-block:: python
+
+ In [4]: df.pivot_table(index="ind1", columns="ind2", values="count", aggfunc="count")
+ PerformanceWarning: The following operation may generate 4294967296 cells in the resulting pandas object.
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_140.notable_bug_fixes.groupby_apply_mutation:
+
+groupby.apply consistent transform detection
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:meth:`.GroupBy.apply` is designed to be flexible, allowing users to perform
+aggregations, transformations, filters, and use it with user-defined functions
+that might not fall into any of these categories. As part of this, apply will
+attempt to detect when an operation is a transform, and in such a case, the
+result will have the same index as the input. In order to determine if the
+operation is a transform, pandas compares the input's index to the result's and
+determines if it has been mutated. Previously in pandas 1.3, different code
+paths used different definitions of "mutated": some would use Python's ``is``
+whereas others would test only up to equality.
+
+This inconsistency has been removed, pandas now tests up to equality.
+
+.. ipython:: python
+
+ def func(x):
+ return x.copy()
+
+ df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
+ df
+
+*Previous behavior*:
+
+.. code-block:: ipython
+
+ In [3]: df.groupby(['a']).apply(func)
+ Out[3]:
+ a b c
+ a
+ 1 0 1 3 5
+ 2 1 2 4 6
+
+ In [4]: df.set_index(['a', 'b']).groupby(['a']).apply(func)
+ Out[4]:
+ c
+ a b
+ 1 3 5
+ 2 4 6
+
+In the examples above, the first uses a code path where pandas uses ``is`` and
+determines that ``func`` is not a transform whereas the second tests up to
+equality and determines that ``func`` is a transform. In the first case, the
+result's index is not the same as the input's.
+
+*New behavior*:
+
+.. ipython:: python
+
+ df.groupby(['a']).apply(func)
+ df.set_index(['a', 'b']).groupby(['a']).apply(func)
+
+Now in both cases it is determined that ``func`` is a transform. In each case,
+the result has the same index as the input.
+
+.. _whatsnew_140.api_breaking:
+
+Backwards incompatible API changes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. _whatsnew_140.api_breaking.python:
+
+Increased minimum version for Python
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+pandas 1.4.0 supports Python 3.8 and higher.
+
+.. _whatsnew_140.api_breaking.deps:
+
+Increased minimum versions for dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Some minimum supported versions of dependencies were updated.
+If installed, we now require:
+
++-----------------+-----------------+----------+---------+
+| Package | Minimum Version | Required | Changed |
++=================+=================+==========+=========+
+| numpy | 1.18.5 | X | X |
++-----------------+-----------------+----------+---------+
+| pytz | 2020.1 | X | X |
++-----------------+-----------------+----------+---------+
+| python-dateutil | 2.8.1 | X | X |
++-----------------+-----------------+----------+---------+
+| bottleneck | 1.3.1 | | X |
++-----------------+-----------------+----------+---------+
+| numexpr | 2.7.1 | | X |
++-----------------+-----------------+----------+---------+
+| pytest (dev) | 6.0 | | |
++-----------------+-----------------+----------+---------+
+| mypy (dev) | 0.930 | | X |
++-----------------+-----------------+----------+---------+
+
+For `optional libraries
+`_ the general
+recommendation is to use the latest version. The following table lists the
+lowest version per library that is currently being tested throughout the
+development of pandas. Optional libraries below the lowest tested version may
+still work, but are not considered supported.
+
++-----------------+-----------------+---------+
+| Package | Minimum Version | Changed |
++=================+=================+=========+
+| beautifulsoup4 | 4.8.2 | X |
++-----------------+-----------------+---------+
+| fastparquet | 0.4.0 | |
++-----------------+-----------------+---------+
+| fsspec | 0.7.4 | |
++-----------------+-----------------+---------+
+| gcsfs | 0.6.0 | |
++-----------------+-----------------+---------+
+| lxml | 4.5.0 | X |
++-----------------+-----------------+---------+
+| matplotlib | 3.3.2 | X |
++-----------------+-----------------+---------+
+| numba | 0.50.1 | X |
++-----------------+-----------------+---------+
+| openpyxl | 3.0.3 | X |
++-----------------+-----------------+---------+
+| pandas-gbq | 0.14.0 | X |
++-----------------+-----------------+---------+
+| pyarrow | 1.0.1 | X |
++-----------------+-----------------+---------+
+| pymysql | 0.10.1 | X |
++-----------------+-----------------+---------+
+| pytables | 3.6.1 | X |
++-----------------+-----------------+---------+
+| s3fs | 0.4.0 | |
++-----------------+-----------------+---------+
+| scipy | 1.4.1 | X |
++-----------------+-----------------+---------+
+| sqlalchemy | 1.4.0 | X |
++-----------------+-----------------+---------+
+| tabulate | 0.8.7 | |
++-----------------+-----------------+---------+
+| xarray | 0.15.1 | X |
++-----------------+-----------------+---------+
+| xlrd | 2.0.1 | X |
++-----------------+-----------------+---------+
+| xlsxwriter | 1.2.2 | X |
++-----------------+-----------------+---------+
+| xlwt | 1.3.0 | |
++-----------------+-----------------+---------+
+
+See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.
+
+.. _whatsnew_140.api_breaking.other:
+
+Other API changes
+^^^^^^^^^^^^^^^^^
+- :meth:`Index.get_indexer_for` no longer accepts keyword arguments (other than ``target``); in the past these would be silently ignored if the index was not unique (:issue:`42310`)
+- Change in the position of the ``min_rows`` argument in :meth:`DataFrame.to_string` due to change in the docstring (:issue:`44304`)
+- Reduction operations for :class:`DataFrame` or :class:`Series` now raising a ``ValueError`` when ``None`` is passed for ``skipna`` (:issue:`44178`)
+- :func:`read_csv` and :func:`read_html` no longer raising an error when one of the header rows consists only of ``Unnamed:`` columns (:issue:`13054`)
+- Changed the ``name`` attribute of several holidays in
+ ``USFederalHolidayCalendar`` to match `official federal holiday
+ names `_
+ specifically:
+
+ - "New Year's Day" gains the possessive apostrophe
+ - "Presidents Day" becomes "Washington's Birthday"
+ - "Martin Luther King Jr. Day" is now "Birthday of Martin Luther King, Jr."
+ - "July 4th" is now "Independence Day"
+ - "Thanksgiving" is now "Thanksgiving Day"
+ - "Christmas" is now "Christmas Day"
+ - Added "Juneteenth National Independence Day"
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_140.deprecations:
+
+Deprecations
+~~~~~~~~~~~~
+
+.. _whatsnew_140.deprecations.int64_uint64_float64index:
+
+Deprecated Int64Index, UInt64Index & Float64Index
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:class:`Int64Index`, :class:`UInt64Index` and :class:`Float64Index` have been
+deprecated in favor of the base :class:`Index` class and will be removed in
+Pandas 2.0 (:issue:`43028`).
+
+For constructing a numeric index, you can use the base :class:`Index` class
+instead specifying the data type (which will also work on older pandas
+releases):
+
+.. code-block:: python
+
+ # replace
+ pd.Int64Index([1, 2, 3])
+ # with
+ pd.Index([1, 2, 3], dtype="int64")
+
+For checking the data type of an index object, you can replace ``isinstance``
+checks with checking the ``dtype``:
+
+.. code-block:: python
+
+ # replace
+ isinstance(idx, pd.Int64Index)
+ # with
+ idx.dtype == "int64"
+
+Currently, in order to maintain backward compatibility, calls to :class:`Index`
+will continue to return :class:`Int64Index`, :class:`UInt64Index` and
+:class:`Float64Index` when given numeric data, but in the future, an
+:class:`Index` will be returned.
+
+*Current behavior*:
+
+.. code-block:: ipython
+
+ In [1]: pd.Index([1, 2, 3], dtype="int32")
+ Out [1]: Int64Index([1, 2, 3], dtype='int64')
+ In [1]: pd.Index([1, 2, 3], dtype="uint64")
+ Out [1]: UInt64Index([1, 2, 3], dtype='uint64')
+
+*Future behavior*:
+
+.. code-block:: ipython
+
+ In [3]: pd.Index([1, 2, 3], dtype="int32")
+ Out [3]: Index([1, 2, 3], dtype='int32')
+ In [4]: pd.Index([1, 2, 3], dtype="uint64")
+ Out [4]: Index([1, 2, 3], dtype='uint64')
+
+
+.. _whatsnew_140.deprecations.frame_series_append:
+
+Deprecated Frame.append and Series.append
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:meth:`DataFrame.append` and :meth:`Series.append` have been deprecated and will
+be removed in Pandas 2.0. Use :func:`pandas.concat` instead (:issue:`35407`).
+
+*Deprecated syntax*
+
+.. code-block:: ipython
+
+ In [1]: pd.Series([1, 2]).append(pd.Series([3, 4])
+ Out [1]:
+ :1: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
+ 0 1
+ 1 2
+ 0 3
+ 1 4
+ dtype: int64
+
+ In [2]: df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
+ In [3]: df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
+ In [4]: df1.append(df2)
+ Out [4]:
+ :1: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
+ A B
+ 0 1 2
+ 1 3 4
+ 0 5 6
+ 1 7 8
+
+*Recommended syntax*
+
+.. ipython:: python
+
+ pd.concat([pd.Series([1, 2]), pd.Series([3, 4])])
+
+ df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
+ df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
+ pd.concat([df1, df2])
+
+
+.. _whatsnew_140.deprecations.other:
+
+Other Deprecations
+^^^^^^^^^^^^^^^^^^
+- Deprecated :meth:`Index.is_type_compatible` (:issue:`42113`)
+- Deprecated ``method`` argument in :meth:`Index.get_loc`, use ``index.get_indexer([label], method=...)`` instead (:issue:`42269`)
+- Deprecated treating integer keys in :meth:`Series.__setitem__` as positional when the index is a :class:`Float64Index` not containing the key, a :class:`IntervalIndex` with no entries containing the key, or a :class:`MultiIndex` with leading :class:`Float64Index` level not containing the key (:issue:`33469`)
+- Deprecated treating ``numpy.datetime64`` objects as UTC times when passed to the :class:`Timestamp` constructor along with a timezone. In a future version, these will be treated as wall-times. To retain the old behavior, use ``Timestamp(dt64).tz_localize("UTC").tz_convert(tz)`` (:issue:`24559`)
+- Deprecated ignoring missing labels when indexing with a sequence of labels on a level of a :class:`MultiIndex` (:issue:`42351`)
+- Creating an empty :class:`Series` without a ``dtype`` will now raise a more visible ``FutureWarning`` instead of a ``DeprecationWarning`` (:issue:`30017`)
+- Deprecated the ``kind`` argument in :meth:`Index.get_slice_bound`, :meth:`Index.slice_indexer`, and :meth:`Index.slice_locs`; in a future version passing ``kind`` will raise (:issue:`42857`)
+- Deprecated dropping of nuisance columns in :class:`Rolling`, :class:`Expanding`, and :class:`EWM` aggregations (:issue:`42738`)
+- Deprecated :meth:`Index.reindex` with a non-unique :class:`Index` (:issue:`42568`)
+- Deprecated :meth:`.Styler.render` in favor of :meth:`.Styler.to_html` (:issue:`42140`)
+- Deprecated :meth:`.Styler.hide_index` and :meth:`.Styler.hide_columns` in favor of :meth:`.Styler.hide` (:issue:`43758`)
+- Deprecated passing in a string column label into ``times`` in :meth:`DataFrame.ewm` (:issue:`43265`)
+- Deprecated the ``include_start`` and ``include_end`` arguments in :meth:`DataFrame.between_time`; in a future version passing ``include_start`` or ``include_end`` will raise (:issue:`40245`)
+- Deprecated the ``squeeze`` argument to :meth:`read_csv`, :meth:`read_table`, and :meth:`read_excel`. Users should squeeze the :class:`DataFrame` afterwards with ``.squeeze("columns")`` instead (:issue:`43242`)
+- Deprecated the ``index`` argument to :class:`SparseArray` construction (:issue:`23089`)
+- Deprecated the ``closed`` argument in :meth:`date_range` and :meth:`bdate_range` in favor of ``inclusive`` argument; In a future version passing ``closed`` will raise (:issue:`40245`)
+- Deprecated :meth:`.Rolling.validate`, :meth:`.Expanding.validate`, and :meth:`.ExponentialMovingWindow.validate` (:issue:`43665`)
+- Deprecated silent dropping of columns that raised a ``TypeError`` in :class:`Series.transform` and :class:`DataFrame.transform` when used with a dictionary (:issue:`43740`)
+- Deprecated silent dropping of columns that raised a ``TypeError``, ``DataError``, and some cases of ``ValueError`` in :meth:`Series.aggregate`, :meth:`DataFrame.aggregate`, :meth:`Series.groupby.aggregate`, and :meth:`DataFrame.groupby.aggregate` when used with a list (:issue:`43740`)
+- Deprecated casting behavior when setting timezone-aware value(s) into a timezone-aware :class:`Series` or :class:`DataFrame` column when the timezones do not match. Previously this cast to object dtype. In a future version, the values being inserted will be converted to the series or column's existing timezone (:issue:`37605`)
+- Deprecated casting behavior when passing an item with mismatched-timezone to :meth:`DatetimeIndex.insert`, :meth:`DatetimeIndex.putmask`, :meth:`DatetimeIndex.where` :meth:`DatetimeIndex.fillna`, :meth:`Series.mask`, :meth:`Series.where`, :meth:`Series.fillna`, :meth:`Series.shift`, :meth:`Series.replace`, :meth:`Series.reindex` (and :class:`DataFrame` column analogues). In the past this has cast to object ``dtype``. In a future version, these will cast the passed item to the index or series's timezone (:issue:`37605`, :issue:`44940`)
+- Deprecated the ``prefix`` keyword argument in :func:`read_csv` and :func:`read_table`, in a future version the argument will be removed (:issue:`43396`)
+- Deprecated passing non boolean argument to ``sort`` in :func:`concat` (:issue:`41518`)
+- Deprecated passing arguments as positional for :func:`read_fwf` other than ``filepath_or_buffer`` (:issue:`41485`)
+- Deprecated passing arguments as positional for :func:`read_xml` other than ``path_or_buffer`` (:issue:`45133`)
+- Deprecated passing ``skipna=None`` for :meth:`DataFrame.mad` and :meth:`Series.mad`, pass ``skipna=True`` instead (:issue:`44580`)
+- Deprecated the behavior of :func:`to_datetime` with the string "now" with ``utc=False``; in a future version this will match ``Timestamp("now")``, which in turn matches :meth:`Timestamp.now` returning the local time (:issue:`18705`)
+- Deprecated :meth:`DateOffset.apply`, use ``offset + other`` instead (:issue:`44522`)
+- Deprecated parameter ``names`` in :meth:`Index.copy` (:issue:`44916`)
+- A deprecation warning is now shown for :meth:`DataFrame.to_latex` indicating the arguments signature may change and emulate more the arguments to :meth:`.Styler.to_latex` in future versions (:issue:`44411`)
+- Deprecated behavior of :func:`concat` between objects with bool-dtype and numeric-dtypes; in a future version these will cast to object dtype instead of coercing bools to numeric values (:issue:`39817`)
+- Deprecated :meth:`Categorical.replace`, use :meth:`Series.replace` instead (:issue:`44929`)
+- Deprecated passing ``set`` or ``dict`` as indexer for :meth:`DataFrame.loc.__setitem__`, :meth:`DataFrame.loc.__getitem__`, :meth:`Series.loc.__setitem__`, :meth:`Series.loc.__getitem__`, :meth:`DataFrame.__getitem__`, :meth:`Series.__getitem__` and :meth:`Series.__setitem__` (:issue:`42825`)
+- Deprecated :meth:`Index.__getitem__` with a bool key; use ``index.values[key]`` to get the old behavior (:issue:`44051`)
+- Deprecated downcasting column-by-column in :meth:`DataFrame.where` with integer-dtypes (:issue:`44597`)
+- Deprecated :meth:`DatetimeIndex.union_many`, use :meth:`DatetimeIndex.union` instead (:issue:`44091`)
+- Deprecated :meth:`.Groupby.pad` in favor of :meth:`.Groupby.ffill` (:issue:`33396`)
+- Deprecated :meth:`.Groupby.backfill` in favor of :meth:`.Groupby.bfill` (:issue:`33396`)
+- Deprecated :meth:`.Resample.pad` in favor of :meth:`.Resample.ffill` (:issue:`33396`)
+- Deprecated :meth:`.Resample.backfill` in favor of :meth:`.Resample.bfill` (:issue:`33396`)
+- Deprecated ``numeric_only=None`` in :meth:`DataFrame.rank`; in a future version ``numeric_only`` must be either ``True`` or ``False`` (the default) (:issue:`45036`)
+- Deprecated the behavior of :meth:`Timestamp.utcfromtimestamp`, in the future it will return a timezone-aware UTC :class:`Timestamp` (:issue:`22451`)
+- Deprecated :meth:`NaT.freq` (:issue:`45071`)
+- Deprecated behavior of :class:`Series` and :class:`DataFrame` construction when passed float-dtype data containing ``NaN`` and an integer dtype ignoring the dtype argument; in a future version this will raise (:issue:`40110`)
+- Deprecated the behaviour of :meth:`Series.to_frame` and :meth:`Index.to_frame` to ignore the ``name`` argument when ``name=None``. Currently, this means to preserve the existing name, but in the future explicitly passing ``name=None`` will set ``None`` as the name of the column in the resulting DataFrame (:issue:`44212`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_140.performance:
+
+Performance improvements
+~~~~~~~~~~~~~~~~~~~~~~~~
+- Performance improvement in :meth:`.GroupBy.sample`, especially when ``weights`` argument provided (:issue:`34483`)
+- Performance improvement when converting non-string arrays to string arrays (:issue:`34483`)
+- Performance improvement in :meth:`.GroupBy.transform` for user-defined functions (:issue:`41598`)
+- Performance improvement in constructing :class:`DataFrame` objects (:issue:`42631`, :issue:`43142`, :issue:`43147`, :issue:`43307`, :issue:`43144`, :issue:`44826`)
+- Performance improvement in :meth:`GroupBy.shift` when ``fill_value`` argument is provided (:issue:`26615`)
+- Performance improvement in :meth:`DataFrame.corr` for ``method=pearson`` on data without missing values (:issue:`40956`)
+- Performance improvement in some :meth:`GroupBy.apply` operations (:issue:`42992`, :issue:`43578`)
+- Performance improvement in :func:`read_stata` (:issue:`43059`, :issue:`43227`)
+- Performance improvement in :func:`read_sas` (:issue:`43333`)
+- Performance improvement in :meth:`to_datetime` with ``uint`` dtypes (:issue:`42606`)
+- Performance improvement in :meth:`to_datetime` with ``infer_datetime_format`` set to ``True`` (:issue:`43901`)
+- Performance improvement in :meth:`Series.sparse.to_coo` (:issue:`42880`)
+- Performance improvement in indexing with a :class:`UInt64Index` (:issue:`43862`)
+- Performance improvement in indexing with a :class:`Float64Index` (:issue:`43705`)
+- Performance improvement in indexing with a non-unique :class:`Index` (:issue:`43792`)
+- Performance improvement in indexing with a listlike indexer on a :class:`MultiIndex` (:issue:`43370`)
+- Performance improvement in indexing with a :class:`MultiIndex` indexer on another :class:`MultiIndex` (:issue:`43370`)
+- Performance improvement in :meth:`GroupBy.quantile` (:issue:`43469`, :issue:`43725`)
+- Performance improvement in :meth:`GroupBy.count` (:issue:`43730`, :issue:`43694`)
+- Performance improvement in :meth:`GroupBy.any` and :meth:`GroupBy.all` (:issue:`43675`, :issue:`42841`)
+- Performance improvement in :meth:`GroupBy.std` (:issue:`43115`, :issue:`43576`)
+- Performance improvement in :meth:`GroupBy.cumsum` (:issue:`43309`)
+- :meth:`SparseArray.min` and :meth:`SparseArray.max` no longer require converting to a dense array (:issue:`43526`)
+- Indexing into a :class:`SparseArray` with a ``slice`` with ``step=1`` no longer requires converting to a dense array (:issue:`43777`)
+- Performance improvement in :meth:`SparseArray.take` with ``allow_fill=False`` (:issue:`43654`)
+- Performance improvement in :meth:`.Rolling.mean`, :meth:`.Expanding.mean`, :meth:`.Rolling.sum`, :meth:`.Expanding.sum`, :meth:`.Rolling.max`, :meth:`.Expanding.max`, :meth:`.Rolling.min` and :meth:`.Expanding.min` with ``engine="numba"`` (:issue:`43612`, :issue:`44176`, :issue:`45170`)
+- Improved performance of :meth:`pandas.read_csv` with ``memory_map=True`` when file encoding is UTF-8 (:issue:`43787`)
+- Performance improvement in :meth:`RangeIndex.sort_values` overriding :meth:`Index.sort_values` (:issue:`43666`)
+- Performance improvement in :meth:`RangeIndex.insert` (:issue:`43988`)
+- Performance improvement in :meth:`Index.insert` (:issue:`43953`)
+- Performance improvement in :meth:`DatetimeIndex.tolist` (:issue:`43823`)
+- Performance improvement in :meth:`DatetimeIndex.union` (:issue:`42353`)
+- Performance improvement in :meth:`Series.nsmallest` (:issue:`43696`)
+- Performance improvement in :meth:`DataFrame.insert` (:issue:`42998`)
+- Performance improvement in :meth:`DataFrame.dropna` (:issue:`43683`)
+- Performance improvement in :meth:`DataFrame.fillna` (:issue:`43316`)
+- Performance improvement in :meth:`DataFrame.values` (:issue:`43160`)
+- Performance improvement in :meth:`DataFrame.select_dtypes` (:issue:`42611`)
+- Performance improvement in :class:`DataFrame` reductions (:issue:`43185`, :issue:`43243`, :issue:`43311`, :issue:`43609`)
+- Performance improvement in :meth:`Series.unstack` and :meth:`DataFrame.unstack` (:issue:`43335`, :issue:`43352`, :issue:`42704`, :issue:`43025`)
+- Performance improvement in :meth:`Series.to_frame` (:issue:`43558`)
+- Performance improvement in :meth:`Series.mad` (:issue:`43010`)
+- Performance improvement in :func:`merge` (:issue:`43332`)
+- Performance improvement in :func:`to_csv` when index column is a datetime and is formatted (:issue:`39413`)
+- Performance improvement in :func:`to_csv` when :class:`MultiIndex` contains a lot of unused levels (:issue:`37484`)
+- Performance improvement in :func:`read_csv` when ``index_col`` was set with a numeric column (:issue:`44158`)
+- Performance improvement in :func:`concat` (:issue:`43354`)
+- Performance improvement in :meth:`SparseArray.__getitem__` (:issue:`23122`)
+- Performance improvement in constructing a :class:`DataFrame` from array-like objects like a ``Pytorch`` tensor (:issue:`44616`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_140.bug_fixes:
+
+Bug fixes
+~~~~~~~~~
+
+Categorical
+^^^^^^^^^^^
+- Bug in setting dtype-incompatible values into a :class:`Categorical` (or ``Series`` or ``DataFrame`` backed by ``Categorical``) raising ``ValueError`` instead of ``TypeError`` (:issue:`41919`)
+- Bug in :meth:`Categorical.searchsorted` when passing a dtype-incompatible value raising ``KeyError`` instead of ``TypeError`` (:issue:`41919`)
+- Bug in :meth:`Categorical.astype` casting datetimes and :class:`Timestamp` to int for dtype ``object`` (:issue:`44930`)
+- Bug in :meth:`Series.where` with ``CategoricalDtype`` when passing a dtype-incompatible value raising ``ValueError`` instead of ``TypeError`` (:issue:`41919`)
+- Bug in :meth:`Categorical.fillna` when passing a dtype-incompatible value raising ``ValueError`` instead of ``TypeError`` (:issue:`41919`)
+- Bug in :meth:`Categorical.fillna` with a tuple-like category raising ``ValueError`` instead of ``TypeError`` when filling with a non-category tuple (:issue:`41919`)
+
+Datetimelike
+^^^^^^^^^^^^
+- Bug in :class:`DataFrame` constructor unnecessarily copying non-datetimelike 2D object arrays (:issue:`39272`)
+- Bug in :func:`to_datetime` with ``format`` and ``pandas.NA`` was raising ``ValueError`` (:issue:`42957`)
+- :func:`to_datetime` would silently swap ``MM/DD/YYYY`` and ``DD/MM/YYYY`` formats if the given ``dayfirst`` option could not be respected - now, a warning is raised in the case of delimited date strings (e.g. ``31-12-2012``) (:issue:`12585`)
+- Bug in :meth:`date_range` and :meth:`bdate_range` do not return right bound when ``start`` = ``end`` and set is closed on one side (:issue:`43394`)
+- Bug in inplace addition and subtraction of :class:`DatetimeIndex` or :class:`TimedeltaIndex` with :class:`DatetimeArray` or :class:`TimedeltaArray` (:issue:`43904`)
+- Bug in calling ``np.isnan``, ``np.isfinite``, or ``np.isinf`` on a timezone-aware :class:`DatetimeIndex` incorrectly raising ``TypeError`` (:issue:`43917`)
+- Bug in constructing a :class:`Series` from datetime-like strings with mixed timezones incorrectly partially-inferring datetime values (:issue:`40111`)
+- Bug in addition of a :class:`Tick` object and a ``np.timedelta64`` object incorrectly raising instead of returning :class:`Timedelta` (:issue:`44474`)
+- ``np.maximum.reduce`` and ``np.minimum.reduce`` now correctly return :class:`Timestamp` and :class:`Timedelta` objects when operating on :class:`Series`, :class:`DataFrame`, or :class:`Index` with ``datetime64[ns]`` or ``timedelta64[ns]`` dtype (:issue:`43923`)
+- Bug in adding a ``np.timedelta64`` object to a :class:`BusinessDay` or :class:`CustomBusinessDay` object incorrectly raising (:issue:`44532`)
+- Bug in :meth:`Index.insert` for inserting ``np.datetime64``, ``np.timedelta64`` or ``tuple`` into :class:`Index` with ``dtype='object'`` with negative loc adding ``None`` and replacing existing value (:issue:`44509`)
+- Bug in :meth:`Timestamp.to_pydatetime` failing to retain the ``fold`` attribute (:issue:`45087`)
+- Bug in :meth:`Series.mode` with ``DatetimeTZDtype`` incorrectly returning timezone-naive and ``PeriodDtype`` incorrectly raising (:issue:`41927`)
+- Fixed regression in :meth:`~Series.reindex` raising an error when using an incompatible fill value with a datetime-like dtype (or not raising a deprecation warning for using a ``datetime.date`` as fill value) (:issue:`42921`)
+- Bug in :class:`DateOffset`` addition with :class:`Timestamp` where ``offset.nanoseconds`` would not be included in the result (:issue:`43968`, :issue:`36589`)
+- Bug in :meth:`Timestamp.fromtimestamp` not supporting the ``tz`` argument (:issue:`45083`)
+- Bug in :class:`DataFrame` construction from dict of :class:`Series` with mismatched index dtypes sometimes raising depending on the ordering of the passed dict (:issue:`44091`)
+- Bug in :class:`Timestamp` hashing during some DST transitions caused a segmentation fault (:issue:`33931` and :issue:`40817`)
+
+Timedelta
+^^^^^^^^^
+- Bug in division of all-``NaT`` :class:`TimeDeltaIndex`, :class:`Series` or :class:`DataFrame` column with object-dtype array like of numbers failing to infer the result as timedelta64-dtype (:issue:`39750`)
+- Bug in floor division of ``timedelta64[ns]`` data with a scalar returning garbage values (:issue:`44466`)
+- Bug in :class:`Timedelta` now properly taking into account any nanoseconds contribution of any kwarg (:issue:`43764`, :issue:`45227`)
+
+Time Zones
+^^^^^^^^^^
+- Bug in :func:`to_datetime` with ``infer_datetime_format=True`` failing to parse zero UTC offset (``Z``) correctly (:issue:`41047`)
+- Bug in :meth:`Series.dt.tz_convert` resetting index in a :class:`Series` with :class:`CategoricalIndex` (:issue:`43080`)
+- Bug in ``Timestamp`` and ``DatetimeIndex`` incorrectly raising a ``TypeError`` when subtracting two timezone-aware objects with mismatched timezones (:issue:`31793`)
+
+Numeric
+^^^^^^^
+- Bug in floor-dividing a list or tuple of integers by a :class:`Series` incorrectly raising (:issue:`44674`)
+- Bug in :meth:`DataFrame.rank` raising ``ValueError`` with ``object`` columns and ``method="first"`` (:issue:`41931`)
+- Bug in :meth:`DataFrame.rank` treating missing values and extreme values as equal (for example ``np.nan`` and ``np.inf``), causing incorrect results when ``na_option="bottom"`` or ``na_option="top`` used (:issue:`41931`)
+- Bug in ``numexpr`` engine still being used when the option ``compute.use_numexpr`` is set to ``False`` (:issue:`32556`)
+- Bug in :class:`DataFrame` arithmetic ops with a subclass whose :meth:`_constructor` attribute is a callable other than the subclass itself (:issue:`43201`)
+- Bug in arithmetic operations involving :class:`RangeIndex` where the result would have the incorrect ``name`` (:issue:`43962`)
+- Bug in arithmetic operations involving :class:`Series` where the result could have the incorrect ``name`` when the operands having matching NA or matching tuple names (:issue:`44459`)
+- Bug in division with ``IntegerDtype`` or ``BooleanDtype`` array and NA scalar incorrectly raising (:issue:`44685`)
+- Bug in multiplying a :class:`Series` with ``FloatingDtype`` with a timedelta-like scalar incorrectly raising (:issue:`44772`)
+
+Conversion
+^^^^^^^^^^
+- Bug in :class:`UInt64Index` constructor when passing a list containing both positive integers small enough to cast to int64 and integers too large to hold in int64 (:issue:`42201`)
+- Bug in :class:`Series` constructor returning 0 for missing values with dtype ``int64`` and ``False`` for dtype ``bool`` (:issue:`43017`, :issue:`43018`)
+- Bug in constructing a :class:`DataFrame` from a :class:`PandasArray` containing :class:`Series` objects behaving differently than an equivalent ``np.ndarray`` (:issue:`43986`)
+- Bug in :class:`IntegerDtype` not allowing coercion from string dtype (:issue:`25472`)
+- Bug in :func:`to_datetime` with ``arg:xr.DataArray`` and ``unit="ns"`` specified raises ``TypeError`` (:issue:`44053`)
+- Bug in :meth:`DataFrame.convert_dtypes` not returning the correct type when a subclass does not overload :meth:`_constructor_sliced` (:issue:`43201`)
+- Bug in :meth:`DataFrame.astype` not propagating ``attrs`` from the original :class:`DataFrame` (:issue:`44414`)
+- Bug in :meth:`DataFrame.convert_dtypes` result losing ``columns.names`` (:issue:`41435`)
+- Bug in constructing a ``IntegerArray`` from pyarrow data failing to validate dtypes (:issue:`44891`)
+- Bug in :meth:`Series.astype` not allowing converting from a ``PeriodDtype`` to ``datetime64`` dtype, inconsistent with the :class:`PeriodIndex` behavior (:issue:`45038`)
+
+Strings
+^^^^^^^
+- Bug in checking for ``string[pyarrow]`` dtype incorrectly raising an ``ImportError`` when pyarrow is not installed (:issue:`44276`)
+
+Interval
+^^^^^^^^
+- Bug in :meth:`Series.where` with ``IntervalDtype`` incorrectly raising when the ``where`` call should not replace anything (:issue:`44181`)
+
+Indexing
+^^^^^^^^
+- Bug in :meth:`Series.rename` with :class:`MultiIndex` and ``level`` is provided (:issue:`43659`)
+- Bug in :meth:`DataFrame.truncate` and :meth:`Series.truncate` when the object's :class:`Index` has a length greater than one but only one unique value (:issue:`42365`)
+- Bug in :meth:`Series.loc` and :meth:`DataFrame.loc` with a :class:`MultiIndex` when indexing with a tuple in which one of the levels is also a tuple (:issue:`27591`)
+- Bug in :meth:`Series.loc` with a :class:`MultiIndex` whose first level contains only ``np.nan`` values (:issue:`42055`)
+- Bug in indexing on a :class:`Series` or :class:`DataFrame` with a :class:`DatetimeIndex` when passing a string, the return type depended on whether the index was monotonic (:issue:`24892`)
+- Bug in indexing on a :class:`MultiIndex` failing to drop scalar levels when the indexer is a tuple containing a datetime-like string (:issue:`42476`)
+- Bug in :meth:`DataFrame.sort_values` and :meth:`Series.sort_values` when passing an ascending value, failed to raise or incorrectly raising ``ValueError`` (:issue:`41634`)
+- Bug in updating values of :class:`pandas.Series` using boolean index, created by using :meth:`pandas.DataFrame.pop` (:issue:`42530`)
+- Bug in :meth:`Index.get_indexer_non_unique` when index contains multiple ``np.nan`` (:issue:`35392`)
+- Bug in :meth:`DataFrame.query` did not handle the degree sign in a backticked column name, such as \`Temp(°C)\`, used in an expression to query a :class:`DataFrame` (:issue:`42826`)
+- Bug in :meth:`DataFrame.drop` where the error message did not show missing labels with commas when raising ``KeyError`` (:issue:`42881`)
+- Bug in :meth:`DataFrame.query` where method calls in query strings led to errors when the ``numexpr`` package was installed (:issue:`22435`)
+- Bug in :meth:`DataFrame.nlargest` and :meth:`Series.nlargest` where sorted result did not count indexes containing ``np.nan`` (:issue:`28984`)
+- Bug in indexing on a non-unique object-dtype :class:`Index` with an NA scalar (e.g. ``np.nan``) (:issue:`43711`)
+- Bug in :meth:`DataFrame.__setitem__` incorrectly writing into an existing column's array rather than setting a new array when the new dtype and the old dtype match (:issue:`43406`)
+- Bug in setting floating-dtype values into a :class:`Series` with integer dtype failing to set inplace when those values can be losslessly converted to integers (:issue:`44316`)
+- Bug in :meth:`Series.__setitem__` with object dtype when setting an array with matching size and dtype='datetime64[ns]' or dtype='timedelta64[ns]' incorrectly converting the datetime/timedeltas to integers (:issue:`43868`)
+- Bug in :meth:`DataFrame.sort_index` where ``ignore_index=True`` was not being respected when the index was already sorted (:issue:`43591`)
+- Bug in :meth:`Index.get_indexer_non_unique` when index contains multiple ``np.datetime64("NaT")`` and ``np.timedelta64("NaT")`` (:issue:`43869`)
+- Bug in setting a scalar :class:`Interval` value into a :class:`Series` with ``IntervalDtype`` when the scalar's sides are floats and the values' sides are integers (:issue:`44201`)
+- Bug when setting string-backed :class:`Categorical` values that can be parsed to datetimes into a :class:`DatetimeArray` or :class:`Series` or :class:`DataFrame` column backed by :class:`DatetimeArray` failing to parse these strings (:issue:`44236`)
+- Bug in :meth:`Series.__setitem__` with an integer dtype other than ``int64`` setting with a ``range`` object unnecessarily upcasting to ``int64`` (:issue:`44261`)
+- Bug in :meth:`Series.__setitem__` with a boolean mask indexer setting a listlike value of length 1 incorrectly broadcasting that value (:issue:`44265`)
+- Bug in :meth:`Series.reset_index` not ignoring ``name`` argument when ``drop`` and ``inplace`` are set to ``True`` (:issue:`44575`)
+- Bug in :meth:`DataFrame.loc.__setitem__` and :meth:`DataFrame.iloc.__setitem__` with mixed dtypes sometimes failing to operate in-place (:issue:`44345`)
+- Bug in :meth:`DataFrame.loc.__getitem__` incorrectly raising ``KeyError`` when selecting a single column with a boolean key (:issue:`44322`).
+- Bug in setting :meth:`DataFrame.iloc` with a single ``ExtensionDtype`` column and setting 2D values e.g. ``df.iloc[:] = df.values`` incorrectly raising (:issue:`44514`)
+- Bug in setting values with :meth:`DataFrame.iloc` with a single ``ExtensionDtype`` column and a tuple of arrays as the indexer (:issue:`44703`)
+- Bug in indexing on columns with ``loc`` or ``iloc`` using a slice with a negative step with ``ExtensionDtype`` columns incorrectly raising (:issue:`44551`)
+- Bug in :meth:`DataFrame.loc.__setitem__` changing dtype when indexer was completely ``False`` (:issue:`37550`)
+- Bug in :meth:`IntervalIndex.get_indexer_non_unique` returning boolean mask instead of array of integers for a non unique and non monotonic index (:issue:`44084`)
+- Bug in :meth:`IntervalIndex.get_indexer_non_unique` not handling targets of ``dtype`` 'object' with NaNs correctly (:issue:`44482`)
+- Fixed regression where a single column ``np.matrix`` was no longer coerced to a 1d ``np.ndarray`` when added to a :class:`DataFrame` (:issue:`42376`)
+- Bug in :meth:`Series.__getitem__` with a :class:`CategoricalIndex` of integers treating lists of integers as positional indexers, inconsistent with the behavior with a single scalar integer (:issue:`15470`, :issue:`14865`)
+- Bug in :meth:`Series.__setitem__` when setting floats or integers into integer-dtype :class:`Series` failing to upcast when necessary to retain precision (:issue:`45121`)
+- Bug in :meth:`DataFrame.iloc.__setitem__` ignores axis argument (:issue:`45032`)
+
+Missing
+^^^^^^^
+- Bug in :meth:`DataFrame.fillna` with ``limit`` and no ``method`` ignores ``axis='columns'`` or ``axis = 1`` (:issue:`40989`, :issue:`17399`)
+- Bug in :meth:`DataFrame.fillna` not replacing missing values when using a dict-like ``value`` and duplicate column names (:issue:`43476`)
+- Bug in constructing a :class:`DataFrame` with a dictionary ``np.datetime64`` as a value and ``dtype='timedelta64[ns]'``, or vice-versa, incorrectly casting instead of raising (:issue:`44428`)
+- Bug in :meth:`Series.interpolate` and :meth:`DataFrame.interpolate` with ``inplace=True`` not writing to the underlying array(s) in-place (:issue:`44749`)
+- Bug in :meth:`Index.fillna` incorrectly returning an unfilled :class:`Index` when NA values are present and ``downcast`` argument is specified. This now raises ``NotImplementedError`` instead; do not pass ``downcast`` argument (:issue:`44873`)
+- Bug in :meth:`DataFrame.dropna` changing :class:`Index` even if no entries were dropped (:issue:`41965`)
+- Bug in :meth:`Series.fillna` with an object-dtype incorrectly ignoring ``downcast="infer"`` (:issue:`44241`)
+
+MultiIndex
+^^^^^^^^^^
+- Bug in :meth:`MultiIndex.get_loc` where the first level is a :class:`DatetimeIndex` and a string key is passed (:issue:`42465`)
+- Bug in :meth:`MultiIndex.reindex` when passing a ``level`` that corresponds to an ``ExtensionDtype`` level (:issue:`42043`)
+- Bug in :meth:`MultiIndex.get_loc` raising ``TypeError`` instead of ``KeyError`` on nested tuple (:issue:`42440`)
+- Bug in :meth:`MultiIndex.union` setting wrong ``sortorder`` causing errors in subsequent indexing operations with slices (:issue:`44752`)
+- Bug in :meth:`MultiIndex.putmask` where the other value was also a :class:`MultiIndex` (:issue:`43212`)
+- Bug in :meth:`MultiIndex.dtypes` duplicate level names returned only one dtype per name (:issue:`45174`)
+
+I/O
+^^^
+- Bug in :func:`read_excel` attempting to read chart sheets from .xlsx files (:issue:`41448`)
+- Bug in :func:`json_normalize` where ``errors=ignore`` could fail to ignore missing values of ``meta`` when ``record_path`` has a length greater than one (:issue:`41876`)
+- Bug in :func:`read_csv` with multi-header input and arguments referencing column names as tuples (:issue:`42446`)
+- Bug in :func:`read_fwf`, where difference in lengths of ``colspecs`` and ``names`` was not raising ``ValueError`` (:issue:`40830`)
+- Bug in :func:`Series.to_json` and :func:`DataFrame.to_json` where some attributes were skipped when serializing plain Python objects to JSON (:issue:`42768`, :issue:`33043`)
+- Column headers are dropped when constructing a :class:`DataFrame` from a sqlalchemy's ``Row`` object (:issue:`40682`)
+- Bug in unpickling an :class:`Index` with object dtype incorrectly inferring numeric dtypes (:issue:`43188`)
+- Bug in :func:`read_csv` where reading multi-header input with unequal lengths incorrectly raised ``IndexError`` (:issue:`43102`)
+- Bug in :func:`read_csv` raising ``ParserError`` when reading file in chunks and some chunk blocks have fewer columns than header for ``engine="c"`` (:issue:`21211`)
+- Bug in :func:`read_csv`, changed exception class when expecting a file path name or file-like object from ``OSError`` to ``TypeError`` (:issue:`43366`)
+- Bug in :func:`read_csv` and :func:`read_fwf` ignoring all ``skiprows`` except first when ``nrows`` is specified for ``engine='python'`` (:issue:`44021`, :issue:`10261`)
+- Bug in :func:`read_csv` keeping the original column in object format when ``keep_date_col=True`` is set (:issue:`13378`)
+- Bug in :func:`read_json` not handling non-numpy dtypes correctly (especially ``category``) (:issue:`21892`, :issue:`33205`)
+- Bug in :func:`json_normalize` where multi-character ``sep`` parameter is incorrectly prefixed to every key (:issue:`43831`)
+- Bug in :func:`json_normalize` where reading data with missing multi-level metadata would not respect ``errors="ignore"`` (:issue:`44312`)
+- Bug in :func:`read_csv` used second row to guess implicit index if ``header`` was set to ``None`` for ``engine="python"`` (:issue:`22144`)
+- Bug in :func:`read_csv` not recognizing bad lines when ``names`` were given for ``engine="c"`` (:issue:`22144`)
+- Bug in :func:`read_csv` with :code:`float_precision="round_trip"` which did not skip initial/trailing whitespace (:issue:`43713`)
+- Bug when Python is built without the lzma module: a warning was raised at the pandas import time, even if the lzma capability isn't used (:issue:`43495`)
+- Bug in :func:`read_csv` not applying dtype for ``index_col`` (:issue:`9435`)
+- Bug in dumping/loading a :class:`DataFrame` with ``yaml.dump(frame)`` (:issue:`42748`)
+- Bug in :func:`read_csv` raising ``ValueError`` when ``names`` was longer than ``header`` but equal to data rows for ``engine="python"`` (:issue:`38453`)
+- Bug in :class:`ExcelWriter`, where ``engine_kwargs`` were not passed through to all engines (:issue:`43442`)
+- Bug in :func:`read_csv` raising ``ValueError`` when ``parse_dates`` was used with :class:`MultiIndex` columns (:issue:`8991`)
+- Bug in :func:`read_csv` not raising an ``ValueError`` when ``\n`` was specified as ``delimiter`` or ``sep`` which conflicts with ``lineterminator`` (:issue:`43528`)
+- Bug in :func:`to_csv` converting datetimes in categorical :class:`Series` to integers (:issue:`40754`)
+- Bug in :func:`read_csv` converting columns to numeric after date parsing failed (:issue:`11019`)
+- Bug in :func:`read_csv` not replacing ``NaN`` values with ``np.nan`` before attempting date conversion (:issue:`26203`)
+- Bug in :func:`read_csv` raising ``AttributeError`` when attempting to read a .csv file and infer index column dtype from an nullable integer type (:issue:`44079`)
+- Bug in :func:`to_csv` always coercing datetime columns with different formats to the same format (:issue:`21734`)
+- :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` with ``compression`` set to ``'zip'`` no longer create a zip file containing a file ending with ".zip". Instead, they try to infer the inner file name more smartly (:issue:`39465`)
+- Bug in :func:`read_csv` where reading a mixed column of booleans and missing values to a float type results in the missing values becoming 1.0 rather than NaN (:issue:`42808`, :issue:`34120`)
+- Bug in :func:`to_xml` raising error for ``pd.NA`` with extension array dtype (:issue:`43903`)
+- Bug in :func:`read_csv` when passing simultaneously a parser in ``date_parser`` and ``parse_dates=False``, the parsing was still called (:issue:`44366`)
+- Bug in :func:`read_csv` not setting name of :class:`MultiIndex` columns correctly when ``index_col`` is not the first column (:issue:`38549`)
+- Bug in :func:`read_csv` silently ignoring errors when failing to create a memory-mapped file (:issue:`44766`)
+- Bug in :func:`read_csv` when passing a ``tempfile.SpooledTemporaryFile`` opened in binary mode (:issue:`44748`)
+- Bug in :func:`read_json` raising ``ValueError`` when attempting to parse json strings containing "://" (:issue:`36271`)
+- Bug in :func:`read_csv` when ``engine="c"`` and ``encoding_errors=None`` which caused a segfault (:issue:`45180`)
+- Bug in :func:`read_csv` an invalid value of ``usecols`` leading to an unclosed file handle (:issue:`45384`)
+- Bug in :meth:`DataFrame.to_json` fix memory leak (:issue:`43877`)
+
+Period
+^^^^^^
+- Bug in adding a :class:`Period` object to a ``np.timedelta64`` object incorrectly raising ``TypeError`` (:issue:`44182`)
+- Bug in :meth:`PeriodIndex.to_timestamp` when the index has ``freq="B"`` inferring ``freq="D"`` for its result instead of ``freq="B"`` (:issue:`44105`)
+- Bug in :class:`Period` constructor incorrectly allowing ``np.timedelta64("NaT")`` (:issue:`44507`)
+- Bug in :meth:`PeriodIndex.to_timestamp` giving incorrect values for indexes with non-contiguous data (:issue:`44100`)
+- Bug in :meth:`Series.where` with ``PeriodDtype`` incorrectly raising when the ``where`` call should not replace anything (:issue:`45135`)
+
+Plotting
+^^^^^^^^
+- When given non-numeric data, :meth:`DataFrame.boxplot` now raises a ``ValueError`` rather than a cryptic ``KeyError`` or ``ZeroDivisionError``, in line with other plotting functions like :meth:`DataFrame.hist` (:issue:`43480`)
+
+Groupby/resample/rolling
+^^^^^^^^^^^^^^^^^^^^^^^^
+- Bug in :meth:`SeriesGroupBy.apply` where passing an unrecognized string argument failed to raise ``TypeError`` when the underlying ``Series`` is empty (:issue:`42021`)
+- Bug in :meth:`Series.rolling.apply`, :meth:`DataFrame.rolling.apply`, :meth:`Series.expanding.apply` and :meth:`DataFrame.expanding.apply` with ``engine="numba"`` where ``*args`` were being cached with the user passed function (:issue:`42287`)
+- Bug in :meth:`GroupBy.max` and :meth:`GroupBy.min` with nullable integer dtypes losing precision (:issue:`41743`)
+- Bug in :meth:`DataFrame.groupby.rolling.var` would calculate the rolling variance only on the first group (:issue:`42442`)
+- Bug in :meth:`GroupBy.shift` that would return the grouping columns if ``fill_value`` was not ``None`` (:issue:`41556`)
+- Bug in :meth:`SeriesGroupBy.nlargest` and :meth:`SeriesGroupBy.nsmallest` would have an inconsistent index when the input :class:`Series` was sorted and ``n`` was greater than or equal to all group sizes (:issue:`15272`, :issue:`16345`, :issue:`29129`)
+- Bug in :meth:`pandas.DataFrame.ewm`, where non-float64 dtypes were silently failing (:issue:`42452`)
+- Bug in :meth:`pandas.DataFrame.rolling` operation along rows (``axis=1``) incorrectly omits columns containing ``float16`` and ``float32`` (:issue:`41779`)
+- Bug in :meth:`Resampler.aggregate` did not allow the use of Named Aggregation (:issue:`32803`)
+- Bug in :meth:`Series.rolling` when the :class:`Series` ``dtype`` was ``Int64`` (:issue:`43016`)
+- Bug in :meth:`DataFrame.rolling.corr` when the :class:`DataFrame` columns was a :class:`MultiIndex` (:issue:`21157`)
+- Bug in :meth:`DataFrame.groupby.rolling` when specifying ``on`` and calling ``__getitem__`` would subsequently return incorrect results (:issue:`43355`)
+- Bug in :meth:`GroupBy.apply` with time-based :class:`Grouper` objects incorrectly raising ``ValueError`` in corner cases where the grouping vector contains a ``NaT`` (:issue:`43500`, :issue:`43515`)
+- Bug in :meth:`GroupBy.mean` failing with ``complex`` dtype (:issue:`43701`)
+- Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not calculating window bounds correctly for the first row when ``center=True`` and index is decreasing (:issue:`43927`)
+- Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` for centered datetimelike windows with uneven nanosecond (:issue:`43997`)
+- Bug in :meth:`GroupBy.mean` raising ``KeyError`` when column was selected at least twice (:issue:`44924`)
+- Bug in :meth:`GroupBy.nth` failing on ``axis=1`` (:issue:`43926`)
+- Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` not respecting right bound on centered datetime-like windows, if the index contain duplicates (:issue:`3944`)
+- Bug in :meth:`Series.rolling` and :meth:`DataFrame.rolling` when using a :class:`pandas.api.indexers.BaseIndexer` subclass that returned unequal start and end arrays would segfault instead of raising a ``ValueError`` (:issue:`44470`)
+- Bug in :meth:`Groupby.nunique` not respecting ``observed=True`` for ``categorical`` grouping columns (:issue:`45128`)
+- Bug in :meth:`GroupBy.head` and :meth:`GroupBy.tail` not dropping groups with ``NaN`` when ``dropna=True`` (:issue:`45089`)
+- Bug in :meth:`GroupBy.__iter__` after selecting a subset of columns in a :class:`GroupBy` object, which returned all columns instead of the chosen subset (:issue:`44821`)
+- Bug in :meth:`Groupby.rolling` when non-monotonic data passed, fails to correctly raise ``ValueError`` (:issue:`43909`)
+- Bug where grouping by a :class:`Series` that has a ``categorical`` data type and length unequal to the axis of grouping raised ``ValueError`` (:issue:`44179`)
+
+Reshaping
+^^^^^^^^^
+- Improved error message when creating a :class:`DataFrame` column from a multi-dimensional :class:`numpy.ndarray` (:issue:`42463`)
+- Bug in :func:`concat` creating :class:`MultiIndex` with duplicate level entries when concatenating a :class:`DataFrame` with duplicates in :class:`Index` and multiple keys (:issue:`42651`)
+- Bug in :meth:`pandas.cut` on :class:`Series` with duplicate indices and non-exact :meth:`pandas.CategoricalIndex` (:issue:`42185`, :issue:`42425`)
+- Bug in :meth:`DataFrame.append` failing to retain dtypes when appended columns do not match (:issue:`43392`)
+- Bug in :func:`concat` of ``bool`` and ``boolean`` dtypes resulting in ``object`` dtype instead of ``boolean`` dtype (:issue:`42800`)
+- Bug in :func:`crosstab` when inputs are categorical :class:`Series`, there are categories that are not present in one or both of the :class:`Series`, and ``margins=True``. Previously the margin value for missing categories was ``NaN``. It is now correctly reported as 0 (:issue:`43505`)
+- Bug in :func:`concat` would fail when the ``objs`` argument all had the same index and the ``keys`` argument contained duplicates (:issue:`43595`)
+- Bug in :func:`concat` which ignored the ``sort`` parameter (:issue:`43375`)
+- Bug in :func:`merge` with :class:`MultiIndex` as column index for the ``on`` argument returning an error when assigning a column internally (:issue:`43734`)
+- Bug in :func:`crosstab` would fail when inputs are lists or tuples (:issue:`44076`)
+- Bug in :meth:`DataFrame.append` failing to retain ``index.name`` when appending a list of :class:`Series` objects (:issue:`44109`)
+- Fixed metadata propagation in :meth:`Dataframe.apply` method, consequently fixing the same issue for :meth:`Dataframe.transform`, :meth:`Dataframe.nunique` and :meth:`Dataframe.mode` (:issue:`28283`)
+- Bug in :func:`concat` casting levels of :class:`MultiIndex` to float if all levels only consist of missing values (:issue:`44900`)
+- Bug in :meth:`DataFrame.stack` with ``ExtensionDtype`` columns incorrectly raising (:issue:`43561`)
+- Bug in :func:`merge` raising ``KeyError`` when joining over differently named indexes with on keywords (:issue:`45094`)
+- Bug in :meth:`Series.unstack` with object doing unwanted type inference on resulting columns (:issue:`44595`)
+- Bug in :meth:`MultiIndex.join()` with overlapping ``IntervalIndex`` levels (:issue:`44096`)
+- Bug in :meth:`DataFrame.replace` and :meth:`Series.replace` results is different ``dtype`` based on ``regex`` parameter (:issue:`44864`)
+- Bug in :meth:`DataFrame.pivot` with ``index=None`` when the :class:`DataFrame` index was a :class:`MultiIndex` (:issue:`23955`)
+
+Sparse
+^^^^^^
+- Bug in :meth:`DataFrame.sparse.to_coo` raising ``AttributeError`` when column names are not unique (:issue:`29564`)
+- Bug in :meth:`SparseArray.max` and :meth:`SparseArray.min` raising ``ValueError`` for arrays with 0 non-null elements (:issue:`43527`)
+- Bug in :meth:`DataFrame.sparse.to_coo` silently converting non-zero fill values to zero (:issue:`24817`)
+- Bug in :class:`SparseArray` comparison methods with an array-like operand of mismatched length raising ``AssertionError`` or unclear ``ValueError`` depending on the input (:issue:`43863`)
+- Bug in :class:`SparseArray` arithmetic methods ``floordiv`` and ``mod`` behaviors when dividing by zero not matching the non-sparse :class:`Series` behavior (:issue:`38172`)
+- Bug in :class:`SparseArray` unary methods as well as :meth:`SparseArray.isna` doesn't recalculate indexes (:issue:`44955`)
+
+ExtensionArray
+^^^^^^^^^^^^^^
+- Bug in :func:`array` failing to preserve :class:`PandasArray` (:issue:`43887`)
+- NumPy ufuncs ``np.abs``, ``np.positive``, ``np.negative`` now correctly preserve dtype when called on ExtensionArrays that implement ``__abs__, __pos__, __neg__``, respectively. In particular this is fixed for :class:`TimedeltaArray` (:issue:`43899`, :issue:`23316`)
+- NumPy ufuncs ``np.minimum.reduce`` ``np.maximum.reduce``, ``np.add.reduce``, and ``np.prod.reduce`` now work correctly instead of raising ``NotImplementedError`` on :class:`Series` with ``IntegerDtype`` or ``FloatDtype`` (:issue:`43923`, :issue:`44793`)
+- NumPy ufuncs with ``out`` keyword are now supported by arrays with ``IntegerDtype`` and ``FloatingDtype`` (:issue:`45122`)
+- Avoid raising ``PerformanceWarning`` about fragmented :class:`DataFrame` when using many columns with an extension dtype (:issue:`44098`)
+- Bug in :class:`IntegerArray` and :class:`FloatingArray` construction incorrectly coercing mismatched NA values (e.g. ``np.timedelta64("NaT")``) to numeric NA (:issue:`44514`)
+- Bug in :meth:`BooleanArray.__eq__` and :meth:`BooleanArray.__ne__` raising ``TypeError`` on comparison with an incompatible type (like a string). This caused :meth:`DataFrame.replace` to sometimes raise a ``TypeError`` if a nullable boolean column was included (:issue:`44499`)
+- Bug in :func:`array` incorrectly raising when passed a ``ndarray`` with ``float16`` dtype (:issue:`44715`)
+- Bug in calling ``np.sqrt`` on :class:`BooleanArray` returning a malformed :class:`FloatingArray` (:issue:`44715`)
+- Bug in :meth:`Series.where` with ``ExtensionDtype`` when ``other`` is a NA scalar incompatible with the :class:`Series` dtype (e.g. ``NaT`` with a numeric dtype) incorrectly casting to a compatible NA value (:issue:`44697`)
+- Bug in :meth:`Series.replace` where explicitly passing ``value=None`` is treated as if no ``value`` was passed, and ``None`` not being in the result (:issue:`36984`, :issue:`19998`)
+- Bug in :meth:`Series.replace` with unwanted downcasting being done in no-op replacements (:issue:`44498`)
+- Bug in :meth:`Series.replace` with ``FloatDtype``, ``string[python]``, or ``string[pyarrow]`` dtype not being preserved when possible (:issue:`33484`, :issue:`40732`, :issue:`31644`, :issue:`41215`, :issue:`25438`)
+
+Styler
+^^^^^^
+- Bug in :class:`.Styler` where the ``uuid`` at initialization maintained a floating underscore (:issue:`43037`)
+- Bug in :meth:`.Styler.to_html` where the ``Styler`` object was updated if the ``to_html`` method was called with some args (:issue:`43034`)
+- Bug in :meth:`.Styler.copy` where ``uuid`` was not previously copied (:issue:`40675`)
+- Bug in :meth:`Styler.apply` where functions which returned :class:`Series` objects were not correctly handled in terms of aligning their index labels (:issue:`13657`, :issue:`42014`)
+- Bug when rendering an empty :class:`DataFrame` with a named :class:`Index` (:issue:`43305`)
+- Bug when rendering a single level :class:`MultiIndex` (:issue:`43383`)
+- Bug when combining non-sparse rendering and :meth:`.Styler.hide_columns` or :meth:`.Styler.hide_index` (:issue:`43464`)
+- Bug setting a table style when using multiple selectors in :class:`.Styler` (:issue:`44011`)
+- Bugs where row trimming and column trimming failed to reflect hidden rows (:issue:`43703`, :issue:`44247`)
+
+Other
+^^^^^
+- Bug in :meth:`DataFrame.astype` with non-unique columns and a :class:`Series` ``dtype`` argument (:issue:`44417`)
+- Bug in :meth:`CustomBusinessMonthBegin.__add__` (:meth:`CustomBusinessMonthEnd.__add__`) not applying the extra ``offset`` parameter when beginning (end) of the target month is already a business day (:issue:`41356`)
+- Bug in :meth:`RangeIndex.union` with another ``RangeIndex`` with matching (even) ``step`` and starts differing by strictly less than ``step / 2`` (:issue:`44019`)
+- Bug in :meth:`RangeIndex.difference` with ``sort=None`` and ``step<0`` failing to sort (:issue:`44085`)
+- Bug in :meth:`Series.replace` and :meth:`DataFrame.replace` with ``value=None`` and ExtensionDtypes (:issue:`44270`, :issue:`37899`)
+- Bug in :meth:`FloatingArray.equals` failing to consider two arrays equal if they contain ``np.nan`` values (:issue:`44382`)
+- Bug in :meth:`DataFrame.shift` with ``axis=1`` and ``ExtensionDtype`` columns incorrectly raising when an incompatible ``fill_value`` is passed (:issue:`44564`)
+- Bug in :meth:`DataFrame.shift` with ``axis=1`` and ``periods`` larger than ``len(frame.columns)`` producing an invalid :class:`DataFrame` (:issue:`44978`)
+- Bug in :meth:`DataFrame.diff` when passing a NumPy integer object instead of an ``int`` object (:issue:`44572`)
+- Bug in :meth:`Series.replace` raising ``ValueError`` when using ``regex=True`` with a :class:`Series` containing ``np.nan`` values (:issue:`43344`)
+- Bug in :meth:`DataFrame.to_records` where an incorrect ``n`` was used when missing names were replaced by ``level_n`` (:issue:`44818`)
+- Bug in :meth:`DataFrame.eval` where ``resolvers`` argument was overriding the default resolvers (:issue:`34966`)
+- :meth:`Series.__repr__` and :meth:`DataFrame.__repr__` no longer replace all null-values in indexes with "NaN" but use their real string-representations. "NaN" is used only for ``float("nan")`` (:issue:`45263`)
+
+.. ---------------------------------------------------------------------------
+
+.. _whatsnew_140.contributors:
+
+Contributors
+~~~~~~~~~~~~
+
+.. contributors:: v1.3.5..v1.4.0|HEAD
diff --git a/environment.yml b/environment.yml
index 788b88ef16ad6..15dd329f80deb 100644
--- a/environment.yml
+++ b/environment.yml
@@ -3,9 +3,9 @@ channels:
- conda-forge
dependencies:
# required
- - numpy>=1.17.3
+ - numpy>=1.18.5
- python=3.8
- - python-dateutil>=2.7.3
+ - python-dateutil>=2.8.1
- pytz
# benchmarks
@@ -15,16 +15,16 @@ dependencies:
# The compiler packages are meta-packages and install the correct compiler (activation) packages on the respective platforms.
- c-compiler
- cxx-compiler
- - cython>=0.29.21
+ - cython>=0.29.24
# code checks
- black=21.5b2
- cpplint
- - flake8=3.9.2
+ - flake8=4.0.1
- flake8-bugbear=21.3.2 # used by flake8, find likely bugs
- - flake8-comprehensions=3.1.0 # used by flake8, linting of unnecessary comprehensions
+ - flake8-comprehensions=3.7.0 # used by flake8, linting of unnecessary comprehensions
- isort>=5.2.1 # check that imports are in the right order
- - mypy=0.812
+ - mypy=0.930
- pre-commit>=2.9.2
- pycodestyle # used by flake8
- pyupgrade
@@ -34,6 +34,11 @@ dependencies:
- gitdb
- sphinx
- sphinx-panels
+ - numpydoc < 1.2 # 2021-02-09 1.2dev breaking CI
+ - types-python-dateutil
+ - types-PyMySQL
+ - types-pytz
+ - types-setuptools
# documentation (jupyter notebooks)
- nbconvert>=5.4.1
@@ -55,12 +60,12 @@ dependencies:
# testing
- boto3
- botocore>=1.11
- - hypothesis>=3.82
+ - hypothesis>=5.5.3
- moto # mock S3
- flask
- - pytest>=5.0.1
+ - pytest>=6.0
- pytest-cov
- - pytest-xdist>=1.21
+ - pytest-xdist>=1.31
- pytest-asyncio
- pytest-instafail
@@ -71,24 +76,24 @@ dependencies:
# unused (required indirectly may be?)
- ipywidgets
- nbformat
- - notebook>=5.7.5
+ - notebook>=6.0.3
- pip
# optional
- blosc
- - bottleneck>=1.2.1
+ - bottleneck>=1.3.1
- ipykernel
- ipython>=7.11.1
- jinja2 # pandas.Styler
- - matplotlib>=2.2.2 # pandas.plotting, Series.plot, DataFrame.plot
- - numexpr>=2.7.0
- - scipy>=1.2
- - numba>=0.46.0
+ - matplotlib>=3.3.2 # pandas.plotting, Series.plot, DataFrame.plot
+ - numexpr>=2.7.1
+ - scipy>=1.4.1
+ - numba>=0.50.1
# optional for io
# ---------------
# pd.read_html
- - beautifulsoup4>=4.6.0
+ - beautifulsoup4>=4.8.2
- html5lib
- lxml
@@ -99,22 +104,22 @@ dependencies:
- xlwt
- odfpy
- - fastparquet>=0.3.2 # pandas.read_parquet, DataFrame.to_parquet
- - pyarrow>=0.17.0 # pandas.read_parquet, DataFrame.to_parquet, pandas.read_feather, DataFrame.to_feather
+ - fastparquet>=0.4.0 # pandas.read_parquet, DataFrame.to_parquet
+ - pyarrow>2.0.1 # pandas.read_parquet, DataFrame.to_parquet, pandas.read_feather, DataFrame.to_feather
- python-snappy # required by pyarrow
- - pyqt>=5.9.2 # pandas.read_clipboard
- - pytables>=3.5.1 # pandas.read_hdf, DataFrame.to_hdf
+ - pytables>=3.6.1 # pandas.read_hdf, DataFrame.to_hdf
- s3fs>=0.4.0 # file IO when using 's3://...' path
+ - aiobotocore<2.0.0 # GH#44311 pinned to fix docbuild
- fsspec>=0.7.4 # for generic remote file operations
- gcsfs>=0.6.0 # file IO when using 'gcs://...' path
- sqlalchemy # pandas.read_sql, DataFrame.to_sql
- - xarray # DataFrame.to_xarray
+ - xarray<0.19 # DataFrame.to_xarray
- cftime # Needed for downstream xarray.CFTimeIndex test
- pyreadstat # pandas.read_spss
- tabulate>=0.8.3 # DataFrame.to_markdown
- natsort # DataFrame.sort_values
- pip:
- git+https://github.com/pydata/pydata-sphinx-theme.git@master
- - numpydoc < 1.2 # 2021-02-09 1.2dev breaking CI
- pandas-dev-flaker==0.2.0
+ - pytest-cython
diff --git a/flake8/cython-template.cfg b/flake8/cython-template.cfg
deleted file mode 100644
index 3d7b288fd8055..0000000000000
--- a/flake8/cython-template.cfg
+++ /dev/null
@@ -1,3 +0,0 @@
-[flake8]
-filename = *.pxi.in
-select = E501,E302,E203,E111,E114,E221,E303,E231,E126,F403
diff --git a/flake8/cython.cfg b/flake8/cython.cfg
deleted file mode 100644
index 2dfe47b60b4c1..0000000000000
--- a/flake8/cython.cfg
+++ /dev/null
@@ -1,3 +0,0 @@
-[flake8]
-filename = *.pyx,*.pxd
-select=E501,E302,E203,E111,E114,E221,E303,E128,E231,E126,E265,E305,E301,E127,E261,E271,E129,W291,E222,E241,E123,F403,C400,C401,C402,C403,C404,C405,C406,C407,C408,C409,C410,C411
diff --git a/pandas/__init__.py b/pandas/__init__.py
index db4043686bcbb..1b18af0f69cf2 100644
--- a/pandas/__init__.py
+++ b/pandas/__init__.py
@@ -19,21 +19,19 @@
del hard_dependencies, dependency, missing_dependencies
# numpy compat
-from pandas.compat import (
- np_version_under1p18 as _np_version_under1p18,
- is_numpy_dev as _is_numpy_dev,
-)
+from pandas.compat import is_numpy_dev as _is_numpy_dev
try:
from pandas._libs import hashtable as _hashtable, lib as _lib, tslib as _tslib
-except ImportError as e: # pragma: no cover
- # hack but overkill to use re
- module = str(e).replace("cannot import name ", "")
+except ImportError as err: # pragma: no cover
+ module = err.name
raise ImportError(
f"C extension: {module} not built. If you want to import "
"pandas from the source directory, you may need to run "
"'python setup.py build_ext --force' to build the C extensions first."
- ) from e
+ ) from err
+else:
+ del _tslib, _lib, _hashtable
from pandas._config import (
get_option,
@@ -74,10 +72,7 @@
# indexes
Index,
CategoricalIndex,
- Int64Index,
- UInt64Index,
RangeIndex,
- Float64Index,
MultiIndex,
IntervalIndex,
TimedeltaIndex,
@@ -137,7 +132,7 @@
qcut,
)
-import pandas.api
+from pandas import api, arrays, errors, io, plotting, testing, tseries
from pandas.util._print_versions import show_versions
from pandas.io.api import (
@@ -176,8 +171,6 @@
from pandas.io.json import _json_normalize as json_normalize
from pandas.util._tester import test
-import pandas.testing
-import pandas.arrays
# use the closest tagged version if possible
from pandas._version import get_versions
@@ -187,12 +180,36 @@
__git_version__ = v.get("full-revisionid")
del get_versions, v
-
# GH 27101
+__deprecated_num_index_names = ["Float64Index", "Int64Index", "UInt64Index"]
+
+
+def __dir__():
+ # GH43028
+ # Int64Index etc. are deprecated, but we still want them to be available in the dir.
+ # Remove in Pandas 2.0, when we remove Int64Index etc. from the code base.
+ return list(globals().keys()) + __deprecated_num_index_names
+
+
def __getattr__(name):
import warnings
- if name == "datetime":
+ if name in __deprecated_num_index_names:
+ warnings.warn(
+ f"pandas.{name} is deprecated "
+ "and will be removed from pandas in a future version. "
+ "Use pandas.Index with the appropriate dtype instead.",
+ FutureWarning,
+ stacklevel=2,
+ )
+ from pandas.core.api import Float64Index, Int64Index, UInt64Index
+
+ return {
+ "Float64Index": Float64Index,
+ "Int64Index": Int64Index,
+ "UInt64Index": UInt64Index,
+ }[name]
+ elif name == "datetime":
warnings.warn(
"The pandas.datetime class is deprecated "
"and will be removed from pandas in a future version. "
@@ -210,7 +227,7 @@ def __getattr__(name):
warnings.warn(
"The pandas.np module is deprecated "
"and will be removed from pandas in a future version. "
- "Import numpy directly instead",
+ "Import numpy directly instead.",
FutureWarning,
stacklevel=2,
)
@@ -221,7 +238,7 @@ def __getattr__(name):
elif name in {"SparseSeries", "SparseDataFrame"}:
warnings.warn(
f"The {name} class is removed from pandas. Accessing it from "
- "the top-level namespace will also be removed in the next version",
+ "the top-level namespace will also be removed in the next version.",
FutureWarning,
stacklevel=2,
)
@@ -284,3 +301,121 @@ def __getattr__(name):
- Time series-specific functionality: date range generation and frequency
conversion, moving window statistics, date shifting and lagging.
"""
+
+# Use __all__ to let type checkers know what is part of the public API.
+# Pandas is not (yet) a py.typed library: the public API is determined
+# based on the documentation.
+__all__ = [
+ "BooleanDtype",
+ "Categorical",
+ "CategoricalDtype",
+ "CategoricalIndex",
+ "DataFrame",
+ "DateOffset",
+ "DatetimeIndex",
+ "DatetimeTZDtype",
+ "ExcelFile",
+ "ExcelWriter",
+ "Flags",
+ "Float32Dtype",
+ "Float64Dtype",
+ "Grouper",
+ "HDFStore",
+ "Index",
+ "IndexSlice",
+ "Int16Dtype",
+ "Int32Dtype",
+ "Int64Dtype",
+ "Int8Dtype",
+ "Interval",
+ "IntervalDtype",
+ "IntervalIndex",
+ "MultiIndex",
+ "NA",
+ "NaT",
+ "NamedAgg",
+ "Period",
+ "PeriodDtype",
+ "PeriodIndex",
+ "RangeIndex",
+ "Series",
+ "SparseDtype",
+ "StringDtype",
+ "Timedelta",
+ "TimedeltaIndex",
+ "Timestamp",
+ "UInt16Dtype",
+ "UInt32Dtype",
+ "UInt64Dtype",
+ "UInt8Dtype",
+ "api",
+ "array",
+ "arrays",
+ "bdate_range",
+ "concat",
+ "crosstab",
+ "cut",
+ "date_range",
+ "describe_option",
+ "errors",
+ "eval",
+ "factorize",
+ "get_dummies",
+ "get_option",
+ "infer_freq",
+ "interval_range",
+ "io",
+ "isna",
+ "isnull",
+ "json_normalize",
+ "lreshape",
+ "melt",
+ "merge",
+ "merge_asof",
+ "merge_ordered",
+ "notna",
+ "notnull",
+ "offsets",
+ "option_context",
+ "options",
+ "period_range",
+ "pivot",
+ "pivot_table",
+ "plotting",
+ "qcut",
+ "read_clipboard",
+ "read_csv",
+ "read_excel",
+ "read_feather",
+ "read_fwf",
+ "read_gbq",
+ "read_hdf",
+ "read_html",
+ "read_json",
+ "read_orc",
+ "read_parquet",
+ "read_pickle",
+ "read_sas",
+ "read_spss",
+ "read_sql",
+ "read_sql_query",
+ "read_sql_table",
+ "read_stata",
+ "read_table",
+ "read_xml",
+ "reset_option",
+ "set_eng_float_format",
+ "set_option",
+ "show_versions",
+ "test",
+ "testing",
+ "timedelta_range",
+ "to_datetime",
+ "to_numeric",
+ "to_pickle",
+ "to_timedelta",
+ "tseries",
+ "unique",
+ "value_counts",
+ "wide_to_long",
+]
diff --git a/pandas/_config/config.py b/pandas/_config/config.py
index be3498dc0829b..5a0f58266c203 100644
--- a/pandas/_config/config.py
+++ b/pandas/_config/config.py
@@ -50,7 +50,6 @@
from __future__ import annotations
-from collections import namedtuple
from contextlib import (
ContextDecorator,
contextmanager,
@@ -60,14 +59,28 @@
Any,
Callable,
Iterable,
+ NamedTuple,
cast,
)
import warnings
from pandas._typing import F
-DeprecatedOption = namedtuple("DeprecatedOption", "key msg rkey removal_ver")
-RegisteredOption = namedtuple("RegisteredOption", "key defval doc validator cb")
+
+class DeprecatedOption(NamedTuple):
+ key: str
+ msg: str | None
+ rkey: str | None
+ removal_ver: str | None
+
+
+class RegisteredOption(NamedTuple):
+ key: str
+ defval: object
+ doc: str
+ validator: Callable[[object], Any] | None
+ cb: Callable[[str], Any] | None
+
# holds deprecated option metadata
_deprecated_options: dict[str, DeprecatedOption] = {}
@@ -85,7 +98,7 @@
class OptionError(AttributeError, KeyError):
"""
Exception for pandas.options, backwards compatible with KeyError
- checks
+ checks.
"""
@@ -157,7 +170,7 @@ def _describe_option(pat: str = "", _print_desc: bool = True):
if len(keys) == 0:
raise OptionError("No such keys(s)")
- s = "\n".join(_build_option_description(k) for k in keys)
+ s = "\n".join([_build_option_description(k) for k in keys])
if _print_desc:
print(s)
@@ -320,7 +333,7 @@ def __doc__(self):
Prints the description for one or more registered options.
-Call with not arguments to get a listing for all registered options.
+Call with no arguments to get a listing for all registered options.
Available options:
@@ -398,7 +411,7 @@ class option_context(ContextDecorator):
Examples
--------
>>> with option_context('display.max_rows', 10, 'display.max_columns', 5):
- ... ...
+ ... pass
"""
def __init__(self, *args):
@@ -425,7 +438,7 @@ def register_option(
key: str,
defval: object,
doc: str = "",
- validator: Callable[[Any], Any] | None = None,
+ validator: Callable[[object], Any] | None = None,
cb: Callable[[str], Any] | None = None,
) -> None:
"""
@@ -497,7 +510,10 @@ def register_option(
def deprecate_option(
- key: str, msg: str | None = None, rkey: str | None = None, removal_ver=None
+ key: str,
+ msg: str | None = None,
+ rkey: str | None = None,
+ removal_ver: str | None = None,
) -> None:
"""
Mark option `key` as deprecated, if code attempts to access this option,
@@ -523,7 +539,7 @@ def deprecate_option(
re-routed to `rkey` including set/get/reset.
rkey must be a fully-qualified option name (e.g "x.y.z.rkey").
used by the default message if no `msg` is specified.
- removal_ver : optional
+ removal_ver : str, optional
Specifies the version in which this option will
be removed. used by the default message if no `msg` is specified.
@@ -626,7 +642,6 @@ def _warn_if_deprecated(key: str) -> bool:
d = _get_deprecated_option(key)
if d:
if d.msg:
- print(d.msg)
warnings.warn(d.msg, FutureWarning)
else:
msg = f"'{key}' is deprecated"
@@ -747,10 +762,12 @@ def inner(key: str, *args, **kwds):
set_option = wrap(set_option)
get_option = wrap(get_option)
register_option = wrap(register_option)
- yield None
- set_option = _set_option
- get_option = _get_option
- register_option = _register_option
+ try:
+ yield
+ finally:
+ set_option = _set_option
+ get_option = _get_option
+ register_option = _register_option
# These factories and methods are handy for use as the validator
@@ -823,7 +840,7 @@ def inner(x) -> None:
return inner
-def is_nonnegative_int(value: int | None) -> None:
+def is_nonnegative_int(value: object) -> None:
"""
Verify that value is None or a positive int.
diff --git a/pandas/_config/localization.py b/pandas/_config/localization.py
index bc76aca93da2a..2a487fa4b6877 100644
--- a/pandas/_config/localization.py
+++ b/pandas/_config/localization.py
@@ -3,16 +3,24 @@
Name `localization` is chosen to avoid overlap with builtin `locale` module.
"""
+from __future__ import annotations
+
from contextlib import contextmanager
import locale
import re
import subprocess
+from typing import (
+ Callable,
+ Iterator,
+)
from pandas._config.config import options
@contextmanager
-def set_locale(new_locale, lc_var: int = locale.LC_ALL):
+def set_locale(
+ new_locale: str | tuple[str, str], lc_var: int = locale.LC_ALL
+) -> Iterator[str | tuple[str, str]]:
"""
Context manager for temporarily setting a locale.
@@ -71,7 +79,7 @@ def can_set_locale(lc: str, lc_var: int = locale.LC_ALL) -> bool:
return True
-def _valid_locales(locales, normalize):
+def _valid_locales(locales: list[str] | str, normalize: bool) -> list[str]:
"""
Return a list of normalized locales that do not throw an ``Exception``
when set.
@@ -98,11 +106,15 @@ def _valid_locales(locales, normalize):
]
-def _default_locale_getter():
+def _default_locale_getter() -> bytes:
return subprocess.check_output(["locale -a"], shell=True)
-def get_locales(prefix=None, normalize=True, locale_getter=_default_locale_getter):
+def get_locales(
+ prefix: str | None = None,
+ normalize: bool = True,
+ locale_getter: Callable[[], bytes] = _default_locale_getter,
+) -> list[str] | None:
"""
Get all the locales that are available on the system.
@@ -142,9 +154,9 @@ def get_locales(prefix=None, normalize=True, locale_getter=_default_locale_gette
# raw_locales is "\n" separated list of locales
# it may contain non-decodable parts, so split
# extract what we can and then rejoin.
- raw_locales = raw_locales.split(b"\n")
+ split_raw_locales = raw_locales.split(b"\n")
out_locales = []
- for x in raw_locales:
+ for x in split_raw_locales:
try:
out_locales.append(str(x, encoding=options.display.encoding))
except UnicodeError:
diff --git a/pandas/_libs/algos.pxd b/pandas/_libs/algos.pxd
index 7e87f4767c86d..fdeff2ed11805 100644
--- a/pandas/_libs/algos.pxd
+++ b/pandas/_libs/algos.pxd
@@ -1,4 +1,12 @@
-from pandas._libs.util cimport numeric
+from pandas._libs.dtypes cimport numeric_t
-cdef numeric kth_smallest_c(numeric* arr, Py_ssize_t k, Py_ssize_t n) nogil
+cdef numeric_t kth_smallest_c(numeric_t* arr, Py_ssize_t k, Py_ssize_t n) nogil
+
+cdef enum TiebreakEnumType:
+ TIEBREAK_AVERAGE
+ TIEBREAK_MIN,
+ TIEBREAK_MAX
+ TIEBREAK_FIRST
+ TIEBREAK_FIRST_DESCENDING
+ TIEBREAK_DENSE
diff --git a/pandas/_libs/algos.pyi b/pandas/_libs/algos.pyi
index d0f664c323a89..df8ac3f3b0696 100644
--- a/pandas/_libs/algos.pyi
+++ b/pandas/_libs/algos.pyi
@@ -1,8 +1,11 @@
-# Note: this covers algos.pyx and algos_common_helper but NOT algos_take_helper
+from __future__ import annotations
+
from typing import Any
import numpy as np
+from pandas._typing import npt
+
class Infinity:
"""
Provide a positive Infinity comparison method for ranking.
@@ -30,7 +33,7 @@ class NegInfinity:
def unique_deltas(
arr: np.ndarray, # const int64_t[:]
) -> np.ndarray: ... # np.ndarray[np.int64, ndim=1]
-def is_lexsorted(list_of_arrays: list[np.ndarray]) -> bool: ...
+def is_lexsorted(list_of_arrays: list[npt.NDArray[np.int64]]) -> bool: ...
def groupsort_indexer(
index: np.ndarray, # const int64_t[:]
ngroups: int,
@@ -47,18 +50,14 @@ def kth_smallest(
# Pairwise correlation/covariance
def nancorr(
- mat: np.ndarray, # const float64_t[:, :]
- cov: bool = False,
- minp=None,
-) -> np.ndarray: ... # ndarray[float64_t, ndim=2]
+ mat: npt.NDArray[np.float64], # const float64_t[:, :]
+ cov: bool = ...,
+ minp: int | None = ...,
+) -> npt.NDArray[np.float64]: ... # ndarray[float64_t, ndim=2]
def nancorr_spearman(
- mat: np.ndarray, # ndarray[float64_t, ndim=2]
- minp: int = 1,
-) -> np.ndarray: ... # ndarray[float64_t, ndim=2]
-def nancorr_kendall(
- mat: np.ndarray, # ndarray[float64_t, ndim=2]
- minp: int = 1,
-) -> np.ndarray: ... # ndarray[float64_t, ndim=2]
+ mat: npt.NDArray[np.float64], # ndarray[float64_t, ndim=2]
+ minp: int = ...,
+) -> npt.NDArray[np.float64]: ... # ndarray[float64_t, ndim=2]
# ----------------------------------------------------------------------
@@ -75,36 +74,36 @@ def nancorr_kendall(
# uint16_t
# uint8_t
-def validate_limit(nobs: int | None, limit=None) -> int: ...
+def validate_limit(nobs: int | None, limit=...) -> int: ...
def pad(
old: np.ndarray, # ndarray[algos_t]
new: np.ndarray, # ndarray[algos_t]
- limit=None,
-) -> np.ndarray: ... # np.ndarray[np.intp, ndim=1]
+ limit=...,
+) -> npt.NDArray[np.intp]: ... # np.ndarray[np.intp, ndim=1]
def pad_inplace(
values: np.ndarray, # algos_t[:]
mask: np.ndarray, # uint8_t[:]
- limit=None,
+ limit=...,
) -> None: ...
def pad_2d_inplace(
values: np.ndarray, # algos_t[:, :]
mask: np.ndarray, # const uint8_t[:, :]
- limit=None,
+ limit=...,
) -> None: ...
def backfill(
old: np.ndarray, # ndarray[algos_t]
new: np.ndarray, # ndarray[algos_t]
- limit=None,
-) -> np.ndarray: ... # np.ndarray[np.intp, ndim=1]
+ limit=...,
+) -> npt.NDArray[np.intp]: ... # np.ndarray[np.intp, ndim=1]
def backfill_inplace(
values: np.ndarray, # algos_t[:]
mask: np.ndarray, # uint8_t[:]
- limit=None,
+ limit=...,
) -> None: ...
def backfill_2d_inplace(
values: np.ndarray, # algos_t[:, :]
mask: np.ndarray, # const uint8_t[:, :]
- limit=None,
+ limit=...,
) -> None: ...
def is_monotonic(
arr: np.ndarray, # ndarray[algos_t, ndim=1]
@@ -123,7 +122,7 @@ def is_monotonic(
def rank_1d(
values: np.ndarray, # ndarray[rank_t, ndim=1]
- labels: np.ndarray, # const int64_t[:]
+ labels: np.ndarray | None = ..., # const int64_t[:]=None
is_datetimelike: bool = ...,
ties_method=...,
ascending: bool = ...,
@@ -146,243 +145,302 @@ def diff_2d(
axis: int,
datetimelike: bool = ...,
) -> None: ...
-def ensure_platform_int(arr: object) -> np.ndarray: ...
-def ensure_object(arr: object) -> np.ndarray: ...
-def ensure_float64(arr: object, copy=True) -> np.ndarray: ...
-def ensure_float32(arr: object, copy=True) -> np.ndarray: ...
-def ensure_int8(arr: object, copy=True) -> np.ndarray: ...
-def ensure_int16(arr: object, copy=True) -> np.ndarray: ...
-def ensure_int32(arr: object, copy=True) -> np.ndarray: ...
-def ensure_int64(arr: object, copy=True) -> np.ndarray: ...
-def ensure_uint8(arr: object, copy=True) -> np.ndarray: ...
-def ensure_uint16(arr: object, copy=True) -> np.ndarray: ...
-def ensure_uint32(arr: object, copy=True) -> np.ndarray: ...
-def ensure_uint64(arr: object, copy=True) -> np.ndarray: ...
+def ensure_platform_int(arr: object) -> npt.NDArray[np.intp]: ...
+def ensure_object(arr: object) -> npt.NDArray[np.object_]: ...
+def ensure_complex64(arr: object, copy=...) -> npt.NDArray[np.complex64]: ...
+def ensure_complex128(arr: object, copy=...) -> npt.NDArray[np.complex128]: ...
+def ensure_float64(arr: object, copy=...) -> npt.NDArray[np.float64]: ...
+def ensure_float32(arr: object, copy=...) -> npt.NDArray[np.float32]: ...
+def ensure_int8(arr: object, copy=...) -> npt.NDArray[np.int8]: ...
+def ensure_int16(arr: object, copy=...) -> npt.NDArray[np.int16]: ...
+def ensure_int32(arr: object, copy=...) -> npt.NDArray[np.int32]: ...
+def ensure_int64(arr: object, copy=...) -> npt.NDArray[np.int64]: ...
+def ensure_uint8(arr: object, copy=...) -> npt.NDArray[np.uint8]: ...
+def ensure_uint16(arr: object, copy=...) -> npt.NDArray[np.uint16]: ...
+def ensure_uint32(arr: object, copy=...) -> npt.NDArray[np.uint32]: ...
+def ensure_uint64(arr: object, copy=...) -> npt.NDArray[np.uint64]: ...
def take_1d_int8_int8(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int8_int32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int8_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int8_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int16_int16(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int16_int32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int16_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int16_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int32_int32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int32_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int32_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int64_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_int64_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_float32_float32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_float32_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_float64_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_object_object(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_bool_bool(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_1d_bool_object(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int8_int8(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int8_int32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int8_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int8_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int16_int16(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int16_int32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int16_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int16_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int32_int32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int32_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int32_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int64_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_int64_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_float32_float32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_float32_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_float64_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_object_object(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_bool_bool(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis0_bool_object(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int8_int8(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int8_int32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int8_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int8_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int16_int16(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int16_int32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int16_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int16_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int32_int32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int32_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int32_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int64_int64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_int64_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_float32_float32(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_float32_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_float64_float64(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_object_object(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_bool_bool(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_axis1_bool_object(
- values: np.ndarray, indexer: np.ndarray, out: np.ndarray, fill_value=...
+ values: np.ndarray, indexer: npt.NDArray[np.intp], out: np.ndarray, fill_value=...
) -> None: ...
def take_2d_multi_int8_int8(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int8_int32(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int8_int64(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int8_float64(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int16_int16(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int16_int32(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int16_int64(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int16_float64(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int32_int32(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int32_int64(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int32_float64(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int64_float64(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_float32_float32(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_float32_float64(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_float64_float64(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_object_object(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_bool_bool(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_bool_object(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
def take_2d_multi_int64_int64(
- values: np.ndarray, indexer, out: np.ndarray, fill_value=...
+ values: np.ndarray,
+ indexer: tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]],
+ out: np.ndarray,
+ fill_value=...,
) -> None: ...
diff --git a/pandas/_libs/algos.pyx b/pandas/_libs/algos.pyx
index 03f4ce273de6e..3d099a53163bc 100644
--- a/pandas/_libs/algos.pyx
+++ b/pandas/_libs/algos.pyx
@@ -15,6 +15,8 @@ import numpy as np
cimport numpy as cnp
from numpy cimport (
+ NPY_COMPLEX64,
+ NPY_COMPLEX128,
NPY_FLOAT32,
NPY_FLOAT64,
NPY_INT8,
@@ -43,6 +45,11 @@ from numpy cimport (
cnp.import_array()
cimport pandas._libs.util as util
+from pandas._libs.dtypes cimport (
+ iu_64_floating_obj_t,
+ numeric_object_t,
+ numeric_t,
+)
from pandas._libs.khash cimport (
kh_destroy_int64,
kh_get_int64,
@@ -52,10 +59,7 @@ from pandas._libs.khash cimport (
kh_resize_int64,
khiter_t,
)
-from pandas._libs.util cimport (
- get_nat,
- numeric,
-)
+from pandas._libs.util cimport get_nat
import pandas._libs.missing as missing
@@ -64,13 +68,6 @@ cdef:
float64_t NaN = np.NaN
int64_t NPY_NAT = get_nat()
-cdef enum TiebreakEnumType:
- TIEBREAK_AVERAGE
- TIEBREAK_MIN,
- TIEBREAK_MAX
- TIEBREAK_FIRST
- TIEBREAK_FIRST_DESCENDING
- TIEBREAK_DENSE
tiebreakers = {
"average": TIEBREAK_AVERAGE,
@@ -122,7 +119,7 @@ cpdef ndarray[int64_t, ndim=1] unique_deltas(const int64_t[:] arr):
Parameters
----------
- arr : ndarray[in64_t]
+ arr : ndarray[int64_t]
Returns
-------
@@ -217,8 +214,8 @@ def groupsort_indexer(const intp_t[:] index, Py_ssize_t ngroups):
This is a reverse of the label factorization process.
"""
cdef:
- Py_ssize_t i, loc, label, n
- ndarray[intp_t] indexer, where, counts
+ Py_ssize_t i, label, n
+ intp_t[::1] indexer, where, counts
counts = np.zeros(ngroups + 1, dtype=np.intp)
n = len(index)
@@ -241,12 +238,12 @@ def groupsort_indexer(const intp_t[:] index, Py_ssize_t ngroups):
indexer[where[label]] = i
where[label] += 1
- return indexer, counts
+ return indexer.base, counts.base
-cdef inline Py_ssize_t swap(numeric *a, numeric *b) nogil:
+cdef inline Py_ssize_t swap(numeric_t *a, numeric_t *b) nogil:
cdef:
- numeric t
+ numeric_t t
# cython doesn't allow pointer dereference so use array syntax
t = a[0]
@@ -255,41 +252,47 @@ cdef inline Py_ssize_t swap(numeric *a, numeric *b) nogil:
return 0
-cdef inline numeric kth_smallest_c(numeric* arr, Py_ssize_t k, Py_ssize_t n) nogil:
+cdef inline numeric_t kth_smallest_c(numeric_t* arr, Py_ssize_t k, Py_ssize_t n) nogil:
"""
See kth_smallest.__doc__. The additional parameter n specifies the maximum
number of elements considered in arr, needed for compatibility with usage
in groupby.pyx
"""
cdef:
- Py_ssize_t i, j, l, m
- numeric x
+ Py_ssize_t i, j, left, m
+ numeric_t x
- l = 0
+ left = 0
m = n - 1
- while l < m:
+ while left < m:
x = arr[k]
- i = l
+ i = left
j = m
while 1:
- while arr[i] < x: i += 1
- while x < arr[j]: j -= 1
+ while arr[i] < x:
+ i += 1
+ while x < arr[j]:
+ j -= 1
if i <= j:
swap(&arr[i], &arr[j])
- i += 1; j -= 1
+ i += 1
+ j -= 1
- if i > j: break
+ if i > j:
+ break
- if j < k: l = i
- if k < i: m = j
+ if j < k:
+ left = i
+ if k < i:
+ m = j
return arr[k]
@cython.boundscheck(False)
@cython.wraparound(False)
-def kth_smallest(numeric[::1] arr, Py_ssize_t k) -> numeric:
+def kth_smallest(numeric_t[::1] arr, Py_ssize_t k) -> numeric_t:
"""
Compute the kth smallest value in arr. Note that the input
array will be modified.
@@ -307,7 +310,7 @@ def kth_smallest(numeric[::1] arr, Py_ssize_t k) -> numeric:
The kth smallest value in arr
"""
cdef:
- numeric result
+ numeric_t result
with nogil:
result = kth_smallest_c(&arr[0], k, arr.shape[0])
@@ -325,11 +328,14 @@ def nancorr(const float64_t[:, :] mat, bint cov=False, minp=None):
cdef:
Py_ssize_t i, j, xi, yi, N, K
bint minpv
- ndarray[float64_t, ndim=2] result
+ float64_t[:, ::1] result
+ # Initialize to None since we only use in the no missing value case
+ float64_t[::1] means=None, ssqds=None
ndarray[uint8_t, ndim=2] mask
+ bint no_nans
int64_t nobs = 0
- float64_t vx, vy, meanx, meany, divisor, prev_meany, prev_meanx, ssqdmx
- float64_t ssqdmy, covxy
+ float64_t mean, ssqd, val
+ float64_t vx, vy, dx, dy, meanx, meany, divisor, ssqdmx, ssqdmy, covxy
N, K = (