diff --git a/.circleci/config.yml b/.circleci/config.yml new file mode 100644 index 0000000000000..dc357101e79fd --- /dev/null +++ b/.circleci/config.yml @@ -0,0 +1,21 @@ +version: 2.1 + +jobs: + test-arm: + machine: + image: ubuntu-2004:202101-01 + resource_class: arm.medium + environment: + ENV_FILE: ci/deps/circle-38-arm64.yaml + PYTEST_WORKERS: auto + PATTERN: "not slow and not network and not clipboard and not arm_slow" + PYTEST_TARGET: "pandas" + steps: + - checkout + - run: ci/setup_env.sh + - run: PATH=$HOME/miniconda3/envs/pandas-dev/bin:$HOME/miniconda3/condabin:$PATH ci/run_tests.sh + +workflows: + test: + jobs: + - test-arm diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 49200523df40f..d27eab5b9c95c 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -1,23 +1,3 @@ # Contributing to pandas -Whether you are a novice or experienced software developer, all contributions and suggestions are welcome! - -Our main contributing guide can be found [in this repo](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst) or [on the website](https://pandas.pydata.org/docs/dev/development/contributing.html). If you do not want to read it in its entirety, we will summarize the main ways in which you can contribute and point to relevant sections of that document for further information. - -## Getting Started - -If you are looking to contribute to the *pandas* codebase, the best place to start is the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues). This is also a great place for filing bug reports and making suggestions for ways in which we can improve the code and documentation. - -If you have additional questions, feel free to ask them on the [mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata) or on [Gitter](https://gitter.im/pydata/pandas). Further information can also be found in the "[Where to start?](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#where-to-start)" section. - -## Filing Issues - -If you notice a bug in the code or documentation, or have suggestions for how we can improve either, feel free to create an issue on the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues) using [GitHub's "issue" form](https://github.com/pandas-dev/pandas/issues/new). The form contains some questions that will help us best address your issue. For more information regarding how to file issues against *pandas*, please refer to the "[Bug reports and enhancement requests](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#bug-reports-and-enhancement-requests)" section. - -## Contributing to the Codebase - -The code is hosted on [GitHub](https://www.github.com/pandas-dev/pandas), so you will need to use [Git](https://git-scm.com/) to clone the project and make changes to the codebase. Once you have obtained a copy of the code, you should create a development environment that is separate from your existing Python environment so that you can make and test changes without compromising your own work environment. For more information, please refer to the "[Working with the code](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#working-with-the-code)" section. - -Before submitting your changes for review, make sure to check that your changes do not break any tests. You can find more information about our test suites in the "[Test-driven development/code writing](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#test-driven-development-code-writing)" section. We also have guidelines regarding coding style that will be enforced during testing, which can be found in the "[Code standards](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#code-standards)" section. - -Once your changes are ready to be submitted, make sure to push your changes to GitHub before creating a pull request. Details about how to do that can be found in the "[Contributing your changes to pandas](https://github.com/pandas-dev/pandas/blob/master/doc/source/development/contributing.rst#contributing-your-changes-to-pandas)" section. We will review your changes, and you will most likely be asked to make additional changes before it is finally ready to merge. However, once it's ready, we will merge it, and you will have successfully contributed to the codebase! +A detailed overview on how to contribute can be found in the **[contributing guide](https://pandas.pydata.org/docs/dev/development/contributing.html)**. diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md deleted file mode 100644 index 765c1b8bff62e..0000000000000 --- a/.github/ISSUE_TEMPLATE/bug_report.md +++ /dev/null @@ -1,39 +0,0 @@ ---- - -name: Bug Report -about: Create a bug report to help us improve pandas -title: "BUG:" -labels: "Bug, Needs Triage" - ---- - -- [ ] I have checked that this issue has not already been reported. - -- [ ] I have confirmed this bug exists on the latest version of pandas. - -- [ ] (optional) I have confirmed this bug exists on the master branch of pandas. - ---- - -**Note**: Please read [this guide](https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports) detailing how to provide the necessary information for us to reproduce your bug. - -#### Code Sample, a copy-pastable example - -```python -# Your code here - -``` - -#### Problem description - -[this should explain **why** the current behaviour is a problem and why the expected output is a better solution] - -#### Expected Output - -#### Output of ``pd.show_versions()`` - -
- -[paste the output of ``pd.show_versions()`` here leaving a blank line after the details tag] - -
diff --git a/.github/ISSUE_TEMPLATE/bug_report.yaml b/.github/ISSUE_TEMPLATE/bug_report.yaml new file mode 100644 index 0000000000000..36bc8dcf02bae --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.yaml @@ -0,0 +1,68 @@ +name: Bug Report +description: Report incorrect behavior in the pandas library +title: "BUG: " +labels: [Bug, Needs Triage] + +body: + - type: checkboxes + id: checks + attributes: + label: Pandas version checks + options: + - label: > + I have checked that this issue has not already been reported. + required: true + - label: > + I have confirmed this bug exists on the + [latest version](https://pandas.pydata.org/docs/whatsnew/index.html) of pandas. + required: true + - label: > + I have confirmed this bug exists on the main branch of pandas. + - type: textarea + id: example + attributes: + label: Reproducible Example + description: > + Please follow [this guide](https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports) on how to + provide a minimal, copy-pastable example. + placeholder: > + import pandas as pd + + df = pd.DataFrame(range(5)) + + ... + render: python + validations: + required: true + - type: textarea + id: problem + attributes: + label: Issue Description + description: > + Please provide a description of the issue shown in the reproducible example. + validations: + required: true + - type: textarea + id: expected-behavior + attributes: + label: Expected Behavior + description: > + Please describe or show a code example of the expected behavior. + validations: + required: true + - type: textarea + id: version + attributes: + label: Installed Versions + description: > + Please paste the output of ``pd.show_versions()`` + value: > +
+ + + Replace this line with the output of pd.show_versions() + + +
+ validations: + required: true diff --git a/.github/ISSUE_TEMPLATE/documentation_improvement.md b/.github/ISSUE_TEMPLATE/documentation_improvement.md deleted file mode 100644 index 3351ff9581121..0000000000000 --- a/.github/ISSUE_TEMPLATE/documentation_improvement.md +++ /dev/null @@ -1,22 +0,0 @@ ---- - -name: Documentation Improvement -about: Report wrong or missing documentation -title: "DOC:" -labels: "Docs, Needs Triage" - ---- - -#### Location of the documentation - -[this should provide the location of the documentation, e.g. "pandas.read_csv" or the URL of the documentation, e.g. "https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html"] - -**Note**: You can check the latest versions of the docs on `master` [here](https://pandas.pydata.org/docs/dev/). - -#### Documentation problem - -[this should provide a description of what documentation you believe needs to be fixed/improved] - -#### Suggested fix for documentation - -[this should explain the suggested fix and **why** it's better than the existing documentation] diff --git a/.github/ISSUE_TEMPLATE/documentation_improvement.yaml b/.github/ISSUE_TEMPLATE/documentation_improvement.yaml new file mode 100644 index 0000000000000..b89600f8598e7 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/documentation_improvement.yaml @@ -0,0 +1,41 @@ +name: Documentation Improvement +description: Report wrong or missing documentation +title: "DOC: " +labels: [Docs, Needs Triage] + +body: + - type: checkboxes + attributes: + label: Pandas version checks + options: + - label: > + I have checked that the issue still exists on the latest versions of the docs + on `main` [here](https://pandas.pydata.org/docs/dev/) + required: true + - type: textarea + id: location + attributes: + label: Location of the documentation + description: > + Please provide the location of the documentation, e.g. "pandas.read_csv" or the + URL of the documentation, e.g. + "https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html" + placeholder: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html + validations: + required: true + - type: textarea + id: problem + attributes: + label: Documentation problem + description: > + Please provide a description of what documentation you believe needs to be fixed/improved + validations: + required: true + - type: textarea + id: suggested-fix + attributes: + label: Suggested fix for documentation + description: > + Please explain the suggested fix and **why** it's better than the existing documentation + validations: + required: true diff --git a/.github/ISSUE_TEMPLATE/installation_issue.yaml b/.github/ISSUE_TEMPLATE/installation_issue.yaml new file mode 100644 index 0000000000000..a80269ff0f12d --- /dev/null +++ b/.github/ISSUE_TEMPLATE/installation_issue.yaml @@ -0,0 +1,66 @@ +name: Installation Issue +description: Report issues installing the pandas library on your system +title: "BUILD: " +labels: [Build, Needs Triage] + +body: + - type: checkboxes + id: checks + attributes: + label: Installation check + options: + - label: > + I have read the [installation guide](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html#installing-pandas). + required: true + - type: input + id: platform + attributes: + label: Platform + description: > + Please provide the output of ``import platform; print(platform.platform())`` + validations: + required: true + - type: dropdown + id: method + attributes: + label: Installation Method + description: > + Please provide how you tried to install pandas from a clean environment. + options: + - pip install + - conda install + - apt-get install + - Built from source + - Other + validations: + required: true + - type: input + id: pandas + attributes: + label: pandas Version + description: > + Please provide the version of pandas you are trying to install. + validations: + required: true + - type: input + id: python + attributes: + label: Python Version + description: > + Please provide the installed version of Python. + validations: + required: true + - type: textarea + id: logs + attributes: + label: Installation Logs + description: > + If possible, please copy and paste the installation logs when attempting to install pandas. + value: > +
+ + + Replace this line with the installation logs. + + +
diff --git a/.github/ISSUE_TEMPLATE/performance_issue.yaml b/.github/ISSUE_TEMPLATE/performance_issue.yaml new file mode 100644 index 0000000000000..096e012f4ee0f --- /dev/null +++ b/.github/ISSUE_TEMPLATE/performance_issue.yaml @@ -0,0 +1,53 @@ +name: Performance Issue +description: Report slow performance or memory issues when running pandas code +title: "PERF: " +labels: [Performance, Needs Triage] + +body: + - type: checkboxes + id: checks + attributes: + label: Pandas version checks + options: + - label: > + I have checked that this issue has not already been reported. + required: true + - label: > + I have confirmed this issue exists on the + [latest version](https://pandas.pydata.org/docs/whatsnew/index.html) of pandas. + required: true + - label: > + I have confirmed this issue exists on the main branch of pandas. + - type: textarea + id: example + attributes: + label: Reproducible Example + description: > + Please provide a minimal, copy-pastable example that quantifies + [slow runtime](https://docs.python.org/3/library/timeit.html) or + [memory](https://pypi.org/project/memory-profiler/) issues. + validations: + required: true + - type: textarea + id: version + attributes: + label: Installed Versions + description: > + Please paste the output of ``pd.show_versions()`` + value: > +
+ + + Replace this line with the output of pd.show_versions() + + +
+ validations: + required: true + - type: textarea + id: prior-performance + attributes: + label: Prior Performance + description: > + If applicable, please provide the prior version of pandas and output + of the same reproducible example where the performance issue did not exist. diff --git a/.github/ISSUE_TEMPLATE/submit_question.md b/.github/ISSUE_TEMPLATE/submit_question.md deleted file mode 100644 index 9b48918ff2f6d..0000000000000 --- a/.github/ISSUE_TEMPLATE/submit_question.md +++ /dev/null @@ -1,24 +0,0 @@ ---- - -name: Submit Question -about: Ask a general question about pandas -title: "QST:" -labels: "Usage Question, Needs Triage" - ---- - -- [ ] I have searched the [[pandas] tag](https://stackoverflow.com/questions/tagged/pandas) on StackOverflow for similar questions. - -- [ ] I have asked my usage related question on [StackOverflow](https://stackoverflow.com). - ---- - -#### Question about pandas - -**Note**: If you'd still like to submit a question, please read [this guide]( -https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports) detailing how to provide the necessary information for us to reproduce your question. - -```python -# Your code here, if applicable - -``` diff --git a/.github/ISSUE_TEMPLATE/submit_question.yml b/.github/ISSUE_TEMPLATE/submit_question.yml new file mode 100644 index 0000000000000..6f73041b0f527 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/submit_question.yml @@ -0,0 +1,44 @@ +name: Submit Question +description: Ask a general question about pandas +title: "QST: " +labels: [Usage Question, Needs Triage] + +body: + - type: markdown + attributes: + value: > + Since [StackOverflow](https://stackoverflow.com) is better suited towards answering + usage questions, we ask that all usage questions are first asked on StackOverflow. + - type: checkboxes + attributes: + label: Research + options: + - label: > + I have searched the [[pandas] tag](https://stackoverflow.com/questions/tagged/pandas) + on StackOverflow for similar questions. + required: true + - label: > + I have asked my usage related question on [StackOverflow](https://stackoverflow.com). + required: true + - type: input + id: question-link + attributes: + label: Link to question on StackOverflow + validations: + required: true + - type: markdown + attributes: + value: --- + - type: textarea + id: question + attributes: + label: Question about pandas + description: > + **Note**: If you'd still like to submit a question, please read [this guide]( + https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports) detailing + how to provide the necessary information for us to reproduce your question. + placeholder: | + ```python + # Your code here, if applicable + + ``` diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 7fb5a6ddf2024..42017db8a05b1 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,4 +1,4 @@ - [ ] closes #xxxx - [ ] tests added / passed -- [ ] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing.html#code-standards) for how to run them +- [ ] Ensure all linting tests pass, see [here](https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#pre-commit) for how to run them - [ ] whatsnew entry diff --git a/.github/actions/build_pandas/action.yml b/.github/actions/build_pandas/action.yml index d4777bcd1d079..2e4bfea165316 100644 --- a/.github/actions/build_pandas/action.yml +++ b/.github/actions/build_pandas/action.yml @@ -13,5 +13,5 @@ runs: - name: Build Pandas run: | python setup.py build_ext -j 2 - python -m pip install -e . --no-build-isolation --no-use-pep517 + python -m pip install -e . --no-build-isolation --no-use-pep517 --no-index shell: bash -l {0} diff --git a/.github/workflows/asv-bot.yml b/.github/workflows/asv-bot.yml new file mode 100644 index 0000000000000..f3946aeb84a63 --- /dev/null +++ b/.github/workflows/asv-bot.yml @@ -0,0 +1,81 @@ +name: "ASV Bot" + +on: + issue_comment: # Pull requests are issues + types: + - created + +env: + ENV_FILE: environment.yml + COMMENT: ${{github.event.comment.body}} + +jobs: + autotune: + name: "Run benchmarks" + # TODO: Support more benchmarking options later, against different branches, against self, etc + if: startsWith(github.event.comment.body, '@github-actions benchmark') + runs-on: ubuntu-latest + defaults: + run: + shell: bash -l {0} + + concurrency: + # Set concurrency to prevent abuse(full runs are ~5.5 hours !!!) + # each user can only run one concurrent benchmark bot at a time + # We don't cancel in progress jobs, but if you want to benchmark multiple PRs, you're gonna have + # to wait + group: ${{ github.actor }}-asv + cancel-in-progress: false + + steps: + - name: Checkout + uses: actions/checkout@v2 + with: + fetch-depth: 0 + + - name: Cache conda + uses: actions/cache@v2 + with: + path: ~/conda_pkgs_dir + key: ${{ runner.os }}-conda-${{ hashFiles('${{ env.ENV_FILE }}') }} + + # Although asv sets up its own env, deps are still needed + # during discovery process + - uses: conda-incubator/setup-miniconda@v2 + with: + activate-environment: pandas-dev + channel-priority: strict + environment-file: ${{ env.ENV_FILE }} + use-only-tar-bz2: true + + - name: Run benchmarks + id: bench + continue-on-error: true # This is a fake failure, asv will exit code 1 for regressions + run: | + # extracting the regex, see https://stackoverflow.com/a/36798723 + REGEX=$(echo "$COMMENT" | sed -n "s/^.*-b\s*\(\S*\).*$/\1/p") + cd asv_bench + asv check -E existing + git remote add upstream https://github.com/pandas-dev/pandas.git + git fetch upstream + asv machine --yes + asv continuous -f 1.1 -b $REGEX upstream/main HEAD + echo 'BENCH_OUTPUT<> $GITHUB_ENV + asv compare -f 1.1 upstream/main HEAD >> $GITHUB_ENV + echo 'EOF' >> $GITHUB_ENV + echo "REGEX=$REGEX" >> $GITHUB_ENV + + - uses: actions/github-script@v5 + env: + BENCH_OUTPUT: ${{env.BENCH_OUTPUT}} + REGEX: ${{env.REGEX}} + with: + script: | + const ENV_VARS = process.env + const run_url = `https://github.com/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}` + github.rest.issues.createComment({ + issue_number: context.issue.number, + owner: context.repo.owner, + repo: context.repo.repo, + body: '\nBenchmarks completed. View runner logs here.' + run_url + '\nRegex used: '+ 'regex ' + ENV_VARS["REGEX"] + '\n' + ENV_VARS["BENCH_OUTPUT"] + }) diff --git a/.github/workflows/autoupdate-pre-commit-config.yml b/.github/workflows/autoupdate-pre-commit-config.yml index 801e063f72726..3696cba8cf2e6 100644 --- a/.github/workflows/autoupdate-pre-commit-config.yml +++ b/.github/workflows/autoupdate-pre-commit-config.yml @@ -2,7 +2,7 @@ name: "Update pre-commit config" on: schedule: - - cron: "0 7 * * 1" # At 07:00 on each Monday. + - cron: "0 7 1 * *" # At 07:00 on 1st of every month. workflow_dispatch: jobs: diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml deleted file mode 100644 index a5a802c678e20..0000000000000 --- a/.github/workflows/ci.yml +++ /dev/null @@ -1,171 +0,0 @@ -name: CI - -on: - push: - branches: [master] - pull_request: - branches: - - master - - 1.2.x - -env: - ENV_FILE: environment.yml - PANDAS_CI: 1 - -jobs: - checks: - name: Checks - runs-on: ubuntu-latest - defaults: - run: - shell: bash -l {0} - - steps: - - name: Checkout - uses: actions/checkout@v2 - with: - fetch-depth: 0 - - - name: Looking for unwanted patterns - run: ci/code_checks.sh patterns - if: always() - - - name: Cache conda - uses: actions/cache@v2 - with: - path: ~/conda_pkgs_dir - key: ${{ runner.os }}-conda-${{ hashFiles('${{ env.ENV_FILE }}') }} - - - uses: conda-incubator/setup-miniconda@v2 - with: - activate-environment: pandas-dev - channel-priority: strict - environment-file: ${{ env.ENV_FILE }} - use-only-tar-bz2: true - - - name: Build Pandas - uses: ./.github/actions/build_pandas - - - name: Linting - run: ci/code_checks.sh lint - if: always() - - - name: Checks on imported code - run: ci/code_checks.sh code - if: always() - - - name: Running doctests - run: ci/code_checks.sh doctests - if: always() - - - name: Docstring validation - run: ci/code_checks.sh docstrings - if: always() - - - name: Typing validation - run: ci/code_checks.sh typing - if: always() - - - name: Testing docstring validation script - run: pytest scripts - if: always() - - - name: Running benchmarks - run: | - cd asv_bench - asv check -E existing - git remote add upstream https://github.com/pandas-dev/pandas.git - git fetch upstream - asv machine --yes - asv dev | sed "/failed$/ s/^/##[error]/" | tee benchmarks.log - if grep "failed" benchmarks.log > /dev/null ; then - exit 1 - fi - if: always() - - - name: Publish benchmarks artifact - uses: actions/upload-artifact@master - with: - name: Benchmarks log - path: asv_bench/benchmarks.log - if: failure() - - web_and_docs: - name: Web and docs - runs-on: ubuntu-latest - steps: - - - name: Checkout - uses: actions/checkout@v2 - with: - fetch-depth: 0 - - - name: Set up pandas - uses: ./.github/actions/setup - - - name: Build website - run: | - source activate pandas-dev - python web/pandas_web.py web/pandas --target-path=web/build - - name: Build documentation - run: | - source activate pandas-dev - doc/make.py --warnings-are-errors | tee sphinx.log ; exit ${PIPESTATUS[0]} - - # This can be removed when the ipython directive fails when there are errors, - # including the `tee sphinx.log` in te previous step (https://github.com/ipython/ipython/issues/11547) - - name: Check ipython directive errors - run: "! grep -B10 \"^<<<-------------------------------------------------------------------------$\" sphinx.log" - - - name: Install ssh key - run: | - mkdir -m 700 -p ~/.ssh - echo "${{ secrets.server_ssh_key }}" > ~/.ssh/id_rsa - chmod 600 ~/.ssh/id_rsa - echo "${{ secrets.server_ip }} ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBE1Kkopomm7FHG5enATf7SgnpICZ4W2bw+Ho+afqin+w7sMcrsa0je7sbztFAV8YchDkiBKnWTG4cRT+KZgZCaY=" > ~/.ssh/known_hosts - if: github.event_name == 'push' - - - name: Upload web - run: rsync -az --delete --exclude='pandas-docs' --exclude='docs' --exclude='Pandas_Cheat_Sheet*' web/build/ docs@${{ secrets.server_ip }}:/usr/share/nginx/pandas - if: github.event_name == 'push' - - - name: Upload dev docs - run: rsync -az --delete doc/build/html/ docs@${{ secrets.server_ip }}:/usr/share/nginx/pandas/pandas-docs/dev - if: github.event_name == 'push' - - - name: Move docs into site directory - run: mv doc/build/html web/build/docs - - name: Save website as an artifact - uses: actions/upload-artifact@v2 - with: - name: website - path: web/build - retention-days: 14 - - data_manager: - name: Test experimental data manager - runs-on: ubuntu-latest - strategy: - matrix: - pattern: ["not slow and not network and not clipboard", "slow"] - steps: - - - name: Checkout - uses: actions/checkout@v2 - with: - fetch-depth: 0 - - - name: Set up pandas - uses: ./.github/actions/setup - - - name: Run tests - env: - PANDAS_DATA_MANAGER: array - PATTERN: ${{ matrix.pattern }} - PYTEST_WORKERS: "auto" - run: | - source activate pandas-dev - ci/run_tests.sh - - - name: Print skipped tests - run: python ci/print_skipped.py diff --git a/.github/workflows/code-checks.yml b/.github/workflows/code-checks.yml new file mode 100644 index 0000000000000..7141b02cac376 --- /dev/null +++ b/.github/workflows/code-checks.yml @@ -0,0 +1,158 @@ +name: Code Checks + +on: + push: + branches: + - main + - 1.4.x + pull_request: + branches: + - main + - 1.4.x + +env: + ENV_FILE: environment.yml + PANDAS_CI: 1 + +jobs: + pre_commit: + name: pre-commit + runs-on: ubuntu-latest + concurrency: + # https://github.community/t/concurrecy-not-work-for-push/183068/7 + group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-pre-commit + cancel-in-progress: true + steps: + - name: Checkout + uses: actions/checkout@v2 + + - name: Install Python + uses: actions/setup-python@v2 + with: + python-version: '3.9.7' + + - name: Run pre-commit + uses: pre-commit/action@v2.0.3 + + typing_and_docstring_validation: + name: Docstring and typing validation + runs-on: ubuntu-latest + defaults: + run: + shell: bash -l {0} + + concurrency: + # https://github.community/t/concurrecy-not-work-for-push/183068/7 + group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-code-checks + cancel-in-progress: true + + steps: + - name: Checkout + uses: actions/checkout@v2 + with: + fetch-depth: 0 + + - name: Cache conda + uses: actions/cache@v2 + with: + path: ~/conda_pkgs_dir + key: ${{ runner.os }}-conda-${{ hashFiles('${{ env.ENV_FILE }}') }} + + - uses: conda-incubator/setup-miniconda@v2 + with: + mamba-version: "*" + channels: conda-forge + activate-environment: pandas-dev + channel-priority: strict + environment-file: ${{ env.ENV_FILE }} + use-only-tar-bz2: true + + - name: Install node.js (for pyright) + uses: actions/setup-node@v2 + with: + node-version: "16" + + - name: Install pyright + # note: keep version in sync with .pre-commit-config.yaml + run: npm install -g pyright@1.1.202 + + - name: Build Pandas + id: build + uses: ./.github/actions/build_pandas + + - name: Run checks on imported code + run: ci/code_checks.sh code + if: ${{ steps.build.outcome == 'success' }} + + - name: Run doctests + run: ci/code_checks.sh doctests + if: ${{ steps.build.outcome == 'success' }} + + - name: Run docstring validation + run: ci/code_checks.sh docstrings + if: ${{ steps.build.outcome == 'success' }} + + - name: Run typing validation + run: ci/code_checks.sh typing + if: ${{ steps.build.outcome == 'success' }} + + - name: Run docstring validation script tests + run: pytest scripts + if: ${{ steps.build.outcome == 'success' }} + + asv-benchmarks: + name: ASV Benchmarks + runs-on: ubuntu-latest + defaults: + run: + shell: bash -l {0} + + concurrency: + # https://github.community/t/concurrecy-not-work-for-push/183068/7 + group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-asv-benchmarks + cancel-in-progress: true + + steps: + - name: Checkout + uses: actions/checkout@v2 + with: + fetch-depth: 0 + + - name: Cache conda + uses: actions/cache@v2 + with: + path: ~/conda_pkgs_dir + key: ${{ runner.os }}-conda-${{ hashFiles('${{ env.ENV_FILE }}') }} + + - uses: conda-incubator/setup-miniconda@v2 + with: + mamba-version: "*" + channels: conda-forge + activate-environment: pandas-dev + channel-priority: strict + environment-file: ${{ env.ENV_FILE }} + use-only-tar-bz2: true + + - name: Build Pandas + id: build + uses: ./.github/actions/build_pandas + + - name: Run ASV benchmarks + run: | + cd asv_bench + asv check -E existing + git remote add upstream https://github.com/pandas-dev/pandas.git + git fetch upstream + asv machine --yes + asv dev | sed "/failed$/ s/^/##[error]/" | tee benchmarks.log + if grep "failed" benchmarks.log > /dev/null ; then + exit 1 + fi + if: ${{ steps.build.outcome == 'success' }} + + - name: Publish benchmarks artifact + uses: actions/upload-artifact@v2 + with: + name: Benchmarks log + path: asv_bench/benchmarks.log + if: failure() diff --git a/.github/workflows/comment_bot.yml b/.github/workflows/comment_bot.yml index dc396be753269..8f610fd5781ef 100644 --- a/.github/workflows/comment_bot.yml +++ b/.github/workflows/comment_bot.yml @@ -13,7 +13,7 @@ jobs: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - - uses: r-lib/actions/pr-fetch@master + - uses: r-lib/actions/pr-fetch@v2 with: repo-token: ${{ secrets.GITHUB_TOKEN }} - name: Cache multiple paths @@ -29,12 +29,12 @@ jobs: - name: Install-pre-commit run: python -m pip install --upgrade pre-commit - name: Run pre-commit - run: pre-commit run --from-ref=origin/master --to-ref=HEAD --all-files || (exit 0) + run: pre-commit run --from-ref=origin/main --to-ref=HEAD --all-files || (exit 0) - name: Commit results run: | git config user.name "$(git log -1 --pretty=format:%an)" git config user.email "$(git log -1 --pretty=format:%ae)" git commit -a -m 'Fixes from pre-commit [automated commit]' || echo "No changes to commit" - - uses: r-lib/actions/pr-push@master + - uses: r-lib/actions/pr-push@v2 with: repo-token: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/database.yml b/.github/workflows/database.yml deleted file mode 100644 index 292598dfcab73..0000000000000 --- a/.github/workflows/database.yml +++ /dev/null @@ -1,106 +0,0 @@ -name: Database - -on: - push: - branches: [master] - pull_request: - branches: - - master - - 1.2.x - paths-ignore: - - "doc/**" - -env: - PYTEST_WORKERS: "auto" - PANDAS_CI: 1 - PATTERN: ((not slow and not network and not clipboard) or (single and db)) - COVERAGE: true - -jobs: - Linux_py37_IO: - runs-on: ubuntu-latest - defaults: - run: - shell: bash -l {0} - - strategy: - matrix: - ENV_FILE: [ci/deps/actions-37-db-min.yaml, ci/deps/actions-37-db.yaml] - fail-fast: false - - services: - mysql: - image: mysql - env: - MYSQL_ALLOW_EMPTY_PASSWORD: yes - MYSQL_DATABASE: pandas - options: >- - --health-cmd "mysqladmin ping" - --health-interval 10s - --health-timeout 5s - --health-retries 5 - ports: - - 3306:3306 - - postgres: - image: postgres - env: - POSTGRES_USER: postgres - POSTGRES_PASSWORD: postgres - POSTGRES_DB: pandas - options: >- - --health-cmd pg_isready - --health-interval 10s - --health-timeout 5s - --health-retries 5 - ports: - - 5432:5432 - - steps: - - name: Checkout - uses: actions/checkout@v2 - with: - fetch-depth: 0 - - - name: Cache conda - uses: actions/cache@v2 - env: - CACHE_NUMBER: 0 - with: - path: ~/conda_pkgs_dir - key: ${{ runner.os }}-conda-${{ env.CACHE_NUMBER }}-${{ - hashFiles('${{ matrix.ENV_FILE }}') }} - - - uses: conda-incubator/setup-miniconda@v2 - with: - activate-environment: pandas-dev - channel-priority: flexible - environment-file: ${{ matrix.ENV_FILE }} - use-only-tar-bz2: true - - - name: Build Pandas - uses: ./.github/actions/build_pandas - - - name: Test - run: pytest -m "${{ env.PATTERN }}" -n 2 --dist=loadfile --cov=pandas --cov-report=xml pandas/tests/io - if: always() - - - name: Build Version - run: pushd /tmp && python -c "import pandas; pandas.show_versions();" && popd - - - name: Publish test results - uses: actions/upload-artifact@master - with: - name: Test results - path: test-data.xml - if: failure() - - - name: Print skipped tests - run: python ci/print_skipped.py - - - name: Upload coverage to Codecov - uses: codecov/codecov-action@v1 - with: - flags: unittests - name: codecov-pandas - fail_ci_if_error: true diff --git a/.github/workflows/datamanger.yml b/.github/workflows/datamanger.yml new file mode 100644 index 0000000000000..3fc515883a225 --- /dev/null +++ b/.github/workflows/datamanger.yml @@ -0,0 +1,57 @@ +name: Data Manager + +on: + push: + branches: + - main + - 1.4.x + pull_request: + branches: + - main + - 1.4.x + +env: + ENV_FILE: environment.yml + PANDAS_CI: 1 + +jobs: + data_manager: + name: Test experimental data manager + runs-on: ubuntu-latest + services: + moto: + image: motoserver/moto + env: + AWS_ACCESS_KEY_ID: foobar_key + AWS_SECRET_ACCESS_KEY: foobar_secret + ports: + - 5000:5000 + strategy: + matrix: + pattern: ["not slow and not network and not clipboard", "slow"] + concurrency: + # https://github.community/t/concurrecy-not-work-for-push/183068/7 + group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-data_manager-${{ matrix.pattern }} + cancel-in-progress: true + + steps: + - name: Checkout + uses: actions/checkout@v2 + with: + fetch-depth: 0 + + - name: Set up pandas + uses: ./.github/actions/setup + + - name: Run tests + env: + PANDAS_DATA_MANAGER: array + PATTERN: ${{ matrix.pattern }} + PYTEST_WORKERS: "auto" + PYTEST_TARGET: pandas + run: | + source activate pandas-dev + ci/run_tests.sh + + - name: Print skipped tests + run: python ci/print_skipped.py diff --git a/.github/workflows/docbuild-and-upload.yml b/.github/workflows/docbuild-and-upload.yml new file mode 100644 index 0000000000000..e8ed6d4545194 --- /dev/null +++ b/.github/workflows/docbuild-and-upload.yml @@ -0,0 +1,77 @@ +name: Doc Build and Upload + +on: + push: + branches: + - main + - 1.4.x + pull_request: + branches: + - main + - 1.4.x + +env: + ENV_FILE: environment.yml + PANDAS_CI: 1 + +jobs: + web_and_docs: + name: Doc Build and Upload + runs-on: ubuntu-latest + + concurrency: + # https://github.community/t/concurrecy-not-work-for-push/183068/7 + group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-web-docs + cancel-in-progress: true + + steps: + - name: Checkout + uses: actions/checkout@v2 + with: + fetch-depth: 0 + + - name: Set up pandas + uses: ./.github/actions/setup + + - name: Build website + run: | + source activate pandas-dev + python web/pandas_web.py web/pandas --target-path=web/build + - name: Build documentation + run: | + source activate pandas-dev + doc/make.py --warnings-are-errors | tee sphinx.log ; exit ${PIPESTATUS[0]} + + # This can be removed when the ipython directive fails when there are errors, + # including the `tee sphinx.log` in te previous step (https://github.com/ipython/ipython/issues/11547) + - name: Check ipython directive errors + run: "! grep -B10 \"^<<<-------------------------------------------------------------------------$\" sphinx.log" + + - name: Install ssh key + run: | + mkdir -m 700 -p ~/.ssh + echo "${{ secrets.server_ssh_key }}" > ~/.ssh/id_rsa + chmod 600 ~/.ssh/id_rsa + echo "${{ secrets.server_ip }} ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBE1Kkopomm7FHG5enATf7SgnpICZ4W2bw+Ho+afqin+w7sMcrsa0je7sbztFAV8YchDkiBKnWTG4cRT+KZgZCaY=" > ~/.ssh/known_hosts + if: ${{github.event_name == 'push' && github.ref == 'refs/heads/main'}} + + - name: Copy cheatsheets into site directory + run: cp doc/cheatsheet/Pandas_Cheat_Sheet* web/build/ + + - name: Upload web + run: rsync -az --delete --exclude='pandas-docs' --exclude='docs' web/build/ docs@${{ secrets.server_ip }}:/usr/share/nginx/pandas + if: ${{github.event_name == 'push' && github.ref == 'refs/heads/main'}} + + - name: Upload dev docs + run: rsync -az --delete doc/build/html/ docs@${{ secrets.server_ip }}:/usr/share/nginx/pandas/pandas-docs/dev + if: ${{github.event_name == 'push' && github.ref == 'refs/heads/main'}} + + - name: Move docs into site directory + run: mv doc/build/html web/build/docs + + - name: Save website as an artifact + uses: actions/upload-artifact@v2 + with: + name: website + path: web/build + retention-days: 14 diff --git a/.github/workflows/posix.yml b/.github/workflows/posix.yml index cb7d3fb5cabcf..135ca0703de8b 100644 --- a/.github/workflows/posix.yml +++ b/.github/workflows/posix.yml @@ -2,11 +2,13 @@ name: Posix on: push: - branches: [master] + branches: + - main + - 1.4.x pull_request: branches: - - master - - 1.2.x + - main + - 1.4.x paths-ignore: - "doc/**" @@ -23,19 +25,22 @@ jobs: strategy: matrix: settings: [ - [actions-37-minimum_versions.yaml, "not slow and not network and not clipboard", "", "", "", "", ""], - [actions-37.yaml, "not slow and not network and not clipboard", "", "", "", "", ""], - [actions-37-locale_slow.yaml, "slow", "language-pack-it xsel", "it_IT.utf8", "it_IT.utf8", "", ""], - [actions-37-slow.yaml, "slow", "", "", "", "", ""], - [actions-38.yaml, "not slow and not network and not clipboard", "", "", "", "", ""], - [actions-38-slow.yaml, "slow", "", "", "", "", ""], - [actions-38-locale.yaml, "not slow and not network", "language-pack-zh-hans xsel", "zh_CN.utf8", "zh_CN.utf8", "", ""], - [actions-38-numpydev.yaml, "not slow and not network", "xsel", "", "", "deprecate", "-W error"], - [actions-39.yaml, "not slow and not network and not clipboard", "", "", "", "", ""] + [actions-38-downstream_compat.yaml, "not slow and not network and not clipboard", "", "", "", "", ""], + [actions-38-minimum_versions.yaml, "slow", "", "", "", "", ""], + [actions-38-minimum_versions.yaml, "not slow and not network and not clipboard", "", "", "", "", ""], + [actions-38.yaml, "not slow and not network", "language-pack-it xsel", "it_IT.utf8", "it_IT.utf8", "", ""], + [actions-38.yaml, "not slow and not network", "language-pack-zh-hans xsel", "zh_CN.utf8", "zh_CN.utf8", "", ""], + [actions-38.yaml, "not slow and not clipboard", "", "", "", "", ""], + [actions-38.yaml, "slow", "", "", "", "", ""], + [actions-pypy-38.yaml, "not slow and not clipboard", "", "", "", "", "--max-worker-restart 0"], + [actions-39.yaml, "slow", "", "", "", "", ""], + [actions-39.yaml, "not slow and not clipboard", "", "", "", "", ""], + [actions-310-numpydev.yaml, "not slow and not network", "xclip", "", "", "deprecate", "-W error"], + [actions-310.yaml, "not slow and not clipboard", "", "", "", "", ""], + [actions-310.yaml, "slow", "", "", "", "", ""], ] fail-fast: false env: - COVERAGE: true ENV_FILE: ci/deps/${{ matrix.settings[0] }} PATTERN: ${{ matrix.settings[1] }} EXTRA_APT: ${{ matrix.settings[2] }} @@ -43,6 +48,50 @@ jobs: LC_ALL: ${{ matrix.settings[4] }} PANDAS_TESTING_MODE: ${{ matrix.settings[5] }} TEST_ARGS: ${{ matrix.settings[6] }} + PYTEST_TARGET: pandas + IS_PYPY: ${{ contains(matrix.settings[0], 'pypy') }} + # TODO: re-enable coverage on pypy, its slow + COVERAGE: ${{ !contains(matrix.settings[0], 'pypy') }} + concurrency: + # https://github.community/t/concurrecy-not-work-for-push/183068/7 + group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.settings[0] }}-${{ matrix.settings[1] }} + cancel-in-progress: true + + services: + mysql: + image: mysql + env: + MYSQL_ALLOW_EMPTY_PASSWORD: yes + MYSQL_DATABASE: pandas + options: >- + --health-cmd "mysqladmin ping" + --health-interval 10s + --health-timeout 5s + --health-retries 5 + ports: + - 3306:3306 + + postgres: + image: postgres + env: + POSTGRES_USER: postgres + POSTGRES_PASSWORD: postgres + POSTGRES_DB: pandas + options: >- + --health-cmd pg_isready + --health-interval 10s + --health-timeout 5s + --health-retries 5 + ports: + - 5432:5432 + + moto: + image: motoserver/moto + env: + AWS_ACCESS_KEY_ID: foobar_key + AWS_SECRET_ACCESS_KEY: foobar_secret + ports: + - 5000:5000 steps: - name: Checkout @@ -64,23 +113,42 @@ jobs: - uses: conda-incubator/setup-miniconda@v2 with: + mamba-version: "*" + channels: conda-forge activate-environment: pandas-dev channel-priority: flexible environment-file: ${{ env.ENV_FILE }} use-only-tar-bz2: true + if: ${{ env.IS_PYPY == 'false' }} # No pypy3.8 support + + - name: Setup PyPy + uses: actions/setup-python@v2 + with: + python-version: "pypy-3.8" + if: ${{ env.IS_PYPY == 'true' }} + + - name: Setup PyPy dependencies + shell: bash + run: | + # TODO: re-enable cov, its slowing the tests down though + # TODO: Unpin Cython, the new Cython 0.29.26 is causing compilation errors + pip install Cython==0.29.25 numpy python-dateutil pytz pytest>=6.0 pytest-xdist>=1.31.0 hypothesis>=5.5.3 + if: ${{ env.IS_PYPY == 'true' }} - name: Build Pandas uses: ./.github/actions/build_pandas - name: Test run: ci/run_tests.sh + # TODO: Don't continue on error for PyPy + continue-on-error: ${{ env.IS_PYPY == 'true' }} if: always() - name: Build Version run: pushd /tmp && python -c "import pandas; pandas.show_versions();" && popd - name: Publish test results - uses: actions/upload-artifact@master + uses: actions/upload-artifact@v2 with: name: Test results path: test-data.xml @@ -90,7 +158,7 @@ jobs: run: python ci/print_skipped.py - name: Upload coverage to Codecov - uses: codecov/codecov-action@v1 + uses: codecov/codecov-action@v2 with: flags: unittests name: codecov-pandas diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml deleted file mode 100644 index 723347913ac38..0000000000000 --- a/.github/workflows/pre-commit.yml +++ /dev/null @@ -1,14 +0,0 @@ -name: pre-commit - -on: - pull_request: - push: - branches: [master] - -jobs: - pre-commit: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v2 - - uses: actions/setup-python@v2 - - uses: pre-commit/action@v2.0.0 diff --git a/.github/workflows/python-dev.yml b/.github/workflows/python-dev.yml index 38b1aa9ae7047..fa1eee2db6fc3 100644 --- a/.github/workflows/python-dev.yml +++ b/.github/workflows/python-dev.yml @@ -1,20 +1,48 @@ +# This file is purposely frozen(does not run). DO NOT DELETE IT +# Unfreeze(by commentingthe if: false() condition) once the +# next Python Dev version has released beta 1 and both Cython and numpy support it +# After that Python has released, migrate the workflows to the +# posix GHA workflows/Azure pipelines and "freeze" this file by +# uncommenting the if: false() condition +# Feel free to modify this comment as necessary. + name: Python Dev on: push: branches: - - master + - main + - 1.4.x pull_request: branches: - - master + - main + - 1.4.x paths-ignore: - "doc/**" +env: + PYTEST_WORKERS: "auto" + PANDAS_CI: 1 + PATTERN: "not slow and not network and not clipboard" + COVERAGE: true + PYTEST_TARGET: pandas + jobs: build: - runs-on: ubuntu-latest - name: actions-310-dev - timeout-minutes: 60 + if: false # Comment this line out to "unfreeze" + runs-on: ${{ matrix.os }} + strategy: + fail-fast: false + matrix: + os: [ubuntu-latest, macOS-latest, windows-latest] + + name: actions-311-dev + timeout-minutes: 80 + + concurrency: + #https://github.community/t/concurrecy-not-work-for-push/183068/7 + group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{ matrix.os }}-${{ matrix.pytest_target }}-dev + cancel-in-progress: true steps: - uses: actions/checkout@v2 @@ -24,15 +52,16 @@ jobs: - name: Set up Python Dev Version uses: actions/setup-python@v2 with: - python-version: '3.10-dev' + python-version: '3.11-dev' + # TODO: GH#44980 https://github.com/pypa/setuptools/issues/2941 - name: Install dependencies + shell: bash run: | - python -m pip install --upgrade pip setuptools wheel - pip install git+https://github.com/numpy/numpy.git - pip install git+https://github.com/pytest-dev/pytest.git + python -m pip install --upgrade pip "setuptools<60.0.0" wheel + pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy pip install git+https://github.com/nedbat/coveragepy.git - pip install cython python-dateutil pytz hypothesis pytest-xdist + pip install cython python-dateutil pytz hypothesis pytest>=6.2.5 pytest-xdist pytest-cov pip list - name: Build Pandas @@ -45,12 +74,12 @@ jobs: python -c "import pandas; pandas.show_versions();" - name: Test with pytest + shell: bash run: | - coverage run -m pytest -m 'not slow and not network and not clipboard' pandas - continue-on-error: true + ci/run_tests.sh - name: Publish test results - uses: actions/upload-artifact@master + uses: actions/upload-artifact@v2 with: name: Test results path: test-data.xml @@ -65,7 +94,7 @@ jobs: coverage report -m - name: Upload coverage to Codecov - uses: codecov/codecov-action@v1 + uses: codecov/codecov-action@v2 with: flags: unittests name: codecov-pandas diff --git a/.github/workflows/sdist.yml b/.github/workflows/sdist.yml new file mode 100644 index 0000000000000..dd030f1aacc44 --- /dev/null +++ b/.github/workflows/sdist.yml @@ -0,0 +1,83 @@ +name: sdist + +on: + push: + branches: + - main + - 1.4.x + pull_request: + branches: + - main + - 1.4.x + paths-ignore: + - "doc/**" + +jobs: + build: + runs-on: ubuntu-latest + timeout-minutes: 60 + defaults: + run: + shell: bash -l {0} + + strategy: + fail-fast: false + matrix: + python-version: ["3.8", "3.9", "3.10"] + concurrency: + # https://github.community/t/concurrecy-not-work-for-push/183068/7 + group: ${{ github.event_name == 'push' && github.run_number || github.ref }}-${{matrix.python-version}}-sdist + cancel-in-progress: true + + steps: + - uses: actions/checkout@v2 + with: + fetch-depth: 0 + + - name: Set up Python + uses: actions/setup-python@v2 + with: + python-version: ${{ matrix.python-version }} + + # TODO: GH#44980 https://github.com/pypa/setuptools/issues/2941 + - name: Install dependencies + run: | + python -m pip install --upgrade pip "setuptools<60.0.0" wheel + + # GH 39416 + pip install numpy + + - name: Build pandas sdist + run: | + pip list + python setup.py sdist --formats=gztar + + - uses: conda-incubator/setup-miniconda@v2 + with: + activate-environment: pandas-sdist + channels: conda-forge + python-version: '${{ matrix.python-version }}' + + # TODO: GH#44980 https://github.com/pypa/setuptools/issues/2941 + - name: Install pandas from sdist + run: | + python -m pip install --upgrade "setuptools<60.0.0" + pip list + python -m pip install dist/*.gz + + - name: Force oldest supported NumPy + run: | + case "${{matrix.python-version}}" in + 3.8) + pip install numpy==1.18.5 ;; + 3.9) + pip install numpy==1.19.3 ;; + 3.10) + pip install numpy==1.21.2 ;; + esac + + - name: Import pandas + run: | + cd .. + conda list + python -c "import pandas; pandas.show_versions();" diff --git a/.gitignore b/.gitignore index 2c337be60e94e..87224f1d6060f 100644 --- a/.gitignore +++ b/.gitignore @@ -50,6 +50,8 @@ dist *.egg-info .eggs .pypirc +# type checkers +pandas/py.typed # tox testing tool .tox diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index d580fcf4fc545..5232b76a6388d 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -9,17 +9,17 @@ repos: - id: absolufy-imports files: ^pandas/ - repo: https://github.com/python/black - rev: 21.5b2 + rev: 21.12b0 hooks: - id: black - repo: https://github.com/codespell-project/codespell - rev: v2.0.0 + rev: v2.1.0 hooks: - id: codespell types_or: [python, rst, markdown] files: ^(pandas|doc)/ - repo: https://github.com/pre-commit/pre-commit-hooks - rev: v4.0.1 + rev: v4.1.0 hooks: - id: debug-statements - id: end-of-file-fixer @@ -35,34 +35,26 @@ repos: # we can lint all header files since they aren't "generated" like C files are. exclude: ^pandas/_libs/src/(klib|headers)/ args: [--quiet, '--extensions=c,h', '--headers=h', --recursive, '--filter=-readability/casting,-runtime/int,-build/include_subdir'] -- repo: https://gitlab.com/pycqa/flake8 - rev: 3.9.2 +- repo: https://github.com/PyCQA/flake8 + rev: 4.0.1 hooks: - id: flake8 - additional_dependencies: - - flake8-comprehensions==3.1.0 - - flake8-bugbear==21.3.2 - - pandas-dev-flaker==0.2.0 - - id: flake8 - name: flake8 (cython) - types: [cython] - args: [--append-config=flake8/cython.cfg] - - id: flake8 - name: flake8 (cython template) - files: \.pxi\.in$ - types: [text] - args: [--append-config=flake8/cython-template.cfg] + additional_dependencies: &flake8_dependencies + - flake8==4.0.1 + - flake8-comprehensions==3.7.0 + - flake8-bugbear==21.3.2 + - pandas-dev-flaker==0.2.0 - repo: https://github.com/PyCQA/isort - rev: 5.8.0 + rev: 5.10.1 hooks: - id: isort - repo: https://github.com/asottile/pyupgrade - rev: v2.18.3 + rev: v2.31.0 hooks: - id: pyupgrade - args: [--py37-plus] + args: [--py38-plus] - repo: https://github.com/pre-commit/pygrep-hooks - rev: v1.8.0 + rev: v1.9.0 hooks: - id: rst-backticks - id: rst-directive-colons @@ -72,14 +64,21 @@ repos: types: [text] # overwrite types: [rst] types_or: [python, rst] - repo: https://github.com/asottile/yesqa - rev: v1.2.3 + rev: v1.3.0 hooks: - id: yesqa - additional_dependencies: - - flake8==3.9.2 - - flake8-comprehensions==3.1.0 - - flake8-bugbear==21.3.2 - - pandas-dev-flaker==0.2.0 + additional_dependencies: *flake8_dependencies +- repo: local + hooks: + - id: pyright + name: pyright + entry: pyright + language: node + pass_filenames: false + types: [python] + stages: [manual] + # note: keep version in sync with .github/workflows/ci.yml + additional_dependencies: ['pyright@1.1.202'] - repo: local hooks: - id: flake8-rst @@ -102,7 +101,42 @@ repos: # Incorrect code-block / IPython directives |\.\.\ code-block\ :: |\.\.\ ipython\ :: + # directive should not have a space before :: + |\.\.\ \w+\ :: + + # Check for deprecated messages without sphinx directive + |(DEPRECATED|DEPRECATE|Deprecated)(:|,|\.) types_or: [python, cython, rst] + - id: cython-casting + name: Check Cython casting is `obj`, not ` obj` + language: pygrep + entry: '[a-zA-Z0-9*]> ' + files: (\.pyx|\.pxi.in)$ + - id: incorrect-backticks + name: Check for backticks incorrectly rendering because of missing spaces + language: pygrep + entry: '[a-zA-Z0-9]\`\`?[a-zA-Z0-9]' + types: [rst] + files: ^doc/source/ + - id: seed-check-asv + name: Check for unnecessary random seeds in asv benchmarks + language: pygrep + entry: 'np\.random\.seed' + files: ^asv_bench/benchmarks + exclude: ^asv_bench/benchmarks/pandas_vb_common\.py + - id: np-testing-array-equal + name: Check for usage of numpy testing or array_equal + language: pygrep + entry: '(numpy|np)(\.testing|\.array_equal)' + files: ^pandas/tests/ + types: [python] + - id: invalid-ea-testing + name: Check for invalid EA testing + language: pygrep + entry: 'tm\.assert_(series|frame)_equal' + files: ^pandas/tests/extension/base + types: [python] + exclude: ^pandas/tests/extension/base/base\.py - id: pip-to-conda name: Generate pip dependency from conda description: This hook checks if the conda environment.yml and requirements-dev.txt are equal @@ -110,7 +144,7 @@ repos: entry: python scripts/generate_pip_deps_from_conda.py files: ^(environment.yml|requirements-dev.txt)$ pass_filenames: false - additional_dependencies: [pyyaml] + additional_dependencies: [pyyaml, toml] - id: sync-flake8-versions name: Check flake8 version is synced across flake8, yesqa, and environment.yml language: python @@ -136,3 +170,19 @@ repos: entry: python scripts/no_bool_in_generic.py language: python files: ^pandas/core/generic\.py$ + - id: pandas-errors-documented + name: Ensure pandas errors are documented in doc/source/reference/general_utility_functions.rst + entry: python scripts/pandas_errors_documented.py + language: python + files: ^pandas/errors/__init__.py$ + - id: pg8000-not-installed-CI + name: Check for pg8000 not installed on CI for test_pg8000_sqlalchemy_passthrough_error + language: pygrep + entry: 'pg8000' + files: ^ci/deps + types: [yaml] + - id: validate-min-versions-in-sync + name: Check minimum version of dependencies are aligned + entry: python scripts/validate_min_versions_in_sync.py + language: python + files: ^(ci/deps/actions-.*-minimum_versions\.yaml|pandas/compat/_optional\.py)$ diff --git a/Dockerfile b/Dockerfile index de1c564921de9..8887e80566772 100644 --- a/Dockerfile +++ b/Dockerfile @@ -28,7 +28,7 @@ RUN mkdir "$pandas_home" \ && git clone "https://github.com/$gh_username/pandas.git" "$pandas_home" \ && cd "$pandas_home" \ && git remote add upstream "https://github.com/pandas-dev/pandas.git" \ - && git pull upstream master + && git pull upstream main # Because it is surprisingly difficult to activate a conda environment inside a DockerFile # (from personal experience and per https://github.com/ContinuumIO/docker-images/issues/89), diff --git a/MANIFEST.in b/MANIFEST.in index d0d93f2cdba8c..78464c9aaedc8 100644 --- a/MANIFEST.in +++ b/MANIFEST.in @@ -17,28 +17,38 @@ global-exclude *.h5 global-exclude *.html global-exclude *.json global-exclude *.jsonl +global-exclude *.msgpack global-exclude *.pdf global-exclude *.pickle global-exclude *.png global-exclude *.pptx -global-exclude *.pyc -global-exclude *.pyd global-exclude *.ods global-exclude *.odt +global-exclude *.orc global-exclude *.sas7bdat global-exclude *.sav global-exclude *.so global-exclude *.xls +global-exclude *.xlsb global-exclude *.xlsm global-exclude *.xlsx global-exclude *.xpt +global-exclude *.cpt global-exclude *.xz global-exclude *.zip +global-exclude *.zst global-exclude *~ global-exclude .DS_Store global-exclude .git* global-exclude \#* +global-exclude *.c +global-exclude *.cpp +global-exclude *.h + +global-exclude *.py[ocd] +global-exclude *.pxi + # GH 39321 # csv_dir_path fixture checks the existence of the directory # exclude the whole directory to avoid running related tests in sdist @@ -47,3 +57,6 @@ prune pandas/tests/io/parser/data include versioneer.py include pandas/_version.py include pandas/io/formats/templates/*.tpl + +graft pandas/_libs/src +graft pandas/_libs/tslibs/src diff --git a/Makefile b/Makefile index 1fdd3cfdcf027..c0aa685ed47ac 100644 --- a/Makefile +++ b/Makefile @@ -12,7 +12,7 @@ build: clean_pyc python setup.py build_ext lint-diff: - git diff upstream/master --name-only -- "*.py" | xargs flake8 + git diff upstream/main --name-only -- "*.py" | xargs flake8 black: black . diff --git a/README.md b/README.md index 04b346c198e90..26aed081de4af 100644 --- a/README.md +++ b/README.md @@ -9,10 +9,10 @@ [![Conda Latest Release](https://anaconda.org/conda-forge/pandas/badges/version.svg)](https://anaconda.org/anaconda/pandas/) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3509134.svg)](https://doi.org/10.5281/zenodo.3509134) [![Package Status](https://img.shields.io/pypi/status/pandas.svg)](https://pypi.org/project/pandas/) -[![License](https://img.shields.io/pypi/l/pandas.svg)](https://github.com/pandas-dev/pandas/blob/master/LICENSE) -[![Azure Build Status](https://dev.azure.com/pandas-dev/pandas/_apis/build/status/pandas-dev.pandas?branch=master)](https://dev.azure.com/pandas-dev/pandas/_build/latest?definitionId=1&branch=master) -[![Coverage](https://codecov.io/github/pandas-dev/pandas/coverage.svg?branch=master)](https://codecov.io/gh/pandas-dev/pandas) -[![Downloads](https://anaconda.org/conda-forge/pandas/badges/downloads.svg)](https://pandas.pydata.org) +[![License](https://img.shields.io/pypi/l/pandas.svg)](https://github.com/pandas-dev/pandas/blob/main/LICENSE) +[![Azure Build Status](https://dev.azure.com/pandas-dev/pandas/_apis/build/status/pandas-dev.pandas?branch=main)](https://dev.azure.com/pandas-dev/pandas/_build/latest?definitionId=1&branch=main) +[![Coverage](https://codecov.io/github/pandas-dev/pandas/coverage.svg?branch=main)](https://codecov.io/gh/pandas-dev/pandas) +[![Downloads](https://static.pepy.tech/personalized-badge/pandas?period=month&units=international_system&left_color=black&right_color=orange&left_text=PyPI%20downloads%20per%20month)](https://pepy.tech/project/pandas) [![Gitter](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/pydata/pandas) [![Powered by NumFOCUS](https://img.shields.io/badge/powered%20by-NumFOCUS-orange.svg?style=flat&colorA=E1523D&colorB=007D8A)](https://numfocus.org) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) @@ -160,7 +160,7 @@ Most development discussions take place on GitHub in this repo. Further, the [pa All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome. -A detailed overview on how to contribute can be found in the **[contributing guide](https://pandas.pydata.org/docs/dev/development/contributing.html)**. There is also an [overview](.github/CONTRIBUTING.md) on GitHub. +A detailed overview on how to contribute can be found in the **[contributing guide](https://pandas.pydata.org/docs/dev/development/contributing.html)**. If you are simply looking to start working with the pandas codebase, navigate to the [GitHub "issues" tab](https://github.com/pandas-dev/pandas/issues) and start looking through interesting issues. There are a number of issues listed under [Docs](https://github.com/pandas-dev/pandas/issues?labels=Docs&sort=updated&state=open) and [good first issue](https://github.com/pandas-dev/pandas/issues?labels=good+first+issue&sort=updated&state=open) where you could start out. @@ -170,4 +170,4 @@ Or maybe through using pandas you have an idea of your own or are looking for so Feel free to ask questions on the [mailing list](https://groups.google.com/forum/?fromgroups#!forum/pydata) or on [Gitter](https://gitter.im/pydata/pandas). -As contributors and maintainers to this project, you are expected to abide by pandas' code of conduct. More information can be found at: [Contributor Code of Conduct](https://github.com/pandas-dev/pandas/blob/master/.github/CODE_OF_CONDUCT.md) +As contributors and maintainers to this project, you are expected to abide by pandas' code of conduct. More information can be found at: [Contributor Code of Conduct](https://github.com/pandas-dev/pandas/blob/main/.github/CODE_OF_CONDUCT.md) diff --git a/asv_bench/asv.conf.json b/asv_bench/asv.conf.json index e8e82edabbfa3..daf2834c50d6a 100644 --- a/asv_bench/asv.conf.json +++ b/asv_bench/asv.conf.json @@ -13,6 +13,10 @@ // benchmarked "repo": "..", + // List of branches to benchmark. If not provided, defaults to "master" + // (for git) or "default" (for mercurial). + "branches": ["main"], + // The tool to use to create environments. May be "conda", // "virtualenv" or other value depending on the plugins in use. // If missing or the empty string, the tool will be automatically @@ -25,7 +29,6 @@ // The Pythons you'd like to test against. If not provided, defaults // to the current version of Python used to run `asv`. - // "pythons": ["2.7", "3.4"], "pythons": ["3.8"], // The matrix of dependencies to test. Each key is the name of a @@ -39,24 +42,21 @@ // followed by the pip installed packages). "matrix": { "numpy": [], - "Cython": ["0.29.21"], + "Cython": ["0.29.24"], "matplotlib": [], "sqlalchemy": [], "scipy": [], "numba": [], "numexpr": [], "pytables": [null, ""], // platform dependent, see excludes below + "pyarrow": [], "tables": [null, ""], "openpyxl": [], "xlsxwriter": [], "xlrd": [], "xlwt": [], "odfpy": [], - "pytest": [], "jinja2": [], - // If using Windows with python 2.7 and want to build using the - // mingw toolchain (rather than MSVC), uncomment the following line. - // "libpython": [], }, "conda_channels": ["defaults", "conda-forge"], // Combinations of libraries/python versions can be excluded/included diff --git a/asv_bench/benchmarks/algorithms.py b/asv_bench/benchmarks/algorithms.py index e48a2060a3b34..2e43827232ae5 100644 --- a/asv_bench/benchmarks/algorithms.py +++ b/asv_bench/benchmarks/algorithms.py @@ -44,9 +44,9 @@ def setup(self, unique, sort, dtype): raise NotImplementedError data = { - "int": pd.Int64Index(np.arange(N)), - "uint": pd.UInt64Index(np.arange(N)), - "float": pd.Float64Index(np.random.randn(N)), + "int": pd.Index(np.arange(N), dtype="int64"), + "uint": pd.Index(np.arange(N), dtype="uint64"), + "float": pd.Index(np.random.randn(N), dtype="float64"), "object": string_index, "datetime64[ns]": pd.date_range("2011-01-01", freq="H", periods=N), "datetime64[ns, tz]": pd.date_range( @@ -76,9 +76,9 @@ class Duplicated: def setup(self, unique, keep, dtype): N = 10 ** 5 data = { - "int": pd.Int64Index(np.arange(N)), - "uint": pd.UInt64Index(np.arange(N)), - "float": pd.Float64Index(np.random.randn(N)), + "int": pd.Index(np.arange(N), dtype="int64"), + "uint": pd.Index(np.arange(N), dtype="uint64"), + "float": pd.Index(np.random.randn(N), dtype="float64"), "string": tm.makeStringIndex(N), "datetime64[ns]": pd.date_range("2011-01-01", freq="H", periods=N), "datetime64[ns, tz]": pd.date_range( diff --git a/asv_bench/benchmarks/algos/isin.py b/asv_bench/benchmarks/algos/isin.py index 296101c9f9800..37fa0b490bd9e 100644 --- a/asv_bench/benchmarks/algos/isin.py +++ b/asv_bench/benchmarks/algos/isin.py @@ -1,9 +1,8 @@ import numpy as np -from pandas.compat.numpy import np_version_under1p20 - from pandas import ( Categorical, + Index, NaT, Series, date_range, @@ -280,10 +279,6 @@ class IsInLongSeriesLookUpDominates: def setup(self, dtype, MaxNumber, series_type): N = 10 ** 7 - # https://github.com/pandas-dev/pandas/issues/39844 - if not np_version_under1p20 and dtype in ("Int64", "Float64"): - raise NotImplementedError - if series_type == "random_hits": array = np.random.randint(0, MaxNumber, N) if series_type == "random_misses": @@ -294,7 +289,8 @@ def setup(self, dtype, MaxNumber, series_type): array = np.arange(N) + MaxNumber self.series = Series(array).astype(dtype) - self.values = np.arange(MaxNumber).astype(dtype) + + self.values = np.arange(MaxNumber).astype(dtype.lower()) def time_isin(self, dtypes, MaxNumber, series_type): self.series.isin(self.values) @@ -310,18 +306,37 @@ class IsInLongSeriesValuesDominate: def setup(self, dtype, series_type): N = 10 ** 7 - # https://github.com/pandas-dev/pandas/issues/39844 - if not np_version_under1p20 and dtype in ("Int64", "Float64"): - raise NotImplementedError - if series_type == "random": vals = np.random.randint(0, 10 * N, N) if series_type == "monotone": vals = np.arange(N) - self.values = vals.astype(dtype) + self.values = vals.astype(dtype.lower()) M = 10 ** 6 + 1 self.series = Series(np.arange(M)).astype(dtype) def time_isin(self, dtypes, series_type): self.series.isin(self.values) + + +class IsInWithLongTupples: + def setup(self): + t = tuple(range(1000)) + self.series = Series([t] * 1000) + self.values = [t] + + def time_isin(self): + self.series.isin(self.values) + + +class IsInIndexes: + def setup(self): + self.range_idx = Index(range(1000)) + self.index = Index(list(range(1000))) + self.series = Series(np.random.randint(100_000, size=1000)) + + def time_isin_range_index(self): + self.series.isin(self.range_idx) + + def time_isin_index(self): + self.series.isin(self.index) diff --git a/asv_bench/benchmarks/arithmetic.py b/asv_bench/benchmarks/arithmetic.py index bfb1be8705495..edd1132116f76 100644 --- a/asv_bench/benchmarks/arithmetic.py +++ b/asv_bench/benchmarks/arithmetic.py @@ -144,7 +144,7 @@ def setup(self, op, shape): # should already be the case, but just to be sure df._consolidate_inplace() - # TODO: GH#33198 the setting here shoudlnt need two steps + # TODO: GH#33198 the setting here shouldn't need two steps arr1 = np.random.randn(n_rows, max(n_cols // 4, 3)).astype("f8") arr2 = np.random.randn(n_rows, n_cols // 2).astype("i8") arr3 = np.random.randn(n_rows, n_cols // 4).astype("f8") diff --git a/asv_bench/benchmarks/dtypes.py b/asv_bench/benchmarks/dtypes.py index c561b80ed1ca6..55f6be848aa13 100644 --- a/asv_bench/benchmarks/dtypes.py +++ b/asv_bench/benchmarks/dtypes.py @@ -50,15 +50,26 @@ def time_pandas_dtype_invalid(self, dtype): class SelectDtypes: - params = [ - tm.ALL_INT_DTYPES - + tm.ALL_EA_INT_DTYPES - + tm.FLOAT_DTYPES - + tm.COMPLEX_DTYPES - + tm.DATETIME64_DTYPES - + tm.TIMEDELTA64_DTYPES - + tm.BOOL_DTYPES - ] + try: + params = [ + tm.ALL_INT_NUMPY_DTYPES + + tm.ALL_INT_EA_DTYPES + + tm.FLOAT_NUMPY_DTYPES + + tm.COMPLEX_DTYPES + + tm.DATETIME64_DTYPES + + tm.TIMEDELTA64_DTYPES + + tm.BOOL_DTYPES + ] + except AttributeError: + params = [ + tm.ALL_INT_DTYPES + + tm.ALL_EA_INT_DTYPES + + tm.FLOAT_DTYPES + + tm.COMPLEX_DTYPES + + tm.DATETIME64_DTYPES + + tm.TIMEDELTA64_DTYPES + + tm.BOOL_DTYPES + ] param_names = ["dtype"] def setup(self, dtype): diff --git a/asv_bench/benchmarks/frame_ctor.py b/asv_bench/benchmarks/frame_ctor.py index 7fbe249788a98..eace665ba0bac 100644 --- a/asv_bench/benchmarks/frame_ctor.py +++ b/asv_bench/benchmarks/frame_ctor.py @@ -2,6 +2,7 @@ import pandas as pd from pandas import ( + Categorical, DataFrame, MultiIndex, Series, @@ -18,7 +19,10 @@ ) except ImportError: # For compatibility with older versions - from pandas.core.datetools import * # noqa + from pandas.core.datetools import ( + Hour, + Nano, + ) class FromDicts: @@ -31,6 +35,9 @@ def setup(self): self.dict_list = frame.to_dict(orient="records") self.data2 = {i: {j: float(j) for j in range(100)} for i in range(2000)} + # arrays which we wont consolidate + self.dict_of_categoricals = {i: Categorical(np.arange(N)) for i in range(K)} + def time_list_of_dict(self): DataFrame(self.dict_list) @@ -50,6 +57,10 @@ def time_nested_dict_int64(self): # nested dict, integer indexes, regression described in #621 DataFrame(self.data2) + def time_dict_of_categoricals(self): + # dict of arrays that we wont consolidate + DataFrame(self.dict_of_categoricals) + class FromSeries: def setup(self): @@ -171,4 +182,21 @@ def time_frame_from_arrays_sparse(self): ) +class From3rdParty: + # GH#44616 + + def setup(self): + try: + import torch + except ImportError: + raise NotImplementedError + + row = 700000 + col = 64 + self.val_tensor = torch.randn(row, col) + + def time_from_torch(self): + DataFrame(self.val_tensor) + + from .pandas_vb_common import setup # noqa: F401 isort:skip diff --git a/asv_bench/benchmarks/frame_methods.py b/asv_bench/benchmarks/frame_methods.py index c32eda4928da7..16925b7959e6a 100644 --- a/asv_bench/benchmarks/frame_methods.py +++ b/asv_bench/benchmarks/frame_methods.py @@ -76,7 +76,7 @@ def time_reindex_axis1_missing(self): self.df.reindex(columns=self.idx) def time_reindex_both_axes(self): - self.df.reindex(index=self.idx, columns=self.idx) + self.df.reindex(index=self.idx, columns=self.idx_cols) def time_reindex_upcast(self): self.df2.reindex(np.random.permutation(range(1200))) @@ -232,6 +232,22 @@ def time_to_html_mixed(self): self.df2.to_html() +class ToDict: + params = [["dict", "list", "series", "split", "records", "index"]] + param_names = ["orient"] + + def setup(self, orient): + data = np.random.randint(0, 1000, size=(10000, 4)) + self.int_df = DataFrame(data) + self.datetimelike_df = self.int_df.astype("timedelta64[ns]") + + def time_to_dict_ints(self, orient): + self.int_df.to_dict(orient=orient) + + def time_to_dict_datetimelike(self, orient): + self.datetimelike_df.to_dict(orient=orient) + + class ToNumpy: def setup(self): N = 10000 @@ -522,8 +538,12 @@ class Interpolate: def setup(self, downcast): N = 10000 # this is the worst case, where every column has NaNs. - self.df = DataFrame(np.random.randn(N, 100)) - self.df.values[::2] = np.nan + arr = np.random.randn(N, 100) + # NB: we need to set values in array, not in df.values, otherwise + # the benchmark will be misleading for ArrayManager + arr[::2] = np.nan + + self.df = DataFrame(arr) self.df2 = DataFrame( { @@ -711,17 +731,6 @@ def time_dataframe_describe(self): self.df.describe() -class SelectDtypes: - params = [100, 1000] - param_names = ["n"] - - def setup(self, n): - self.df = DataFrame(np.random.randn(10, n)) - - def time_select_dtypes(self, n): - self.df.select_dtypes(include="int") - - class MemoryUsage: def setup(self): self.df = DataFrame(np.random.randn(100000, 2), columns=list("AB")) diff --git a/asv_bench/benchmarks/groupby.py b/asv_bench/benchmarks/groupby.py index 1648985a56b91..ff58e382a9ba2 100644 --- a/asv_bench/benchmarks/groupby.py +++ b/asv_bench/benchmarks/groupby.py @@ -369,6 +369,18 @@ def time_category_size(self): self.draws.groupby(self.cats).size() +class Shift: + def setup(self): + N = 18 + self.df = DataFrame({"g": ["a", "b"] * 9, "v": list(range(N))}) + + def time_defaults(self): + self.df.groupby("g").shift() + + def time_fill_value(self): + self.df.groupby("g").shift(fill_value=99) + + class FillNA: def setup(self): N = 100 @@ -391,7 +403,7 @@ def time_srs_bfill(self): class GroupByMethods: - param_names = ["dtype", "method", "application"] + param_names = ["dtype", "method", "application", "ncols"] params = [ ["int", "float", "object", "datetime", "uint"], [ @@ -431,15 +443,39 @@ class GroupByMethods: "var", ], ["direct", "transformation"], + [1, 5], ] - def setup(self, dtype, method, application): + def setup(self, dtype, method, application, ncols): if method in method_blocklist.get(dtype, {}): raise NotImplementedError # skip benchmark - ngroups = 1000 + + if ncols != 1 and method in ["value_counts", "unique"]: + # DataFrameGroupBy doesn't have these methods + raise NotImplementedError + + if application == "transformation" and method in [ + "describe", + "head", + "tail", + "unique", + "value_counts", + "size", + ]: + # DataFrameGroupBy doesn't have these methods + raise NotImplementedError + + if method == "describe": + ngroups = 20 + elif method in ["mad", "skew"]: + ngroups = 100 + else: + ngroups = 1000 size = ngroups * 2 - rng = np.arange(ngroups) - values = rng.take(np.random.randint(0, ngroups, size=size)) + rng = np.arange(ngroups).reshape(-1, 1) + rng = np.broadcast_to(rng, (len(rng), ncols)) + taker = np.random.randint(0, ngroups, size=size) + values = rng.take(taker, axis=0) if dtype == "int": key = np.random.randint(0, size, size=size) elif dtype == "uint": @@ -453,22 +489,24 @@ def setup(self, dtype, method, application): elif dtype == "datetime": key = date_range("1/1/2011", periods=size, freq="s") - df = DataFrame({"values": values, "key": key}) + cols = [f"values{n}" for n in range(ncols)] + df = DataFrame(values, columns=cols) + df["key"] = key - if application == "transform": - if method == "describe": - raise NotImplementedError + if len(cols) == 1: + cols = cols[0] - self.as_group_method = lambda: df.groupby("key")["values"].transform(method) - self.as_field_method = lambda: df.groupby("values")["key"].transform(method) + if application == "transformation": + self.as_group_method = lambda: df.groupby("key")[cols].transform(method) + self.as_field_method = lambda: df.groupby(cols)["key"].transform(method) else: - self.as_group_method = getattr(df.groupby("key")["values"], method) - self.as_field_method = getattr(df.groupby("values")["key"], method) + self.as_group_method = getattr(df.groupby("key")[cols], method) + self.as_field_method = getattr(df.groupby(cols)["key"], method) - def time_dtype_as_group(self, dtype, method, application): + def time_dtype_as_group(self, dtype, method, application, ncols): self.as_group_method() - def time_dtype_as_field(self, dtype, method, application): + def time_dtype_as_field(self, dtype, method, application, ncols): self.as_field_method() @@ -568,6 +606,38 @@ def time_sum(self): self.df.groupby(["a"])["b"].sum() +class String: + # GH#41596 + param_names = ["dtype", "method"] + params = [ + ["str", "string[python]"], + [ + "sum", + "prod", + "min", + "max", + "mean", + "median", + "var", + "first", + "last", + "any", + "all", + ], + ] + + def setup(self, dtype, method): + cols = list("abcdefghjkl") + self.df = DataFrame( + np.random.randint(0, 100, size=(1_000_000, len(cols))), + columns=cols, + dtype=dtype, + ) + + def time_str_func(self, dtype, method): + self.df.groupby("a")[self.df.columns[1:]].agg(method) + + class Categories: def setup(self): N = 10 ** 5 @@ -832,4 +902,18 @@ def function(values): self.grouper.agg(function, engine="cython") +class Sample: + def setup(self): + N = 10 ** 3 + self.df = DataFrame({"a": np.zeros(N)}) + self.groups = np.arange(0, N) + self.weights = np.ones(N) + + def time_sample(self): + self.df.groupby(self.groups).sample(n=1) + + def time_sample_weights(self): + self.df.groupby(self.groups).sample(n=1, weights=self.weights) + + from .pandas_vb_common import setup # noqa: F401 isort:skip diff --git a/asv_bench/benchmarks/index_object.py b/asv_bench/benchmarks/index_object.py index 9c05019c70396..2b2302a796730 100644 --- a/asv_bench/benchmarks/index_object.py +++ b/asv_bench/benchmarks/index_object.py @@ -86,6 +86,12 @@ def time_iter_dec(self): for _ in self.idx_dec: pass + def time_sort_values_asc(self): + self.idx_inc.sort_values() + + def time_sort_values_des(self): + self.idx_inc.sort_values(ascending=False) + class IndexEquals: def setup(self): diff --git a/asv_bench/benchmarks/indexing.py b/asv_bench/benchmarks/indexing.py index 10fb926ee4d03..58f2a73d82842 100644 --- a/asv_bench/benchmarks/indexing.py +++ b/asv_bench/benchmarks/indexing.py @@ -366,11 +366,20 @@ class InsertColumns: def setup(self): self.N = 10 ** 3 self.df = DataFrame(index=range(self.N)) + self.df2 = DataFrame(np.random.randn(self.N, 2)) def time_insert(self): for i in range(100): self.df.insert(0, i, np.random.randn(self.N), allow_duplicates=True) + def time_insert_middle(self): + # same as time_insert but inserting to a middle column rather than + # front or back (which have fast-paths) + for i in range(100): + self.df2.insert( + 1, "colname", np.random.randn(self.N), allow_duplicates=True + ) + def time_assign_with_setitem(self): for i in range(100): self.df[i] = np.random.randn(self.N) @@ -390,12 +399,14 @@ class ChainIndexing: def setup(self, mode): self.N = 1000000 + self.df = DataFrame({"A": np.arange(self.N), "B": "foo"}) def time_chained_indexing(self, mode): + df = self.df + N = self.N with warnings.catch_warnings(record=True): with option_context("mode.chained_assignment", mode): - df = DataFrame({"A": np.arange(self.N), "B": "foo"}) - df2 = df[df.A > self.N // 2] + df2 = df[df.A > N // 2] df2["C"] = 1.0 diff --git a/asv_bench/benchmarks/indexing_engines.py b/asv_bench/benchmarks/indexing_engines.py index 30ef7f63dc0dc..60e07a9d1469c 100644 --- a/asv_bench/benchmarks/indexing_engines.py +++ b/asv_bench/benchmarks/indexing_engines.py @@ -1,5 +1,5 @@ """ -Benchmarks in this fiel depend exclusively on code in _libs/ +Benchmarks in this file depend exclusively on code in _libs/ If a PR does not edit anything in _libs, it is very unlikely that benchmarks in this file will be affected. @@ -35,25 +35,49 @@ class NumericEngineIndexing: params = [ _get_numeric_engines(), ["monotonic_incr", "monotonic_decr", "non_monotonic"], + [True, False], + [10 ** 5, 2 * 10 ** 6], # 2e6 is above SIZE_CUTOFF ] - param_names = ["engine_and_dtype", "index_type"] + param_names = ["engine_and_dtype", "index_type", "unique", "N"] - def setup(self, engine_and_dtype, index_type): + def setup(self, engine_and_dtype, index_type, unique, N): engine, dtype = engine_and_dtype - N = 10 ** 5 - values = list([1] * N + [2] * N + [3] * N) - arr = { - "monotonic_incr": np.array(values, dtype=dtype), - "monotonic_decr": np.array(list(reversed(values)), dtype=dtype), - "non_monotonic": np.array([1, 2, 3] * N, dtype=dtype), - }[index_type] - self.data = engine(lambda: arr, len(arr)) + if index_type == "monotonic_incr": + if unique: + arr = np.arange(N * 3, dtype=dtype) + else: + values = list([1] * N + [2] * N + [3] * N) + arr = np.array(values, dtype=dtype) + elif index_type == "monotonic_decr": + if unique: + arr = np.arange(N * 3, dtype=dtype)[::-1] + else: + values = list([1] * N + [2] * N + [3] * N) + arr = np.array(values, dtype=dtype)[::-1] + else: + assert index_type == "non_monotonic" + if unique: + arr = np.empty(N * 3, dtype=dtype) + arr[:N] = np.arange(N * 2, N * 3, dtype=dtype) + arr[N:] = np.arange(N * 2, dtype=dtype) + else: + arr = np.array([1, 2, 3] * N, dtype=dtype) + + self.data = engine(arr) # code belows avoids populating the mapping etc. while timing. self.data.get_loc(2) - def time_get_loc(self, engine_and_dtype, index_type): - self.data.get_loc(2) + self.key_middle = arr[len(arr) // 2] + self.key_early = arr[2] + + def time_get_loc(self, engine_and_dtype, index_type, unique, N): + self.data.get_loc(self.key_early) + + def time_get_loc_near_middle(self, engine_and_dtype, index_type, unique, N): + # searchsorted performance may be different near the middle of a range + # vs near an endpoint + self.data.get_loc(self.key_middle) class ObjectEngineIndexing: @@ -70,7 +94,7 @@ def setup(self, index_type): "non_monotonic": np.array(list("abc") * N, dtype=object), }[index_type] - self.data = libindex.ObjectEngine(lambda: arr, len(arr)) + self.data = libindex.ObjectEngine(arr) # code belows avoids populating the mapping etc. while timing. self.data.get_loc("b") diff --git a/asv_bench/benchmarks/inference.py b/asv_bench/benchmarks/inference.py index 0aa924dabd469..a5a7bc5b5c8bd 100644 --- a/asv_bench/benchmarks/inference.py +++ b/asv_bench/benchmarks/inference.py @@ -115,19 +115,27 @@ def time_maybe_convert_objects(self): class ToDatetimeFromIntsFloats: def setup(self): self.ts_sec = Series(range(1521080307, 1521685107), dtype="int64") + self.ts_sec_uint = Series(range(1521080307, 1521685107), dtype="uint64") self.ts_sec_float = self.ts_sec.astype("float64") self.ts_nanosec = 1_000_000 * self.ts_sec + self.ts_nanosec_uint = 1_000_000 * self.ts_sec_uint self.ts_nanosec_float = self.ts_nanosec.astype("float64") - # speed of int64 and float64 paths should be comparable + # speed of int64, uint64 and float64 paths should be comparable def time_nanosec_int64(self): to_datetime(self.ts_nanosec, unit="ns") + def time_nanosec_uint64(self): + to_datetime(self.ts_nanosec_uint, unit="ns") + def time_nanosec_float64(self): to_datetime(self.ts_nanosec_float, unit="ns") + def time_sec_uint64(self): + to_datetime(self.ts_sec_uint, unit="s") + def time_sec_int64(self): to_datetime(self.ts_sec, unit="s") @@ -165,6 +173,7 @@ def setup(self): self.strings_tz_space = [ x.strftime("%Y-%m-%d %H:%M:%S") + " -0800" for x in rng ] + self.strings_zero_tz = [x.strftime("%Y-%m-%d %H:%M:%S") + "Z" for x in rng] def time_iso8601(self): to_datetime(self.strings) @@ -181,6 +190,10 @@ def time_iso8601_format_no_sep(self): def time_iso8601_tz_spaceformat(self): to_datetime(self.strings_tz_space) + def time_iso8601_infer_zero_tz_fromat(self): + # GH 41047 + to_datetime(self.strings_zero_tz, infer_datetime_format=True) + class ToDatetimeNONISO8601: def setup(self): @@ -264,6 +277,16 @@ def time_dup_string_tzoffset_dates(self, cache): to_datetime(self.dup_string_with_tz, cache=cache) +# GH 43901 +class ToDatetimeInferDatetimeFormat: + def setup(self): + rng = date_range(start="1/1/2000", periods=100000, freq="H") + self.strings = rng.strftime("%Y-%m-%d %H:%M:%S").tolist() + + def time_infer_datetime_format(self): + to_datetime(self.strings, infer_datetime_format=True) + + class ToTimedelta: def setup(self): self.ints = np.random.randint(0, 60, size=10000) diff --git a/asv_bench/benchmarks/io/csv.py b/asv_bench/benchmarks/io/csv.py index 5ff9431fbf8e4..0b443b29116a2 100644 --- a/asv_bench/benchmarks/io/csv.py +++ b/asv_bench/benchmarks/io/csv.py @@ -10,6 +10,7 @@ from pandas import ( Categorical, DataFrame, + concat, date_range, read_csv, to_datetime, @@ -54,6 +55,26 @@ def time_frame(self, kind): self.df.to_csv(self.fname) +class ToCSVMultiIndexUnusedLevels(BaseIO): + + fname = "__test__.csv" + + def setup(self): + df = DataFrame({"a": np.random.randn(100_000), "b": 1, "c": 1}) + self.df = df.set_index(["a", "b"]) + self.df_unused_levels = self.df.iloc[:10_000] + self.df_single_index = df.set_index(["a"]).iloc[:10_000] + + def time_full_frame(self): + self.df.to_csv(self.fname) + + def time_sliced_frame(self): + self.df_unused_levels.to_csv(self.fname) + + def time_single_index_frame(self): + self.df_single_index.to_csv(self.fname) + + class ToCSVDatetime(BaseIO): fname = "__test__.csv" @@ -66,6 +87,21 @@ def time_frame_date_formatting(self): self.data.to_csv(self.fname, date_format="%Y%m%d") +class ToCSVDatetimeIndex(BaseIO): + + fname = "__test__.csv" + + def setup(self): + rng = date_range("2000", periods=100_000, freq="S") + self.data = DataFrame({"a": 1}, index=rng) + + def time_frame_date_formatting_index(self): + self.data.to_csv(self.fname, date_format="%Y-%m-%d %H:%M:%S") + + def time_frame_date_no_format_index(self): + self.data.to_csv(self.fname) + + class ToCSVDatetimeBig(BaseIO): fname = "__test__.csv" @@ -206,7 +242,7 @@ def time_read_csv(self, bad_date_value): class ReadCSVSkipRows(BaseIO): fname = "__test__.csv" - params = ([None, 10000], ["c", "python"]) + params = ([None, 10000], ["c", "python", "pyarrow"]) param_names = ["skiprows", "engine"] def setup(self, skiprows, engine): @@ -291,7 +327,8 @@ class ReadCSVFloatPrecision(StringIORewind): def setup(self, sep, decimal, float_precision): floats = [ - "".join(random.choice(string.digits) for _ in range(28)) for _ in range(15) + "".join([random.choice(string.digits) for _ in range(28)]) + for _ in range(15) ] rows = sep.join([f"0{decimal}" + "{}"] * 3) + "\n" data = rows * 5 @@ -319,7 +356,7 @@ def time_read_csv_python_engine(self, sep, decimal, float_precision): class ReadCSVEngine(StringIORewind): - params = ["c", "python"] + params = ["c", "python", "pyarrow"] param_names = ["engine"] def setup(self, engine): @@ -395,7 +432,7 @@ class ReadCSVCachedParseDates(StringIORewind): param_names = ["do_cache", "engine"] def setup(self, do_cache, engine): - data = ("\n".join(f"10/{year}" for year in range(2000, 2100)) + "\n") * 10 + data = ("\n".join([f"10/{year}" for year in range(2000, 2100)]) + "\n") * 10 self.StringIO_input = StringIO(data) def time_read_csv_cached(self, do_cache, engine): @@ -458,6 +495,34 @@ def time_read_special_date(self, value, engine): ) +class ReadCSVMemMapUTF8: + + fname = "__test__.csv" + number = 5 + + def setup(self): + lines = [] + line_length = 128 + start_char = " " + end_char = "\U00010080" + # This for loop creates a list of 128-char strings + # consisting of consecutive Unicode chars + for lnum in range(ord(start_char), ord(end_char), line_length): + line = "".join([chr(c) for c in range(lnum, lnum + 0x80)]) + "\n" + try: + line.encode("utf-8") + except UnicodeEncodeError: + # Some 16-bit words are not valid Unicode chars and must be skipped + continue + lines.append(line) + df = DataFrame(lines) + df = concat([df for n in range(100)], ignore_index=True) + df.to_csv(self.fname, index=False, header=False, encoding="utf-8") + + def time_read_memmapped_utf8(self): + read_csv(self.fname, header=None, memory_map=True, encoding="utf-8", engine="c") + + class ParseDateComparison(StringIORewind): params = ([False, True],) param_names = ["cache_dates"] @@ -495,4 +560,14 @@ def time_to_datetime_format_DD_MM_YYYY(self, cache_dates): to_datetime(df["date"], cache=cache_dates, format="%d-%m-%Y") +class ReadCSVIndexCol(StringIORewind): + def setup(self): + count_elem = 100_000 + data = "a,b\n" + "1,2\n" * count_elem + self.StringIO_input = StringIO(data) + + def time_read_csv_index_col(self): + read_csv(self.StringIO_input, index_col="a") + + from ..pandas_vb_common import setup # noqa: F401 isort:skip diff --git a/asv_bench/benchmarks/io/json.py b/asv_bench/benchmarks/io/json.py index d9d27ce7e5d8c..d1468a238c491 100644 --- a/asv_bench/benchmarks/io/json.py +++ b/asv_bench/benchmarks/io/json.py @@ -172,15 +172,19 @@ def time_to_json(self, orient, frame): def peakmem_to_json(self, orient, frame): getattr(self, frame).to_json(self.fname, orient=orient) - def time_to_json_wide(self, orient, frame): + +class ToJSONWide(ToJSON): + def setup(self, orient, frame): + super().setup(orient, frame) base_df = getattr(self, frame).copy() - df = concat([base_df.iloc[:100]] * 1000, ignore_index=True, axis=1) - df.to_json(self.fname, orient=orient) + df_wide = concat([base_df.iloc[:100]] * 1000, ignore_index=True, axis=1) + self.df_wide = df_wide + + def time_to_json_wide(self, orient, frame): + self.df_wide.to_json(self.fname, orient=orient) def peakmem_to_json_wide(self, orient, frame): - base_df = getattr(self, frame).copy() - df = concat([base_df.iloc[:100]] * 1000, ignore_index=True, axis=1) - df.to_json(self.fname, orient=orient) + self.df_wide.to_json(self.fname, orient=orient) class ToJSONISO(BaseIO): diff --git a/asv_bench/benchmarks/io/style.py b/asv_bench/benchmarks/io/style.py index 82166a2a95c76..f0902c9c2c328 100644 --- a/asv_bench/benchmarks/io/style.py +++ b/asv_bench/benchmarks/io/style.py @@ -34,13 +34,29 @@ def peakmem_classes_render(self, cols, rows): self._style_classes() self.st._render_html(True, True) + def time_tooltips_render(self, cols, rows): + self._style_tooltips() + self.st._render_html(True, True) + + def peakmem_tooltips_render(self, cols, rows): + self._style_tooltips() + self.st._render_html(True, True) + def time_format_render(self, cols, rows): self._style_format() - self.st.render() + self.st._render_html(True, True) def peakmem_format_render(self, cols, rows): self._style_format() - self.st.render() + self.st._render_html(True, True) + + def time_apply_format_hide_render(self, cols, rows): + self._style_apply_format_hide() + self.st._render_html(True, True) + + def peakmem_apply_format_hide_render(self, cols, rows): + self._style_apply_format_hide() + self.st._render_html(True, True) def _style_apply(self): def _apply_func(s): @@ -63,3 +79,15 @@ def _style_format(self): self.st = self.df.style.format( "{:,.3f}", subset=IndexSlice["row_1":f"row_{ir}", "float_1":f"float_{ic}"] ) + + def _style_apply_format_hide(self): + self.st = self.df.style.applymap(lambda v: "color: red;") + self.st.format("{:.3f}") + self.st.hide_index(self.st.index[1:]) + self.st.hide_columns(self.st.columns[1:]) + + def _style_tooltips(self): + ttips = DataFrame("abc", index=self.df.index[::2], columns=self.df.columns[::2]) + self.st = self.df.style.set_tooltips(ttips) + self.st.hide_index(self.st.index[12:]) + self.st.hide_columns(self.st.columns[12:]) diff --git a/asv_bench/benchmarks/join_merge.py b/asv_bench/benchmarks/join_merge.py index 27eaecff09d0f..ad40adc75c567 100644 --- a/asv_bench/benchmarks/join_merge.py +++ b/asv_bench/benchmarks/join_merge.py @@ -262,12 +262,24 @@ def setup(self): Z=self.right_object["Z"].astype("category") ) + self.left_cat_col = self.left_object.astype({"X": "category"}) + self.right_cat_col = self.right_object.astype({"X": "category"}) + + self.left_cat_idx = self.left_cat_col.set_index("X") + self.right_cat_idx = self.right_cat_col.set_index("X") + def time_merge_object(self): merge(self.left_object, self.right_object, on="X") def time_merge_cat(self): merge(self.left_cat, self.right_cat, on="X") + def time_merge_on_cat_col(self): + merge(self.left_cat_col, self.right_cat_col, on="X") + + def time_merge_on_cat_idx(self): + merge(self.left_cat_idx, self.right_cat_idx, on="X") + class MergeOrdered: def setup(self): diff --git a/asv_bench/benchmarks/pandas_vb_common.py b/asv_bench/benchmarks/pandas_vb_common.py index ed44102700dc6..d3168bde0a783 100644 --- a/asv_bench/benchmarks/pandas_vb_common.py +++ b/asv_bench/benchmarks/pandas_vb_common.py @@ -17,7 +17,7 @@ try: import pandas._testing as tm except ImportError: - import pandas.util.testing as tm # noqa + import pandas.util.testing as tm # noqa:F401 numeric_dtypes = [ diff --git a/asv_bench/benchmarks/reshape.py b/asv_bench/benchmarks/reshape.py index 232aabfb87c58..c83cd9a925f6d 100644 --- a/asv_bench/benchmarks/reshape.py +++ b/asv_bench/benchmarks/reshape.py @@ -102,6 +102,7 @@ def setup(self, dtype): columns = np.arange(n) if dtype == "int": values = np.arange(m * m * n).reshape(m * m, n) + self.df = DataFrame(values, index, columns) else: # the category branch is ~20x slower than int. So we # cut down the size a bit. Now it's only ~3x slower. @@ -111,7 +112,10 @@ def setup(self, dtype): values = np.take(list(string.ascii_letters), indices) values = [pd.Categorical(v) for v in values.T] - self.df = DataFrame(values, index, columns) + self.df = DataFrame( + {i: cat for i, cat in enumerate(values)}, index, columns + ) + self.df2 = self.df.iloc[:-1] def time_full_product(self, dtype): diff --git a/asv_bench/benchmarks/rolling.py b/asv_bench/benchmarks/rolling.py index d35770b720f7a..1c53d4adc8c25 100644 --- a/asv_bench/benchmarks/rolling.py +++ b/asv_bench/benchmarks/rolling.py @@ -1,3 +1,5 @@ +import warnings + import numpy as np import pandas as pd @@ -7,22 +9,24 @@ class Methods: params = ( ["DataFrame", "Series"], - [10, 1000], + [("rolling", {"window": 10}), ("rolling", {"window": 1000}), ("expanding", {})], ["int", "float"], - ["median", "mean", "max", "min", "std", "count", "skew", "kurt", "sum"], + ["median", "mean", "max", "min", "std", "count", "skew", "kurt", "sum", "sem"], ) - param_names = ["constructor", "window", "dtype", "method"] + param_names = ["constructor", "window_kwargs", "dtype", "method"] - def setup(self, constructor, window, dtype, method): + def setup(self, constructor, window_kwargs, dtype, method): N = 10 ** 5 + window, kwargs = window_kwargs arr = (100 * np.random.random(N)).astype(dtype) - self.roll = getattr(pd, constructor)(arr).rolling(window) + obj = getattr(pd, constructor)(arr) + self.window = getattr(obj, window)(**kwargs) - def time_rolling(self, constructor, window, dtype, method): - getattr(self.roll, method)() + def time_method(self, constructor, window_kwargs, dtype, method): + getattr(self.window, method)() - def peakmem_rolling(self, constructor, window, dtype, method): - getattr(self.roll, method)() + def peakmem_method(self, constructor, window_kwargs, dtype, method): + getattr(self.window, method)() class Apply: @@ -44,77 +48,116 @@ def time_rolling(self, constructor, window, dtype, function, raw): self.roll.apply(function, raw=raw) -class Engine: +class NumbaEngineMethods: params = ( ["DataFrame", "Series"], ["int", "float"], - [np.sum, lambda x: np.sum(x) + 5], - ["cython", "numba"], - ["sum", "max", "min", "median", "mean"], + [("rolling", {"window": 10}), ("expanding", {})], + ["sum", "max", "min", "median", "mean", "var", "std"], + [True, False], + [None, 100], ) - param_names = ["constructor", "dtype", "function", "engine", "method"] - - def setup(self, constructor, dtype, function, engine, method): + param_names = [ + "constructor", + "dtype", + "window_kwargs", + "method", + "parallel", + "cols", + ] + + def setup(self, constructor, dtype, window_kwargs, method, parallel, cols): N = 10 ** 3 - arr = (100 * np.random.random(N)).astype(dtype) - self.data = getattr(pd, constructor)(arr) - - def time_rolling_apply(self, constructor, dtype, function, engine, method): - self.data.rolling(10).apply(function, raw=True, engine=engine) - - def time_expanding_apply(self, constructor, dtype, function, engine, method): - self.data.expanding().apply(function, raw=True, engine=engine) - - def time_rolling_methods(self, constructor, dtype, function, engine, method): - getattr(self.data.rolling(10), method)(engine=engine) - - -class ExpandingMethods: - + window, kwargs = window_kwargs + shape = (N, cols) if cols is not None and constructor != "Series" else N + arr = (100 * np.random.random(shape)).astype(dtype) + data = getattr(pd, constructor)(arr) + + # Warm the cache + with warnings.catch_warnings(record=True): + # Catch parallel=True not being applicable e.g. 1D data + self.window = getattr(data, window)(**kwargs) + getattr(self.window, method)( + engine="numba", engine_kwargs={"parallel": parallel} + ) + + def test_method(self, constructor, dtype, window_kwargs, method, parallel, cols): + with warnings.catch_warnings(record=True): + getattr(self.window, method)( + engine="numba", engine_kwargs={"parallel": parallel} + ) + + +class NumbaEngineApply: params = ( ["DataFrame", "Series"], ["int", "float"], - ["median", "mean", "max", "min", "std", "count", "skew", "kurt", "sum"], + [("rolling", {"window": 10}), ("expanding", {})], + [np.sum, lambda x: np.sum(x) + 5], + [True, False], + [None, 100], ) - param_names = ["constructor", "window", "dtype", "method"] - - def setup(self, constructor, dtype, method): - N = 10 ** 5 - N_groupby = 100 - arr = (100 * np.random.random(N)).astype(dtype) - self.expanding = getattr(pd, constructor)(arr).expanding() - self.expanding_groupby = ( - pd.DataFrame({"A": arr[:N_groupby], "B": range(N_groupby)}) - .groupby("B") - .expanding() - ) - - def time_expanding(self, constructor, dtype, method): - getattr(self.expanding, method)() - - def time_expanding_groupby(self, constructor, dtype, method): - getattr(self.expanding_groupby, method)() + param_names = [ + "constructor", + "dtype", + "window_kwargs", + "function", + "parallel", + "cols", + ] + + def setup(self, constructor, dtype, window_kwargs, function, parallel, cols): + N = 10 ** 3 + window, kwargs = window_kwargs + shape = (N, cols) if cols is not None and constructor != "Series" else N + arr = (100 * np.random.random(shape)).astype(dtype) + data = getattr(pd, constructor)(arr) + + # Warm the cache + with warnings.catch_warnings(record=True): + # Catch parallel=True not being applicable e.g. 1D data + self.window = getattr(data, window)(**kwargs) + self.window.apply( + function, raw=True, engine="numba", engine_kwargs={"parallel": parallel} + ) + + def test_method(self, constructor, dtype, window_kwargs, function, parallel, cols): + with warnings.catch_warnings(record=True): + self.window.apply( + function, raw=True, engine="numba", engine_kwargs={"parallel": parallel} + ) class EWMMethods: - params = (["DataFrame", "Series"], [10, 1000], ["int", "float"], ["mean", "std"]) - param_names = ["constructor", "window", "dtype", "method"] + params = ( + ["DataFrame", "Series"], + [ + ({"halflife": 10}, "mean"), + ({"halflife": 10}, "std"), + ({"halflife": 1000}, "mean"), + ({"halflife": 1000}, "std"), + ( + { + "halflife": "1 Day", + "times": pd.date_range("1900", periods=10 ** 5, freq="23s"), + }, + "mean", + ), + ], + ["int", "float"], + ) + param_names = ["constructor", "kwargs_method", "dtype"] - def setup(self, constructor, window, dtype, method): + def setup(self, constructor, kwargs_method, dtype): N = 10 ** 5 + kwargs, method = kwargs_method arr = (100 * np.random.random(N)).astype(dtype) - times = pd.date_range("1900", periods=N, freq="23s") - self.ewm = getattr(pd, constructor)(arr).ewm(halflife=window) - self.ewm_times = getattr(pd, constructor)(arr).ewm( - halflife="1 Day", times=times - ) - - def time_ewm(self, constructor, window, dtype, method): - getattr(self.ewm, method)() + self.method = method + self.ewm = getattr(pd, constructor)(arr).ewm(**kwargs) - def time_ewm_times(self, constructor, window, dtype, method): - self.ewm_times.mean() + def time_ewm(self, constructor, kwargs_method, dtype): + getattr(self.ewm, self.method)() class VariableWindowMethods(Methods): @@ -122,7 +165,7 @@ class VariableWindowMethods(Methods): ["DataFrame", "Series"], ["50s", "1h", "1d"], ["int", "float"], - ["median", "mean", "max", "min", "std", "count", "skew", "kurt", "sum"], + ["median", "mean", "max", "min", "std", "count", "skew", "kurt", "sum", "sem"], ) param_names = ["constructor", "window", "dtype", "method"] @@ -130,35 +173,35 @@ def setup(self, constructor, window, dtype, method): N = 10 ** 5 arr = (100 * np.random.random(N)).astype(dtype) index = pd.date_range("2017-01-01", periods=N, freq="5s") - self.roll = getattr(pd, constructor)(arr, index=index).rolling(window) + self.window = getattr(pd, constructor)(arr, index=index).rolling(window) class Pairwise: - params = ([10, 1000, None], ["corr", "cov"], [True, False]) - param_names = ["window", "method", "pairwise"] + params = ( + [({"window": 10}, "rolling"), ({"window": 1000}, "rolling"), ({}, "expanding")], + ["corr", "cov"], + [True, False], + ) + param_names = ["window_kwargs", "method", "pairwise"] - def setup(self, window, method, pairwise): + def setup(self, kwargs_window, method, pairwise): N = 10 ** 4 n_groups = 20 + kwargs, window = kwargs_window groups = [i for _ in range(N // n_groups) for i in range(n_groups)] arr = np.random.random(N) self.df = pd.DataFrame(arr) - self.df_group = pd.DataFrame({"A": groups, "B": arr}).groupby("A") + self.window = getattr(self.df, window)(**kwargs) + self.window_group = getattr( + pd.DataFrame({"A": groups, "B": arr}).groupby("A"), window + )(**kwargs) - def time_pairwise(self, window, method, pairwise): - if window is None: - r = self.df.expanding() - else: - r = self.df.rolling(window=window) - getattr(r, method)(self.df, pairwise=pairwise) + def time_pairwise(self, kwargs_window, method, pairwise): + getattr(self.window, method)(self.df, pairwise=pairwise) - def time_groupby(self, window, method, pairwise): - if window is None: - r = self.df_group.expanding() - else: - r = self.df_group.rolling(window=window) - getattr(r, method)(self.df, pairwise=pairwise) + def time_groupby(self, kwargs_window, method, pairwise): + getattr(self.window_group, method)(self.df, pairwise=pairwise) class Quantile: @@ -180,6 +223,33 @@ def time_quantile(self, constructor, window, dtype, percentile, interpolation): self.roll.quantile(percentile, interpolation=interpolation) +class Rank: + params = ( + ["DataFrame", "Series"], + [10, 1000], + ["int", "float"], + [True, False], + [True, False], + ["min", "max", "average"], + ) + param_names = [ + "constructor", + "window", + "dtype", + "percentile", + "ascending", + "method", + ] + + def setup(self, constructor, window, dtype, percentile, ascending, method): + N = 10 ** 5 + arr = np.random.random(N).astype(dtype) + self.roll = getattr(pd, constructor)(arr).rolling(window) + + def time_rank(self, constructor, window, dtype, percentile, ascending, method): + self.roll.rank(pct=percentile, ascending=ascending, method=method) + + class PeakMemFixedWindowMinMax: params = ["min", "max"] @@ -218,10 +288,18 @@ def peakmem_rolling(self, constructor, window_size, dtype, method): class Groupby: - params = ["sum", "median", "mean", "max", "min", "kurt", "sum"] + params = ( + ["sum", "median", "mean", "max", "min", "kurt", "sum"], + [ + ("rolling", {"window": 2}), + ("rolling", {"window": "30s", "on": "C"}), + ("expanding", {}), + ], + ) - def setup(self, method): + def setup(self, method, window_kwargs): N = 1000 + window, kwargs = window_kwargs df = pd.DataFrame( { "A": [str(i) for i in range(N)] * 10, @@ -229,14 +307,10 @@ def setup(self, method): "C": pd.date_range(start="1900-01-01", freq="1min", periods=N * 10), } ) - self.groupby_roll_int = df.groupby("A").rolling(window=2) - self.groupby_roll_offset = df.groupby("A").rolling(window="30s", on="C") - - def time_rolling_int(self, method): - getattr(self.groupby_roll_int, method)() + self.groupby_window = getattr(df.groupby("A"), window)(**kwargs) - def time_rolling_offset(self, method): - getattr(self.groupby_roll_offset, method)() + def time_method(self, method, window_kwargs): + getattr(self.groupby_window, method)() class GroupbyLargeGroups: @@ -296,5 +370,8 @@ def time_apply(self, method): table_method_func, raw=True, engine="numba" ) + def time_ewm_mean(self, method): + self.df.ewm(1, method=method).mean(engine="numba") + from .pandas_vb_common import setup # noqa: F401 isort:skip diff --git a/asv_bench/benchmarks/series_methods.py b/asv_bench/benchmarks/series_methods.py index 7592ce54e3712..d8578ed604ae3 100644 --- a/asv_bench/benchmarks/series_methods.py +++ b/asv_bench/benchmarks/series_methods.py @@ -27,6 +27,19 @@ def time_constructor(self, data): Series(data=self.data, index=self.idx) +class ToFrame: + params = [["int64", "datetime64[ns]", "category", "Int64"], [None, "foo"]] + param_names = ["dtype", "name"] + + def setup(self, dtype, name): + arr = np.arange(10 ** 5) + ser = Series(arr, dtype=dtype) + self.ser = ser + + def time_to_frame(self, dtype, name): + self.ser.to_frame(name) + + class NSort: params = ["first", "last", "all"] @@ -139,6 +152,18 @@ def time_value_counts(self, N, dtype): self.s.value_counts() +class ValueCountsObjectDropNAFalse: + + params = [10 ** 3, 10 ** 4, 10 ** 5] + param_names = ["N"] + + def setup(self, N): + self.s = Series(np.random.randint(0, N, size=10 * N)).astype("object") + + def time_value_counts(self, N): + self.s.value_counts(dropna=False) + + class Mode: params = [[10 ** 3, 10 ** 4, 10 ** 5], ["int", "uint", "float", "object"]] @@ -151,6 +176,18 @@ def time_mode(self, N, dtype): self.s.mode() +class ModeObjectDropNAFalse: + + params = [10 ** 3, 10 ** 4, 10 ** 5] + param_names = ["N"] + + def setup(self, N): + self.s = Series(np.random.randint(0, N, size=10 * N)).astype("object") + + def time_mode(self, N): + self.s.mode(dropna=False) + + class Dir: def setup(self): self.s = Series(index=tm.makeStringIndex(10000)) diff --git a/asv_bench/benchmarks/sparse.py b/asv_bench/benchmarks/sparse.py index 35e5818cd3b2b..ec704896f5726 100644 --- a/asv_bench/benchmarks/sparse.py +++ b/asv_bench/benchmarks/sparse.py @@ -67,16 +67,42 @@ def time_sparse_series_from_coo(self): class ToCoo: - def setup(self): + params = [True, False] + param_names = ["sort_labels"] + + def setup(self, sort_labels): s = Series([np.nan] * 10000) s[0] = 3.0 s[100] = -1.0 s[999] = 12.1 - s.index = MultiIndex.from_product([range(10)] * 4) - self.ss = s.astype("Sparse") - def time_sparse_series_to_coo(self): - self.ss.sparse.to_coo(row_levels=[0, 1], column_levels=[2, 3], sort_labels=True) + s_mult_lvl = s.set_axis(MultiIndex.from_product([range(10)] * 4)) + self.ss_mult_lvl = s_mult_lvl.astype("Sparse") + + s_two_lvl = s.set_axis(MultiIndex.from_product([range(100)] * 2)) + self.ss_two_lvl = s_two_lvl.astype("Sparse") + + def time_sparse_series_to_coo(self, sort_labels): + self.ss_mult_lvl.sparse.to_coo( + row_levels=[0, 1], column_levels=[2, 3], sort_labels=sort_labels + ) + + def time_sparse_series_to_coo_single_level(self, sort_labels): + self.ss_two_lvl.sparse.to_coo(sort_labels=sort_labels) + + +class ToCooFrame: + def setup(self): + N = 10000 + k = 10 + arr = np.zeros((N, k), dtype=float) + arr[0, 0] = 3.0 + arr[12, 7] = -1.0 + arr[0, 9] = 11.2 + self.df = pd.DataFrame(arr, dtype=pd.SparseDtype("float", fill_value=0.0)) + + def time_to_coo(self): + self.df.sparse.to_coo() class Arithmetic: @@ -140,4 +166,68 @@ def time_division(self, fill_value): self.arr1 / self.arr2 +class MinMax: + + params = (["min", "max"], [0.0, np.nan]) + param_names = ["func", "fill_value"] + + def setup(self, func, fill_value): + N = 1_000_000 + arr = make_array(N, 1e-5, fill_value, np.float64) + self.sp_arr = SparseArray(arr, fill_value=fill_value) + + def time_min_max(self, func, fill_value): + getattr(self.sp_arr, func)() + + +class Take: + + params = ([np.array([0]), np.arange(100_000), np.full(100_000, -1)], [True, False]) + param_names = ["indices", "allow_fill"] + + def setup(self, indices, allow_fill): + N = 1_000_000 + fill_value = 0.0 + arr = make_array(N, 1e-5, fill_value, np.float64) + self.sp_arr = SparseArray(arr, fill_value=fill_value) + + def time_take(self, indices, allow_fill): + self.sp_arr.take(indices, allow_fill=allow_fill) + + +class GetItem: + def setup(self): + N = 1_000_000 + d = 1e-5 + arr = make_array(N, d, np.nan, np.float64) + self.sp_arr = SparseArray(arr) + + def time_integer_indexing(self): + self.sp_arr[78] + + def time_slice(self): + self.sp_arr[1:] + + +class GetItemMask: + + params = [True, False, np.nan] + param_names = ["fill_value"] + + def setup(self, fill_value): + N = 1_000_000 + d = 1e-5 + arr = make_array(N, d, np.nan, np.float64) + self.sp_arr = SparseArray(arr) + b_arr = np.full(shape=N, fill_value=fill_value, dtype=np.bool8) + fv_inds = np.unique( + np.random.randint(low=0, high=N - 1, size=int(N * d), dtype=np.int32) + ) + b_arr[fv_inds] = True if pd.isna(fill_value) else not fill_value + self.sp_b_arr = SparseArray(b_arr, dtype=np.bool8, fill_value=fill_value) + + def time_mask(self, fill_value): + self.sp_arr[self.sp_b_arr] + + from .pandas_vb_common import setup # noqa: F401 isort:skip diff --git a/asv_bench/benchmarks/tslibs/fields.py b/asv_bench/benchmarks/tslibs/fields.py index 0607a799ec707..23ae73811204c 100644 --- a/asv_bench/benchmarks/tslibs/fields.py +++ b/asv_bench/benchmarks/tslibs/fields.py @@ -12,7 +12,7 @@ class TimeGetTimedeltaField: params = [ _sizes, - ["days", "h", "s", "seconds", "ms", "microseconds", "us", "ns", "nanoseconds"], + ["days", "seconds", "microseconds", "nanoseconds"], ] param_names = ["size", "field"] diff --git a/azure-pipelines.yml b/azure-pipelines.yml index 956feaef5f83e..9c04d10707a64 100644 --- a/azure-pipelines.yml +++ b/azure-pipelines.yml @@ -2,43 +2,48 @@ trigger: branches: include: - - master - - 1.2.x + - main + - 1.4.x paths: exclude: - 'doc/*' pr: -- master -- 1.2.x + autoCancel: true + branches: + include: + - main + - 1.4.x variables: PYTEST_WORKERS: auto + PYTEST_TARGET: pandas jobs: # Mac and Linux use the same template - template: ci/azure/posix.yml parameters: name: macOS - vmImage: macOS-10.14 + vmImage: macOS-10.15 - template: ci/azure/windows.yml parameters: name: Windows - vmImage: vs2017-win2016 + vmImage: windows-2019 -- job: py37_32bit +- job: py38_32bit pool: vmImage: ubuntu-18.04 steps: + # TODO: GH#44980 https://github.com/pypa/setuptools/issues/2941 - script: | docker pull quay.io/pypa/manylinux2014_i686 docker run -v $(pwd):/pandas quay.io/pypa/manylinux2014_i686 \ /bin/bash -xc "cd pandas && \ - /opt/python/cp37-cp37m/bin/python -m venv ~/virtualenvs/pandas-dev && \ + /opt/python/cp38-cp38/bin/python -m venv ~/virtualenvs/pandas-dev && \ . ~/virtualenvs/pandas-dev/bin/activate && \ - python -m pip install --no-deps -U pip wheel setuptools && \ + python -m pip install --no-deps -U pip wheel 'setuptools<60.0.0' && \ pip install cython numpy python-dateutil pytz pytest pytest-xdist hypothesis pytest-azurepipelines && \ python setup.py build_ext -q -j2 && \ python -m pip install --no-build-isolation -e . && \ @@ -50,4 +55,4 @@ jobs: inputs: testResultsFiles: '**/test-*.xml' failTaskOnFailedTests: true - testRunTitle: 'Publish test results for Python 3.7-32 bit full Linux' + testRunTitle: 'Publish test results for Python 3.8-32 bit full Linux' diff --git a/ci/azure/posix.yml b/ci/azure/posix.yml index 2caacf3a07290..02a4a9ad44865 100644 --- a/ci/azure/posix.yml +++ b/ci/azure/posix.yml @@ -8,11 +8,36 @@ jobs: vmImage: ${{ parameters.vmImage }} strategy: matrix: - ${{ if eq(parameters.name, 'macOS') }}: - py37_macos: - ENV_FILE: ci/deps/azure-macos-37.yaml - CONDA_PY: "37" - PATTERN: "not slow and not network" + py38_macos_1: + ENV_FILE: ci/deps/azure-macos-38.yaml + CONDA_PY: "38" + PATTERN: "not slow" + PYTEST_TARGET: "pandas/tests/[a-h]*" + py38_macos_2: + ENV_FILE: ci/deps/azure-macos-38.yaml + CONDA_PY: "38" + PATTERN: "not slow" + PYTEST_TARGET: "pandas/tests/[i-z]*" + py39_macos_1: + ENV_FILE: ci/deps/azure-macos-39.yaml + CONDA_PY: "39" + PATTERN: "not slow" + PYTEST_TARGET: "pandas/tests/[a-h]*" + py39_macos_2: + ENV_FILE: ci/deps/azure-macos-39.yaml + CONDA_PY: "39" + PATTERN: "not slow" + PYTEST_TARGET: "pandas/tests/[i-z]*" + py310_macos_1: + ENV_FILE: ci/deps/azure-macos-310.yaml + CONDA_PY: "310" + PATTERN: "not slow" + PYTEST_TARGET: "pandas/tests/[a-h]*" + py310_macos_2: + ENV_FILE: ci/deps/azure-macos-310.yaml + CONDA_PY: "310" + PATTERN: "not slow" + PYTEST_TARGET: "pandas/tests/[i-z]*" steps: - script: echo '##vso[task.prependpath]$(HOME)/miniconda3/bin' diff --git a/ci/azure/windows.yml b/ci/azure/windows.yml index 5644ad46714d5..7061a266f28c7 100644 --- a/ci/azure/windows.yml +++ b/ci/azure/windows.yml @@ -8,41 +8,70 @@ jobs: vmImage: ${{ parameters.vmImage }} strategy: matrix: - py37_np17: - ENV_FILE: ci/deps/azure-windows-37.yaml - CONDA_PY: "37" - PATTERN: "not slow and not network" + py38_np18_1: + ENV_FILE: ci/deps/azure-windows-38.yaml + CONDA_PY: "38" + PATTERN: "not slow" + PYTEST_WORKERS: 2 # GH-42236 + PYTEST_TARGET: "pandas/tests/[a-h]*" - py38_np18: + py38_np18_2: ENV_FILE: ci/deps/azure-windows-38.yaml CONDA_PY: "38" - PATTERN: "not slow and not network and not high_memory" + PATTERN: "not slow" + PYTEST_WORKERS: 2 # GH-42236 + PYTEST_TARGET: "pandas/tests/[i-z]*" + + py39_1: + ENV_FILE: ci/deps/azure-windows-39.yaml + CONDA_PY: "39" + PATTERN: "not slow and not high_memory" + PYTEST_WORKERS: 2 # GH-42236 + PYTEST_TARGET: "pandas/tests/[a-h]*" + + py39_2: + ENV_FILE: ci/deps/azure-windows-39.yaml + CONDA_PY: "39" + PATTERN: "not slow and not high_memory" + PYTEST_WORKERS: 2 # GH-42236 + PYTEST_TARGET: "pandas/tests/[i-z]*" + + py310_1: + ENV_FILE: ci/deps/azure-windows-310.yaml + CONDA_PY: "310" + PATTERN: "not slow and not high_memory" + PYTEST_WORKERS: 2 # GH-42236 + PYTEST_TARGET: "pandas/tests/[a-h]*" + + py310_2: + ENV_FILE: ci/deps/azure-windows-310.yaml + CONDA_PY: "310" + PATTERN: "not slow and not high_memory" + PYTEST_WORKERS: 2 # GH-42236 + PYTEST_TARGET: "pandas/tests/[i-z]*" steps: - powershell: | Write-Host "##vso[task.prependpath]$env:CONDA\Scripts" Write-Host "##vso[task.prependpath]$HOME/miniconda3/bin" displayName: 'Add conda to PATH' - - script: conda update -q -n base conda displayName: 'Update conda' - bash: | conda env create -q --file ci\\deps\\azure-windows-$(CONDA_PY).yaml displayName: 'Create anaconda environment' - - bash: | source activate pandas-dev conda list python setup.py build_ext -q -j 4 python -m pip install --no-build-isolation -e . displayName: 'Build' - - bash: | source activate pandas-dev + wmic.exe cpu get caption, deviceid, name, numberofcores, maxclockspeed ci/run_tests.sh displayName: 'Test' - - task: PublishTestResults@2 condition: succeededOrFailed() inputs: diff --git a/ci/code_checks.sh b/ci/code_checks.sh index 1844cb863c183..4498585e36ce5 100755 --- a/ci/code_checks.sh +++ b/ci/code_checks.sh @@ -3,22 +3,18 @@ # Run checks related to code quality. # # This script is intended for both the CI and to check locally that code standards are -# respected. We are currently linting (PEP-8 and similar), looking for patterns of -# common mistakes (sphinx directives with missing blank lines, old style classes, -# unwanted imports...), we run doctests here (currently some files only), and we +# respected. We run doctests here (currently some files only), and we # validate formatting error in docstrings. # # Usage: # $ ./ci/code_checks.sh # run all checks -# $ ./ci/code_checks.sh lint # run linting only -# $ ./ci/code_checks.sh patterns # check for patterns that should not exist # $ ./ci/code_checks.sh code # checks on imported code # $ ./ci/code_checks.sh doctests # run doctests # $ ./ci/code_checks.sh docstrings # validate docstring errors # $ ./ci/code_checks.sh typing # run static type analysis -[[ -z "$1" || "$1" == "lint" || "$1" == "patterns" || "$1" == "code" || "$1" == "doctests" || "$1" == "docstrings" || "$1" == "typing" ]] || \ - { echo "Unknown command $1. Usage: $0 [lint|patterns|code|doctests|docstrings|typing]"; exit 9999; } +[[ -z "$1" || "$1" == "code" || "$1" == "doctests" || "$1" == "docstrings" || "$1" == "typing" ]] || \ + { echo "Unknown command $1. Usage: $0 [code|doctests|docstrings|typing]"; exit 9999; } BASE_DIR="$(dirname $0)/.." RET=0 @@ -38,49 +34,7 @@ function invgrep { } if [[ "$GITHUB_ACTIONS" == "true" ]]; then - FLAKE8_FORMAT="##[error]%(path)s:%(row)s:%(col)s:%(code)s:%(text)s" INVGREP_PREPEND="##[error]" -else - FLAKE8_FORMAT="default" -fi - -### LINTING ### -if [[ -z "$CHECK" || "$CHECK" == "lint" ]]; then - - # Check that cython casting is of the form `obj` as opposed to ` obj`; - # it doesn't make a difference, but we want to be internally consistent. - # Note: this grep pattern is (intended to be) equivalent to the python - # regex r'(?])> ' - MSG='Linting .pyx code for spacing conventions in casting' ; echo $MSG - invgrep -r -E --include '*.pyx' --include '*.pxi.in' '[a-zA-Z0-9*]> ' pandas/_libs - RET=$(($RET + $?)) ; echo $MSG "DONE" - - # readability/casting: Warnings about C casting instead of C++ casting - # runtime/int: Warnings about using C number types instead of C++ ones - # build/include_subdir: Warnings about prefacing included header files with directory - -fi - -### PATTERNS ### -if [[ -z "$CHECK" || "$CHECK" == "patterns" ]]; then - - # Check for the following code in the extension array base tests: `tm.assert_frame_equal` and `tm.assert_series_equal` - MSG='Check for invalid EA testing' ; echo $MSG - invgrep -r -E --include '*.py' --exclude base.py 'tm.assert_(series|frame)_equal' pandas/tests/extension/base - RET=$(($RET + $?)) ; echo $MSG "DONE" - - MSG='Check for deprecated messages without sphinx directive' ; echo $MSG - invgrep -R --include="*.py" --include="*.pyx" -E "(DEPRECATED|DEPRECATE|Deprecated)(:|,|\.)" pandas - RET=$(($RET + $?)) ; echo $MSG "DONE" - - MSG='Check for backticks incorrectly rendering because of missing spaces' ; echo $MSG - invgrep -R --include="*.rst" -E "[a-zA-Z0-9]\`\`?[a-zA-Z0-9]" doc/source/ - RET=$(($RET + $?)) ; echo $MSG "DONE" - - MSG='Check for unnecessary random seeds in asv benchmarks' ; echo $MSG - invgrep -R --exclude pandas_vb_common.py -E 'np.random.seed' asv_bench/benchmarks/ - RET=$(($RET + $?)) ; echo $MSG "DONE" - fi ### CODE ### @@ -110,45 +64,13 @@ fi ### DOCTESTS ### if [[ -z "$CHECK" || "$CHECK" == "doctests" ]]; then - MSG='Doctests for individual files' ; echo $MSG - pytest -q --doctest-modules \ - pandas/core/accessor.py \ - pandas/core/aggregation.py \ - pandas/core/algorithms.py \ - pandas/core/base.py \ - pandas/core/construction.py \ - pandas/core/frame.py \ - pandas/core/generic.py \ - pandas/core/indexers.py \ - pandas/core/nanops.py \ - pandas/core/series.py \ - pandas/io/sql.py + MSG='Doctests' ; echo $MSG + # Ignore test_*.py files or else the unit tests will run + python -m pytest --doctest-modules --ignore-glob="**/test_*.py" pandas RET=$(($RET + $?)) ; echo $MSG "DONE" - MSG='Doctests for directories' ; echo $MSG - pytest -q --doctest-modules \ - pandas/_libs/ \ - pandas/api/ \ - pandas/arrays/ \ - pandas/compat/ \ - pandas/core/array_algos/ \ - pandas/core/arrays/ \ - pandas/core/computation/ \ - pandas/core/dtypes/ \ - pandas/core/groupby/ \ - pandas/core/indexes/ \ - pandas/core/ops/ \ - pandas/core/reshape/ \ - pandas/core/strings/ \ - pandas/core/tools/ \ - pandas/core/window/ \ - pandas/errors/ \ - pandas/io/clipboard/ \ - pandas/io/json/ \ - pandas/io/excel/ \ - pandas/io/parsers/ \ - pandas/io/sas/ \ - pandas/tseries/ + MSG='Cython Doctests' ; echo $MSG + python -m pytest --doctest-cython pandas/_libs RET=$(($RET + $?)) ; echo $MSG "DONE" fi @@ -156,8 +78,8 @@ fi ### DOCSTRINGS ### if [[ -z "$CHECK" || "$CHECK" == "docstrings" ]]; then - MSG='Validate docstrings (GL03, GL04, GL05, GL06, GL07, GL09, GL10, SS01, SS02, SS04, SS05, PR03, PR04, PR05, PR10, EX04, RT01, RT04, RT05, SA02, SA03)' ; echo $MSG - $BASE_DIR/scripts/validate_docstrings.py --format=actions --errors=GL03,GL04,GL05,GL06,GL07,GL09,GL10,SS02,SS04,SS05,PR03,PR04,PR05,PR10,EX04,RT01,RT04,RT05,SA02,SA03 + MSG='Validate docstrings (GL01, GL02, GL03, GL04, GL05, GL06, GL07, GL09, GL10, SS01, SS02, SS03, SS04, SS05, PR03, PR04, PR05, PR06, PR08, PR09, PR10, EX04, RT01, RT04, RT05, SA02, SA03)' ; echo $MSG + $BASE_DIR/scripts/validate_docstrings.py --format=actions --errors=GL01,GL02,GL03,GL04,GL05,GL06,GL07,GL09,GL10,SS02,SS03,SS04,SS05,PR03,PR04,PR05,PR06,PR08,PR09,PR10,EX04,RT01,RT04,RT05,SA02,SA03 RET=$(($RET + $?)) ; echo $MSG "DONE" fi @@ -169,8 +91,15 @@ if [[ -z "$CHECK" || "$CHECK" == "typing" ]]; then mypy --version MSG='Performing static analysis using mypy' ; echo $MSG - mypy pandas + mypy RET=$(($RET + $?)) ; echo $MSG "DONE" + + # run pyright, if it is installed + if command -v pyright &> /dev/null ; then + MSG='Performing static analysis using pyright' ; echo $MSG + pyright + RET=$(($RET + $?)) ; echo $MSG "DONE" + fi fi exit $RET diff --git a/ci/deps/actions-38-numpydev.yaml b/ci/deps/actions-310-numpydev.yaml similarity index 64% rename from ci/deps/actions-38-numpydev.yaml rename to ci/deps/actions-310-numpydev.yaml index 6eed2daac0c3b..3e32665d5433f 100644 --- a/ci/deps/actions-38-numpydev.yaml +++ b/ci/deps/actions-310-numpydev.yaml @@ -2,20 +2,20 @@ name: pandas-dev channels: - defaults dependencies: - - python=3.8.* + - python=3.10 # tools - pytest>=6.0 - pytest-cov - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 # pandas dependencies + - python-dateutil - pytz - pip - pip: - - cython==0.29.21 # GH#34014 - - "git+git://github.com/dateutil/dateutil.git" + - cython==0.29.24 # GH#34014 - "--extra-index-url https://pypi.anaconda.org/scipy-wheels-nightly/simple" - "--pre" - "numpy" diff --git a/ci/deps/actions-310.yaml b/ci/deps/actions-310.yaml new file mode 100644 index 0000000000000..9829380620f86 --- /dev/null +++ b/ci/deps/actions-310.yaml @@ -0,0 +1,51 @@ +name: pandas-dev +channels: + - conda-forge +dependencies: + - python=3.9 + + # test dependencies + - cython=0.29.24 + - pytest>=6.0 + - pytest-cov + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 + - psutil + + # required dependencies + - python-dateutil + - numpy + - pytz + + # optional dependencies + - beautifulsoup4 + - blosc + - bottleneck + - fastparquet + - fsspec + - html5lib + - gcsfs + - jinja2 + - lxml + - matplotlib + # TODO: uncomment after numba supports py310 + #- numba + - numexpr + - openpyxl + - odfpy + - pandas-gbq + - psycopg2 + - pymysql + - pytables + - pyarrow + - pyreadstat + - pyxlsb + - s3fs + - scipy + - sqlalchemy + - tabulate + - xarray + - xlrd + - xlsxwriter + - xlwt + - zstandard diff --git a/ci/deps/actions-37-db-min.yaml b/ci/deps/actions-37-db-min.yaml deleted file mode 100644 index cae4361ca37a7..0000000000000 --- a/ci/deps/actions-37-db-min.yaml +++ /dev/null @@ -1,48 +0,0 @@ -name: pandas-dev -channels: - - conda-forge -dependencies: - - python=3.7.* - - # tools - - cython>=0.29.21 - - pytest>=6.0 - - pytest-cov - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 - - # required - - numpy<1.20 # GH#39541 compat for pyarrow<3 - - python-dateutil - - pytz - - # optional - - beautifulsoup4 - - blosc=1.17.0 - - python-blosc - - fastparquet=0.4.0 - - html5lib - - ipython - - jinja2 - - lxml=4.3.0 - - matplotlib - - nomkl - - numexpr - - openpyxl - - pandas-gbq - - google-cloud-bigquery>=1.27.2 # GH 36436 - - protobuf>=3.12.4 - - pyarrow=0.17.1 # GH 38803 - - pytables>=3.5.1 - - scipy - - xarray=0.12.3 - - xlrd<2.0 - - xlsxwriter - - xlwt - - moto - - flask - - # sql - - psycopg2=2.7 - - pymysql=0.8.1 - - sqlalchemy=1.3.0 diff --git a/ci/deps/actions-37-locale_slow.yaml b/ci/deps/actions-37-locale_slow.yaml deleted file mode 100644 index c6eb3b00a63ac..0000000000000 --- a/ci/deps/actions-37-locale_slow.yaml +++ /dev/null @@ -1,30 +0,0 @@ -name: pandas-dev -channels: - - defaults - - conda-forge -dependencies: - - python=3.7.* - - # tools - - cython>=0.29.21 - - pytest>=6.0 - - pytest-cov - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 - - # pandas dependencies - - beautifulsoup4=4.6.0 - - bottleneck=1.2.* - - lxml - - matplotlib=3.0.0 - - numpy=1.17.* - - openpyxl=3.0.0 - - python-dateutil - - python-blosc - - pytz=2017.3 - - scipy - - sqlalchemy=1.3.0 - - xlrd=1.2.0 - - xlsxwriter=1.0.2 - - xlwt=1.3.0 - - html5lib=1.0.1 diff --git a/ci/deps/actions-37-minimum_versions.yaml b/ci/deps/actions-37-minimum_versions.yaml deleted file mode 100644 index b97601d18917c..0000000000000 --- a/ci/deps/actions-37-minimum_versions.yaml +++ /dev/null @@ -1,31 +0,0 @@ -name: pandas-dev -channels: - - conda-forge -dependencies: - - python=3.7.1 - - # tools - - cython=0.29.21 - - pytest>=6.0 - - pytest-cov - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 - - psutil - - # pandas dependencies - - beautifulsoup4=4.6.0 - - bottleneck=1.2.1 - - jinja2=2.10 - - numba=0.46.0 - - numexpr=2.7.0 - - numpy=1.17.3 - - openpyxl=3.0.0 - - pytables=3.5.1 - - python-dateutil=2.7.3 - - pytz=2017.3 - - pyarrow=0.17.0 - - scipy=1.2 - - xlrd=1.2.0 - - xlsxwriter=1.0.2 - - xlwt=1.3.0 - - html5lib=1.0.1 diff --git a/ci/deps/actions-37.yaml b/ci/deps/actions-37.yaml deleted file mode 100644 index 0effe6f80df86..0000000000000 --- a/ci/deps/actions-37.yaml +++ /dev/null @@ -1,28 +0,0 @@ -name: pandas-dev -channels: - - defaults - - conda-forge -dependencies: - - python=3.7.* - - # tools - - cython>=0.29.21 - - pytest>=6.0 - - pytest-cov - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 - - # pandas dependencies - - botocore>=1.11 - - fsspec>=0.7.4 - - numpy=1.19 - - python-dateutil - - nomkl - - pyarrow - - pytz - - s3fs>=0.4.0 - - moto>=1.3.14 - - flask - - tabulate - - pyreadstat - - pip diff --git a/ci/deps/actions-37-db.yaml b/ci/deps/actions-38-downstream_compat.yaml similarity index 51% rename from ci/deps/actions-37-db.yaml rename to ci/deps/actions-38-downstream_compat.yaml index e568f8615a8df..af4f7dee851d5 100644 --- a/ci/deps/actions-37-db.yaml +++ b/ci/deps/actions-38-downstream_compat.yaml @@ -1,54 +1,66 @@ +# Non-dependencies that pandas utilizes or has compatibility with pandas objects name: pandas-dev channels: - conda-forge dependencies: - - python=3.7.* + - python=3.8 + - pip - # tools - - cython>=0.29.21 + # test dependencies + - cython>=0.29.24 - pytest>=6.0 - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 - pytest-cov>=2.10.1 # this is only needed in the coverage build, ref: GH 35737 + - nomkl + + # required dependencies + - numpy + - python-dateutil + - pytz - # pandas dependencies + # optional dependencies - beautifulsoup4 - - botocore>=1.11 - - dask + - blosc - fastparquet>=0.4.0 - fsspec>=0.7.4 - - gcsfs>=0.6.0 - - geopandas + - gcsfs - html5lib + - jinja2 + - lxml - matplotlib - - moto>=1.3.14 - - flask - - nomkl - numexpr - - numpy=1.17.* - odfpy - openpyxl - pandas-gbq - - google-cloud-bigquery>=1.27.2 # GH 36436 - psycopg2 - - pyarrow>=0.17.0 + - pyarrow>=1.0.1 - pymysql - pytables - - python-snappy - - python-dateutil - - pytz + - pyxlsb - s3fs>=0.4.0 - - scikit-learn - scipy - sqlalchemy - - statsmodels - xarray - - xlrd<2.0 + - xlrd - xlsxwriter - xlwt - - pip + + # downstream packages + - aiobotocore<2.0.0 # GH#44311 pinned to fix docbuild + - boto3 + - botocore>=1.11 + - dask + - ipython + - geopandas + - python-snappy + - seaborn + - scikit-learn + - statsmodels + - brotlipy + - coverage + - pandas-datareader + - pyyaml + - py - pip: - - brotlipy - - coverage - - pandas-datareader - - pyxlsb + - torch diff --git a/ci/deps/actions-38-locale.yaml b/ci/deps/actions-38-locale.yaml deleted file mode 100644 index 34a6860936550..0000000000000 --- a/ci/deps/actions-38-locale.yaml +++ /dev/null @@ -1,41 +0,0 @@ -name: pandas-dev -channels: - - conda-forge -dependencies: - - python=3.8.* - - # tools - - cython>=0.29.21 - - pytest>=6.0 - - pytest-cov - - pytest-xdist>=1.21 - - pytest-asyncio>=0.12.0 - - hypothesis>=3.58.0 - - # pandas dependencies - - beautifulsoup4 - - flask - - html5lib - - ipython - - jinja2 - - jedi<0.18.0 - - lxml - - matplotlib<3.3.0 - - moto - - nomkl - - numexpr - - numpy<1.20 # GH#39541 compat with pyarrow<3 - - openpyxl - - pytables - - python-dateutil - - pytz - - scipy - - xarray - - xlrd<2.0 - - xlsxwriter - - xlwt - - moto - - pyarrow=1.0.0 - - pip - - pip: - - pyxlsb diff --git a/ci/deps/actions-38-minimum_versions.yaml b/ci/deps/actions-38-minimum_versions.yaml new file mode 100644 index 0000000000000..467402bb6ef7f --- /dev/null +++ b/ci/deps/actions-38-minimum_versions.yaml @@ -0,0 +1,52 @@ +# Minimum version of required + optional dependencies +# Aligned with getting_started/install.rst and compat/_optional.py +name: pandas-dev +channels: + - conda-forge +dependencies: + - python=3.8.0 + + # test dependencies + - cython=0.29.24 + - pytest>=6.0 + - pytest-cov + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 + - psutil + + # required dependencies + - python-dateutil=2.8.1 + - numpy=1.18.5 + - pytz=2020.1 + + # optional dependencies + - beautifulsoup4=4.8.2 + - blosc=1.20.1 + - bottleneck=1.3.1 + - fastparquet=0.4.0 + - fsspec=0.7.4 + - html5lib=1.1 + - gcsfs=0.6.0 + - jinja2=2.11 + - lxml=4.5.0 + - matplotlib=3.3.2 + - numba=0.50.1 + - numexpr=2.7.1 + - odfpy=1.4.1 + - openpyxl=3.0.3 + - pandas-gbq=0.14.0 + - psycopg2=2.8.4 + - pymysql=0.10.1 + - pytables=3.6.1 + - pyarrow=1.0.1 + - pyreadstat=1.1.0 + - pyxlsb=1.0.6 + - s3fs=0.4.0 + - scipy=1.4.1 + - sqlalchemy=1.4.0 + - tabulate=0.8.7 + - xarray=0.15.1 + - xlrd=2.0.1 + - xlsxwriter=1.2.2 + - xlwt=1.3.0 + - zstandard=0.15.2 diff --git a/ci/deps/actions-38-slow.yaml b/ci/deps/actions-38-slow.yaml deleted file mode 100644 index afba60e451b90..0000000000000 --- a/ci/deps/actions-38-slow.yaml +++ /dev/null @@ -1,38 +0,0 @@ -name: pandas-dev -channels: - - conda-forge -dependencies: - - python=3.8.* - - # tools - - cython>=0.29.21 - - pytest>=6.0 - - pytest-cov - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 - - # pandas dependencies - - beautifulsoup4 - - fsspec>=0.7.4 - - html5lib - - lxml - - matplotlib - - numexpr - - numpy - - openpyxl - - patsy - - psycopg2 - - pymysql - - pytables - - python-dateutil - - pytz - - s3fs>=0.4.0 - - moto>=1.3.14 - - scipy - - sqlalchemy - - xlrd>=2.0 - - xlsxwriter - - xlwt - - moto - - flask - - numba diff --git a/ci/deps/actions-38.yaml b/ci/deps/actions-38.yaml index 11daa92046eb4..b23f686d845e9 100644 --- a/ci/deps/actions-38.yaml +++ b/ci/deps/actions-38.yaml @@ -1,20 +1,50 @@ name: pandas-dev channels: - - defaults - conda-forge dependencies: - - python=3.8.* + - python=3.8 - # tools - - cython>=0.29.21 + # test dependencies + - cython=0.29.24 - pytest>=6.0 - pytest-cov - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 + - psutil - # pandas dependencies - - numpy + # required dependencies - python-dateutil - - nomkl + - numpy - pytz - - tabulate==0.8.7 + + # optional dependencies + - beautifulsoup4 + - blosc + - bottleneck + - fastparquet + - fsspec + - html5lib + - gcsfs + - jinja2 + - lxml + - matplotlib + - numba + - numexpr + - openpyxl + - odfpy + - pandas-gbq + - psycopg2 + - pymysql + - pytables + - pyarrow=3 + - pyreadstat + - pyxlsb + - s3fs + - scipy + - sqlalchemy + - tabulate + - xarray + - xlrd + - xlsxwriter + - xlwt + - zstandard diff --git a/ci/deps/actions-39.yaml b/ci/deps/actions-39.yaml index b74f1af8ee0f6..631ef40b02e33 100644 --- a/ci/deps/actions-39.yaml +++ b/ci/deps/actions-39.yaml @@ -2,21 +2,49 @@ name: pandas-dev channels: - conda-forge dependencies: - - python=3.9.* + - python=3.9 - # tools - - cython>=0.29.21 + # test dependencies + - cython=0.29.24 - pytest>=6.0 - pytest-cov - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 + - psutil - # pandas dependencies - - numpy + # required dependencies - python-dateutil + - numpy - pytz # optional dependencies + - beautifulsoup4 + - blosc + - bottleneck + - fastparquet + - fsspec + - html5lib + - gcsfs + - jinja2 + - lxml + - matplotlib + - numba + - numexpr + - openpyxl + - odfpy + - pandas-gbq + - psycopg2 + - pymysql - pytables + - pyarrow=5 + - pyreadstat + - pyxlsb + - s3fs - scipy - - pyarrow=1.0 + - sqlalchemy + - tabulate + - xarray + - xlrd + - xlsxwriter + - xlwt + - zstandard diff --git a/ci/deps/actions-pypy-38.yaml b/ci/deps/actions-pypy-38.yaml new file mode 100644 index 0000000000000..ad05d2ab2dacc --- /dev/null +++ b/ci/deps/actions-pypy-38.yaml @@ -0,0 +1,20 @@ +name: pandas-dev +channels: + - conda-forge +dependencies: + # TODO: Add the rest of the dependencies in here + # once the other plentiful failures/segfaults + # with base pandas has been dealt with + - python=3.8[build=*_pypy] # TODO: use this once pypy3.8 is available + + # tools + - cython>=0.29.24 + - pytest>=6.0 + - pytest-cov + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 + + # required + - numpy + - python-dateutil + - pytz diff --git a/ci/deps/azure-macos-37.yaml b/ci/deps/azure-macos-310.yaml similarity index 57% rename from ci/deps/azure-macos-37.yaml rename to ci/deps/azure-macos-310.yaml index 43e1055347f17..312fac8091db6 100644 --- a/ci/deps/azure-macos-37.yaml +++ b/ci/deps/azure-macos-310.yaml @@ -3,12 +3,13 @@ channels: - defaults - conda-forge dependencies: - - python=3.7.* + - python=3.10 # tools + - cython>=0.29.24 - pytest>=6.0 - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 - pytest-azurepipelines # pandas dependencies @@ -17,21 +18,19 @@ dependencies: - html5lib - jinja2 - lxml - - matplotlib=2.2.3 + - matplotlib - nomkl - numexpr - - numpy=1.17.3 + - numpy - openpyxl - - pyarrow=0.17 + - pyarrow + - pyreadstat - pytables - - python-dateutil==2.7.3 + - python-dateutil==2.8.1 - pytz + - pyxlsb - xarray - - xlrd<2.0 + - xlrd - xlsxwriter - xlwt - - pip - - pip: - - cython>=0.29.21 - - pyreadstat - - pyxlsb + - zstandard diff --git a/ci/deps/azure-macos-38.yaml b/ci/deps/azure-macos-38.yaml new file mode 100644 index 0000000000000..422aa86c57fc7 --- /dev/null +++ b/ci/deps/azure-macos-38.yaml @@ -0,0 +1,36 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - python=3.8 + + # tools + - cython>=0.29.24 + - pytest>=6.0 + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 + - pytest-azurepipelines + + # pandas dependencies + - beautifulsoup4 + - bottleneck + - html5lib + - jinja2 + - lxml + - matplotlib=3.3.2 + - nomkl + - numexpr + - numpy=1.18.5 + - openpyxl + - pyarrow=1.0.1 + - pyreadstat + - pytables + - python-dateutil==2.8.1 + - pytz + - pyxlsb + - xarray + - xlrd + - xlsxwriter + - xlwt + - zstandard diff --git a/ci/deps/azure-macos-39.yaml b/ci/deps/azure-macos-39.yaml new file mode 100644 index 0000000000000..140d67796452c --- /dev/null +++ b/ci/deps/azure-macos-39.yaml @@ -0,0 +1,36 @@ +name: pandas-dev +channels: + - defaults + - conda-forge +dependencies: + - python=3.9 + + # tools + - cython>=0.29.24 + - pytest>=6.0 + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 + - pytest-azurepipelines + + # pandas dependencies + - beautifulsoup4 + - bottleneck + - html5lib + - jinja2 + - lxml + - matplotlib=3.3.2 + - nomkl + - numexpr + - numpy=1.21.3 + - openpyxl + - pyarrow=4 + - pyreadstat + - pytables + - python-dateutil==2.8.1 + - pytz + - pyxlsb + - xarray + - xlrd + - xlsxwriter + - xlwt + - zstandard diff --git a/ci/deps/azure-windows-37.yaml b/ci/deps/azure-windows-310.yaml similarity index 62% rename from ci/deps/azure-windows-37.yaml rename to ci/deps/azure-windows-310.yaml index 5cbc029f8c03d..8e6f4deef6057 100644 --- a/ci/deps/azure-windows-37.yaml +++ b/ci/deps/azure-windows-310.yaml @@ -1,42 +1,41 @@ name: pandas-dev channels: - - defaults - conda-forge + - defaults dependencies: - - python=3.7.* + - python=3.10 # tools - - cython>=0.29.21 + - cython>=0.29.24 - pytest>=6.0 - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 - pytest-azurepipelines # pandas dependencies - beautifulsoup4 - bottleneck - fsspec>=0.8.0 - - gcsfs>=0.6.0 + - gcsfs - html5lib - jinja2 - lxml - - matplotlib=2.2.* - - moto>=1.3.14 - - flask + - matplotlib + # TODO: uncomment after numba supports py310 + #- numba - numexpr - - numpy=1.17.* + - numpy - openpyxl - - pyarrow=0.17.0 + - pyarrow - pytables - python-dateutil - pytz - s3fs>=0.4.2 - scipy - sqlalchemy - - xlrd>=2.0 + - xlrd - xlsxwriter - xlwt - pyreadstat - - pip - - pip: - - pyxlsb + - pyxlsb + - zstandard diff --git a/ci/deps/azure-windows-38.yaml b/ci/deps/azure-windows-38.yaml index 7fdecae626f9d..eb533524147d9 100644 --- a/ci/deps/azure-windows-38.yaml +++ b/ci/deps/azure-windows-38.yaml @@ -3,34 +3,33 @@ channels: - conda-forge - defaults dependencies: - - python=3.8.* + - python=3.8 # tools - - cython>=0.29.21 + - cython>=0.29.24 - pytest>=6.0 - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 - pytest-azurepipelines # pandas dependencies - blosc - bottleneck - fastparquet>=0.4.0 - - flask - fsspec>=0.8.0 - - matplotlib=3.1.3 - - moto>=1.3.14 + - matplotlib=3.3.2 - numba - numexpr - - numpy=1.18.* + - numpy=1.18 - openpyxl - jinja2 - - pyarrow>=0.17.0 + - pyarrow=2 - pytables - python-dateutil - pytz - s3fs>=0.4.0 - scipy - - xlrd<2.0 + - xlrd - xlsxwriter - xlwt + - zstandard diff --git a/ci/deps/actions-37-slow.yaml b/ci/deps/azure-windows-39.yaml similarity index 56% rename from ci/deps/actions-37-slow.yaml rename to ci/deps/azure-windows-39.yaml index 166f2237dcad3..6f820b1c2aedb 100644 --- a/ci/deps/actions-37-slow.yaml +++ b/ci/deps/azure-windows-39.yaml @@ -1,39 +1,40 @@ name: pandas-dev channels: - - defaults - conda-forge + - defaults dependencies: - - python=3.7.* + - python=3.9 # tools - - cython>=0.29.21 + - cython>=0.29.24 - pytest>=6.0 - - pytest-cov - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 + - pytest-azurepipelines # pandas dependencies - beautifulsoup4 - - fsspec>=0.7.4 + - bottleneck + - fsspec>=0.8.0 + - gcsfs - html5lib + - jinja2 - lxml - matplotlib + - numba - numexpr - numpy - openpyxl - - patsy - - psycopg2 - - pymysql + - pyarrow=6 - pytables - python-dateutil - pytz - - s3fs>=0.4.0 - - moto>=1.3.14 + - s3fs>=0.4.2 - scipy - sqlalchemy - - xlrd<2.0 + - xlrd - xlsxwriter - xlwt - - moto - - flask - - numba + - pyreadstat + - pyxlsb + - zstandard diff --git a/ci/deps/circle-37-arm64.yaml b/ci/deps/circle-38-arm64.yaml similarity index 64% rename from ci/deps/circle-37-arm64.yaml rename to ci/deps/circle-38-arm64.yaml index 995ebda1f97e7..60608c3ee1a86 100644 --- a/ci/deps/circle-37-arm64.yaml +++ b/ci/deps/circle-38-arm64.yaml @@ -2,20 +2,20 @@ name: pandas-dev channels: - conda-forge dependencies: - - python=3.7.* + - python=3.8 # tools - - cython>=0.29.21 + - cython>=0.29.24 - pytest>=6.0 - - pytest-xdist>=1.21 - - hypothesis>=3.58.0 + - pytest-xdist>=1.31 + - hypothesis>=5.5.3 # pandas dependencies - botocore>=1.11 + - flask + - moto - numpy - python-dateutil - pytz + - zstandard - pip - - flask - - pip: - - moto diff --git a/ci/run_tests.sh b/ci/run_tests.sh index 0d6f26d8c29f8..203f8fe293a06 100755 --- a/ci/run_tests.sh +++ b/ci/run_tests.sh @@ -5,12 +5,17 @@ # https://github.com/pytest-dev/pytest/issues/1075 export PYTHONHASHSEED=$(python -c 'import random; print(random.randint(1, 4294967295))') +# May help reproduce flaky CI builds if set in subsequent runs +echo PYTHONHASHSEED=$PYTHONHASHSEED + if [[ "not network" == *"$PATTERN"* ]]; then export http_proxy=http://1.2.3.4 https_proxy=http://1.2.3.4; fi -if [ "$COVERAGE" ]; then +if [[ "$COVERAGE" == "true" ]]; then COVERAGE="-s --cov=pandas --cov-report=xml --cov-append" +else + COVERAGE="" # We need to reset this for COVERAGE="false" case fi # If no X server is found, we use xvfb to emulate it @@ -19,18 +24,19 @@ if [[ $(uname) == "Linux" && -z $DISPLAY ]]; then XVFB="xvfb-run " fi -PYTEST_CMD="${XVFB}pytest -m \"$PATTERN\" -n $PYTEST_WORKERS --dist=loadfile $TEST_ARGS $COVERAGE pandas" +PYTEST_CMD="${XVFB}pytest -m \"$PATTERN\" -n $PYTEST_WORKERS --dist=loadfile $TEST_ARGS $COVERAGE $PYTEST_TARGET" if [[ $(uname) != "Linux" && $(uname) != "Darwin" ]]; then - # GH#37455 windows py38 build appears to be running out of memory - # skip collection of window tests - PYTEST_CMD="$PYTEST_CMD --ignore=pandas/tests/window/moments --ignore=pandas/tests/plotting/" + PYTEST_CMD="$PYTEST_CMD --ignore=pandas/tests/plotting/" fi echo $PYTEST_CMD sh -c "$PYTEST_CMD" -PYTEST_AM_CMD="PANDAS_DATA_MANAGER=array pytest -m \"$PATTERN and arraymanager\" -n $PYTEST_WORKERS --dist=loadfile $TEST_ARGS $COVERAGE pandas" +if [[ "$PANDAS_DATA_MANAGER" != "array" ]]; then + # The ArrayManager tests should have already been run by PYTEST_CMD if PANDAS_DATA_MANAGER was already set to array + PYTEST_AM_CMD="PANDAS_DATA_MANAGER=array pytest -m \"$PATTERN and arraymanager\" -n $PYTEST_WORKERS --dist=loadfile $TEST_ARGS $COVERAGE pandas" -echo $PYTEST_AM_CMD -sh -c "$PYTEST_AM_CMD" + echo $PYTEST_AM_CMD + sh -c "$PYTEST_AM_CMD" +fi diff --git a/ci/setup_env.sh b/ci/setup_env.sh index 2e16bc6545161..d51ff98b241a6 100755 --- a/ci/setup_env.sh +++ b/ci/setup_env.sh @@ -48,6 +48,7 @@ conda config --set ssl_verify false conda config --set quiet true --set always_yes true --set changeps1 false conda install pip conda # create conda to create a historical artifact for pip & setuptools conda update -n base conda +conda install -y -c conda-forge mamba echo "conda info -a" conda info -a @@ -62,8 +63,8 @@ conda list conda remove --all -q -y -n pandas-dev echo -echo "conda env create -q --file=${ENV_FILE}" -time conda env create -q --file="${ENV_FILE}" +echo "mamba env create -q --file=${ENV_FILE}" +time mamba env create -q --file="${ENV_FILE}" if [[ "$BITS32" == "yes" ]]; then @@ -86,11 +87,6 @@ echo "w/o removing anything else" conda remove pandas -y --force || true pip uninstall -y pandas || true -echo -echo "remove postgres if has been installed with conda" -echo "we use the one from the CI" -conda remove postgresql -y --force || true - echo echo "remove qt" echo "causes problems with the clipboard, we use xsel for that" @@ -106,7 +102,8 @@ echo "[Build extensions]" python setup.py build_ext -q -j2 echo "[Updating pip]" -python -m pip install --no-deps -U pip wheel setuptools +# TODO: GH#44980 https://github.com/pypa/setuptools/issues/2941 +python -m pip install --no-deps -U pip wheel "setuptools<60.0.0" echo "[Install pandas]" python -m pip install --no-build-isolation -e . @@ -115,13 +112,4 @@ echo echo "conda list" conda list -# Install DB for Linux - -if [[ -n ${SQL:0} ]]; then - echo "installing dbs" - mysql -e 'create database pandas_nosetest;' - psql -c 'create database pandas_nosetest;' -U postgres -else - echo "not using dbs on non-linux Travis builds or Azure Pipelines" -fi echo "done" diff --git a/codecov.yml b/codecov.yml index 893e40db004a6..d893bdbdc9298 100644 --- a/codecov.yml +++ b/codecov.yml @@ -1,5 +1,5 @@ codecov: - branch: master + branch: main notify: after_n_builds: 10 comment: false @@ -12,6 +12,7 @@ coverage: patch: default: target: '50' + informational: true github_checks: annotations: false diff --git a/doc/source/_static/style/appmaphead1.png b/doc/source/_static/style/appmaphead1.png new file mode 100644 index 0000000000000..905bcaa63e900 Binary files /dev/null and b/doc/source/_static/style/appmaphead1.png differ diff --git a/doc/source/_static/style/appmaphead2.png b/doc/source/_static/style/appmaphead2.png new file mode 100644 index 0000000000000..9adde61908378 Binary files /dev/null and b/doc/source/_static/style/appmaphead2.png differ diff --git a/doc/source/_static/style/df_pipe.png b/doc/source/_static/style/df_pipe.png new file mode 100644 index 0000000000000..071a481ad5acc Binary files /dev/null and b/doc/source/_static/style/df_pipe.png differ diff --git a/doc/source/_static/style/latex_stocks.png b/doc/source/_static/style/latex_stocks.png new file mode 100644 index 0000000000000..c8906c33b810b Binary files /dev/null and b/doc/source/_static/style/latex_stocks.png differ diff --git a/doc/source/_static/style/latex_stocks_html.png b/doc/source/_static/style/latex_stocks_html.png new file mode 100644 index 0000000000000..11b30faddf47c Binary files /dev/null and b/doc/source/_static/style/latex_stocks_html.png differ diff --git a/doc/source/conf.py b/doc/source/conf.py index 8df048ce65582..e8cd85e3369f7 100644 --- a/doc/source/conf.py +++ b/doc/source/conf.py @@ -225,11 +225,24 @@ # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. + +switcher_version = version +if ".dev" in version: + switcher_version = "dev" +elif "rc" in version: + switcher_version = version.split("rc")[0] + " (rc)" + html_theme_options = { "external_links": [], "github_url": "https://github.com/pandas-dev/pandas", "twitter_url": "https://twitter.com/pandas_dev", "google_analytics_id": "UA-27880019-2", + "navbar_end": ["version-switcher", "navbar-icon-links"], + "switcher": { + "json_url": "https://pandas.pydata.org/versions.json", + "url_template": "https://pandas.pydata.org/{version}/", + "version_match": switcher_version, + }, } # Add any paths that contain custom themes here, relative to this directory. @@ -461,7 +474,6 @@ # eg pandas.Series.str and pandas.Series.dt (see GH9322) import sphinx # isort:skip -from sphinx.util import rpartition # isort:skip from sphinx.ext.autodoc import ( # isort:skip AttributeDocumenter, Documenter, @@ -521,8 +533,8 @@ def resolve_name(self, modname, parents, path, base): # HACK: this is added in comparison to ClassLevelDocumenter # mod_cls still exists of class.accessor, so an extra # rpartition is needed - modname, accessor = rpartition(mod_cls, ".") - modname, cls = rpartition(modname, ".") + modname, _, accessor = mod_cls.rpartition(".") + modname, _, cls = modname.rpartition(".") parents = [cls, accessor] # if the module name is still missing, get it like above if not modname: @@ -652,7 +664,7 @@ def linkcode_resolve(domain, info): fn = os.path.relpath(fn, start=os.path.dirname(pandas.__file__)) if "+" in pandas.__version__: - return f"https://github.com/pandas-dev/pandas/blob/master/pandas/{fn}{linespec}" + return f"https://github.com/pandas-dev/pandas/blob/main/pandas/{fn}{linespec}" else: return ( f"https://github.com/pandas-dev/pandas/blob/" diff --git a/doc/source/development/code_style.rst b/doc/source/development/code_style.rst index 77c8d56765e5e..7bbfc010fbfb2 100644 --- a/doc/source/development/code_style.rst +++ b/doc/source/development/code_style.rst @@ -28,7 +28,7 @@ Testing Failing tests -------------- -See https://docs.pytest.org/en/latest/skipping.html for background. +See https://docs.pytest.org/en/latest/how-to/skipping.html for background. Do not use ``pytest.xfail`` --------------------------- diff --git a/doc/source/development/contributing.rst b/doc/source/development/contributing.rst index f4a09e0daa750..1d745d21dacae 100644 --- a/doc/source/development/contributing.rst +++ b/doc/source/development/contributing.rst @@ -59,7 +59,7 @@ will allow others to reproduce the bug and provide insight into fixing. See `this blogpost `_ for tips on writing a good bug report. -Trying the bug-producing code out on the *master* branch is often a worthwhile exercise +Trying the bug-producing code out on the *main* branch is often a worthwhile exercise to confirm the bug still exists. It is also worth searching existing bug reports and pull requests to see if the issue has already been reported and/or fixed. @@ -143,7 +143,7 @@ as the version number cannot be computed anymore. Creating a branch ----------------- -You want your master branch to reflect only production-ready code, so create a +You want your main branch to reflect only production-ready code, so create a feature branch for making your changes. For example:: git branch shiny-new-feature @@ -158,14 +158,14 @@ changes in this branch specific to one bug or feature so it is clear what the branch brings to pandas. You can have many shiny-new-features and switch in between them using the git checkout command. -When creating this branch, make sure your master branch is up to date with -the latest upstream master version. To update your local master branch, you +When creating this branch, make sure your main branch is up to date with +the latest upstream main version. To update your local main branch, you can do:: - git checkout master - git pull upstream master --ff-only + git checkout main + git pull upstream main --ff-only -When you want to update the feature branch with changes in master after +When you want to update the feature branch with changes in main after you created the branch, check the section on :ref:`updating a PR `. @@ -256,7 +256,7 @@ double check your branch changes against the branch it was based on: #. Navigate to your repository on GitHub -- https://github.com/your-user-name/pandas #. Click on ``Branches`` #. Click on the ``Compare`` button for your feature branch -#. Select the ``base`` and ``compare`` branches, if necessary. This will be ``master`` and +#. Select the ``base`` and ``compare`` branches, if necessary. This will be ``main`` and ``shiny-new-feature``, respectively. Finally, make the pull request @@ -264,8 +264,8 @@ Finally, make the pull request If everything looks good, you are ready to make a pull request. A pull request is how code from a local repository becomes available to the GitHub community and can be looked -at and eventually merged into the master version. This pull request and its associated -changes will eventually be committed to the master branch and available in the next +at and eventually merged into the main version. This pull request and its associated +changes will eventually be committed to the main branch and available in the next release. To submit a pull request: #. Navigate to your repository on GitHub @@ -294,14 +294,14 @@ This will automatically update your pull request with the latest code and restar :any:`Continuous Integration ` tests. Another reason you might need to update your pull request is to solve conflicts -with changes that have been merged into the master branch since you opened your +with changes that have been merged into the main branch since you opened your pull request. -To do this, you need to "merge upstream master" in your branch:: +To do this, you need to "merge upstream main" in your branch:: git checkout shiny-new-feature git fetch upstream - git merge upstream/master + git merge upstream/main If there are no conflicts (or they could be fixed automatically), a file with a default commit message will open, and you can simply save and quit this file. @@ -313,7 +313,7 @@ Once the conflicts are merged and the files where the conflicts were solved are added, you can run ``git commit`` to save those fixes. If you have uncommitted changes at the moment you want to update the branch with -master, you will need to ``stash`` them prior to updating (see the +main, you will need to ``stash`` them prior to updating (see the `stash docs `__). This will effectively store your changes and they can be reapplied after updating. @@ -331,18 +331,23 @@ can comment:: @github-actions pre-commit -on that pull request. This will trigger a workflow which will autofix formatting errors. +on that pull request. This will trigger a workflow which will autofix formatting +errors. + +To automatically fix formatting errors on each commit you make, you can +set up pre-commit yourself. First, create a Python :ref:`environment +` and then set up :ref:`pre-commit `. Delete your merged branch (optional) ------------------------------------ Once your feature branch is accepted into upstream, you'll probably want to get rid of -the branch. First, merge upstream master into your branch so git knows it is safe to +the branch. First, merge upstream main into your branch so git knows it is safe to delete your branch:: git fetch upstream - git checkout master - git merge upstream/master + git checkout main + git merge upstream/main Then you can do:: diff --git a/doc/source/development/contributing_codebase.rst b/doc/source/development/contributing_codebase.rst index e812aaa760a8f..4826921d4866b 100644 --- a/doc/source/development/contributing_codebase.rst +++ b/doc/source/development/contributing_codebase.rst @@ -23,11 +23,10 @@ contributing them to the project:: ./ci/code_checks.sh -The script verifies the linting of code files, it looks for common mistake patterns -(like missing spaces around sphinx directives that make the documentation not -being rendered properly) and it also validates the doctests. It is possible to -run the checks independently by using the parameters ``lint``, ``patterns`` and -``doctests`` (e.g. ``./ci/code_checks.sh lint``). +The script validates the doctests, formatting in docstrings, static typing, and +imported modules. It is possible to run the checks independently by using the +parameters ``docstring``, ``code``, ``typing``, and ``doctests`` +(e.g. ``./ci/code_checks.sh doctests``). In addition, because a lot of people use our library, it is important that we do not make sudden changes to the code that could have the potential to break @@ -70,9 +69,9 @@ to run its checks with:: without needing to have done ``pre-commit install`` beforehand. -If you want to run checks on all recently committed files on upstream/master you can use:: +If you want to run checks on all recently committed files on upstream/main you can use:: - pre-commit run --from-ref=upstream/master --to-ref=HEAD --all-files + pre-commit run --from-ref=upstream/main --to-ref=HEAD --all-files without needing to have done ``pre-commit install`` beforehand. @@ -156,7 +155,7 @@ Python (PEP8 / black) pandas follows the `PEP8 `_ standard and uses `Black `_ and -`Flake8 `_ to ensure a consistent code +`Flake8 `_ to ensure a consistent code format throughout the project. We encourage you to use :ref:`pre-commit `. :ref:`Continuous Integration ` will run those tools and @@ -164,7 +163,7 @@ report any stylistic errors in your code. Therefore, it is helpful before submitting code to run the check yourself:: black pandas - git diff upstream/master -u -- "*.py" | flake8 --diff + git diff upstream/main -u -- "*.py" | flake8 --diff to auto-format your code. Additionally, many editors have plugins that will apply ``black`` as you edit files. @@ -172,7 +171,7 @@ apply ``black`` as you edit files. You should use a ``black`` version 21.5b2 as previous versions are not compatible with the pandas codebase. -One caveat about ``git diff upstream/master -u -- "*.py" | flake8 --diff``: this +One caveat about ``git diff upstream/main -u -- "*.py" | flake8 --diff``: this command will catch any stylistic errors in your changes specifically, but be beware it may not catch all of them. For example, if you delete the only usage of an imported function, it is stylistically incorrect to import an @@ -180,18 +179,18 @@ unused function. However, style-checking the diff will not catch this because the actual import is not part of the diff. Thus, for completeness, you should run this command, though it may take longer:: - git diff upstream/master --name-only -- "*.py" | xargs -r flake8 + git diff upstream/main --name-only -- "*.py" | xargs -r flake8 -Note that on OSX, the ``-r`` flag is not available, so you have to omit it and +Note that on macOS, the ``-r`` flag is not available, so you have to omit it and run this slightly modified command:: - git diff upstream/master --name-only -- "*.py" | xargs flake8 + git diff upstream/main --name-only -- "*.py" | xargs flake8 Windows does not support the ``xargs`` command (unless installed for example via the `MinGW `__ toolchain), but one can imitate the behaviour as follows:: - for /f %i in ('git diff upstream/master --name-only -- "*.py"') do flake8 %i + for /f %i in ('git diff upstream/main --name-only -- "*.py"') do flake8 %i This will get all the files being changed by the PR (and ending with ``.py``), and run ``flake8`` on them, one after the other. @@ -205,7 +204,7 @@ Import formatting pandas uses `isort `__ to standardise import formatting across the codebase. -A guide to import layout as per pep8 can be found `here `__. +A guide to import layout as per pep8 can be found `here `__. A summary of our current import sections ( in order ): @@ -243,9 +242,9 @@ to automatically format imports correctly. This will modify your local copy of t Alternatively, you can run a command similar to what was suggested for ``black`` and ``flake8`` :ref:`right above `:: - git diff upstream/master --name-only -- "*.py" | xargs -r isort + git diff upstream/main --name-only -- "*.py" | xargs -r isort -Where similar caveats apply if you are on OSX or Windows. +Where similar caveats apply if you are on macOS or Windows. You can then verify the changes look ok, then git :any:`commit ` and :any:`push `. @@ -304,7 +303,7 @@ pandas strongly encourages the use of :pep:`484` style type hints. New developme Style guidelines ~~~~~~~~~~~~~~~~ -Types imports should follow the ``from typing import ...`` convention. So rather than +Type imports should follow the ``from typing import ...`` convention. Some types do not need to be imported since :pep:`585` some builtin constructs, such as ``list`` and ``tuple``, can directly be used for type annotations. So rather than .. code-block:: python @@ -316,21 +315,31 @@ You should write .. code-block:: python - from typing import List, Optional, Union + primes: list[int] = [] - primes: List[int] = [] +``Optional`` should be avoided in favor of the shorter ``| None``, so instead of -``Optional`` should be used where applicable, so instead of +.. code-block:: python + + from typing import Union + + maybe_primes: list[Union[int, None]] = [] + +or .. code-block:: python - maybe_primes: List[Union[int, None]] = [] + from typing import Optional + + maybe_primes: list[Optional[int]] = [] You should write .. code-block:: python - maybe_primes: List[Optional[int]] = [] + from __future__ import annotations # noqa: F404 + + maybe_primes: list[int | None] = [] In some cases in the code base classes may define class variables that shadow builtins. This causes an issue as described in `Mypy 1775 `_. The defensive solution here is to create an unambiguous alias of the builtin and use that without your annotation. For example, if you come across a definition like @@ -380,7 +389,7 @@ With custom types and inference this is not always possible so exceptions are ma pandas-specific types ~~~~~~~~~~~~~~~~~~~~~ -Commonly used types specific to pandas will appear in `pandas._typing `_ and you should use these where applicable. This module is private for now but ultimately this should be exposed to third party libraries who want to implement type checking against pandas. +Commonly used types specific to pandas will appear in `pandas._typing `_ and you should use these where applicable. This module is private for now but ultimately this should be exposed to third party libraries who want to implement type checking against pandas. For example, quite a few functions in pandas accept a ``dtype`` argument. This can be expressed as a string like ``"object"``, a ``numpy.dtype`` like ``np.int64`` or even a pandas ``ExtensionDtype`` like ``pd.CategoricalDtype``. Rather than burden the user with having to constantly annotate all of those options, this can simply be imported and reused from the pandas._typing module @@ -396,14 +405,41 @@ This module will ultimately house types for repeatedly used concepts like "path- Validating type hints ~~~~~~~~~~~~~~~~~~~~~ -pandas uses `mypy `_ to statically analyze the code base and type hints. After making any change you can ensure your type hints are correct by running +pandas uses `mypy `_ and `pyright `_ to statically analyze the code base and type hints. After making any change you can ensure your type hints are correct by running .. code-block:: shell - mypy pandas + mypy + + # let pre-commit setup and run pyright + pre-commit run --hook-stage manual --all-files pyright + # or if pyright is installed (requires node.js) + pyright + +A recent version of ``numpy`` (>=1.21.0) is required for type validation. .. _contributing.ci: +Testing type hints in code using pandas +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. warning:: + + * Pandas is not yet a py.typed library (:pep:`561`)! + The primary purpose of locally declaring pandas as a py.typed library is to test and + improve the pandas-builtin type annotations. + +Until pandas becomes a py.typed library, it is possible to easily experiment with the type +annotations shipped with pandas by creating an empty file named "py.typed" in the pandas +installation folder: + +.. code-block:: none + + python -c "import pandas; import pathlib; (pathlib.Path(pandas.__path__[0]) / 'py.typed').touch()" + +The existence of the py.typed file signals to type checkers that pandas is already a py.typed +library. This makes type checkers aware of the type annotations shipped with pandas. + Testing with continuous integration ----------------------------------- @@ -413,7 +449,7 @@ continuous integration services, once your pull request is submitted. However, if you wish to run the test suite on a branch prior to submitting the pull request, then the continuous integration services need to be hooked to your GitHub repository. Instructions are here for `GitHub Actions `__ and -`Azure Pipelines `__. +`Azure Pipelines `__. A pull-request will be considered for merging when you have an all 'green' build. If any tests are failing, then you will get a red 'X', where you can click through to see the individual failed tests. @@ -454,8 +490,7 @@ Writing tests All tests should go into the ``tests`` subdirectory of the specific package. This folder contains many current examples of tests, and we suggest looking to these for inspiration. If your test requires working with files or -network connectivity, there is more information on the `testing page -`_ of the wiki. +network connectivity, there is more information on the :wiki:`Testing` of the wiki. The ``pandas._testing`` module has many special ``assert`` functions that make it easier to make statements about whether Series or DataFrame objects are @@ -741,10 +776,10 @@ Running the performance test suite Performance matters and it is worth considering whether your code has introduced performance regressions. pandas is in the process of migrating to -`asv benchmarks `__ +`asv benchmarks `__ to enable easy monitoring of the performance of critical pandas operations. These benchmarks are all found in the ``pandas/asv_bench`` directory, and the -test results can be found `here `__. +test results can be found `here `__. To use all features of asv, you will need either ``conda`` or ``virtualenv``. For more details please check the `asv installation @@ -752,18 +787,18 @@ webpage `_. To install asv:: - pip install git+https://github.com/spacetelescope/asv + pip install git+https://github.com/airspeed-velocity/asv If you need to run a benchmark, change your directory to ``asv_bench/`` and run:: - asv continuous -f 1.1 upstream/master HEAD + asv continuous -f 1.1 upstream/main HEAD You can replace ``HEAD`` with the name of the branch you are working on, and report benchmarks that changed by more than 10%. The command uses ``conda`` by default for creating the benchmark environments. If you want to use virtualenv instead, write:: - asv continuous -f 1.1 -E virtualenv upstream/master HEAD + asv continuous -f 1.1 -E virtualenv upstream/main HEAD The ``-E virtualenv`` option should be added to all ``asv`` commands that run benchmarks. The default value is defined in ``asv.conf.json``. @@ -775,12 +810,12 @@ do not cause unexpected performance regressions. You can run specific benchmark using the ``-b`` flag, which takes a regular expression. For example, this will only run benchmarks from a ``pandas/asv_bench/benchmarks/groupby.py`` file:: - asv continuous -f 1.1 upstream/master HEAD -b ^groupby + asv continuous -f 1.1 upstream/main HEAD -b ^groupby If you want to only run a specific group of benchmarks from a file, you can do it using ``.`` as a separator. For example:: - asv continuous -f 1.1 upstream/master HEAD -b groupby.GroupByMethods + asv continuous -f 1.1 upstream/main HEAD -b groupby.GroupByMethods will only run the ``GroupByMethods`` benchmark defined in ``groupby.py``. @@ -812,7 +847,21 @@ Changes should be reflected in the release notes located in ``doc/source/whatsne This file contains an ongoing change log for each release. Add an entry to this file to document your fix, enhancement or (unavoidable) breaking change. Make sure to include the GitHub issue number when adding your entry (using ``:issue:`1234``` where ``1234`` is the -issue/pull request number). +issue/pull request number). Your entry should be written using full sentences and proper +grammar. + +When mentioning parts of the API, use a Sphinx ``:func:``, ``:meth:``, or ``:class:`` +directive as appropriate. Not all public API functions and methods have a +documentation page; ideally links would only be added if they resolve. You can +usually find similar examples by checking the release notes for one of the previous +versions. + +If your code is a bugfix, add your entry to the relevant bugfix section. Avoid +adding to the ``Other`` section; only in rare cases should entries go there. +Being as concise as possible, the description of the bug should include how the +user may encounter it and an indication of the bug itself, e.g. +"produces incorrect results" or "incorrectly raises". It may be necessary to also +indicate the new behavior. If your code is an enhancement, it is most likely necessary to add usage examples to the existing documentation. This can be done following the section diff --git a/doc/source/development/contributing_docstring.rst b/doc/source/development/contributing_docstring.rst index 623d1e8d45565..a87d8d5ad44bf 100644 --- a/doc/source/development/contributing_docstring.rst +++ b/doc/source/development/contributing_docstring.rst @@ -68,7 +68,7 @@ explained in this document: * `numpydoc docstring guide `_ (which is based in the original `Guide to NumPy/SciPy documentation - `_) + `_) numpydoc is a Sphinx extension to support the NumPy docstring convention. diff --git a/doc/source/development/contributing_documentation.rst b/doc/source/development/contributing_documentation.rst index a4a4f781d9dad..39bc582511148 100644 --- a/doc/source/development/contributing_documentation.rst +++ b/doc/source/development/contributing_documentation.rst @@ -202,10 +202,10 @@ And you'll have the satisfaction of seeing your new and improved documentation! .. _contributing.dev_docs: -Building master branch documentation +Building main branch documentation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -When pull requests are merged into the pandas ``master`` branch, the main parts of +When pull requests are merged into the pandas ``main`` branch, the main parts of the documentation are also built by Travis-CI. These docs are then hosted `here `__, see also the :any:`Continuous Integration ` section. diff --git a/doc/source/development/contributing_environment.rst b/doc/source/development/contributing_environment.rst index bc0a3556b9ac1..5f36a2a609c9f 100644 --- a/doc/source/development/contributing_environment.rst +++ b/doc/source/development/contributing_environment.rst @@ -47,7 +47,7 @@ Enable Docker support and use the Services tool window to build and manage image run and interact with containers. See https://www.jetbrains.com/help/pycharm/docker.html for details. -Note that you might need to rebuild the C extensions if/when you merge with upstream/master using:: +Note that you might need to rebuild the C extensions if/when you merge with upstream/main using:: python setup.py build_ext -j 4 @@ -72,7 +72,7 @@ These packages will automatically be installed by using the ``pandas`` **Windows** -You will need `Build Tools for Visual Studio 2017 +You will need `Build Tools for Visual Studio 2019 `_. .. warning:: @@ -82,7 +82,7 @@ You will need `Build Tools for Visual Studio 2017 In the installer, select the "C++ build tools" workload. You can install the necessary components on the commandline using -`vs_buildtools.exe `_: +`vs_buildtools.exe `_: .. code:: @@ -133,14 +133,13 @@ compiler installation instructions. Let us know if you have any difficulties by opening an issue or reaching out on `Gitter `_. - Creating a Python environment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now create an isolated pandas development environment: -* Install either `Anaconda `_, `miniconda - `_, or `miniforge `_ +* Install either `Anaconda `_, `miniconda + `_, or `miniforge `_ * Make sure your conda is up to date (``conda update conda``) * Make sure that you have :any:`cloned the repository ` * ``cd`` to the pandas source directory @@ -166,7 +165,7 @@ We'll now kick off a three-step process: At this point you should be able to import pandas from your locally built version:: - $ python # start an interpreter + $ python >>> import pandas >>> print(pandas.__version__) 0.22.0.dev0+29.g4ad6d4d74 @@ -182,18 +181,15 @@ To return to your root environment:: conda deactivate -See the full conda docs `here `__. +See the full conda docs `here `__. Creating a Python environment (pip) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you aren't using conda for your development environment, follow these instructions. -You'll need to have at least the :ref:`minimum Python version ` that pandas supports. If your Python version -is 3.8.0 (or later), you might need to update your ``setuptools`` to version 42.0.0 (or later) -in your development environment before installing the build dependencies:: - - pip install --upgrade setuptools +You'll need to have at least the :ref:`minimum Python version ` that pandas supports. +You also need to have ``setuptools`` 51.0.0 or later to build pandas. **Unix**/**macOS with virtualenv** @@ -242,7 +238,7 @@ Consult the docs for setting up pyenv `here `__. Below is a brief overview on how to set-up a virtual environment with Powershell under Windows. For details please refer to the -`official virtualenv user guide `__ +`official virtualenv user guide `__ Use an ENV_DIR of your choice. We'll use ~\\virtualenvs\\pandas-dev where '~' is the folder pointed to by either $env:USERPROFILE (Powershell) or diff --git a/doc/source/development/debugging_extensions.rst b/doc/source/development/debugging_extensions.rst index 894277d304020..7ba2091e18853 100644 --- a/doc/source/development/debugging_extensions.rst +++ b/doc/source/development/debugging_extensions.rst @@ -80,7 +80,7 @@ Once the process launches, simply type ``run`` and the test suite will begin, st Checking memory leaks with valgrind =================================== -You can use `Valgrind `_ to check for and log memory leaks in extensions. For instance, to check for a memory leak in a test from the suite you can run: +You can use `Valgrind `_ to check for and log memory leaks in extensions. For instance, to check for a memory leak in a test from the suite you can run: .. code-block:: sh diff --git a/doc/source/development/developer.rst b/doc/source/development/developer.rst index d701208792a4c..6de237b70f08d 100644 --- a/doc/source/development/developer.rst +++ b/doc/source/development/developer.rst @@ -180,7 +180,7 @@ As an example of fully-formed metadata: 'numpy_type': 'int64', 'metadata': None} ], - 'pandas_version': '0.20.0', + 'pandas_version': '1.4.0', 'creator': { 'library': 'pyarrow', 'version': '0.13.0' diff --git a/doc/source/development/extending.rst b/doc/source/development/extending.rst index d5b45f5953453..5347aab2c731a 100644 --- a/doc/source/development/extending.rst +++ b/doc/source/development/extending.rst @@ -50,7 +50,7 @@ decorate a class, providing the name of attribute to add. The class's Now users can access your methods using the ``geo`` namespace: - >>> ds = pd.Dataframe( + >>> ds = pd.DataFrame( ... {"longitude": np.linspace(0, 10), "latitude": np.linspace(0, 20)} ... ) >>> ds.geo.center @@ -106,7 +106,7 @@ extension array for IP Address data, this might be ``ipaddress.IPv4Address``. See the `extension dtype source`_ for interface definition. -:class:`pandas.api.extension.ExtensionDtype` can be registered to pandas to allow creation via a string dtype name. +:class:`pandas.api.extensions.ExtensionDtype` can be registered to pandas to allow creation via a string dtype name. This allows one to instantiate ``Series`` and ``.astype()`` with a registered string name, for example ``'category'`` is a registered string accessor for the ``CategoricalDtype``. @@ -125,7 +125,7 @@ data. We do require that your array be convertible to a NumPy array, even if this is relatively expensive (as it is for ``Categorical``). They may be backed by none, one, or many NumPy arrays. For example, -``pandas.Categorical`` is an extension array backed by two arrays, +:class:`pandas.Categorical` is an extension array backed by two arrays, one for codes and one for categories. An array of IPv6 addresses may be backed by a NumPy structured array with two fields, one for the lower 64 bits and one for the upper 64 bits. Or they may be backed @@ -231,7 +231,7 @@ Testing extension arrays We provide a test suite for ensuring that your extension arrays satisfy the expected behavior. To use the test suite, you must provide several pytest fixtures and inherit from the base test class. The required fixtures are found in -https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/conftest.py. +https://github.com/pandas-dev/pandas/blob/main/pandas/tests/extension/conftest.py. To use a test, subclass it: @@ -244,7 +244,7 @@ To use a test, subclass it: pass -See https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/base/__init__.py +See https://github.com/pandas-dev/pandas/blob/main/pandas/tests/extension/base/__init__.py for a list of all the tests available. .. _extending.extension.arrow: @@ -290,9 +290,9 @@ See more in the `Arrow documentation `__ +Libraries implementing the plotting backend should use `entry points `__ to make their backend discoverable to pandas. The key is ``"pandas_plotting_backends"``. For example, pandas registers the default "matplotlib" backend as follows. @@ -486,4 +486,4 @@ registers the default "matplotlib" backend as follows. More information on how to implement a third-party plotting backend can be found at -https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/__init__.py#L1. +https://github.com/pandas-dev/pandas/blob/main/pandas/plotting/__init__.py#L1. diff --git a/doc/source/development/maintaining.rst b/doc/source/development/maintaining.rst index a0e9ba53acd00..a8521039c5427 100644 --- a/doc/source/development/maintaining.rst +++ b/doc/source/development/maintaining.rst @@ -237,4 +237,4 @@ a milestone before tagging, you can request the bot to backport it with: .. _governance documents: https://github.com/pandas-dev/pandas-governance -.. _list of permissions: https://help.github.com/en/github/setting-up-and-managing-organizations-and-teams/repository-permission-levels-for-an-organization +.. _list of permissions: https://docs.github.com/en/organizations/managing-access-to-your-organizations-repositories/repository-roles-for-an-organization diff --git a/doc/source/development/roadmap.rst b/doc/source/development/roadmap.rst index 37e45bf5a42b5..ccdb4f1fafae4 100644 --- a/doc/source/development/roadmap.rst +++ b/doc/source/development/roadmap.rst @@ -74,8 +74,7 @@ types. This includes consistent behavior in all operations (indexing, arithmetic operations, comparisons, etc.). There has been discussion of eventually making the new semantics the default. -This has been discussed at -`github #28095 `__ (and +This has been discussed at :issue:`28095` (and linked issues), and described in more detail in this `design doc `__. @@ -129,8 +128,7 @@ We propose that it should only work with positional indexing, and the translatio to positions should be entirely done at a higher level. Indexing is a complicated API with many subtleties. This refactor will require care -and attention. More details are discussed at -https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code +and attention. More details are discussed at :wiki:`(Tentative)-rules-for-restructuring-indexing-code` Numba-accelerated operations ---------------------------- @@ -205,4 +203,4 @@ We improved the pandas documentation * :ref:`getting_started` contains a number of resources intended for new pandas users coming from a variety of backgrounds (:issue:`26831`). -.. _pydata-sphinx-theme: https://github.com/pandas-dev/pydata-sphinx-theme +.. _pydata-sphinx-theme: https://github.com/pydata/pydata-sphinx-theme diff --git a/doc/source/ecosystem.rst b/doc/source/ecosystem.rst index ee061e7b7d3e6..16cae9bbfbf46 100644 --- a/doc/source/ecosystem.rst +++ b/doc/source/ecosystem.rst @@ -19,7 +19,7 @@ development to remain focused around it's original requirements. This is an inexhaustive list of projects that build on pandas in order to provide tools in the PyData space. For a list of projects that depend on pandas, see the -`libraries.io usage page for pandas `_ +`Github network dependents for pandas `_ or `search pypi for pandas `_. We'd like to make it easier for users to find these projects, if you know of other @@ -30,16 +30,18 @@ substantial projects that you feel should be on this list, please let us know. Data cleaning and validation ---------------------------- -`Pyjanitor `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +`Pyjanitor `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pyjanitor provides a clean API for cleaning data, using method chaining. -`Engarde `__ +`Pandera `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Engarde is a lightweight library used to explicitly state assumptions about your datasets -and check that they're *actually* true. +Pandera provides a flexible and expressive API for performing data validation on dataframes +to make data processing pipelines more readable and robust. +Dataframes contain information that pandera explicitly validates at runtime. This is useful in +production-critical data pipelines or reproducible research settings. `pandas-path `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -69,19 +71,19 @@ a long-standing special relationship with pandas. Statsmodels provides powerful econometrics, analysis and modeling functionality that is out of pandas' scope. Statsmodels leverages pandas objects as the underlying data container for computation. -`sklearn-pandas `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +`sklearn-pandas `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use pandas DataFrames in your `scikit-learn `__ ML pipeline. `Featuretools `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Featuretools is a Python library for automated feature engineering built on top of pandas. It excels at transforming temporal and relational datasets into feature matrices for machine learning using reusable feature engineering "primitives". Users can contribute their own primitives in Python and share them with the rest of the community. `Compose `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compose is a machine learning tool for labeling data and prediction engineering. It allows you to structure the labeling process by parameterizing prediction problems and transforming time-driven relational data into target values with cutoff times that can be used for supervised learning. @@ -113,8 +115,8 @@ simplicity produces beautiful and effective visualizations with a minimal amount of code. Altair works with pandas DataFrames. -`Bokeh `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +`Bokeh `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel @@ -145,7 +147,7 @@ estimation while plotting, aggregating across observations and visualizing the fit of statistical models to emphasize patterns in a dataset. `plotnine `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Hadley Wickham's `ggplot2 `__ is a foundational exploratory visualization package for the R language. Based on `"The Grammar of Graphics" `__ it @@ -159,10 +161,10 @@ A good implementation for Python users is `has2k1/plotnine `__ leverages `Vega `__ to create plots within Jupyter Notebook. -`Plotly `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +`Plotly `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -`Plotly’s `__ `Python API `__ enables interactive figures and web shareability. Maps, 2D, 3D, and live-streaming graphs are rendered with WebGL and `D3.js `__. The library supports plotting directly from a pandas DataFrame and cloud-based collaboration. Users of `matplotlib, ggplot for Python, and Seaborn `__ can convert figures into interactive web-based plots. Plots can be drawn in `IPython Notebooks `__ , edited with R or MATLAB, modified in a GUI, or embedded in apps and dashboards. Plotly is free for unlimited sharing, and has `cloud `__, `offline `__, or `on-premise `__ accounts for private use. +`Plotly’s `__ `Python API `__ enables interactive figures and web shareability. Maps, 2D, 3D, and live-streaming graphs are rendered with WebGL and `D3.js `__. The library supports plotting directly from a pandas DataFrame and cloud-based collaboration. Users of `matplotlib, ggplot for Python, and Seaborn `__ can convert figures into interactive web-based plots. Plots can be drawn in `IPython Notebooks `__ , edited with R or MATLAB, modified in a GUI, or embedded in apps and dashboards. Plotly is free for unlimited sharing, and has `offline `__, or `on-premise `__ accounts for private use. `Lux `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -177,7 +179,7 @@ A good implementation for Python users is `has2k1/plotnine `__ that highlights interesting trends and patterns in the dataframe. Users can leverage any existing pandas commands without modifying their code, while being able to visualize their pandas data structures (e.g., DataFrame, Series, Index) at the same time. Lux also offers a `powerful, intuitive language `__ that allow users to create `Altair `__, `matplotlib `__, or `Vega-Lite `__ visualizations without having to think at the level of code. +By printing out a dataframe, Lux automatically `recommends a set of visualizations `__ that highlights interesting trends and patterns in the dataframe. Users can leverage any existing pandas commands without modifying their code, while being able to visualize their pandas data structures (e.g., DataFrame, Series, Index) at the same time. Lux also offers a `powerful, intuitive language `__ that allow users to create `Altair `__, `matplotlib `__, or `Vega-Lite `__ visualizations without having to think at the level of code. `Qtpandas `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -202,8 +204,7 @@ invoked with the following command dtale.show(df) D-Tale integrates seamlessly with Jupyter notebooks, Python terminals, Kaggle -& Google Colab. Here are some demos of the `grid `__ -and `chart-builder `__. +& Google Colab. Here are some demos of the `grid `__. `hvplot `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -218,7 +219,7 @@ It can be loaded as a native pandas plotting backend via .. _ecosystem.ide: IDE ------- +--- `IPython `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -262,7 +263,7 @@ debugging and profiling functionality of a software development tool with the data exploration, interactive execution, deep inspection and rich visualization capabilities of a scientific environment like MATLAB or Rstudio. -Its `Variable Explorer `__ +Its `Variable Explorer `__ allows users to view, manipulate and edit pandas ``Index``, ``Series``, and ``DataFrame`` objects like a "spreadsheet", including copying and modifying values, sorting, displaying a "heatmap", converting data types and more. @@ -272,9 +273,9 @@ Spyder can also import data from a variety of plain text and binary files or the clipboard into a new pandas DataFrame via a sophisticated import wizard. Most pandas classes, methods and data attributes can be autocompleted in -Spyder's `Editor `__ and -`IPython Console `__, -and Spyder's `Help pane `__ can retrieve +Spyder's `Editor `__ and +`IPython Console `__, +and Spyder's `Help pane `__ can retrieve and render Numpydoc documentation on pandas objects in rich text with Sphinx both automatically and on-demand. @@ -310,8 +311,8 @@ The following data feeds are available: * Stooq Index Data * MOEX Data -`Quandl/Python `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +`Quandl/Python `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Quandl API for Python wraps the Quandl REST API to return pandas DataFrames with timeseries indexes. @@ -322,8 +323,8 @@ PyDatastream is a Python interface to the REST API to return indexed pandas DataFrames with financial data. This package requires valid credentials for this API (non free). -`pandaSDMX `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +`pandaSDMX `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pandaSDMX is a library to retrieve and acquire statistical data and metadata disseminated in `SDMX `_ 2.1, an ISO-standard @@ -355,8 +356,8 @@ with pandas. Domain specific --------------- -`Geopandas `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +`Geopandas `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Geopandas extends pandas data objects to include geographic information which support geometric operations. If your work entails maps and geographical coordinates, and @@ -396,7 +397,7 @@ any Delta table into Pandas dataframe. .. _ecosystem.out-of-core: Out-of-core -------------- +----------- `Blaze `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -434,8 +435,8 @@ can selectively scale parts of their pandas DataFrame applications. print(df3) -`Dask `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +`Dask `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Dask is a flexible parallel computing library for analytics. Dask provides a familiar ``DataFrame`` interface for out-of-core, parallel and distributed computing. @@ -445,6 +446,12 @@ provides a familiar ``DataFrame`` interface for out-of-core, parallel and distri Dask-ML enables parallel and distributed machine learning using Dask alongside existing machine learning libraries like Scikit-Learn, XGBoost, and TensorFlow. +`Ibis `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Ibis offers a standard way to write analytics code, that can be run in multiple engines. It helps in bridging the gap between local Python environments (like pandas) and remote storage and execution systems like Hadoop components (like HDFS, Impala, Hive, Spark) and SQL databases (Postgres, etc.). + + `Koalas `__ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -467,8 +474,8 @@ time-consuming tasks like ingesting data (``read_csv``, ``read_excel``, df = pd.read_csv("big.csv") # use all your cores! -`Odo `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +`Odo `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Odo provides a uniform API for moving data between different formats. It uses pandas own ``read_csv`` for CSV IO and leverages many existing packages such as @@ -492,8 +499,8 @@ If also displays progress bars. df.parallel_apply(func) -`Vaex `__ -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +`Vaex `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Increasingly, packages are being built on top of pandas to address specific needs in data preparation, analysis and visualization. Vaex is a Python library for Out-of-Core DataFrames (similar to pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (10\ :sup:`9`) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted). @@ -567,5 +574,18 @@ Library Accessor Classes Description .. _pathlib.Path: https://docs.python.org/3/library/pathlib.html .. _pint-pandas: https://github.com/hgrecco/pint-pandas .. _composeml: https://github.com/alteryx/compose -.. _datatest: https://datatest.readthedocs.io/ +.. _datatest: https://datatest.readthedocs.io/en/stable/ .. _woodwork: https://github.com/alteryx/woodwork + +Development tools +----------------- + +`pandas-stubs `__ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +While pandas repository is partially typed, the package itself doesn't expose this information for external use. +Install pandas-stubs to enable basic type coverage of pandas API. + +Learn more by reading through :issue:`14468`, :issue:`26766`, :issue:`28142`. + +See installation and usage instructions on the `github page `__. diff --git a/doc/source/getting_started/comparison/comparison_with_r.rst b/doc/source/getting_started/comparison/comparison_with_r.rst index 864081002086b..f91f4218c3429 100644 --- a/doc/source/getting_started/comparison/comparison_with_r.rst +++ b/doc/source/getting_started/comparison/comparison_with_r.rst @@ -31,7 +31,7 @@ Quick reference We'll start off with a quick reference guide pairing some common R operations using `dplyr -`__ with +`__ with pandas equivalents. @@ -326,8 +326,8 @@ table below shows how these data structures could be mapped in Python. | data.frame | dataframe | +------------+-------------------------------+ -|ddply|_ -~~~~~~~~ +ddply +~~~~~ An expression using a data.frame called ``df`` in R where you want to summarize ``x`` by ``month``: @@ -372,8 +372,8 @@ For more details and examples see :ref:`the groupby documentation reshape / reshape2 ------------------ -|meltarray|_ -~~~~~~~~~~~~~ +meltarray +~~~~~~~~~ An expression using a 3 dimensional array called ``a`` in R where you want to melt it into a data.frame: @@ -390,8 +390,8 @@ In Python, since ``a`` is a list, you can simply use list comprehension. a = np.array(list(range(1, 24)) + [np.NAN]).reshape(2, 3, 4) pd.DataFrame([tuple(list(x) + [val]) for x, val in np.ndenumerate(a)]) -|meltlist|_ -~~~~~~~~~~~~ +meltlist +~~~~~~~~ An expression using a list called ``a`` in R where you want to melt it into a data.frame: @@ -412,8 +412,8 @@ In Python, this list would be a list of tuples, so For more details and examples see :ref:`the Into to Data Structures documentation `. -|meltdf|_ -~~~~~~~~~~~~~~~~ +meltdf +~~~~~~ An expression using a data.frame called ``cheese`` in R where you want to reshape the data.frame: @@ -447,8 +447,8 @@ In Python, the :meth:`~pandas.melt` method is the R equivalent: For more details and examples see :ref:`the reshaping documentation `. -|cast|_ -~~~~~~~ +cast +~~~~ In R ``acast`` is an expression using a data.frame called ``df`` in R to cast into a higher dimensional array: @@ -577,20 +577,5 @@ For more details and examples see :ref:`categorical introduction ` .. |subset| replace:: ``subset`` .. _subset: https://stat.ethz.ch/R-manual/R-patched/library/base/html/subset.html -.. |ddply| replace:: ``ddply`` -.. _ddply: https://cran.r-project.org/web/packages/plyr/plyr.pdf#Rfn.ddply.1 - -.. |meltarray| replace:: ``melt.array`` -.. _meltarray: https://cran.r-project.org/web/packages/reshape2/reshape2.pdf#Rfn.melt.array.1 - -.. |meltlist| replace:: ``melt.list`` -.. meltlist: https://cran.r-project.org/web/packages/reshape2/reshape2.pdf#Rfn.melt.list.1 - -.. |meltdf| replace:: ``melt.data.frame`` -.. meltdf: https://cran.r-project.org/web/packages/reshape2/reshape2.pdf#Rfn.melt.data.frame.1 - -.. |cast| replace:: ``cast`` -.. cast: https://cran.r-project.org/web/packages/reshape2/reshape2.pdf#Rfn.cast.1 - .. |factor| replace:: ``factor`` .. _factor: https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html diff --git a/doc/source/getting_started/comparison/comparison_with_sas.rst b/doc/source/getting_started/comparison/comparison_with_sas.rst index 54b45dc20db20..5a624c9c55782 100644 --- a/doc/source/getting_started/comparison/comparison_with_sas.rst +++ b/doc/source/getting_started/comparison/comparison_with_sas.rst @@ -96,7 +96,7 @@ Reading external data Like SAS, pandas provides utilities for reading in data from many formats. The ``tips`` dataset, found within the pandas -tests (`csv `_) +tests (`csv `_) will be used in many of the following examples. SAS provides ``PROC IMPORT`` to read csv data into a data set. @@ -113,7 +113,7 @@ The pandas method is :func:`read_csv`, which works similarly. url = ( "https://raw.github.com/pandas-dev/" - "pandas/master/pandas/tests/io/data/csv/tips.csv" + "pandas/main/pandas/tests/io/data/csv/tips.csv" ) tips = pd.read_csv(url) tips @@ -335,7 +335,7 @@ Extracting substring by position ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SAS extracts a substring from a string based on its position with the -`SUBSTR `__ function. +`SUBSTR `__ function. .. code-block:: sas @@ -538,7 +538,7 @@ This means that the size of data able to be loaded in pandas is limited by your machine's memory, but also that the operations on that data may be faster. If out of core processing is needed, one possibility is the -`dask.dataframe `_ +`dask.dataframe `_ library (currently in development) which provides a subset of pandas functionality for an on-disk ``DataFrame`` diff --git a/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst b/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst index bdd0f7d8cfddf..a7148405ba8a0 100644 --- a/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst +++ b/doc/source/getting_started/comparison/comparison_with_spreadsheets.rst @@ -11,7 +11,7 @@ of how various spreadsheet operations would be performed using pandas. This page terminology and link to documentation for Excel, but much will be the same/similar in `Google Sheets `_, `LibreOffice Calc `_, -`Apple Numbers `_, and other +`Apple Numbers `_, and other Excel-compatible spreadsheet software. .. include:: includes/introduction.rst @@ -85,14 +85,14 @@ In a spreadsheet, `values can be typed directly into cells `__ +Both `Excel `__ and :ref:`pandas <10min_tut_02_read_write>` can import data from various sources in various formats. CSV ''' -Let's load and display the `tips `_ +Let's load and display the `tips `_ dataset from the pandas tests, which is a CSV file. In Excel, you would download and then `open the CSV `_. In pandas, you pass the URL or local path of the CSV file to :func:`~pandas.read_csv`: @@ -101,7 +101,7 @@ In pandas, you pass the URL or local path of the CSV file to :func:`~pandas.read url = ( "https://raw.github.com/pandas-dev" - "/pandas/master/pandas/tests/io/data/csv/tips.csv" + "/pandas/main/pandas/tests/io/data/csv/tips.csv" ) tips = pd.read_csv(url) tips @@ -435,13 +435,14 @@ The equivalent in pandas: Adding a row ~~~~~~~~~~~~ -Assuming we are using a :class:`~pandas.RangeIndex` (numbered ``0``, ``1``, etc.), we can use :meth:`DataFrame.append` to add a row to the bottom of a ``DataFrame``. +Assuming we are using a :class:`~pandas.RangeIndex` (numbered ``0``, ``1``, etc.), we can use :func:`concat` to add a row to the bottom of a ``DataFrame``. .. ipython:: python df - new_row = {"class": "E", "student_count": 51, "all_pass": True} - df.append(new_row, ignore_index=True) + new_row = pd.DataFrame([["E", 51, True]], + columns=["class", "student_count", "all_pass"]) + pd.concat([df, new_row]) Find and Replace diff --git a/doc/source/getting_started/comparison/comparison_with_sql.rst b/doc/source/getting_started/comparison/comparison_with_sql.rst index 49a21f87382b3..0a891a4c6d2d7 100644 --- a/doc/source/getting_started/comparison/comparison_with_sql.rst +++ b/doc/source/getting_started/comparison/comparison_with_sql.rst @@ -18,7 +18,7 @@ structure. url = ( "https://raw.github.com/pandas-dev" - "/pandas/master/pandas/tests/io/data/csv/tips.csv" + "/pandas/main/pandas/tests/io/data/csv/tips.csv" ) tips = pd.read_csv(url) tips @@ -233,6 +233,12 @@ default, :meth:`~pandas.DataFrame.join` will join the DataFrames on their indice parameters allowing you to specify the type of join to perform (``LEFT``, ``RIGHT``, ``INNER``, ``FULL``) or the columns to join on (column names or indices). +.. warning:: + + If both key columns contain rows where the key is a null value, those + rows will be matched against each other. This is different from usual SQL + join behaviour and can lead to unexpected results. + .. ipython:: python df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)}) diff --git a/doc/source/getting_started/comparison/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst index 94c45adcccc82..636778a2ca32e 100644 --- a/doc/source/getting_started/comparison/comparison_with_stata.rst +++ b/doc/source/getting_started/comparison/comparison_with_stata.rst @@ -92,7 +92,7 @@ Reading external data Like Stata, pandas provides utilities for reading in data from many formats. The ``tips`` data set, found within the pandas -tests (`csv `_) +tests (`csv `_) will be used in many of the following examples. Stata provides ``import delimited`` to read csv data into a data set in memory. @@ -109,7 +109,7 @@ the data set if presented with a url. url = ( "https://raw.github.com/pandas-dev" - "/pandas/master/pandas/tests/io/data/csv/tips.csv" + "/pandas/main/pandas/tests/io/data/csv/tips.csv" ) tips = pd.read_csv(url) tips @@ -496,6 +496,6 @@ Disk vs memory pandas and Stata both operate exclusively in memory. This means that the size of data able to be loaded in pandas is limited by your machine's memory. If out of core processing is needed, one possibility is the -`dask.dataframe `_ +`dask.dataframe `_ library, which provides a subset of pandas functionality for an on-disk ``DataFrame``. diff --git a/doc/source/getting_started/comparison/includes/nth_word.rst b/doc/source/getting_started/comparison/includes/nth_word.rst index 7af0285005d5b..20e2ec47a8c9d 100644 --- a/doc/source/getting_started/comparison/includes/nth_word.rst +++ b/doc/source/getting_started/comparison/includes/nth_word.rst @@ -5,5 +5,5 @@ word by index. Note there are more powerful approaches should you need them. firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]}) firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0] - firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[0] + firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[1] firstlast diff --git a/doc/source/getting_started/install.rst b/doc/source/getting_started/install.rst index 88e54421daa11..df9c258f4aa6d 100644 --- a/doc/source/getting_started/install.rst +++ b/doc/source/getting_started/install.rst @@ -12,7 +12,7 @@ cross platform distribution for data analysis and scientific computing. This is the recommended installation method for most users. Instructions for installing from source, -`PyPI `__, `ActivePython `__, various Linux distributions, or a +`PyPI `__, `ActivePython `__, various Linux distributions, or a `development version `__ are also provided. .. _install.version: @@ -20,7 +20,7 @@ Instructions for installing from source, Python version support ---------------------- -Officially Python 3.7.1 and above, 3.8, and 3.9. +Officially Python 3.8, and 3.9. Installing pandas ----------------- @@ -47,7 +47,7 @@ rest of the `SciPy `__ stack without needing to install anything else, and without needing to wait for any software to be compiled. Installation instructions for `Anaconda `__ -`can be found here `__. +`can be found here `__. A full list of the packages available as part of the `Anaconda `__ distribution @@ -70,18 +70,18 @@ and involves downloading the installer which is a few hundred megabytes in size. If you want to have more control on which packages, or have a limited internet bandwidth, then installing pandas with -`Miniconda `__ may be a better solution. +`Miniconda `__ may be a better solution. -`Conda `__ is the package manager that the +`Conda `__ is the package manager that the `Anaconda `__ distribution is built upon. It is a package manager that is both cross-platform and language agnostic (it can play a similar role to a pip and virtualenv combination). `Miniconda `__ allows you to create a minimal self contained Python installation, and then use the -`Conda `__ command to install additional packages. +`Conda `__ command to install additional packages. -First you will need `Conda `__ to be installed and +First you will need `Conda `__ to be installed and downloading and running the `Miniconda `__ will do this for you. The installer @@ -132,6 +132,9 @@ Installing from PyPI pandas can be installed via pip from `PyPI `__. +.. note:: + You must have ``pip>=19.3`` to install from PyPI. + :: pip install pandas @@ -140,8 +143,8 @@ Installing with ActivePython ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Installation instructions for -`ActivePython `__ can be found -`here `__. Versions +`ActivePython `__ can be found +`here `__. Versions 2.7, 3.5 and 3.6 include pandas. Installing using your Linux distribution's package manager. @@ -155,10 +158,10 @@ The commands in this table will install pandas for Python 3 from your distributi Debian, stable, `official Debian repository `__ , ``sudo apt-get install python3-pandas`` - Debian & Ubuntu, unstable (latest packages), `NeuroDebian `__ , ``sudo apt-get install python3-pandas`` + Debian & Ubuntu, unstable (latest packages), `NeuroDebian `__ , ``sudo apt-get install python3-pandas`` Ubuntu, stable, `official Ubuntu repository `__ , ``sudo apt-get install python3-pandas`` OpenSuse, stable, `OpenSuse Repository `__ , ``zypper in python3-pandas`` - Fedora, stable, `official Fedora repository `__ , ``dnf install python3-pandas`` + Fedora, stable, `official Fedora repository `__ , ``dnf install python3-pandas`` Centos/RHEL, stable, `EPEL repository `__ , ``yum install python3-pandas`` **However**, the packages in the linux package managers are often a few versions behind, so @@ -196,7 +199,7 @@ the code base as of this writing. To run it on your machine to verify that everything is working (and that you have all of the dependencies, soft and hard, installed), make sure you have `pytest `__ >= 6.0 and `Hypothesis -`__ >= 3.58, then run: +`__ >= 3.58, then run: :: @@ -221,9 +224,9 @@ Dependencies ================================================================ ========================== Package Minimum supported version ================================================================ ========================== -`NumPy `__ 1.17.3 -`python-dateutil `__ 2.7.3 -`pytz `__ 2017.3 +`NumPy `__ 1.18.5 +`python-dateutil `__ 2.8.1 +`pytz `__ 2020.1 ================================================================ ========================== .. _install.recommended_dependencies: @@ -233,11 +236,11 @@ Recommended dependencies * `numexpr `__: for accelerating certain numerical operations. ``numexpr`` uses multiple cores as well as smart chunking and caching to achieve large speedups. - If installed, must be Version 2.7.0 or higher. + If installed, must be Version 2.7.1 or higher. * `bottleneck `__: for accelerating certain types of ``nan`` evaluations. ``bottleneck`` uses specialized cython routines to achieve large speedups. If installed, - must be Version 1.2.1 or higher. + must be Version 1.3.1 or higher. .. note:: @@ -262,9 +265,8 @@ Visualization ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -setuptools 38.6.0 Utils for entry points of plotting backend -matplotlib 2.2.3 Plotting library -Jinja2 2.10 Conditional formatting with DataFrame.style +matplotlib 3.3.2 Plotting library +Jinja2 2.11 Conditional formatting with DataFrame.style tabulate 0.8.7 Printing in Markdown-friendly format (see `tabulate`_) ========================= ================== ============================================================= @@ -274,10 +276,10 @@ Computation ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -SciPy 1.12.0 Miscellaneous statistical functions -numba 0.46.0 Alternative execution engine for rolling operations +SciPy 1.14.1 Miscellaneous statistical functions +numba 0.50.1 Alternative execution engine for rolling operations (see :ref:`Enhancing Performance `) -xarray 0.12.3 pandas-like API for N-dimensional data +xarray 0.15.1 pandas-like API for N-dimensional data ========================= ================== ============================================================= Excel files @@ -286,10 +288,10 @@ Excel files ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -xlrd 1.2.0 Reading Excel +xlrd 2.0.1 Reading Excel xlwt 1.3.0 Writing Excel -xlsxwriter 1.0.2 Writing Excel -openpyxl 3.0.0 Reading / writing for xlsx files +xlsxwriter 1.2.2 Writing Excel +openpyxl 3.0.3 Reading / writing for xlsx files pyxlsb 1.0.6 Reading for xlsb files ========================= ================== ============================================================= @@ -299,9 +301,9 @@ HTML ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -BeautifulSoup4 4.6.0 HTML parser for read_html -html5lib 1.0.1 HTML parser for read_html -lxml 4.3.0 HTML parser for read_html +BeautifulSoup4 4.8.2 HTML parser for read_html +html5lib 1.1 HTML parser for read_html +lxml 4.5.0 HTML parser for read_html ========================= ================== ============================================================= One of the following combinations of libraries is needed to use the @@ -334,7 +336,7 @@ XML ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -lxml 4.3.0 XML parser for read_xml and tree builder for to_xml +lxml 4.5.0 XML parser for read_xml and tree builder for to_xml ========================= ================== ============================================================= SQL databases @@ -343,9 +345,9 @@ SQL databases ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -SQLAlchemy 1.3.0 SQL support for databases other than sqlite -psycopg2 2.7 PostgreSQL engine for sqlalchemy -pymysql 0.8.1 MySQL engine for sqlalchemy +SQLAlchemy 1.4.0 SQL support for databases other than sqlite +psycopg2 2.8.4 PostgreSQL engine for sqlalchemy +pymysql 0.10.1 MySQL engine for sqlalchemy ========================= ================== ============================================================= Other data sources @@ -354,12 +356,12 @@ Other data sources ========================= ================== ============================================================= Dependency Minimum Version Notes ========================= ================== ============================================================= -PyTables 3.5.1 HDF5-based reading / writing -blosc 1.17.0 Compression for HDF5 +PyTables 3.6.1 HDF5-based reading / writing +blosc 1.20.1 Compression for HDF5 zlib Compression for HDF5 fastparquet 0.4.0 Parquet reading / writing -pyarrow 0.17.0 Parquet, ORC, and feather reading / writing -pyreadstat SPSS files (.sav) reading +pyarrow 1.0.1 Parquet, ORC, and feather reading / writing +pyreadstat 1.1.0 SPSS files (.sav) reading ========================= ================== ============================================================= .. _install.warn_orc: @@ -385,7 +387,7 @@ Dependency Minimum Version Notes ========================= ================== ============================================================= fsspec 0.7.4 Handling files aside from simple local and HTTP gcsfs 0.6.0 Google Cloud Storage access -pandas-gbq 0.12.0 Google Big Query access +pandas-gbq 0.14.0 Google Big Query access s3fs 0.4.0 Amazon S3 access ========================= ================== ============================================================= @@ -400,3 +402,13 @@ qtpy Clipboard I/O xclip Clipboard I/O on linux xsel Clipboard I/O on linux ========================= ================== ============================================================= + + +Compression +^^^^^^^^^^^ + +========================= ================== ============================================================= +Dependency Minimum Version Notes +========================= ================== ============================================================= +Zstandard Zstandard compression +========================= ================== ============================================================= diff --git a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst index fcf754e340ab2..caa37d69f2945 100644 --- a/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst +++ b/doc/source/getting_started/intro_tutorials/06_calculate_statistics.rst @@ -82,7 +82,7 @@ return a ``DataFrame``, see the :ref:`subset data tutorial <10min_tut_03_subset> The aggregating statistic can be calculated for multiple columns at the -same time. Remember the ``describe`` function from :ref:`first tutorial <10min_tut_01_tableoriented>` tutorial? +same time. Remember the ``describe`` function from :ref:`first tutorial <10min_tut_01_tableoriented>`? .. ipython:: python diff --git a/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst b/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst index bd4a617fe753b..d09511143787a 100644 --- a/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst +++ b/doc/source/getting_started/intro_tutorials/07_reshape_table_layout.rst @@ -67,7 +67,7 @@ measurement. .. raw:: html

- To raw data + To raw data diff --git a/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst b/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst index be4c284912db4..0b165c4aaa94e 100644 --- a/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst +++ b/doc/source/getting_started/intro_tutorials/08_combine_dataframes.rst @@ -34,7 +34,7 @@ Westminster* in respectively Paris, Antwerp and London. .. raw:: html

- To raw data + To raw data @@ -69,7 +69,7 @@ Westminster* in respectively Paris, Antwerp and London. .. raw:: html

- To raw data + To raw data diff --git a/doc/source/getting_started/intro_tutorials/09_timeseries.rst b/doc/source/getting_started/intro_tutorials/09_timeseries.rst index b9cab0747196e..1b3c3f2a601e8 100644 --- a/doc/source/getting_started/intro_tutorials/09_timeseries.rst +++ b/doc/source/getting_started/intro_tutorials/09_timeseries.rst @@ -35,7 +35,7 @@ Westminster* in respectively Paris, Antwerp and London. .. raw:: html

- To raw data + To raw data diff --git a/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst b/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst index a5a5442330e43..410062cf46344 100644 --- a/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst +++ b/doc/source/getting_started/intro_tutorials/includes/air_quality_no2.rst @@ -17,6 +17,6 @@ in respectively Paris, Antwerp and London. .. raw:: html

- To raw data + To raw data diff --git a/doc/source/getting_started/intro_tutorials/includes/titanic.rst b/doc/source/getting_started/intro_tutorials/includes/titanic.rst index 7032b70b3f1cf..1267a33d605ed 100644 --- a/doc/source/getting_started/intro_tutorials/includes/titanic.rst +++ b/doc/source/getting_started/intro_tutorials/includes/titanic.rst @@ -27,6 +27,6 @@ consists of the following data columns: .. raw:: html

- To raw data + To raw data diff --git a/doc/source/getting_started/overview.rst b/doc/source/getting_started/overview.rst index 7084b67cf9424..320d2da01418c 100644 --- a/doc/source/getting_started/overview.rst +++ b/doc/source/getting_started/overview.rst @@ -29,7 +29,7 @@ and :class:`DataFrame` (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, :class:`DataFrame` provides everything that R's ``data.frame`` provides and much more. pandas is built on top of `NumPy -`__ and is intended to integrate well within a scientific +`__ and is intended to integrate well within a scientific computing environment with many other 3rd party libraries. Here are just a few of the things that pandas does well: @@ -75,7 +75,7 @@ Some other notes specialized tool. - pandas is a dependency of `statsmodels - `__, making it an important part of the + `__, making it an important part of the statistical computing ecosystem in Python. - pandas has been used extensively in production in financial applications. @@ -168,7 +168,7 @@ The list of the Core Team members and more detailed information can be found on Institutional partners ---------------------- -The information about current institutional partners can be found on `pandas website page `__. +The information about current institutional partners can be found on `pandas website page `__. License ------- diff --git a/doc/source/getting_started/tutorials.rst b/doc/source/getting_started/tutorials.rst index b8940d2efed2f..a4c555ac227e6 100644 --- a/doc/source/getting_started/tutorials.rst +++ b/doc/source/getting_started/tutorials.rst @@ -18,6 +18,19 @@ entails. For the table of contents, see the `pandas-cookbook GitHub repository `_. +pandas workshop by Stefanie Molin +--------------------------------- + +An introductory workshop by `Stefanie Molin `_ +designed to quickly get you up to speed with pandas using real-world datasets. +It covers getting started with pandas, data wrangling, and data visualization +(with some exposure to matplotlib and seaborn). The +`pandas-workshop GitHub repository `_ +features detailed environment setup instructions (including a Binder environment), +slides and notebooks for following along, and exercises to practice the concepts. +There is also a lab with new exercises on a dataset not covered in the workshop for +additional practice. + Learn pandas by Hernan Rojas ---------------------------- @@ -77,11 +90,11 @@ Video tutorials * `Data analysis in Python with pandas `_ (2016-2018) `GitHub repo `__ and - `Jupyter Notebook `__ + `Jupyter Notebook `__ * `Best practices with pandas `_ (2018) `GitHub repo `__ and - `Jupyter Notebook `__ + `Jupyter Notebook `__ Various tutorials diff --git a/doc/source/index.rst.template b/doc/source/index.rst.template index 51a6807b30e2a..3b440122c2b97 100644 --- a/doc/source/index.rst.template +++ b/doc/source/index.rst.template @@ -12,6 +12,9 @@ pandas documentation **Download documentation**: `PDF Version `__ | `Zipped HTML `__ +**Previous versions**: Documentation of previous pandas versions is available at +`pandas.pydata.org `__. + **Useful links**: `Binary Installers `__ | `Source Repository `__ | diff --git a/doc/source/reference/arrays.rst b/doc/source/reference/arrays.rst index c6fda85b0486d..38792c46e5f54 100644 --- a/doc/source/reference/arrays.rst +++ b/doc/source/reference/arrays.rst @@ -2,9 +2,9 @@ .. _api.arrays: -============= -pandas arrays -============= +====================================== +pandas arrays, scalars, and data types +====================================== .. currentmodule:: pandas @@ -141,11 +141,11 @@ Methods Timestamp.weekday A collection of timestamps may be stored in a :class:`arrays.DatetimeArray`. -For timezone-aware data, the ``.dtype`` of a ``DatetimeArray`` is a +For timezone-aware data, the ``.dtype`` of a :class:`arrays.DatetimeArray` is a :class:`DatetimeTZDtype`. For timezone-naive data, ``np.dtype("datetime64[ns]")`` is used. -If the data are tz-aware, then every value in the array must have the same timezone. +If the data are timezone-aware, then every value in the array must have the same timezone. .. autosummary:: :toctree: api/ @@ -206,7 +206,7 @@ Methods Timedelta.to_numpy Timedelta.total_seconds -A collection of timedeltas may be stored in a :class:`TimedeltaArray`. +A collection of :class:`Timedelta` may be stored in a :class:`TimedeltaArray`. .. autosummary:: :toctree: api/ @@ -267,8 +267,8 @@ Methods Period.strftime Period.to_timestamp -A collection of timedeltas may be stored in a :class:`arrays.PeriodArray`. -Every period in a ``PeriodArray`` must have the same ``freq``. +A collection of :class:`Period` may be stored in a :class:`arrays.PeriodArray`. +Every period in a :class:`arrays.PeriodArray` must have the same ``freq``. .. autosummary:: :toctree: api/ @@ -383,8 +383,8 @@ Categorical data ---------------- pandas defines a custom data type for representing data that can take only a -limited, fixed set of values. The dtype of a ``Categorical`` can be described by -a :class:`pandas.api.types.CategoricalDtype`. +limited, fixed set of values. The dtype of a :class:`Categorical` can be described by +a :class:`CategoricalDtype`. .. autosummary:: :toctree: api/ @@ -414,7 +414,7 @@ have the categories and integer codes already: Categorical.from_codes -The dtype information is available on the ``Categorical`` +The dtype information is available on the :class:`Categorical` .. autosummary:: :toctree: api/ @@ -425,21 +425,21 @@ The dtype information is available on the ``Categorical`` Categorical.codes ``np.asarray(categorical)`` works by implementing the array interface. Be aware, that this converts -the Categorical back to a NumPy array, so categories and order information is not preserved! +the :class:`Categorical` back to a NumPy array, so categories and order information is not preserved! .. autosummary:: :toctree: api/ Categorical.__array__ -A ``Categorical`` can be stored in a ``Series`` or ``DataFrame``. +A :class:`Categorical` can be stored in a :class:`Series` or :class:`DataFrame`. To create a Series of dtype ``category``, use ``cat = s.astype(dtype)`` or ``Series(..., dtype=dtype)`` where ``dtype`` is either * the string ``'category'`` -* an instance of :class:`~pandas.api.types.CategoricalDtype`. +* an instance of :class:`CategoricalDtype`. -If the Series is of dtype ``CategoricalDtype``, ``Series.cat`` can be used to change the categorical +If the :class:`Series` is of dtype :class:`CategoricalDtype`, ``Series.cat`` can be used to change the categorical data. See :ref:`api.series.cat` for more. .. _api.arrays.sparse: @@ -488,7 +488,7 @@ we recommend using :class:`StringDtype` (with the alias ``"string"``). StringDtype -The ``Series.str`` accessor is available for ``Series`` backed by a :class:`arrays.StringArray`. +The ``Series.str`` accessor is available for :class:`Series` backed by a :class:`arrays.StringArray`. See :ref:`api.series.str` for more. @@ -498,7 +498,7 @@ Boolean data with missing values -------------------------------- The boolean dtype (with the alias ``"boolean"``) provides support for storing -boolean data (True, False values) with missing values, which is not possible +boolean data (``True``, ``False``) with missing values, which is not possible with a bool :class:`numpy.ndarray`. .. autosummary:: diff --git a/doc/source/reference/extensions.rst b/doc/source/reference/extensions.rst index 7b451ed3bf296..ce8d8d5c2ca10 100644 --- a/doc/source/reference/extensions.rst +++ b/doc/source/reference/extensions.rst @@ -48,6 +48,7 @@ objects. api.extensions.ExtensionArray.equals api.extensions.ExtensionArray.factorize api.extensions.ExtensionArray.fillna + api.extensions.ExtensionArray.insert api.extensions.ExtensionArray.isin api.extensions.ExtensionArray.isna api.extensions.ExtensionArray.ravel @@ -60,6 +61,7 @@ objects. api.extensions.ExtensionArray.nbytes api.extensions.ExtensionArray.ndim api.extensions.ExtensionArray.shape + api.extensions.ExtensionArray.tolist Additionally, we have some utility methods for ensuring your object behaves correctly. diff --git a/doc/source/reference/general_functions.rst b/doc/source/reference/general_functions.rst index b5832cb8aa591..dde16fb7fac71 100644 --- a/doc/source/reference/general_functions.rst +++ b/doc/source/reference/general_functions.rst @@ -37,15 +37,15 @@ Top-level missing data notna notnull -Top-level conversions -~~~~~~~~~~~~~~~~~~~~~ +Top-level dealing with numeric data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :toctree: api/ to_numeric -Top-level dealing with datetimelike -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Top-level dealing with datetimelike data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :toctree: api/ @@ -57,8 +57,8 @@ Top-level dealing with datetimelike timedelta_range infer_freq -Top-level dealing with intervals -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Top-level dealing with Interval data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autosummary:: :toctree: api/ diff --git a/doc/source/reference/general_utility_functions.rst b/doc/source/reference/general_utility_functions.rst index 37fe980dbf68c..ee17ef3831164 100644 --- a/doc/source/reference/general_utility_functions.rst +++ b/doc/source/reference/general_utility_functions.rst @@ -35,14 +35,17 @@ Exceptions and warnings .. autosummary:: :toctree: api/ + errors.AbstractMethodError errors.AccessorRegistrationWarning errors.DtypeWarning errors.DuplicateLabelError errors.EmptyDataError errors.InvalidIndexError + errors.IntCastingNaNError errors.MergeError errors.NullFrequencyError errors.NumbaUtilError + errors.OptionError errors.OutOfBoundsDatetime errors.OutOfBoundsTimedelta errors.ParserError diff --git a/doc/source/reference/groupby.rst b/doc/source/reference/groupby.rst index ccf130d03418c..2bb0659264eb0 100644 --- a/doc/source/reference/groupby.rst +++ b/doc/source/reference/groupby.rst @@ -122,6 +122,7 @@ application to columns of a specific data type. DataFrameGroupBy.skew DataFrameGroupBy.take DataFrameGroupBy.tshift + DataFrameGroupBy.value_counts The following methods are available only for ``SeriesGroupBy`` objects. diff --git a/doc/source/reference/indexing.rst b/doc/source/reference/indexing.rst index 1a8c21a2c1a74..ddfef14036ef3 100644 --- a/doc/source/reference/indexing.rst +++ b/doc/source/reference/indexing.rst @@ -406,6 +406,7 @@ Methods :toctree: api/ DatetimeIndex.mean + DatetimeIndex.std TimedeltaIndex -------------- diff --git a/doc/source/reference/io.rst b/doc/source/reference/io.rst index 442631de50c7a..44ee09f2a5e6b 100644 --- a/doc/source/reference/io.rst +++ b/doc/source/reference/io.rst @@ -13,6 +13,7 @@ Pickling :toctree: api/ read_pickle + DataFrame.to_pickle Flat file ~~~~~~~~~ @@ -21,6 +22,7 @@ Flat file read_table read_csv + DataFrame.to_csv read_fwf Clipboard @@ -29,6 +31,7 @@ Clipboard :toctree: api/ read_clipboard + DataFrame.to_clipboard Excel ~~~~~ @@ -36,14 +39,26 @@ Excel :toctree: api/ read_excel + DataFrame.to_excel ExcelFile.parse +.. currentmodule:: pandas.io.formats.style + +.. autosummary:: + :toctree: api/ + + Styler.to_excel + +.. currentmodule:: pandas + .. autosummary:: :toctree: api/ :template: autosummary/class_without_autosummary.rst ExcelWriter +.. currentmodule:: pandas + JSON ~~~~ .. autosummary:: @@ -51,6 +66,7 @@ JSON read_json json_normalize + DataFrame.to_json .. currentmodule:: pandas.io.json @@ -67,6 +83,16 @@ HTML :toctree: api/ read_html + DataFrame.to_html + +.. currentmodule:: pandas.io.formats.style + +.. autosummary:: + :toctree: api/ + + Styler.to_html + +.. currentmodule:: pandas XML ~~~~ @@ -74,6 +100,23 @@ XML :toctree: api/ read_xml + DataFrame.to_xml + +Latex +~~~~~ +.. autosummary:: + :toctree: api/ + + DataFrame.to_latex + +.. currentmodule:: pandas.io.formats.style + +.. autosummary:: + :toctree: api/ + + Styler.to_latex + +.. currentmodule:: pandas HDFStore: PyTables (HDF5) ~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -92,7 +135,7 @@ HDFStore: PyTables (HDF5) .. warning:: - One can store a subclass of ``DataFrame`` or ``Series`` to HDF5, + One can store a subclass of :class:`DataFrame` or :class:`Series` to HDF5, but the type of the subclass is lost upon storing. Feather @@ -101,6 +144,7 @@ Feather :toctree: api/ read_feather + DataFrame.to_feather Parquet ~~~~~~~ @@ -108,6 +152,7 @@ Parquet :toctree: api/ read_parquet + DataFrame.to_parquet ORC ~~~ @@ -138,6 +183,7 @@ SQL read_sql_table read_sql_query read_sql + DataFrame.to_sql Google BigQuery ~~~~~~~~~~~~~~~ @@ -152,6 +198,7 @@ STATA :toctree: api/ read_stata + DataFrame.to_stata .. currentmodule:: pandas.io.stata diff --git a/doc/source/reference/series.rst b/doc/source/reference/series.rst index 3ff3b2bb53fda..a60dab549e66d 100644 --- a/doc/source/reference/series.rst +++ b/doc/source/reference/series.rst @@ -427,6 +427,8 @@ strings and apply several methods to it. These can be accessed like Series.str.normalize Series.str.pad Series.str.partition + Series.str.removeprefix + Series.str.removesuffix Series.str.repeat Series.str.replace Series.str.rfind diff --git a/doc/source/reference/style.rst b/doc/source/reference/style.rst index 5a2ff803f0323..a739993e4d376 100644 --- a/doc/source/reference/style.rst +++ b/doc/source/reference/style.rst @@ -24,6 +24,8 @@ Styler properties Styler.env Styler.template_html + Styler.template_html_style + Styler.template_html_table Styler.template_latex Styler.loader @@ -34,13 +36,17 @@ Style application Styler.apply Styler.applymap - Styler.where + Styler.apply_index + Styler.applymap_index Styler.format + Styler.format_index + Styler.hide Styler.set_td_classes Styler.set_table_styles Styler.set_table_attributes Styler.set_tooltips Styler.set_caption + Styler.set_sticky Styler.set_properties Styler.set_uuid Styler.clear @@ -65,9 +71,8 @@ Style export and import .. autosummary:: :toctree: api/ - Styler.render - Styler.export - Styler.use Styler.to_html - Styler.to_excel Styler.to_latex + Styler.to_excel + Styler.export + Styler.use diff --git a/doc/source/reference/window.rst b/doc/source/reference/window.rst index a255b3ae8081e..0be3184a9356c 100644 --- a/doc/source/reference/window.rst +++ b/doc/source/reference/window.rst @@ -35,6 +35,7 @@ Rolling window functions Rolling.aggregate Rolling.quantile Rolling.sem + Rolling.rank .. _api.functions_window: @@ -75,6 +76,7 @@ Expanding window functions Expanding.aggregate Expanding.quantile Expanding.sem + Expanding.rank .. _api.functions_ewm: @@ -86,6 +88,7 @@ Exponentially-weighted window functions :toctree: api/ ExponentialMovingWindow.mean + ExponentialMovingWindow.sum ExponentialMovingWindow.std ExponentialMovingWindow.var ExponentialMovingWindow.corr diff --git a/doc/source/user_guide/10min.rst b/doc/source/user_guide/10min.rst index 2b329ef362354..08488a33936f0 100644 --- a/doc/source/user_guide/10min.rst +++ b/doc/source/user_guide/10min.rst @@ -19,7 +19,7 @@ Customarily, we import as follows: Object creation --------------- -See the :ref:`Data Structure Intro section `. +See the :ref:`Intro to data structures section `. Creating a :class:`Series` by passing a list of values, letting pandas create a default integer index: @@ -39,7 +39,8 @@ and labeled columns: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD")) df -Creating a :class:`DataFrame` by passing a dict of objects that can be converted to series-like. +Creating a :class:`DataFrame` by passing a dictionary of objects that can be +converted into a series-like structure: .. ipython:: python @@ -56,7 +57,7 @@ Creating a :class:`DataFrame` by passing a dict of objects that can be converted df2 The columns of the resulting :class:`DataFrame` have different -:ref:`dtypes `. +:ref:`dtypes `: .. ipython:: python @@ -116,14 +117,14 @@ of the dtypes in the DataFrame. This may end up being ``object``, which requires casting every value to a Python object. For ``df``, our :class:`DataFrame` of all floating-point values, -:meth:`DataFrame.to_numpy` is fast and doesn't require copying data. +:meth:`DataFrame.to_numpy` is fast and doesn't require copying data: .. ipython:: python df.to_numpy() For ``df2``, the :class:`DataFrame` with multiple dtypes, -:meth:`DataFrame.to_numpy` is relatively expensive. +:meth:`DataFrame.to_numpy` is relatively expensive: .. ipython:: python @@ -180,7 +181,7 @@ equivalent to ``df.A``: df["A"] -Selecting via ``[]``, which slices the rows. +Selecting via ``[]``, which slices the rows: .. ipython:: python @@ -278,13 +279,13 @@ For getting fast access to a scalar (equivalent to the prior method): Boolean indexing ~~~~~~~~~~~~~~~~ -Using a single column's values to select data. +Using a single column's values to select data: .. ipython:: python df[df["A"] > 0] -Selecting values from a DataFrame where a boolean condition is met. +Selecting values from a DataFrame where a boolean condition is met: .. ipython:: python @@ -303,7 +304,7 @@ Setting ~~~~~~~ Setting a new column automatically aligns the data -by the indexes. +by the indexes: .. ipython:: python @@ -329,13 +330,13 @@ Setting by assigning with a NumPy array: df.loc[:, "D"] = np.array([5] * len(df)) -The result of the prior setting operations. +The result of the prior setting operations: .. ipython:: python df -A ``where`` operation with setting. +A ``where`` operation with setting: .. ipython:: python @@ -352,7 +353,7 @@ default not included in computations. See the :ref:`Missing Data section `. Reindexing allows you to change/add/delete the index on a specified axis. This -returns a copy of the data. +returns a copy of the data: .. ipython:: python @@ -360,19 +361,19 @@ returns a copy of the data. df1.loc[dates[0] : dates[1], "E"] = 1 df1 -To drop any rows that have missing data. +To drop any rows that have missing data: .. ipython:: python df1.dropna(how="any") -Filling missing data. +Filling missing data: .. ipython:: python df1.fillna(value=5) -To get the boolean mask where values are ``nan``. +To get the boolean mask where values are ``nan``: .. ipython:: python @@ -402,7 +403,7 @@ Same operation on the other axis: df.mean(1) Operating with objects that have different dimensionality and need alignment. -In addition, pandas automatically broadcasts along the specified dimension. +In addition, pandas automatically broadcasts along the specified dimension: .. ipython:: python @@ -477,7 +478,6 @@ Concatenating pandas objects together with :func:`concat`: a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the :class:`DataFrame` constructor instead of building a :class:`DataFrame` by iteratively appending records to it. - See :ref:`Appending to dataframe ` for more. Join ~~~~ @@ -527,14 +527,14 @@ See the :ref:`Grouping section `. df Grouping and then applying the :meth:`~pandas.core.groupby.GroupBy.sum` function to the resulting -groups. +groups: .. ipython:: python df.groupby("A").sum() Grouping by multiple columns forms a hierarchical index, and again we can -apply the :meth:`~pandas.core.groupby.GroupBy.sum` function. +apply the :meth:`~pandas.core.groupby.GroupBy.sum` function: .. ipython:: python @@ -565,7 +565,7 @@ Stack df2 The :meth:`~DataFrame.stack` method "compresses" a level in the DataFrame's -columns. +columns: .. ipython:: python @@ -673,7 +673,7 @@ pandas can include categorical data in a :class:`DataFrame`. For full docs, see -Convert the raw grades to a categorical data type. +Converting the raw grades to a categorical data type: .. ipython:: python @@ -681,13 +681,13 @@ Convert the raw grades to a categorical data type. df["grade"] Rename the categories to more meaningful names (assigning to -:meth:`Series.cat.categories` is in place!). +:meth:`Series.cat.categories` is in place!): .. ipython:: python df["grade"].cat.categories = ["very good", "good", "very bad"] -Reorder the categories and simultaneously add the missing categories (methods under :meth:`Series.cat` return a new :class:`Series` by default). +Reorder the categories and simultaneously add the missing categories (methods under :meth:`Series.cat` return a new :class:`Series` by default): .. ipython:: python @@ -696,13 +696,13 @@ Reorder the categories and simultaneously add the missing categories (methods un ) df["grade"] -Sorting is per order in the categories, not lexical order. +Sorting is per order in the categories, not lexical order: .. ipython:: python df.sort_values(by="grade") -Grouping by a categorical column also shows empty categories. +Grouping by a categorical column also shows empty categories: .. ipython:: python @@ -722,7 +722,7 @@ We use the standard convention for referencing the matplotlib API: plt.close("all") -The :meth:`~plt.close` method is used to `close `__ a figure window. +The :meth:`~plt.close` method is used to `close `__ a figure window: .. ipython:: python @@ -732,6 +732,14 @@ The :meth:`~plt.close` method is used to `close `__ to show it or +`matplotlib.pyplot.savefig `__ to write it to a file. + +.. ipython:: python + + plt.show(); + On a DataFrame, the :meth:`~DataFrame.plot` method is a convenience to plot all of the columns with labels: @@ -754,13 +762,13 @@ Getting data in/out CSV ~~~ -:ref:`Writing to a csv file. ` +:ref:`Writing to a csv file: ` .. ipython:: python df.to_csv("foo.csv") -:ref:`Reading from a csv file. ` +:ref:`Reading from a csv file: ` .. ipython:: python @@ -778,13 +786,13 @@ HDF5 Reading and writing to :ref:`HDFStores `. -Writing to a HDF5 Store. +Writing to a HDF5 Store: .. ipython:: python df.to_hdf("foo.h5", "df") -Reading from a HDF5 Store. +Reading from a HDF5 Store: .. ipython:: python @@ -800,13 +808,13 @@ Excel Reading and writing to :ref:`MS Excel `. -Writing to an excel file. +Writing to an excel file: .. ipython:: python df.to_excel("foo.xlsx", sheet_name="Sheet1") -Reading from an excel file. +Reading from an excel file: .. ipython:: python diff --git a/doc/source/user_guide/advanced.rst b/doc/source/user_guide/advanced.rst index 3b33ebe701037..b8df21ab5a5b4 100644 --- a/doc/source/user_guide/advanced.rst +++ b/doc/source/user_guide/advanced.rst @@ -7,7 +7,7 @@ MultiIndex / advanced indexing ****************************** This section covers :ref:`indexing with a MultiIndex ` -and :ref:`other advanced indexing features `. +and :ref:`other advanced indexing features `. See the :ref:`Indexing and Selecting Data ` for general indexing documentation. @@ -738,7 +738,7 @@ faster than fancy indexing. %timeit ser.iloc[indexer] %timeit ser.take(indexer) -.. _indexing.index_types: +.. _advanced.index_types: Index types ----------- @@ -749,7 +749,7 @@ and documentation about ``TimedeltaIndex`` is found :ref:`here `__. -.. _indexing.float64index: +.. _advanced.float64index: Float64Index ~~~~~~~~~~~~ +.. deprecated:: 1.4.0 + :class:`Index` will become the default index type for numeric types in the future + instead of ``Int64Index``, ``Float64Index`` and ``UInt64Index`` and those index types + are therefore deprecated and will be removed in a future version of Pandas. + ``RangeIndex`` will not be removed as it represents an optimized version of an integer index. + By default a :class:`Float64Index` will be automatically created when passing floating, or mixed-integer-floating values in index creation. This enables a pure label-based slicing paradigm that makes ``[],ix,loc`` for scalar indexing and slicing work exactly the same. @@ -956,6 +968,7 @@ If you need integer based selection, you should use ``iloc``: dfir.iloc[0:5] + .. _advanced.intervalindex: IntervalIndex @@ -1233,5 +1246,5 @@ This is because the (re)indexing operations above silently inserts ``NaNs`` and changes accordingly. This can cause some issues when using ``numpy`` ``ufuncs`` such as ``numpy.logical_and``. -See the `this old issue `__ for a more +See the :issue:`2388` for a more detailed discussion. diff --git a/doc/source/user_guide/basics.rst b/doc/source/user_guide/basics.rst index 82c8a27bec3a5..a34d4891b9d77 100644 --- a/doc/source/user_guide/basics.rst +++ b/doc/source/user_guide/basics.rst @@ -848,8 +848,8 @@ have introduced the popular ``(%>%)`` (read pipe) operator for R_. The implementation of ``pipe`` here is quite clean and feels right at home in Python. We encourage you to view the source code of :meth:`~DataFrame.pipe`. -.. _dplyr: https://github.com/hadley/dplyr -.. _magrittr: https://github.com/smbache/magrittr +.. _dplyr: https://github.com/tidyverse/dplyr +.. _magrittr: https://github.com/tidyverse/magrittr .. _R: https://www.r-project.org @@ -1045,6 +1045,9 @@ not noted for a particular column will be ``NaN``: Mixed dtypes ++++++++++++ +.. deprecated:: 1.4.0 + Attempting to determine which columns cannot be aggregated and silently dropping them from the results is deprecated and will be removed in a future version. If any porition of the columns or operations provided fail, the call to ``.agg`` will raise. + When presented with mixed dtypes that cannot aggregate, ``.agg`` will only take the valid aggregations. This is similar to how ``.groupby.agg`` works. @@ -1061,6 +1064,7 @@ aggregations. This is similar to how ``.groupby.agg`` works. mdf.dtypes .. ipython:: python + :okwarning: mdf.agg(["min", "sum"]) @@ -2047,32 +2051,33 @@ The following table lists all of pandas extension types. For methods requiring ` arguments, strings can be specified as indicated. See the respective documentation sections for more on each type. -+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+ -| Kind of Data | Data Type | Scalar | Array | String Aliases | Documentation | -+===================+===========================+====================+===============================+=========================================+===============================+ -| tz-aware datetime | :class:`DatetimeTZDtype` | :class:`Timestamp` | :class:`arrays.DatetimeArray` | ``'datetime64[ns, ]'`` | :ref:`timeseries.timezone` | -+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+ -| Categorical | :class:`CategoricalDtype` | (none) | :class:`Categorical` | ``'category'`` | :ref:`categorical` | -+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+ -| period | :class:`PeriodDtype` | :class:`Period` | :class:`arrays.PeriodArray` | ``'period[]'``, | :ref:`timeseries.periods` | -| (time spans) | | | | ``'Period[]'`` | | -+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+ -| sparse | :class:`SparseDtype` | (none) | :class:`arrays.SparseArray` | ``'Sparse'``, ``'Sparse[int]'``, | :ref:`sparse` | -| | | | | ``'Sparse[float]'`` | | -+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+ -| intervals | :class:`IntervalDtype` | :class:`Interval` | :class:`arrays.IntervalArray` | ``'interval'``, ``'Interval'``, | :ref:`advanced.intervalindex` | -| | | | | ``'Interval[]'``, | | -| | | | | ``'Interval[datetime64[ns, ]]'``, | | -| | | | | ``'Interval[timedelta64[]]'`` | | -+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+ -| nullable integer + :class:`Int64Dtype`, ... | (none) | :class:`arrays.IntegerArray` | ``'Int8'``, ``'Int16'``, ``'Int32'``, | :ref:`integer_na` | -| | | | | ``'Int64'``, ``'UInt8'``, ``'UInt16'``, | | -| | | | | ``'UInt32'``, ``'UInt64'`` | | -+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+ -| Strings | :class:`StringDtype` | :class:`str` | :class:`arrays.StringArray` | ``'string'`` | :ref:`text` | -+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+ -| Boolean (with NA) | :class:`BooleanDtype` | :class:`bool` | :class:`arrays.BooleanArray` | ``'boolean'`` | :ref:`api.arrays.bool` | -+-------------------+---------------------------+--------------------+-------------------------------+-----------------------------------------+-------------------------------+ ++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+ +| Kind of Data | Data Type | Scalar | Array | String Aliases | ++=================================================+===============+===========+========+===========+===============================+========================================+ +| :ref:`tz-aware datetime ` | :class:`DatetimeTZDtype` | :class:`Timestamp` | :class:`arrays.DatetimeArray` | ``'datetime64[ns, ]'`` | +| | | | | | ++-------------------------------------------------+---------------+-----------+--------------------+-------------------------------+----------------------------------------+ +| :ref:`Categorical ` | :class:`CategoricalDtype` | (none) | :class:`Categorical` | ``'category'`` | ++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+ +| :ref:`period (time spans) ` | :class:`PeriodDtype` | :class:`Period` | :class:`arrays.PeriodArray` | ``'period[]'``, | +| | | | ``'Period[]'`` | | ++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+ +| :ref:`sparse ` | :class:`SparseDtype` | (none) | :class:`arrays.SparseArray` | ``'Sparse'``, ``'Sparse[int]'``, | +| | | | | ``'Sparse[float]'`` | ++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+ +| :ref:`intervals ` | :class:`IntervalDtype` | :class:`Interval` | :class:`arrays.IntervalArray` | ``'interval'``, ``'Interval'``, | +| | | | | ``'Interval[]'``, | +| | | | | ``'Interval[datetime64[ns, ]]'``, | +| | | | | ``'Interval[timedelta64[]]'`` | ++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+ +| :ref:`nullable integer ` | :class:`Int64Dtype`, ... | (none) | :class:`arrays.IntegerArray` | ``'Int8'``, ``'Int16'``, ``'Int32'``, | +| | | | | ``'Int64'``, ``'UInt8'``, ``'UInt16'``,| +| | | | | ``'UInt32'``, ``'UInt64'`` | ++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+ +| :ref:`Strings ` | :class:`StringDtype` | :class:`str` | :class:`arrays.StringArray` | ``'string'`` | ++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+ +| :ref:`Boolean (with NA) ` | :class:`BooleanDtype` | :class:`bool` | :class:`arrays.BooleanArray` | ``'boolean'`` | ++-------------------------------------------------+---------------------------+--------------------+-------------------------------+----------------------------------------+ pandas has two ways to store strings. diff --git a/doc/source/user_guide/boolean.rst b/doc/source/user_guide/boolean.rst index 76c922fcef638..54c67674b890c 100644 --- a/doc/source/user_guide/boolean.rst +++ b/doc/source/user_guide/boolean.rst @@ -12,6 +12,11 @@ Nullable Boolean data type ************************** +.. note:: + + BooleanArray is currently experimental. Its API or implementation may + change without warning. + .. versionadded:: 1.0.0 diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst index f65638cd78a2b..0105cf99193dd 100644 --- a/doc/source/user_guide/categorical.rst +++ b/doc/source/user_guide/categorical.rst @@ -777,8 +777,8 @@ value is included in the ``categories``: df try: df.iloc[2:4, :] = [["c", 3], ["c", 3]] - except ValueError as e: - print("ValueError:", str(e)) + except TypeError as e: + print("TypeError:", str(e)) Setting values by assigning categorical data will also check that the ``categories`` match: @@ -788,8 +788,8 @@ Setting values by assigning categorical data will also check that the ``categori df try: df.loc["j":"k", "cats"] = pd.Categorical(["b", "b"], categories=["a", "b", "c"]) - except ValueError as e: - print("ValueError:", str(e)) + except TypeError as e: + print("TypeError:", str(e)) Assigning a ``Categorical`` to parts of a column of other types will use the values: @@ -1141,7 +1141,7 @@ Categorical index ``CategoricalIndex`` is a type of index that is useful for supporting indexing with duplicates. This is a container around a ``Categorical`` and allows efficient indexing and storage of an index with a large number of duplicated elements. -See the :ref:`advanced indexing docs ` for a more detailed +See the :ref:`advanced indexing docs ` for a more detailed explanation. Setting the index will create a ``CategoricalIndex``: diff --git a/doc/source/user_guide/cookbook.rst b/doc/source/user_guide/cookbook.rst index e1aae0fd481b1..f88f4a9708c45 100644 --- a/doc/source/user_guide/cookbook.rst +++ b/doc/source/user_guide/cookbook.rst @@ -193,8 +193,7 @@ The :ref:`indexing ` docs. df[(df.AAA <= 6) & (df.index.isin([0, 2, 4]))] -`Use loc for label-oriented slicing and iloc positional slicing -`__ +Use loc for label-oriented slicing and iloc positional slicing :issue:`2904` .. ipython:: python @@ -229,7 +228,7 @@ Ambiguity arises when an index consists of integers with a non-zero start or non df2.loc[1:3] # Label-oriented `Using inverse operator (~) to take the complement of a mask -`__ +`__ .. ipython:: python @@ -259,7 +258,7 @@ New columns df `Keep other columns when using min() with groupby -`__ +`__ .. ipython:: python @@ -389,14 +388,13 @@ Sorting ******* `Sort by specific column or an ordered list of columns, with a MultiIndex -`__ +`__ .. ipython:: python df.sort_values(by=("Labs", "II"), ascending=False) -`Partial selection, the need for sortedness; -`__ +Partial selection, the need for sortedness :issue:`2995` Levels ****** @@ -405,7 +403,7 @@ Levels `__ `Flatten Hierarchical columns -`__ +`__ .. _cookbook.missing_data: @@ -556,7 +554,7 @@ Unlike agg, apply's callable is passed a sub-DataFrame which gives you access to ts `Create a value counts column and reassign back to the DataFrame -`__ +`__ .. ipython:: python @@ -663,7 +661,7 @@ Pivot The :ref:`Pivot ` docs. `Partial sums and subtotals -`__ +`__ .. ipython:: python @@ -870,7 +868,7 @@ Timeseries `__ `Constructing a datetime range that excludes weekends and includes only certain times -`__ +`__ `Vectorized Lookup `__ @@ -910,8 +908,7 @@ Valid frequency arguments to Grouper :ref:`Timeseries `__ -`Using TimeGrouper and another grouping to create subgroups, then apply a custom function -`__ +Using TimeGrouper and another grouping to create subgroups, then apply a custom function :issue:`3791` `Resampling with custom periods `__ @@ -929,9 +926,9 @@ Valid frequency arguments to Grouper :ref:`Timeseries ` docs. The :ref:`Join ` docs. +The :ref:`Join ` docs. -`Append two dataframes with overlapping index (emulate R rbind) +`Concatenate two dataframes with overlapping index (emulate R rbind) `__ .. ipython:: python @@ -944,11 +941,10 @@ Depending on df construction, ``ignore_index`` may be needed .. ipython:: python - df = df1.append(df2, ignore_index=True) + df = pd.concat([df1, df2], ignore_index=True) df -`Self Join of a DataFrame -`__ +Self Join of a DataFrame :issue:`2996` .. ipython:: python @@ -1038,7 +1034,7 @@ Data in/out ----------- `Performance comparison of SQL vs HDF5 -`__ +`__ .. _cookbook.csv: @@ -1070,14 +1066,7 @@ using that handle to read. `Inferring dtypes from a file `__ -`Dealing with bad lines -`__ - -`Dealing with bad lines II -`__ - -`Reading CSV with Unix timestamps and converting to local timezone -`__ +Dealing with bad lines :issue:`2886` `Write a multi-row index CSV without writing duplicates `__ @@ -1211,6 +1200,8 @@ The :ref:`Excel ` docs `Modifying formatting in XlsxWriter output `__ +Loading only visible sheets :issue:`19842#issuecomment-892150745` + .. _cookbook.html: HTML @@ -1229,8 +1220,7 @@ The :ref:`HDFStores ` docs `Simple queries with a Timestamp Index `__ -`Managing heterogeneous data using a linked multiple table hierarchy -`__ +Managing heterogeneous data using a linked multiple table hierarchy :issue:`3032` `Merging on-disk tables with millions of rows `__ @@ -1250,7 +1240,7 @@ csv file and creating a store by chunks, with date parsing as well. `__ `Large Data work flows -`__ +`__ `Reading in a sequence of files, then providing a global unique index to a store while appending `__ @@ -1300,7 +1290,7 @@ is closed. .. ipython:: python - store = pd.HDFStore("test.h5", "w", diver="H5FD_CORE") + store = pd.HDFStore("test.h5", "w", driver="H5FD_CORE") df = pd.DataFrame(np.random.randn(8, 3)) store["test"] = df @@ -1381,7 +1371,7 @@ Computation ----------- `Numerical integration (sample-based) of a time series -`__ +`__ Correlation *********** diff --git a/doc/source/user_guide/duplicates.rst b/doc/source/user_guide/duplicates.rst index 7cda067fb24ad..36c2ec53d58b4 100644 --- a/doc/source/user_guide/duplicates.rst +++ b/doc/source/user_guide/duplicates.rst @@ -28,6 +28,7 @@ duplicates present. The output can't be determined, and so pandas raises. .. ipython:: python :okexcept: + :okwarning: s1 = pd.Series([0, 1, 2], index=["a", "b", "b"]) s1.reindex(["a", "b", "c"]) diff --git a/doc/source/user_guide/enhancingperf.rst b/doc/source/user_guide/enhancingperf.rst index aa9a1ba6d6bf0..eef41eb4be80e 100644 --- a/doc/source/user_guide/enhancingperf.rst +++ b/doc/source/user_guide/enhancingperf.rst @@ -35,7 +35,7 @@ by trying to remove for-loops and making use of NumPy vectorization. It's always optimising in Python first. This tutorial walks through a "typical" process of cythonizing a slow computation. -We use an `example from the Cython documentation `__ +We use an `example from the Cython documentation `__ but in the context of pandas. Our final cythonized solution is around 100 times faster than the pure Python solution. @@ -302,28 +302,63 @@ For more about ``boundscheck`` and ``wraparound``, see the Cython docs on .. _enhancingperf.numba: -Using Numba ------------ +Numba (JIT compilation) +----------------------- -A recent alternative to statically compiling Cython code, is to use a *dynamic jit-compiler*, Numba. +An alternative to statically compiling Cython code is to use a dynamic just-in-time (JIT) compiler with `Numba `__. -Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters. +Numba allows you to write a pure Python function which can be JIT compiled to native machine instructions, similar in performance to C, C++ and Fortran, +by decorating your function with ``@jit``. -Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack. +Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). +Numba supports compilation of Python to run on either CPU or GPU hardware and is designed to integrate with the Python scientific software stack. .. note:: - You will need to install Numba. This is easy with ``conda``, by using: ``conda install numba``, see :ref:`installing using miniconda`. + The ``@jit`` compilation will add overhead to the runtime of the function, so performance benefits may not be realized especially when using small data sets. + Consider `caching `__ your function to avoid compilation overhead each time your function is run. -.. note:: +Numba can be used in 2 ways with pandas: + +#. Specify the ``engine="numba"`` keyword in select pandas methods +#. Define your own Python function decorated with ``@jit`` and pass the underlying NumPy array of :class:`Series` or :class:`Dataframe` (using ``to_numpy()``) into the function + +pandas Numba Engine +~~~~~~~~~~~~~~~~~~~ + +If Numba is installed, one can specify ``engine="numba"`` in select pandas methods to execute the method using Numba. +Methods that support ``engine="numba"`` will also have an ``engine_kwargs`` keyword that accepts a dictionary that allows one to specify +``"nogil"``, ``"nopython"`` and ``"parallel"`` keys with boolean values to pass into the ``@jit`` decorator. +If ``engine_kwargs`` is not specified, it defaults to ``{"nogil": False, "nopython": True, "parallel": False}`` unless otherwise specified. + +In terms of performance, **the first time a function is run using the Numba engine will be slow** +as Numba will have some function compilation overhead. However, the JIT compiled functions are cached, +and subsequent calls will be fast. In general, the Numba engine is performant with +a larger amount of data points (e.g. 1+ million). - As of Numba version 0.20, pandas objects cannot be passed directly to Numba-compiled functions. Instead, one must pass the NumPy array underlying the pandas object to the Numba-compiled function as demonstrated below. +.. code-block:: ipython + + In [1]: data = pd.Series(range(1_000_000)) # noqa: E225 + + In [2]: roll = data.rolling(10) -Jit -~~~ + In [3]: def f(x): + ...: return np.sum(x) + 5 + # Run the first time, compilation time will affect performance + In [4]: %timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True) + 1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) + # Function is cached and performance will improve + In [5]: %timeit roll.apply(f, engine='numba', raw=True) + 188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) -We demonstrate how to use Numba to just-in-time compile our code. We simply -take the plain Python code from above and annotate with the ``@jit`` decorator. + In [6]: %timeit roll.apply(f, engine='cython', raw=True) + 3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) + +Custom Function Examples +~~~~~~~~~~~~~~~~~~~~~~~~ + +A custom Python function decorated with ``@jit`` can be used with pandas objects by passing their NumPy array +representations with ``to_numpy()``. .. code-block:: python @@ -360,8 +395,6 @@ take the plain Python code from above and annotate with the ``@jit`` decorator. ) return pd.Series(result, index=df.index, name="result") -Note that we directly pass NumPy arrays to the Numba function. ``compute_numba`` is just a wrapper that provides a -nicer interface by passing/returning pandas objects. .. code-block:: ipython @@ -370,19 +403,9 @@ nicer interface by passing/returning pandas objects. In this example, using Numba was faster than Cython. -Numba as an argument -~~~~~~~~~~~~~~~~~~~~ - -Additionally, we can leverage the power of `Numba `__ -by calling it as an argument in :meth:`~Rolling.apply`. See :ref:`Computation tools -` for an extensive example. - -Vectorize -~~~~~~~~~ - Numba can also be used to write vectorized functions that do not require the user to explicitly loop over the observations of a vector; a vectorized function will be applied to each row automatically. -Consider the following toy example of doubling each observation: +Consider the following example of doubling each observation: .. code-block:: python @@ -414,25 +437,23 @@ Consider the following toy example of doubling each observation: Caveats ~~~~~~~ -.. note:: - - Numba will execute on any function, but can only accelerate certain classes of functions. - Numba is best at accelerating functions that apply numerical functions to NumPy -arrays. When passed a function that only uses operations it knows how to -accelerate, it will execute in ``nopython`` mode. - -If Numba is passed a function that includes something it doesn't know how to -work with -- a category that currently includes sets, lists, dictionaries, or -string functions -- it will revert to ``object mode``. In ``object mode``, -Numba will execute but your code will not speed up significantly. If you would +arrays. If you try to ``@jit`` a function that contains unsupported `Python `__ +or `NumPy `__ +code, compilation will revert `object mode `__ which +will mostly likely not speed up your function. If you would prefer that Numba throw an error if it cannot compile a function in a way that speeds up your code, pass Numba the argument -``nopython=True`` (e.g. ``@numba.jit(nopython=True)``). For more on +``nopython=True`` (e.g. ``@jit(nopython=True)``). For more on troubleshooting Numba modes, see the `Numba troubleshooting page `__. -Read more in the `Numba docs `__. +Using ``parallel=True`` (e.g. ``@jit(parallel=True)``) may result in a ``SIGABRT`` if the threading layer leads to unsafe +behavior. You can first `specify a safe threading layer `__ +before running a JIT function with ``parallel=True``. + +Generally if the you encounter a segfault (``SIGSEGV``) while using Numba, please report the issue +to the `Numba issue tracker. `__ .. _enhancingperf.eval: diff --git a/doc/source/user_guide/gotchas.rst b/doc/source/user_guide/gotchas.rst index 1de978b195382..bf764316df373 100644 --- a/doc/source/user_guide/gotchas.rst +++ b/doc/source/user_guide/gotchas.rst @@ -341,7 +341,7 @@ Why not make NumPy like R? Many people have suggested that NumPy should simply emulate the ``NA`` support present in the more domain-specific statistical programming language `R -`__. Part of the reason is the NumPy type hierarchy: +`__. Part of the reason is the NumPy type hierarchy: .. csv-table:: :header: "Typeclass","Dtypes" diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst index 870ec6763c72f..0fb59c50efa74 100644 --- a/doc/source/user_guide/groupby.rst +++ b/doc/source/user_guide/groupby.rst @@ -391,7 +391,6 @@ something different for each of the columns. Thus, using ``[]`` similar to getting a column from a DataFrame, you can do: .. ipython:: python - :suppress: df = pd.DataFrame( { @@ -402,7 +401,7 @@ getting a column from a DataFrame, you can do: } ) -.. ipython:: python + df grouped = df.groupby(["A"]) grouped_C = grouped["C"] @@ -579,7 +578,7 @@ column, which produces an aggregated result with a hierarchical index: .. ipython:: python - grouped.agg([np.sum, np.mean, np.std]) + grouped[["C", "D"]].agg([np.sum, np.mean, np.std]) The resulting aggregations are named for the functions themselves. If you @@ -598,7 +597,7 @@ For a grouped ``DataFrame``, you can rename in a similar manner: .. ipython:: python ( - grouped.agg([np.sum, np.mean, np.std]).rename( + grouped[["C", "D"]].agg([np.sum, np.mean, np.std]).rename( columns={"sum": "foo", "mean": "bar", "std": "baz"} ) ) @@ -1106,11 +1105,9 @@ Numba Accelerated Routines .. versionadded:: 1.1 If `Numba `__ is installed as an optional dependency, the ``transform`` and -``aggregate`` methods support ``engine='numba'`` and ``engine_kwargs`` arguments. The ``engine_kwargs`` -argument is a dictionary of keyword arguments that will be passed into the -`numba.jit decorator `__. -These keyword arguments will be applied to the passed function. Currently only ``nogil``, ``nopython``, -and ``parallel`` are supported, and their default values are set to ``False``, ``True`` and ``False`` respectively. +``aggregate`` methods support ``engine='numba'`` and ``engine_kwargs`` arguments. +See :ref:`enhancing performance with Numba ` for general usage of the arguments +and performance considerations. The function signature must start with ``values, index`` **exactly** as the data belonging to each group will be passed into ``values``, and the group index will be passed into ``index``. @@ -1121,52 +1118,6 @@ will be passed into ``values``, and the group index will be passed into ``index` data and group index will be passed as NumPy arrays to the JITed user defined function, and no alternative execution attempts will be tried. -.. note:: - - In terms of performance, **the first time a function is run using the Numba engine will be slow** - as Numba will have some function compilation overhead. However, the compiled functions are cached, - and subsequent calls will be fast. In general, the Numba engine is performant with - a larger amount of data points (e.g. 1+ million). - -.. code-block:: ipython - - In [1]: N = 10 ** 3 - - In [2]: data = {0: [str(i) for i in range(100)] * N, 1: list(range(100)) * N} - - In [3]: df = pd.DataFrame(data, columns=[0, 1]) - - In [4]: def f_numba(values, index): - ...: total = 0 - ...: for i, value in enumerate(values): - ...: if i % 2: - ...: total += value + 5 - ...: else: - ...: total += value * 2 - ...: return total - ...: - - In [5]: def f_cython(values): - ...: total = 0 - ...: for i, value in enumerate(values): - ...: if i % 2: - ...: total += value + 5 - ...: else: - ...: total += value * 2 - ...: return total - ...: - - In [6]: groupby = df.groupby(0) - # Run the first time, compilation time will affect performance - In [7]: %timeit -r 1 -n 1 groupby.aggregate(f_numba, engine='numba') # noqa: E225 - 2.14 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) - # Function is cached and performance will improve - In [8]: %timeit groupby.aggregate(f_numba, engine='numba') - 4.93 ms ± 32.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) - - In [9]: %timeit groupby.aggregate(f_cython, engine='cython') - 18.6 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) - Other useful features --------------------- diff --git a/doc/source/user_guide/indexing.rst b/doc/source/user_guide/indexing.rst index dc66303a44f53..e41f938170417 100644 --- a/doc/source/user_guide/indexing.rst +++ b/doc/source/user_guide/indexing.rst @@ -701,7 +701,7 @@ Having a duplicated index will raise for a ``.reindex()``: .. code-block:: ipython In [17]: s.reindex(labels) - ValueError: cannot reindex from a duplicate axis + ValueError: cannot reindex on an axis with duplicate labels Generally, you can intersect the desired labels with the current axis, and then reindex. @@ -717,7 +717,7 @@ However, this would *still* raise if your resulting index is duplicated. In [41]: labels = ['a', 'd'] In [42]: s.loc[s.index.intersection(labels)].reindex(labels) - ValueError: cannot reindex from a duplicate axis + ValueError: cannot reindex on an axis with duplicate labels .. _indexing.basics.partial_setting: @@ -997,6 +997,15 @@ a list of items you want to check for. df.isin(values) +To return the DataFrame of booleans where the values are *not* in the original DataFrame, +use the ``~`` operator: + +.. ipython:: python + + values = {'ids': ['a', 'b'], 'vals': [1, 3]} + + ~df.isin(values) + Combine DataFrame's ``isin`` with the ``any()`` and ``all()`` methods to quickly select subsets of your data that meet a given criteria. To select a row where each column meets its own criterion: @@ -1523,8 +1532,8 @@ Looking up values by index/column labels ---------------------------------------- Sometimes you want to extract a set of values given a sequence of row labels -and column labels, this can be achieved by ``DataFrame.melt`` combined by filtering the corresponding -rows with ``DataFrame.loc``. For instance: +and column labels, this can be achieved by ``pandas.factorize`` and NumPy indexing. +For instance: .. ipython:: python @@ -1532,9 +1541,8 @@ rows with ``DataFrame.loc``. For instance: 'A': [80, 23, np.nan, 22], 'B': [80, 55, 76, 67]}) df - melt = df.melt('col') - melt = melt.loc[melt['col'] == melt['variable'], 'value'] - melt.reset_index(drop=True) + idx, cols = pd.factorize(df['col']) + df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx] Formerly this could be achieved with the dedicated ``DataFrame.lookup`` method which was deprecated in version 1.2.0. diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst index c2b030d732ba9..be761bb97f320 100644 --- a/doc/source/user_guide/io.rst +++ b/doc/source/user_guide/io.rst @@ -26,7 +26,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like text;`XML `__;:ref:`read_xml`;:ref:`to_xml` text; Local clipboard;:ref:`read_clipboard`;:ref:`to_clipboard` binary;`MS Excel `__;:ref:`read_excel`;:ref:`to_excel` - binary;`OpenDocument `__;:ref:`read_excel`; + binary;`OpenDocument `__;:ref:`read_excel`; binary;`HDF5 Format `__;:ref:`read_hdf`;:ref:`to_hdf` binary;`Feather Format `__;:ref:`read_feather`;:ref:`to_feather` binary;`Parquet Format `__;:ref:`read_parquet`;:ref:`to_parquet` @@ -102,7 +102,7 @@ header : int or list of ints, default ``'infer'`` names : array-like, default ``None`` List of column names to use. If file contains no header row, then you should explicitly pass ``header=None``. Duplicates in this list are not allowed. -index_col : int, str, sequence of int / str, or False, default ``None`` +index_col : int, str, sequence of int / str, or False, optional, default ``None`` Column(s) to use as the row labels of the ``DataFrame``, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used. @@ -116,11 +116,19 @@ index_col : int, str, sequence of int / str, or False, default ``None`` of the data file, then a default index is used. If it is larger, then the first columns are used as index so that the remaining number of fields in the body are equal to the number of fields in the header. + + The first row after the header is used to determine the number of columns, + which will go into the index. If the subsequent rows contain less columns + than the first row, they are filled with ``NaN``. + + This can be avoided through ``usecols``. This ensures that the columns are + taken as is and the trailing data are ignored. usecols : list-like or callable, default ``None`` Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in ``names`` or - inferred from the document header row(s). For example, a valid list-like + inferred from the document header row(s). If ``names`` are given, the document + header row(s) are not taken into account. For example, a valid list-like ``usecols`` parameter would be ``[0, 1, 2]`` or ``['foo', 'bar', 'baz']``. Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``. To @@ -142,11 +150,29 @@ usecols : list-like or callable, default ``None`` pd.read_csv(StringIO(data)) pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["COL1", "COL3"]) - Using this parameter results in much faster parsing time and lower memory usage. + Using this parameter results in much faster parsing time and lower memory usage + when using the c engine. The Python engine loads the data first before deciding + which columns to drop. squeeze : boolean, default ``False`` If the parsed data only contains one column then return a ``Series``. + + .. deprecated:: 1.4.0 + Append ``.squeeze("columns")`` to the call to ``{func_name}`` to squeeze + the data. prefix : str, default ``None`` Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ... + + .. deprecated:: 1.4.0 + Use a list comprehension on the DataFrame's columns after calling ``read_csv``. + + .. ipython:: python + + data = "col1,col2,col3\na,b,1" + + df = pd.read_csv(StringIO(data)) + df.columns = [f"pre_{col}" for col in df.columns] + df + mangle_dupe_cols : boolean, default ``True`` Duplicate columns will be specified as 'X', 'X.1'...'X.N', rather than 'X'...'X'. Passing in ``False`` will cause data to be overwritten if there are duplicate @@ -160,9 +186,15 @@ dtype : Type name or dict of column -> type, default ``None`` (unsupported with ``engine='python'``). Use ``str`` or ``object`` together with suitable ``na_values`` settings to preserve and not interpret dtype. -engine : {``'c'``, ``'python'``} - Parser engine to use. The C engine is faster while the Python engine is - currently more feature-complete. +engine : {``'c'``, ``'python'``, ``'pyarrow'``} + Parser engine to use. The C and pyarrow engines are faster, while the python engine + is currently more feature-complete. Multithreading is currently only supported by + the pyarrow engine. + + .. versionadded:: 1.4.0 + + The "pyarrow" engine was added as an *experimental* engine, and some features + are unsupported, or may not work correctly, with this engine. converters : dict, default ``None`` Dict of functions for converting values in certain columns. Keys can either be integers or column labels. @@ -284,14 +316,14 @@ chunksize : int, default ``None`` Quoting, compression, and file format +++++++++++++++++++++++++++++++++++++ -compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``None``, ``dict``}, default ``'infer'`` +compression : {``'infer'``, ``'gzip'``, ``'bz2'``, ``'zip'``, ``'xz'``, ``'zstd'``, ``None``, ``dict``}, default ``'infer'`` For on-the-fly decompression of on-disk data. If 'infer', then use gzip, - bz2, zip, or xz if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2', - '.zip', or '.xz', respectively, and no decompression otherwise. If using 'zip', + bz2, zip, xz, or zstandard if ``filepath_or_buffer`` is path-like ending in '.gz', '.bz2', + '.zip', '.xz', '.zst', respectively, and no decompression otherwise. If using 'zip', the ZIP file must contain only one data file to be read in. Set to ``None`` for no decompression. Can also be a dict with key ``'method'`` - set to one of {``'zip'``, ``'gzip'``, ``'bz2'``} and other key-value pairs are - forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, or ``bz2.BZ2File``. + set to one of {``'zip'``, ``'gzip'``, ``'bz2'``, ``'zstd'``} and other key-value pairs are + forwarded to ``zipfile.ZipFile``, ``gzip.GzipFile``, ``bz2.BZ2File``, or ``zstandard.ZstdDecompressor``. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: ``compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}``. @@ -342,7 +374,7 @@ dialect : str or :class:`python:csv.Dialect` instance, default ``None`` Error handling ++++++++++++++ -error_bad_lines : boolean, default ``None`` +error_bad_lines : boolean, optional, default ``None`` Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no ``DataFrame`` will be returned. If ``False``, then these "bad lines" will dropped from the @@ -352,7 +384,7 @@ error_bad_lines : boolean, default ``None`` .. deprecated:: 1.3.0 The ``on_bad_lines`` parameter should be used instead to specify behavior upon encountering a bad line instead. -warn_bad_lines : boolean, default ``None`` +warn_bad_lines : boolean, optional, default ``None`` If error_bad_lines is ``False``, and warn_bad_lines is ``True``, a warning for each "bad line" will be output. @@ -1202,6 +1234,10 @@ Returning Series Using the ``squeeze`` keyword, the parser will return output with a single column as a ``Series``: +.. deprecated:: 1.4.0 + Users should append ``.squeeze("columns")`` to the DataFrame returned by + ``read_csv`` instead. + .. ipython:: python :suppress: @@ -1211,6 +1247,7 @@ as a ``Series``: fh.write(data) .. ipython:: python + :okwarning: print(open("tmp.csv").read()) @@ -1268,19 +1305,57 @@ You can elect to skip bad lines: 0 1 2 3 1 8 9 10 +Or pass a callable function to handle the bad line if ``engine="python"``. +The bad line will be a list of strings that was split by the ``sep``: + +.. code-block:: ipython + + In [29]: external_list = [] + + In [30]: def bad_lines_func(line): + ...: external_list.append(line) + ...: return line[-3:] + + In [31]: pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python") + Out[31]: + a b c + 0 1 2 3 + 1 5 6 7 + 2 8 9 10 + + In [32]: external_list + Out[32]: [4, 5, 6, 7] + + .. versionadded:: 1.4.0 + + You can also use the ``usecols`` parameter to eliminate extraneous column data that appear in some lines but not others: .. code-block:: ipython - In [30]: pd.read_csv(StringIO(data), usecols=[0, 1, 2]) + In [33]: pd.read_csv(StringIO(data), usecols=[0, 1, 2]) - Out[30]: + Out[33]: a b c 0 1 2 3 1 4 5 6 2 8 9 10 +In case you want to keep all data including the lines with too many fields, you can +specify a sufficient number of ``names``. This ensures that lines with not enough +fields are filled with ``NaN``. + +.. code-block:: ipython + + In [34]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd']) + + Out[34]: + a b c d + 0 1 2 3 NaN + 1 4 5 6 7 + 2 8 9 10 NaN + .. _io.dialect: Dialect @@ -1622,11 +1697,17 @@ Specifying ``iterator=True`` will also return the ``TextFileReader`` object: Specifying the parser engine '''''''''''''''''''''''''''' -Under the hood pandas uses a fast and efficient parser implemented in C as well -as a Python implementation which is currently more feature-complete. Where -possible pandas uses the C parser (specified as ``engine='c'``), but may fall -back to Python if C-unsupported options are specified. Currently, C-unsupported -options include: +Pandas currently supports three engines, the C engine, the python engine, and an experimental +pyarrow engine (requires the ``pyarrow`` package). In general, the pyarrow engine is fastest +on larger workloads and is equivalent in speed to the C engine on most other workloads. +The python engine tends to be slower than the pyarrow and C engines on most workloads. However, +the pyarrow engine is much less robust than the C engine, which lacks a few features compared to the +Python engine. + +Where possible, pandas uses the C parser (specified as ``engine='c'``), but it may fall +back to Python if C-unsupported options are specified. + +Currently, options unsupported by the C and pyarrow engines include: * ``sep`` other than a single character (e.g. regex separators) * ``skipfooter`` @@ -1635,6 +1716,32 @@ options include: Specifying any of the above options will produce a ``ParserWarning`` unless the python engine is selected explicitly using ``engine='python'``. +Options that are unsupported by the pyarrow engine which are not covered by the list above include: + +* ``float_precision`` +* ``chunksize`` +* ``comment`` +* ``nrows`` +* ``thousands`` +* ``memory_map`` +* ``dialect`` +* ``warn_bad_lines`` +* ``error_bad_lines`` +* ``on_bad_lines`` +* ``delim_whitespace`` +* ``quoting`` +* ``lineterminator`` +* ``converters`` +* ``decimal`` +* ``iterator`` +* ``dayfirst`` +* ``infer_datetime_format`` +* ``verbose`` +* ``skipinitialspace`` +* ``low_memory`` + +Specifying these options with ``engine='pyarrow'`` will raise a ``ValueError``. + .. _io.remote: Reading/writing remote files @@ -1820,6 +1927,7 @@ with optional parameters: ``index``; dict like {index -> {column -> value}} ``columns``; dict like {column -> {index -> value}} ``values``; just the values array + ``table``; adhering to the JSON `Table Schema`_ * ``date_format`` : string, type of date conversion, 'epoch' for timestamp, 'iso' for ISO8601. * ``double_precision`` : The number of decimal places to use when encoding floating point values, default 10. @@ -2394,7 +2502,6 @@ A few notes on the generated table schema: * For ``MultiIndex``, ``mi.names`` is used. If any level has no name, then ``level_`` is used. - ``read_json`` also accepts ``orient='table'`` as an argument. This allows for the preservation of metadata such as dtypes and index names in a round-trippable manner. @@ -2436,8 +2543,18 @@ indicate missing values and the subsequent read cannot distinguish the intent. os.remove("test.json") +When using ``orient='table'`` along with user-defined ``ExtensionArray``, +the generated schema will contain an additional ``extDtype`` key in the respective +``fields`` element. This extra key is not standard but does enable JSON roundtrips +for extension types (e.g. ``read_json(df.to_json(orient="table"), orient="table")``). + +The ``extDtype`` key carries the name of the extension, if you have properly registered +the ``ExtensionDtype``, pandas will use said name to perform a lookup into the registry +and re-convert the serialized data into your custom dtype. + .. _Table Schema: https://specs.frictionlessdata.io/table-schema/ + HTML ---- @@ -2464,14 +2581,16 @@ Read a URL with no options: .. ipython:: python - url = ( - "https://raw.githubusercontent.com/pandas-dev/pandas/master/" - "pandas/tests/io/data/html/spam.html" - ) + url = "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list" dfs = pd.read_html(url) dfs -Read in the content of the "banklist.html" file and pass it to ``read_html`` +.. note:: + + The data from the above URL changes every Monday so the resulting data above + and the data below may be slightly different. + +Read in the content of the file from the above URL and pass it to ``read_html`` as a string: .. ipython:: python @@ -2503,7 +2622,7 @@ You can even pass in an instance of ``StringIO`` if you so desire: that having so many network-accessing functions slows down the documentation build. If you spot an error or an example that doesn't run, please do not hesitate to report it over on `pandas GitHub issues page - `__. + `__. Read a URL and match a table that contains specific text: @@ -2977,6 +3096,7 @@ Read in the content of the "books.xml" as instance of ``StringIO`` or Even read XML from AWS S3 buckets such as Python Software Foundation's IRS 990 Form: .. ipython:: python + :okwarning: df = pd.read_xml( "s3://irs-form-990/201923199349319487_public.xml", @@ -3460,9 +3580,9 @@ with ``on_demand=True``. Specifying sheets +++++++++++++++++ -.. note :: The second argument is ``sheet_name``, not to be confused with ``ExcelFile.sheet_names``. +.. note:: The second argument is ``sheet_name``, not to be confused with ``ExcelFile.sheet_names``. -.. note :: An ExcelFile's attribute ``sheet_names`` provides access to a list of sheets. +.. note:: An ExcelFile's attribute ``sheet_names`` provides access to a list of sheets. * The arguments ``sheet_name`` allows specifying the sheet or sheets to read. * The default value for ``sheet_name`` is 0, indicating to read the first sheet @@ -3936,18 +4056,18 @@ Compressed pickle files ''''''''''''''''''''''' :func:`read_pickle`, :meth:`DataFrame.to_pickle` and :meth:`Series.to_pickle` can read -and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz`` are supported for reading and writing. +and write compressed pickle files. The compression types of ``gzip``, ``bz2``, ``xz``, ``zstd`` are supported for reading and writing. The ``zip`` file format only supports reading and must contain only one data file to be read. The compression type can be an explicit parameter or be inferred from the file extension. -If 'infer', then use ``gzip``, ``bz2``, ``zip``, or ``xz`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``, or -``'.xz'``, respectively. +If 'infer', then use ``gzip``, ``bz2``, ``zip``, ``xz``, ``zstd`` if filename ends in ``'.gz'``, ``'.bz2'``, ``'.zip'``, +``'.xz'``, or ``'.zst'``, respectively. The compression parameter can also be a ``dict`` in order to pass options to the compression protocol. It must have a ``'method'`` key set to the name of the compression protocol, which must be one of -{``'zip'``, ``'gzip'``, ``'bz2'``}. All other key-value pairs are passed to +{``'zip'``, ``'gzip'``, ``'bz2'``, ``'xz'``, ``'zstd'``}. All other key-value pairs are passed to the underlying compression library. .. ipython:: python @@ -4872,7 +4992,7 @@ control compression: ``complevel`` and ``complib``. rates but is somewhat slow. - `lzo `_: Fast compression and decompression. - - `bzip2 `_: Good compression rates. + - `bzip2 `_: Good compression rates. - `blosc `_: Fast compression and decompression. @@ -4881,10 +5001,10 @@ control compression: ``complevel`` and ``complib``. - `blosc:blosclz `_ This is the default compressor for ``blosc`` - `blosc:lz4 - `_: + `_: A compact, very popular and fast compressor. - `blosc:lz4hc - `_: + `_: A tweaked version of LZ4, produces better compression ratios at the expense of speed. - `blosc:snappy `_: @@ -5226,15 +5346,6 @@ Several caveats: See the `Full Documentation `__. -.. ipython:: python - :suppress: - - import warnings - - # This can be removed once building with pyarrow >=0.15.0 - warnings.filterwarnings("ignore", "The Sparse", FutureWarning) - - .. ipython:: python df = pd.DataFrame( @@ -5477,7 +5588,7 @@ SQL queries The :mod:`pandas.io.sql` module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for -your database. Examples of such drivers are `psycopg2 `__ +your database. Examples of such drivers are `psycopg2 `__ for PostgreSQL or `pymysql `__ for MySQL. For `SQLite `__ this is included in Python's standard library by default. @@ -5509,7 +5620,7 @@ The key functions are: the provided input (database table name or sql query). Table names do not need to be quoted if they have special characters. -In the following example, we use the `SQlite `__ SQL database +In the following example, we use the `SQlite `__ SQL database engine. You can use a temporary SQLite database where data are stored in "memory". @@ -5526,13 +5637,23 @@ below and the SQLAlchemy `documentation `__ +for an explanation of how the database connection is handled. .. code-block:: python with engine.connect() as conn, conn.begin(): data = pd.read_sql_table("data", conn) +.. warning:: + + When you open a connection to a database you are also responsible for closing it. + Side effects of leaving a connection open may include locking the database or + other breaking behaviour. + Writing DataFrames '''''''''''''''''' @@ -5663,7 +5784,7 @@ Possible values are: specific backend dialect features. Example of a callable using PostgreSQL `COPY clause -`__:: +`__:: # Alternative to_sql() *method* for DBs that support COPY FROM import csv @@ -5689,7 +5810,7 @@ Example of a callable using PostgreSQL `COPY clause writer.writerows(data_iter) s_buf.seek(0) - columns = ', '.join('"{}"'.format(k) for k in keys) + columns = ', '.join(['"{}"'.format(k) for k in keys]) if table.schema: table_name = '{}.{}'.format(table.schema, table.name) else: @@ -5925,7 +6046,7 @@ pandas integrates with this external package. if ``pandas-gbq`` is installed, yo use the pandas methods ``pd.read_gbq`` and ``DataFrame.to_gbq``, which will call the respective functions from ``pandas-gbq``. -Full documentation can be found `here `__. +Full documentation can be found `here `__. .. _io.stata: @@ -6133,7 +6254,7 @@ Obtain an iterator and read an XPORT file 100,000 lines at a time: The specification_ for the xport file format is available from the SAS web site. -.. _specification: https://support.sas.com/techsup/technote/ts140.pdf +.. _specification: https://support.sas.com/content/dam/SAS/support/en/technical-papers/record-layout-of-a-sas-version-5-or-6-data-set-in-sas-transport-xport-format.pdf No official documentation is available for the SAS7BDAT format. @@ -6175,7 +6296,7 @@ avoid converting categorical columns into ``pd.Categorical``: More information about the SAV and ZSAV file formats is available here_. -.. _here: https://www.ibm.com/support/knowledgecenter/en/SSLVMB_22.0.0/com.ibm.spss.statistics.help/spss/base/savedatatypes.htm +.. _here: https://www.ibm.com/docs/en/spss-statistics/22.0.0 .. _io.other: @@ -6193,7 +6314,7 @@ xarray_ provides data structures inspired by the pandas ``DataFrame`` for workin with multi-dimensional datasets, with a focus on the netCDF file format and easy conversion to and from pandas. -.. _xarray: https://xarray.pydata.org/ +.. _xarray: https://xarray.pydata.org/en/stable/ .. _io.perf: diff --git a/doc/source/user_guide/merging.rst b/doc/source/user_guide/merging.rst index 09b3d3a8c96df..bbca5773afdfe 100644 --- a/doc/source/user_guide/merging.rst +++ b/doc/source/user_guide/merging.rst @@ -237,59 +237,6 @@ Similarly, we could index before the concatenation: p.plot([df1, df4], result, labels=["df1", "df4"], vertical=False); plt.close("all"); -.. _merging.concatenation: - -Concatenating using ``append`` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -A useful shortcut to :func:`~pandas.concat` are the :meth:`~DataFrame.append` -instance methods on ``Series`` and ``DataFrame``. These methods actually predated -``concat``. They concatenate along ``axis=0``, namely the index: - -.. ipython:: python - - result = df1.append(df2) - -.. ipython:: python - :suppress: - - @savefig merging_append1.png - p.plot([df1, df2], result, labels=["df1", "df2"], vertical=True); - plt.close("all"); - -In the case of ``DataFrame``, the indexes must be disjoint but the columns do not -need to be: - -.. ipython:: python - - result = df1.append(df4, sort=False) - -.. ipython:: python - :suppress: - - @savefig merging_append2.png - p.plot([df1, df4], result, labels=["df1", "df4"], vertical=True); - plt.close("all"); - -``append`` may take multiple objects to concatenate: - -.. ipython:: python - - result = df1.append([df2, df3]) - -.. ipython:: python - :suppress: - - @savefig merging_append3.png - p.plot([df1, df2, df3], result, labels=["df1", "df2", "df3"], vertical=True); - plt.close("all"); - -.. note:: - - Unlike the :py:meth:`~list.append` method, which appends to the original list - and returns ``None``, :meth:`~DataFrame.append` here **does not** modify - ``df1`` and returns its copy with ``df2`` appended. - .. _merging.ignore_index: Ignoring indexes on the concatenation axis @@ -309,19 +256,6 @@ do this, use the ``ignore_index`` argument: p.plot([df1, df4], result, labels=["df1", "df4"], vertical=True); plt.close("all"); -This is also a valid argument to :meth:`DataFrame.append`: - -.. ipython:: python - - result = df1.append(df4, ignore_index=True, sort=False) - -.. ipython:: python - :suppress: - - @savefig merging_append_ignore_index.png - p.plot([df1, df4], result, labels=["df1", "df4"], vertical=True); - plt.close("all"); - .. _merging.mixed_ndims: Concatenating with mixed ndims @@ -473,14 +407,13 @@ like GroupBy where the order of a categorical variable is meaningful. Appending rows to a DataFrame ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -While not especially efficient (since a new object must be created), you can -append a single row to a ``DataFrame`` by passing a ``Series`` or dict to -``append``, which returns a new ``DataFrame`` as above. +If you have a series that you want to append as a single row to a ``DataFrame``, you can convert the row into a +``DataFrame`` and use ``concat`` .. ipython:: python s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"]) - result = df1.append(s2, ignore_index=True) + result = pd.concat([df1, s2.to_frame().T], ignore_index=True) .. ipython:: python :suppress: @@ -493,20 +426,6 @@ You should use ``ignore_index`` with this method to instruct DataFrame to discard its index. If you wish to preserve the index, you should construct an appropriately-indexed DataFrame and append or concatenate those objects. -You can also pass a list of dicts or Series: - -.. ipython:: python - - dicts = [{"A": 1, "B": 2, "C": 3, "X": 4}, {"A": 5, "B": 6, "C": 7, "Y": 8}] - result = df1.append(dicts, ignore_index=True, sort=False) - -.. ipython:: python - :suppress: - - @savefig merging_append_dits.png - p.plot([df1, pd.DataFrame(dicts)], result, labels=["df1", "dicts"], vertical=True); - plt.close("all"); - .. _merging.join: Database-style DataFrame or named Series joining/merging @@ -562,7 +481,7 @@ all standard database join operations between ``DataFrame`` or named ``Series`` (hierarchical), the number of levels must match the number of join keys from the right DataFrame or Series. * ``right_index``: Same usage as ``left_index`` for the right DataFrame or Series -* ``how``: One of ``'left'``, ``'right'``, ``'outer'``, ``'inner'``. Defaults +* ``how``: One of ``'left'``, ``'right'``, ``'outer'``, ``'inner'``, ``'cross'``. Defaults to ``inner``. See below for more detailed description of each method. * ``sort``: Sort the result DataFrame by the join keys in lexicographical order. Defaults to ``True``, setting to ``False`` will improve performance @@ -707,6 +626,7 @@ either the left or right tables, the values in the joined table will be ``right``, ``RIGHT OUTER JOIN``, Use keys from right frame only ``outer``, ``FULL OUTER JOIN``, Use union of keys from both frames ``inner``, ``INNER JOIN``, Use intersection of keys from both frames + ``cross``, ``CROSS JOIN``, Create the cartesian product of rows of both frames .. ipython:: python @@ -751,6 +671,17 @@ either the left or right tables, the values in the joined table will be p.plot([left, right], result, labels=["left", "right"], vertical=False); plt.close("all"); +.. ipython:: python + + result = pd.merge(left, right, how="cross") + +.. ipython:: python + :suppress: + + @savefig merging_merge_cross.png + p.plot([left, right], result, labels=["left", "right"], vertical=False); + plt.close("all"); + You can merge a mult-indexed Series and a DataFrame, if the names of the MultiIndex correspond to the columns from the DataFrame. Transform the Series to a DataFrame using :meth:`Series.reset_index` before merging, diff --git a/doc/source/user_guide/missing_data.rst b/doc/source/user_guide/missing_data.rst index 1621b37f31b23..3052ee3001681 100644 --- a/doc/source/user_guide/missing_data.rst +++ b/doc/source/user_guide/missing_data.rst @@ -470,7 +470,7 @@ at the new values. interp_s = ser.reindex(new_index).interpolate(method="pchip") interp_s[49:51] -.. _scipy: https://www.scipy.org +.. _scipy: https://scipy.org/ .. _documentation: https://docs.scipy.org/doc/scipy/reference/interpolate.html#univariate-interpolation .. _guide: https://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html @@ -580,7 +580,7 @@ String/regular expression replacement backslashes than strings without this prefix. Backslashes in raw strings will be interpreted as an escaped backslash, e.g., ``r'\' == '\\'``. You should `read about them - `__ + `__ if this is unclear. Replace the '.' with ``NaN`` (str -> str): diff --git a/doc/source/user_guide/options.rst b/doc/source/user_guide/options.rst index 62a347acdaa34..f6e98b68afdc9 100644 --- a/doc/source/user_guide/options.rst +++ b/doc/source/user_guide/options.rst @@ -31,18 +31,18 @@ namespace: * :func:`~pandas.option_context` - execute a codeblock with a set of options that revert to prior settings after execution. -**Note:** Developers can check out `pandas/core/config_init.py `_ for more information. +**Note:** Developers can check out `pandas/core/config_init.py `_ for more information. All of the functions above accept a regexp pattern (``re.search`` style) as an argument, and so passing in a substring will work - as long as it is unambiguous: .. ipython:: python - pd.get_option("display.max_rows") - pd.set_option("display.max_rows", 101) - pd.get_option("display.max_rows") - pd.set_option("max_r", 102) - pd.get_option("display.max_rows") + pd.get_option("display.chop_threshold") + pd.set_option("display.chop_threshold", 2) + pd.get_option("display.chop_threshold") + pd.set_option("chop", 4) + pd.get_option("display.chop_threshold") The following will **not work** because it matches multiple option names, e.g. @@ -52,7 +52,7 @@ The following will **not work** because it matches multiple option names, e.g. :okexcept: try: - pd.get_option("column") + pd.get_option("max") except KeyError as e: print(e) @@ -138,7 +138,7 @@ More information can be found in the `IPython documentation import pandas as pd pd.set_option("display.max_rows", 999) - pd.set_option("precision", 5) + pd.set_option("display.precision", 5) .. _options.frequently_used: @@ -153,27 +153,27 @@ lines are replaced by an ellipsis. .. ipython:: python df = pd.DataFrame(np.random.randn(7, 2)) - pd.set_option("max_rows", 7) + pd.set_option("display.max_rows", 7) df - pd.set_option("max_rows", 5) + pd.set_option("display.max_rows", 5) df - pd.reset_option("max_rows") + pd.reset_option("display.max_rows") Once the ``display.max_rows`` is exceeded, the ``display.min_rows`` options determines how many rows are shown in the truncated repr. .. ipython:: python - pd.set_option("max_rows", 8) - pd.set_option("min_rows", 4) + pd.set_option("display.max_rows", 8) + pd.set_option("display.min_rows", 4) # below max_rows -> all rows shown df = pd.DataFrame(np.random.randn(7, 2)) df # above max_rows -> only min_rows (4) rows shown df = pd.DataFrame(np.random.randn(9, 2)) df - pd.reset_option("max_rows") - pd.reset_option("min_rows") + pd.reset_option("display.max_rows") + pd.reset_option("display.min_rows") ``display.expand_frame_repr`` allows for the representation of dataframes to stretch across pages, wrapped over the full column vs row-wise. @@ -193,13 +193,13 @@ dataframes to stretch across pages, wrapped over the full column vs row-wise. .. ipython:: python df = pd.DataFrame(np.random.randn(10, 10)) - pd.set_option("max_rows", 5) + pd.set_option("display.max_rows", 5) pd.set_option("large_repr", "truncate") df pd.set_option("large_repr", "info") df pd.reset_option("large_repr") - pd.reset_option("max_rows") + pd.reset_option("display.max_rows") ``display.max_colwidth`` sets the maximum width of columns. Cells of this length or longer will be truncated with an ellipsis. @@ -253,9 +253,9 @@ This is only a suggestion. .. ipython:: python df = pd.DataFrame(np.random.randn(5, 5)) - pd.set_option("precision", 7) + pd.set_option("display.precision", 7) df - pd.set_option("precision", 4) + pd.set_option("display.precision", 4) df ``display.chop_threshold`` sets at what level pandas rounds to zero when @@ -430,6 +430,10 @@ display.html.use_mathjax True When True, Jupyter notebook table contents using MathJax, rendering mathematical expressions enclosed by the dollar symbol. +display.max_dir_items 100 The number of columns from a dataframe that + are added to dir. These columns can then be + suggested by tab completion. 'None' value means + unlimited. io.excel.xls.writer xlwt The default Excel writer engine for 'xls' files. @@ -487,8 +491,32 @@ styler.sparse.index True "Sparsify" MultiIndex displ elements in outer levels within groups). styler.sparse.columns True "Sparsify" MultiIndex display for columns in Styler output. +styler.render.repr html Standard output format for Styler rendered in Jupyter Notebook. + Should be one of "html" or "latex". styler.render.max_elements 262144 Maximum number of datapoints that Styler will render trimming either rows, columns or both to fit. +styler.render.max_rows None Maximum number of rows that Styler will render. By default + this is dynamic based on ``max_elements``. +styler.render.max_columns None Maximum number of columns that Styler will render. By default + this is dynamic based on ``max_elements``. +styler.render.encoding utf-8 Default encoding for output HTML or LaTeX files. +styler.format.formatter None Object to specify formatting functions to ``Styler.format``. +styler.format.na_rep None String representation for missing data. +styler.format.precision 6 Precision to display floating point and complex numbers. +styler.format.decimal . String representation for decimal point separator for floating + point and complex numbers. +styler.format.thousands None String representation for thousands separator for + integers, and floating point and complex numbers. +styler.format.escape None Whether to escape "html" or "latex" special + characters in the display representation. +styler.html.mathjax True If set to False will render specific CSS classes to + table attributes that will prevent Mathjax from rendering + in Jupyter Notebook. +styler.latex.multicol_align r Alignment of headers in a merged column due to sparsification. Can be in {"r", "c", "l"}. +styler.latex.multirow_align c Alignment of index labels in a merged row due to sparsification. Can be in {"c", "t", "b"}. +styler.latex.environment None If given will replace the default ``\\begin{table}`` environment. If "longtable" is specified + this will render with a specific "longtable" template with longtable features. +styler.latex.hrules False If set to True will render ``\\toprule``, ``\\midrule``, and ``\bottomrule`` by default. ======================================= ============ ================================== diff --git a/doc/source/user_guide/reshaping.rst b/doc/source/user_guide/reshaping.rst index 7d1d03fe020a6..e74272c825e46 100644 --- a/doc/source/user_guide/reshaping.rst +++ b/doc/source/user_guide/reshaping.rst @@ -474,7 +474,15 @@ rows and columns: .. ipython:: python - df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std) + table = df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std) + table + +Additionally, you can call :meth:`DataFrame.stack` to display a pivoted DataFrame +as having a multi-level index: + +.. ipython:: python + + table.stack() .. _reshaping.crosstabulations: diff --git a/doc/source/user_guide/sparse.rst b/doc/source/user_guide/sparse.rst index 52d99533c1f60..b2b3678e48534 100644 --- a/doc/source/user_guide/sparse.rst +++ b/doc/source/user_guide/sparse.rst @@ -294,7 +294,7 @@ To convert back to sparse SciPy matrix in COO format, you can use the :meth:`Dat sdf.sparse.to_coo() -meth:`Series.sparse.to_coo` is implemented for transforming a ``Series`` with sparse values indexed by a :class:`MultiIndex` to a :class:`scipy.sparse.coo_matrix`. +:meth:`Series.sparse.to_coo` is implemented for transforming a ``Series`` with sparse values indexed by a :class:`MultiIndex` to a :class:`scipy.sparse.coo_matrix`. The method requires a ``MultiIndex`` with two or more levels. diff --git a/doc/source/user_guide/style.ipynb b/doc/source/user_guide/style.ipynb index 7d8d8e90dfbda..2dc40e67338b4 100644 --- a/doc/source/user_guide/style.ipynb +++ b/doc/source/user_guide/style.ipynb @@ -11,7 +11,7 @@ "\n", "[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n", "[viz]: visualization.rst\n", - "[download]: https://nbviewer.ipython.org/github/pandas-dev/pandas/blob/master/doc/source/user_guide/style.ipynb" + "[download]: https://nbviewer.ipython.org/github/pandas-dev/pandas/blob/main/doc/source/user_guide/style.ipynb" ] }, { @@ -49,6 +49,7 @@ "source": [ "import pandas as pd\n", "import numpy as np\n", + "import matplotlib as mpl\n", "\n", "df = pd.DataFrame([[38.0, 2.0, 18.0, 22.0, 21, np.nan],[19, 439, 6, 452, 226,232]], \n", " index=pd.Index(['Tumour (Positive)', 'Non-Tumour (Negative)'], name='Actual Label:'), \n", @@ -60,9 +61,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The above output looks very similar to the standard DataFrame HTML representation. But the HTML here has already attached some CSS classes to each cell, even if we haven't yet created any styles. We can view these by calling the [.render()][render] method, which returns the raw HTML as string, which is useful for further processing or adding to a file - read on in [More about CSS and HTML](#More-About-CSS-and-HTML). Below we will show how we can use these to format the DataFrame to be more communicative. For example how we can build `s`:\n", + "The above output looks very similar to the standard DataFrame HTML representation. But the HTML here has already attached some CSS classes to each cell, even if we haven't yet created any styles. We can view these by calling the [.to_html()][tohtml] method, which returns the raw HTML as string, which is useful for further processing or adding to a file - read on in [More about CSS and HTML](#More-About-CSS-and-HTML). Below we will show how we can use these to format the DataFrame to be more communicative. For example how we can build `s`:\n", "\n", - "[render]: ../reference/api/pandas.io.formats.style.Styler.render.rst" + "[tohtml]: ../reference/api/pandas.io.formats.style.Styler.to_html.rst" ] }, { @@ -150,15 +151,14 @@ "\n", "### Formatting Values\n", "\n", - "Before adding styles it is useful to show that the [Styler][styler] can distinguish the *display* value from the *actual* value. To control the display value, the text is printed in each cell, and we can use the [.format()][formatfunc] method to manipulate this according to a [format spec string][format] or a callable that takes a single value and returns a string. It is possible to define this for the whole table or for individual columns. \n", + "Before adding styles it is useful to show that the [Styler][styler] can distinguish the *display* value from the *actual* value, in both datavlaues and index or columns headers. To control the display value, the text is printed in each cell as string, and we can use the [.format()][formatfunc] and [.format_index()][formatfuncindex] methods to manipulate this according to a [format spec string][format] or a callable that takes a single value and returns a string. It is possible to define this for the whole table, or index, or for individual columns, or MultiIndex levels. \n", "\n", - "Additionally, the format function has a **precision** argument to specifically help formatting floats, an **na_rep** argument to display missing data, and an **escape** argument to help displaying safe-HTML. The default formatter is configured to adopt pandas' regular `display.precision` option, controllable using `with pd.option_context('display.precision', 2):`\n", - "\n", - "Here is an example of using the multiple options to control the formatting generally and with specific column formatters.\n", + "Additionally, the format function has a **precision** argument to specifically help formatting floats, as well as **decimal** and **thousands** separators to support other locales, an **na_rep** argument to display missing data, and an **escape** argument to help displaying safe-HTML or safe-LaTeX. The default formatter is configured to adopt pandas' `styler.format.precision` option, controllable using `with pd.option_context('format.precision', 2):` \n", "\n", "[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n", "[format]: https://docs.python.org/3/library/string.html#format-specification-mini-language\n", - "[formatfunc]: ../reference/api/pandas.io.formats.style.Styler.format.rst" + "[formatfunc]: ../reference/api/pandas.io.formats.style.Styler.format.rst\n", + "[formatfuncindex]: ../reference/api/pandas.io.formats.style.Styler.format_index.rst" ] }, { @@ -167,28 +167,72 @@ "metadata": {}, "outputs": [], "source": [ - "df.style.format(precision=0, na_rep='MISSING', \n", + "df.style.format(precision=0, na_rep='MISSING', thousands=\" \",\n", " formatter={('Decision Tree', 'Tumour'): \"{:.2f}\",\n", - " ('Regression', 'Non-Tumour'): lambda x: \"$ {:,.1f}\".format(x*-1e3)\n", + " ('Regression', 'Non-Tumour'): lambda x: \"$ {:,.1f}\".format(x*-1e6)\n", " })" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using Styler to manipulate the display is a useful feature because maintaining the indexing and datavalues for other purposes gives greater control. You do not have to overwrite your DataFrame to display it how you like. Here is an example of using the formatting functions whilst still relying on the underlying data for indexing and calculations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "weather_df = pd.DataFrame(np.random.rand(10,2)*5, \n", + " index=pd.date_range(start=\"2021-01-01\", periods=10),\n", + " columns=[\"Tokyo\", \"Beijing\"])\n", + "\n", + "def rain_condition(v): \n", + " if v < 1.75:\n", + " return \"Dry\"\n", + " elif v < 2.75:\n", + " return \"Rain\"\n", + " return \"Heavy Rain\"\n", + "\n", + "def make_pretty(styler):\n", + " styler.set_caption(\"Weather Conditions\")\n", + " styler.format(rain_condition)\n", + " styler.format_index(lambda v: v.strftime(\"%A\"))\n", + " styler.background_gradient(axis=None, vmin=1, vmax=5, cmap=\"YlGnBu\")\n", + " return styler\n", + "\n", + "weather_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "weather_df.loc[\"2021-01-04\":\"2021-01-08\"].style.pipe(make_pretty)" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hiding Data\n", "\n", - "The index can be hidden from rendering by calling [.hide_index()][hideidx], which might be useful if your index is integer based.\n", + "The index and column headers can be completely hidden, as well subselecting rows or columns that one wishes to exclude. Both these options are performed using the same methods.\n", "\n", - "Columns can be hidden from rendering by calling [.hide_columns()][hidecols] and passing in the name of a column, or a slice of columns.\n", + "The index can be hidden from rendering by calling [.hide()][hideidx] without any arguments, which might be useful if your index is integer based. Similarly column headers can be hidden by calling [.hide(axis=\"columns\")][hideidx] without any further arguments.\n", "\n", - "Hiding does not change the integer arrangement of CSS classes, e.g. hiding the first two columns of a DataFrame means the column class indexing will start at `col2`, since `col0` and `col1` are simply ignored.\n", + "Specific rows or columns can be hidden from rendering by calling the same [.hide()][hideidx] method and passing in a row/column label, a list-like or a slice of row/column labels to for the ``subset`` argument.\n", "\n", - "We can update our `Styler` object to hide some data and format the values.\n", + "Hiding does not change the integer arrangement of CSS classes, e.g. hiding the first two columns of a DataFrame means the column class indexing will still start at `col2`, since `col0` and `col1` are simply ignored.\n", "\n", - "[hideidx]: ../reference/api/pandas.io.formats.style.Styler.hide_index.rst\n", - "[hidecols]: ../reference/api/pandas.io.formats.style.Styler.hide_columns.rst" + "We can update our `Styler` object from before to hide some data and format the values.\n", + "\n", + "[hideidx]: ../reference/api/pandas.io.formats.style.Styler.hide.rst" ] }, { @@ -197,7 +241,7 @@ "metadata": {}, "outputs": [], "source": [ - "s = df.style.format('{:.0f}').hide_columns([('Random', 'Tumour'), ('Random', 'Non-Tumour')])\n", + "s = df.style.format('{:.0f}').hide([('Random', 'Tumour'), ('Random', 'Non-Tumour')], axis=\"columns\")\n", "s" ] }, @@ -223,13 +267,15 @@ "\n", "- Using [.set_table_styles()][table] to control broader areas of the table with specified internal CSS. Although table styles allow the flexibility to add CSS selectors and properties controlling all individual parts of the table, they are unwieldy for individual cell specifications. Also, note that table styles cannot be exported to Excel. \n", "- Using [.set_td_classes()][td_class] to directly link either external CSS classes to your data cells or link the internal CSS classes created by [.set_table_styles()][table]. See [here](#Setting-Classes-and-Linking-to-External-CSS). These cannot be used on column header rows or indexes, and also won't export to Excel. \n", - "- Using the [.apply()][apply] and [.applymap()][applymap] functions to add direct internal CSS to specific data cells. See [here](#Styler-Functions). These cannot be used on column header rows or indexes, but only these methods add styles that will export to Excel. These methods work in a similar way to [DataFrame.apply()][dfapply] and [DataFrame.applymap()][dfapplymap].\n", + "- Using the [.apply()][apply] and [.applymap()][applymap] functions to add direct internal CSS to specific data cells. See [here](#Styler-Functions). As of v1.4.0 there are also methods that work directly on column header rows or indexes; [.apply_index()][applyindex] and [.applymap_index()][applymapindex]. Note that only these methods add styles that will export to Excel. These methods work in a similar way to [DataFrame.apply()][dfapply] and [DataFrame.applymap()][dfapplymap].\n", "\n", "[table]: ../reference/api/pandas.io.formats.style.Styler.set_table_styles.rst\n", "[styler]: ../reference/api/pandas.io.formats.style.Styler.rst\n", "[td_class]: ../reference/api/pandas.io.formats.style.Styler.set_td_classes.rst\n", "[apply]: ../reference/api/pandas.io.formats.style.Styler.apply.rst\n", "[applymap]: ../reference/api/pandas.io.formats.style.Styler.applymap.rst\n", + "[applyindex]: ../reference/api/pandas.io.formats.style.Styler.apply_index.rst\n", + "[applymapindex]: ../reference/api/pandas.io.formats.style.Styler.applymap_index.rst\n", "[dfapply]: ../reference/api/pandas.DataFrame.apply.rst\n", "[dfapplymap]: ../reference/api/pandas.DataFrame.applymap.rst" ] @@ -377,7 +423,7 @@ "metadata": {}, "outputs": [], "source": [ - "out = s.set_table_attributes('class=\"my-table-cls\"').render()\n", + "out = s.set_table_attributes('class=\"my-table-cls\"').to_html()\n", "print(out[out.find('