Skip to content

BigQuery: 'RowIterator.to_dataframe' surprisingly consumes / merges all pages. #7293

Closed
@kykrueger

Description

@kykrueger

Environment details

python version: 3.7.2
virtual environment: Conda managed

pip freeze | grep 'google

  • google-api-core==1.7.0
  • google-auth==1.6.2
  • google-auth-oauthlib==0.2.0
  • google-cloud-bigquery==1.9.0
  • google-cloud-core==0.29.1
  • google-resumable-media==0.3.2
  • googleapis-common-protos==1.5.6

Problem

The strategy for paginating through a table in BigQuery with RowIterator.to_dataframe() does not work as expected.
Either

  • the documentation should be updated to show that it will iterate automatically over all pages,
  • or it should only fill the DataFrame with the first page, and let the user iterate and join the frames as needed.

Steps to reproduce

  1. Check the instructions for paginating through a query.
  2. Note the suggestion for how to read subsequent pages

A page is a subset of the total number of rows. If your results are more than one page of data, the result data will have a pageToken property. To retrieve the next page of results, make another list call and include the token value as a URL parameter named pageToken.

  1. Check the response for the Client.list_rows() method.
  2. See that the Iterator may be transformed to a dataframe, and assume that it will only transform the page which was already loaded.
  3. Call RowIterator.to_dataframe() and see that it loads the rest of the pages, and unions them into a single dataframe: more API queries are made . . .
  4. Be surprised that the rest of the pages were loaded without warning, and that you cannot work with dataframes of pages at a time with the native implementation.
  5. Check the source code and confirm your suspicions.
    def _to_dataframe_tabledata_list(self, dtypes):
        """Use (slower, but free) tabledata.list to construct a DataFrame."""
        column_names = [field.name for field in self.schema]
        frames = []
        for page in iter(self.pages):
            frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))
        return pandas.concat(frames)

    def to_dataframe(self, bqstorage_client=None, dtypes=None):
        . . .
        if pandas is None:
            raise ValueError(_NO_PANDAS_ERROR)
        if dtypes is None:
            dtypes = {}

        if bqstorage_client is not None:
            return self._to_dataframe_bqstorage(bqstorage_client, dtypes)
        else:
            return self._to_dataframe_tabledata_list(dtypes)

Code example

table = client.get_table(destination_table)
row_iterator = client.list_rows(destination_table,
                                selected_fields=table.schema,
                                page_size=5,
                                max_results=20)
df = row_iterator.to_dataframe()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to X
Data columns (total X columns):

Metadata

Metadata

Assignees

Labels

api: bigqueryIssues related to the BigQuery API.type: docsImprovement to the documentation for an API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions