Closed
Description
Environment details
python version: 3.7.2
virtual environment: Conda managed
pip freeze | grep 'google
- google-api-core==1.7.0
- google-auth==1.6.2
- google-auth-oauthlib==0.2.0
- google-cloud-bigquery==1.9.0
- google-cloud-core==0.29.1
- google-resumable-media==0.3.2
- googleapis-common-protos==1.5.6
Problem
The strategy for paginating through a table in BigQuery with RowIterator.to_dataframe() does not work as expected.
Either
- the documentation should be updated to show that it will iterate automatically over all pages,
- or it should only fill the DataFrame with the first page, and let the user iterate and join the frames as needed.
Steps to reproduce
- Check the instructions for paginating through a query.
- Note the suggestion for how to read subsequent pages
A page is a subset of the total number of rows. If your results are more than one page of data, the result data will have a pageToken property. To retrieve the next page of results, make another list call and include the token value as a URL parameter named pageToken.
- Check the response for the
Client.list_rows()
method. - See that the Iterator may be transformed to a dataframe, and assume that it will only transform the page which was already loaded.
- Call RowIterator.to_dataframe() and see that it loads the rest of the pages, and unions them into a single dataframe: more API queries are made . . .
- Be surprised that the rest of the pages were loaded without warning, and that you cannot work with dataframes of pages at a time with the native implementation.
- Check the source code and confirm your suspicions.
def _to_dataframe_tabledata_list(self, dtypes):
"""Use (slower, but free) tabledata.list to construct a DataFrame."""
column_names = [field.name for field in self.schema]
frames = []
for page in iter(self.pages):
frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))
return pandas.concat(frames)
def to_dataframe(self, bqstorage_client=None, dtypes=None):
. . .
if pandas is None:
raise ValueError(_NO_PANDAS_ERROR)
if dtypes is None:
dtypes = {}
if bqstorage_client is not None:
return self._to_dataframe_bqstorage(bqstorage_client, dtypes)
else:
return self._to_dataframe_tabledata_list(dtypes)
Code example
table = client.get_table(destination_table)
row_iterator = client.list_rows(destination_table,
selected_fields=table.schema,
page_size=5,
max_results=20)
df = row_iterator.to_dataframe()
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to X
Data columns (total X columns):