pandas 大查询类型错误:to_panda()获得意外的关键字参数“timestamp_as_object”

q9yhzks0  于 2023-01-07  发布在  其他
关注(0)|答案(2)|浏览(114)

环境详细信息

  • 操作系统类型和版本:1.5.29-debian10
  • Python版本:3.7
  • google-cloud-bigquery版本:2.8.0

我正在配置一个dataproc集群,它将BigQuery中的数据放入一个panda Dataframe 中。随着数据的增长,我希望提高性能,并听说了使用BigQuery存储客户端。
我过去遇到过同样的问题,通过将google-cloud-bigquery设置为版本1.26.1解决了这个问题。如果我使用该版本,我会得到以下消息。

/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/client.py:407: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.
 "Cannot create BigQuery Storage client, the dependency "

代码片段执行,但速度慢了一点。如果我不指定pip版本,我会遇到这个错误。

重现步骤

1.在数据处理上创建群集

gcloud dataproc clusters create testing-cluster  --region=europe-west1  --zone=europe-west1-b  --master-machine-type n1-standard-16  --single-node  --image-version 1.5-debian10  --initialization-actions gs://dataproc-initialization-actions/python/pip-install.sh  --metadata 'PIP_PACKAGES=elasticsearch google-cloud-bigquery google-cloud-bigquery-storage pandas pandas_gbq'

1.在群集上执行以下脚本
一个二个一个一个
使用pandas-gbq版本会产生完全相同的错误

query_config = {
    'query': {
        'parameterMode': 'NAMED',         
        'queryParameters': [
            {
                'name': 'query_start',
                'parameterType': {'type': 'STRING'},
                'parameterValue': {'value': str('2021-02-09 00:00:00')}
            },
            {
                'name': 'query_end',
                'parameterType': {'type': 'STRING'},
                'parameterValue': {'value': str('2021-02-09 23:59:59.99')}
            },
        ]
    }
}
df = pd.read_gbq(base_query, 
                 configuration=query_config, 
                 progress_bar_type='tqdm',
                 use_bqstorage_api=True)
2021-02-11 09:21:19,532 - preprocessing logger initialized
2021-02-11 09:21:19,532 - arguments = [file, arg1, arg2, arg3, arg4, project_id, arg5, arg6]
started
Downloading: 100%|██████████| 3107858/3107858 [00:14<00:00, 207656.33rows/s]
Traceback (most recent call last):
  File "/tmp/1830d5bcf198440e9e030c8e42a1b870/immo_preprocessing-pageviews.py", line 98, in <module>
    use_bqstorage_api=True)
  File "/opt/conda/default/lib/python3.7/site-packages/pandas/io/gbq.py", line 193, in read_gbq
    **kwargs,
  File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 977, in read_gbq
    dtypes=dtypes,
  File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 536, in run_query
    user_dtypes=dtypes,
  File "/opt/conda/default/lib/python3.7/site-packages/pandas_gbq/gbq.py", line 590, in _download_results
    **to_dataframe_kwargs
  File "/opt/conda/default/lib/python3.7/site-packages/google/cloud/bigquery/table.py", line 1793, in to_dataframe
    df = record_batch.to_pandas(date_as_object=date_as_object, **extra_kwargs)
  File "pyarrow/array.pxi", line 414, in pyarrow.lib._PandasConvertible.to_pandas
TypeError: to_pandas() got an unexpected keyword argument 'timestamp_as_object'

https://github.com/googleapis/python-bigquery/issues/519

bvn4nwqk

bvn4nwqk1#

@Sam回答了这个问题,但我想我还是提一下可操作的命令吧:
在Jupyter笔记本中:
第一个月
在虚拟环境中
pip install pyarrow==3.0.0

c7rzv4ha

c7rzv4ha2#

Dataproc默认安装pyarrow 0.15.0,而bigquery-storage-API需要更新的版本,在安装时手动设置pyarrow为3.0.0解决了这个问题。也就是说,PySpark有一个兼容性设置,为Pyarrow〉= 0.15.0 https://spark.apache.org/docs/3.0.0-preview/sql-pyspark-pandas-with-arrow.html#apache-arrow-in-spark我看过dataproc的发行说明,这个env变量自2020年5月起被设置为默认值。

相关问题