我正在尝试从MySQL服务器查询数据,并使用panda.to_gbq API将其写入Google BigQuery。
def production_to_gbq(table_name_prod,prefix,table_name_gbq,dataset,project):
# Extract data from Production
q = """
SELECT *
FROM
{}
""".format(table_name_prod)
df = pd.read_sql(q, con)
# Write to gbq
df.to_gbq(dataset + table_name_gbq, project, chunksize=1000, verbose=True, reauth=False, if_exists='replace', private_key=None)
return df
我一直收到400错误指示无效输入。
Load is 100.0% Complete
---------------------------------------------------------------------------
BadRequest Traceback (most recent call last)
/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema)
569 self.client, dataframe, dataset_id, table_id,
--> 570 chunksize=chunksize):
571 self._print("\rLoad is {0}% Complete".format(
/usr/local/lib/python3.6/site-packages/pandas_gbq/_load.py in load_chunks(client, dataframe, dataset_id, table_id, chunksize, schema)
73 destination_table,
---> 74 job_config=job_config).result()
/usr/local/lib/python3.6/site-packages/google/cloud/bigquery/job.py in result(self, timeout)
527 # TODO: modify PollingFuture so it can pass a retry argument to done().
--> 528 return super(_AsyncJob, self).result(timeout=timeout)
529
/usr/local/lib/python3.6/site-packages/google/api_core/future/polling.py in result(self, timeout)
110 # Pylint doesn't recognize that this is valid in this case.
--> 111 raise self._exception
112
BadRequest: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 10; errors: 1. Please look into the error stream for more details.
During handling of the above exception, another exception occurred:
GenericGBQException Traceback (most recent call last)
<ipython-input-73-ef9c7cec0104> in <module>()
----> 1 departments.to_gbq(dataset + table_name_gbq, project, chunksize=1000, verbose=True, reauth=False, if_exists='replace', private_key=None)
2
/usr/local/lib/python3.6/site-packages/pandas/core/frame.py in to_gbq(self, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
1058 return gbq.to_gbq(self, destination_table, project_id=project_id,
1059 chunksize=chunksize, verbose=verbose, reauth=reauth,
-> 1060 if_exists=if_exists, private_key=private_key)
1061
1062 @classmethod
/usr/local/lib/python3.6/site-packages/pandas/io/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
107 chunksize=chunksize,
108 verbose=verbose, reauth=reauth,
--> 109 if_exists=if_exists, private_key=private_key)
/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key, auth_local_webserver, table_schema)
980 connector.load_data(
981 dataframe, dataset_id, table_id, chunksize=chunksize,
--> 982 schema=table_schema)
983
984
/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema)
572 ((total_rows - remaining_rows) * 100) / total_rows))
573 except self.http_error as ex:
--> 574 self.process_http_error(ex)
575
576 self._print("\n")
/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in process_http_error(ex)
453 # <https://cloud.google.com/bigquery/troubleshooting-errors>`__
454
--> 455 raise GenericGBQException("Reason: {0}".format(ex))
456
457 def run_query(self, query, **kwargs):
GenericGBQException: Reason: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 10; errors: 1. Please look into the error stream for more details.
我研究了表模式,
id INTEGER NULLABLE
name STRING NULLABLE
description STRING NULLABLE
created_at INTEGER NULLABLE
modified_at FLOAT NULLABLE
它与 Dataframe 相同:
id int64
name object
description object
created_at int64
modified_at float64
该表在GBQ中创建,但保持为空。
我在www.example.com _gbq API上读了一些东西,但没有找到太多pandas.to,除了这条似乎相关但没有回复的消息:
bigquery table is empty when using pandas to_gbq
我发现了一个关于对象数据类型中的数字的潜在解决方案,这些数字被传递到GBQ表中而不带引号,通过将列数据类型设置为字符串可以解决这个问题。
I use to_gbq on pandas for updating Google BigQuery and get GenericGBQException
我试过了:
for col in df.columns:
if df[col].dtypes == object:
df[col] = df[col].fillna('')
df[col] = df[col].astype(str)
不幸的是,我仍然得到同样的错误。类似地,试图格式化丢失的数据并为int和float设置dtype也会给出同样的错误。
我是不是漏掉了什么?
4条答案
按热度按时间puruo6ea1#
发现bigquery无法正确处理**\r**(有时也是**\n**)有同样的问题,定位了问题,当用空格替换**\r**时,我真的很惊讶:
s6fujrry2#
当我在从云存储上的parquet文件导入bigquery时遇到类似的问题时,我已经在这里结束了好几次,然而,每次我都忘记了解决它的方法,所以我希望把我的发现留在这里不会太违反协议!
我意识到,我有一些列都是NULL,这看起来像是它们的数据类型是Pandas,但是如果使用pyarrow.parquet.read_schema(parquet_file),您将看到数据类型是null。
删除列后,上载将正常工作!
pftdvrlh3#
我在
string
列中有一些无效字符(pandas
中的object
)。我使用@Echochi方法,它工作正常它对可接受的字符有一点限制,所以我使用了一种更通用的方法,因为biquery与
UTF-8
bigquery文档兼容使用
r"[^\u0900-\u097F]+"
,您将接受所有与UTF-8
兼容的字符集tzxcd3kk4#
由于数据集中存在不需要的列“Unnamed 0”,我遇到了类似问题。我删除了该列,问题得到了解决。如果数据集中存在任何空列或不需要的列,请尝试查看形状和标题