scrapy Google BigQuery更新比Insert慢70倍,如何修复？

lp0sw83n 于 2023-05-22 发布在 Go

关注(0)|答案(1)|浏览(147)

我使用BigQuery作为我的数据库与Scrapy蜘蛛。下面是将数据存储到DB中的两个管道。一个使用Insert，另一个使用Update方法。Update方法比insert慢70 times（每分钟仅更新20条记录）。更新需要3.560秒，而插入只需0.05秒。我错在哪里，如何加快更新方法？
P.S.当前表的大小大约是20k条记录，潜在的大小可以高达500000条记录。需要每天更新记录。

更新方式

# Define the update query
query = f"""
        UPDATE `{self.dataset_id}.{self.table_id}`
        SET `Sold Status` = '{data['Sold Status']}',
            `Amount of Views` = '{data['Amount of Views']}',
            `Amount of Likes` = '{data['Amount of Likes']}',
            `Sold Date & Time` = '{data['Sold Date & Time']}'
        WHERE `Item number` = '{data['Item number']}'
    """
start_time = time.time()
# Run the update query
job = self.client.query(query)

# Wait for the job to complete
job.result()

# Check if the query was successful
if job.state == 'DONE':
    print('Update query executed successfully.')
else:
    print('Update query failed.')
end_time = time.time()
execution_time = end_time - start_time
logging.info(execution_time)

return item

插入方式

start_time = time.time()
data = item
slug = data['slug']
if slug in self.ids_seen:
    raise DropItem("Duplicate item found: {}".format(slug))
else:
    data.pop('slug', None)
    self.ids_seen.add(slug)
    table_ref = self.client.dataset(self.dataset_id).table(self.table_id)

    # Define the rows to be inserted
    rows = [
        data
    ]

    # Insert rows into the table
    errors = self.client.insert_rows_json(table_ref, rows)

    if errors == []:
        print("Rows inserted successfully.")
    else:
        print("Encountered errors while inserting rows:", errors) 
    end_time = time.time()
    execution_time = end_time - start_time
    logging.info(execution_time)
    
    return item

scrapy

来源：https://stackoverflow.com/questions/76287425/google-bigquery-update-is-70x-slower-then-insert-how-to-fix

1条答案

按热度按时间

k3bvogb11#

好的。感谢ChatGPT，我找到了解决这个问题的方法。过去需要9个小时的代码现在只需要不到15分钟。所以36倍的改善。
基本上我做的是：
1.创建新的临时表。
1.将所有抓取的数据作为json批量追加到此表
1.从新表到主表运行Update命令（整个过程只需要5秒，而不是每个更新查询需要5秒）。怎么会这样？）
1.截断临时表，让它在几个小时内为下一次使用做好准备（仅供参考，你不能立即使用截断的表。2到15分钟之间的SMTH应该过去，以使table准备好插入。到目前为止，我对结果相当满意，并将坚持这个解决方案一段时间。

赞(0）回复(0）举报 2023-05-22

我来回答

scrapy Google BigQuery更新比Insert慢70倍,如何修复？

1条答案

相关问题

热门标签

最新问答