通过Pydrive将Google电子表格转换为Pandas数据框，无需下载

uajslkp6 于 2023-01-19 发布在 Go

关注(0)|答案(2)|浏览(169)

如何在不下载文件的情况下将Google电子表格的内容读入Pandas数据框？
我认为gspread或df2gspread可能是很好的解决方案，但到目前为止，我一直在使用pydrive，并接近解决方案。
使用Pydrive，我设法获得了电子表格的导出链接，格式为.csv或.xlsx文件。

gauth = GoogleAuth()
    gauth.LocalWebserverAuth()
    drive = GoogleDrive(gauth)
    
    # choose whether to export csv or xlsx
    data_type = 'csv'
    
    # get list of files in folder as dictionaries
    file_list = drive.ListFile({'q': "'my-folder-ID' in parents and 
    trashed=false"}).GetList()
    
    export_key = 'exportLinks'
    
    excel_key = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
    csv_key = 'text/csv'
    
    if data_type == 'excel':
        urls = [ file[export_key][excel_key] for file in file_list ]
    
    elif data_type == 'csv':
        urls = [ file[export_key][csv_key] for file in file_list ]

我得到的xlsx的url类型是

https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=xlsx

并且对于csv类似

https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=csv

现在，如果我点击这些链接（或使用webbrowser.open(url)访问它们），我就 * 下载 * 了文件，然后我可以使用pandas.read_excel()或pandas.read_csv()正常地将其读入Pandas Dataframe ，如here所述。

如何跳过下载，直接从这些链接将文件读入 Dataframe ？

我尝试了几种解决方案：

The obviouspd.read_csv(url)给出

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

有趣的是，这些数字（1, 6, 2）并不依赖于电子表格中的行数和列数，这暗示着脚本试图读取的不是它想要读取的内容。

模拟量pd.read_excel(url)给出

ValueError: Excel file format cannot be determined, you must specify an engine manually.

并且指定例如engine = 'openpyxl'给出

zipfile.BadZipFile: File is not a zip file

BytesIO解决方案看起来很有前途，但是

r = requests.get(url)
    data = r.content
    df = pd.read_csv(BytesIO(data))

仍然给予

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

如果我print(data)我会得到数百行html代码

b'\n<!DOCTYPE html>\n<html lang="de">\n  <head>\n  <meta charset="utf-8">\n  <meta content="width=300, initial-scale=1" name="viewport">\n 
    ...
    ...
     </script>\n  </body>\n</html>\n'

pandas

来源：https://stackoverflow.com/questions/71278523/google-spreadsheet-to-pandas-dataframe-via-pydrive-without-download

2条答案

按热度按时间

ccrfmcuu1#

在您的情况下，如何进行以下修改？在这种情况下，通过从gauth检索访问令牌，电子表格被导出为XLSX数据，XLSX数据被放入 Dataframe 。

修改的脚本：

gauth = GoogleAuth()
gauth.LocalWebserverAuth()

url = "https://docs.google.com/spreadsheets/export?id={spreadsheetId}&exportFormat=xlsx"
res = requests.get(url, headers={"Authorization": "Bearer " + gauth.attr['credentials'].access_token})
values = pd.read_excel(BytesIO(res.content))
print(values)

在此脚本中，请添加import requests。
在这种情况下，使用XLSX数据的第一个选项卡。
当您想使用其他选项卡时，请修改values = pd.read_excel(BytesIO(res.content))，如下所示。

sheet = "Sheet2"
  values = pd.read_excel(BytesIO(res.content), sheet_name=sheet)

赞(0）回复(0）举报 2023-01-19

3gtaxfhh2#

我想为@Tanaike的精彩回答贡献一个额外的选项。确实成功获得excel文件是相当困难的（来自drive的.xlsx，而不是google的表单）导入到python环境中，而不需要将内容发布到Web上。（），我通常在colab/jupyter笔记本中使用不同的认证方法，改编自googleapis文档，在我的环境中使用BytesIO（response.content）是不必要的。

import pandas as pd

from oauth2client.client import GoogleCredentials
from google.colab import auth
auth.authenticate_user()

from google.auth.transport.requests import AuthorizedSession
from google.auth import default
creds, _ = default()

id = 'aaaaaaaaaaaaaaaaaaaaaaaaaaa'
sheet = 'Sheet12345'

url = f'https://docs.google.com/spreadsheets/export?id={id}&exportFormat=xlsx'

authed_session = AuthorizedSession(creds)
response = authed_session.get(url)

values = pd.read_excel(response.content, sheet_name=sheet)

赞(0）回复(0）举报 2023-01-19

我来回答

通过Pydrive将Google电子表格转换为Pandas数据框，无需下载

如何跳过下载，直接从这些链接将文件读入 Dataframe ？

2条答案

修改的脚本：

相关问题

热门标签

最新问答