如何将多个分页表转换为一个Pandas Dataframe ?

sq1bmfud  于 2023-01-01  发布在  其他
关注(0)|答案(1)|浏览(160)

mendelghelped me scrape一个javascript生成的表BeautifulSoup.但是,由于表是分页的,代码只将最后一页的10行转换成 Dataframe ,而不是将所有页的表合并成一个 Dataframe 。
原始代码如下:

import requests
import pandas as pd
from bs4 import BeautifulSoup

data = {
    "action": "geteCMSList",
    "keyword": "",
    "officeId": "0",
    "contractAwardTo": "",
    "contractStartDtFrom": "",
    "contractStartDtTo": "",
    "contractEndDtFrom": "",
    "contractEndDtTo": "",
    "departmentId": "",
    "tenderId": "",
    "procurementMethod": "",
    "procurementNature": "",
    "contAwrdSearchOpt": "Contains",
    "exCertSearchOpt": "Contains",
    "exCertificateNo": "",
    "tendererId": "",
    "procType": "",
    "statusTab": "eTenders",
    "pageNo": "1",
    "size": "10",
    "workStatus": "All",
}

_columns = [
    "S. No",
    "Ministry, Division, Organization, PE",
    "Procurement Nature, Type & Method",
    "Tender/Proposal ID, Ref No., Title..",
    "Contract Awarded To",
    "Company Unique ID",
    "Experience Certificate No  ",
    "Contract Amount",
    "Contract Start & End Date",
    "Work Status",
]

for page in range(1, 11):  # <--- Increase number of pages here
    print(f"Page: {page}")
    data["pageNo"] = page

    response = requests.post(
        "https://www.eprocure.gov.bd/AdvSearcheCMSServlet", data=data
    )
    # The HTML is missing a `table` tag, so we need to fix it
    soup = BeautifulSoup("<table>" + "".join(response.text) + "</table>", "html.parser")
    df = pd.read_html(
        str(soup),
    )[0]

    df.columns = _columns
    print(df.to_string())

当我更改for循环中的页数时,得到的df只包含最后一页(在上面的例子中是第11页)的最后10行。
相反,我想要的输出是一个 Dataframe ,其中包含所有页面中的所有表。

ruarlubt

ruarlubt1#

您可以使用pandas.concat
在循环 * 外部 * 创建列表:

all_data = []

并在循环中向其追加:

df = pd.read_html(
        str(soup),
    )[0]

    all_data.append(df)

然后,再一次,在循环之外:

df = pd.concat(all_data)

print(df.to_string())

完整示例:

import requests
import pandas as pd
from bs4 import BeautifulSoup

data = {
    "action": "geteCMSList",
    "keyword": "",
    "officeId": "0",
    "contractAwardTo": "",
    "contractStartDtFrom": "",
    "contractStartDtTo": "",
    "contractEndDtFrom": "",
    "contractEndDtTo": "",
    "departmentId": "",
    "tenderId": "",
    "procurementMethod": "",
    "procurementNature": "",
    "contAwrdSearchOpt": "Contains",
    "exCertSearchOpt": "Contains",
    "exCertificateNo": "",
    "tendererId": "",
    "procType": "",
    "statusTab": "eTenders",
    "pageNo": "1",
    "size": "10",
    "workStatus": "All",
}

_columns = [
    "S. No",
    "Ministry, Division, Organization, PE",
    "Procurement Nature, Type & Method",
    "Tender/Proposal ID, Ref No., Title..",
    "Contract Awarded To",
    "Company Unique ID",
    "Experience Certificate No  ",
    "Contract Amount",
    "Contract Start & End Date",
    "Work Status",
]

all_data = []
for page in range(1, 2):  # <--- Increase number of pages here
    print(f"Page: {page}")
    data["pageNo"] = page

    response = requests.post(
        "https://www.eprocure.gov.bd/AdvSearcheCMSServlet", data=data
    )
    # The HTML is missing a `table` tag, so we need to fix it
    soup = BeautifulSoup("<table>" + "".join(response.text) + "</table>", "html.parser")
    df = pd.read_html(
        str(soup),
    )[0]

    all_data.append(df)

_df = pd.concat(all_data)

print(_df.to_string())

相关问题