mendelghelped me scrape一个javascript生成的表BeautifulSoup.但是,由于表是分页的,代码只将最后一页的10行转换成 Dataframe ,而不是将所有页的表合并成一个 Dataframe 。
原始代码如下:
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = {
"action": "geteCMSList",
"keyword": "",
"officeId": "0",
"contractAwardTo": "",
"contractStartDtFrom": "",
"contractStartDtTo": "",
"contractEndDtFrom": "",
"contractEndDtTo": "",
"departmentId": "",
"tenderId": "",
"procurementMethod": "",
"procurementNature": "",
"contAwrdSearchOpt": "Contains",
"exCertSearchOpt": "Contains",
"exCertificateNo": "",
"tendererId": "",
"procType": "",
"statusTab": "eTenders",
"pageNo": "1",
"size": "10",
"workStatus": "All",
}
_columns = [
"S. No",
"Ministry, Division, Organization, PE",
"Procurement Nature, Type & Method",
"Tender/Proposal ID, Ref No., Title..",
"Contract Awarded To",
"Company Unique ID",
"Experience Certificate No ",
"Contract Amount",
"Contract Start & End Date",
"Work Status",
]
for page in range(1, 11): # <--- Increase number of pages here
print(f"Page: {page}")
data["pageNo"] = page
response = requests.post(
"https://www.eprocure.gov.bd/AdvSearcheCMSServlet", data=data
)
# The HTML is missing a `table` tag, so we need to fix it
soup = BeautifulSoup("<table>" + "".join(response.text) + "</table>", "html.parser")
df = pd.read_html(
str(soup),
)[0]
df.columns = _columns
print(df.to_string())
当我更改for循环中的页数时,得到的df只包含最后一页(在上面的例子中是第11页)的最后10行。
相反,我想要的输出是一个 Dataframe ,其中包含所有页面中的所有表。
1条答案
按热度按时间ruarlubt1#
您可以使用
pandas.concat
。在循环 * 外部 * 创建列表:
并在循环中向其追加:
然后,再一次,在循环之外:
完整示例: