使用python从网页中抓取表格

zazmityj  于 2023-11-15  发布在  Python
关注(0)|答案(1)|浏览(109)

我希望得到this website的表格内容。然而,网页的设计非常特殊,我下面的代码只能得到第一页的表格:
我知道,因为只有三页,我可以手动复制,但我仍然希望写一个脚本,可以自动化整个过程。

driver = webdriver.Chrome()
driver.get(url) 
time.sleep(5)   
html_str = driver.page_source 
soup = bs(html_str, "html.parser")
soup.find("table")

字符串
这是soup的pagnitor部分,我没有Web开发经验,也不明白点击Next后会发生什么。

<ha-paginator data-translation-block="false" data-translation-id="1442"><!-- --><nav aria-label="Page navigation" class="text-center" data-translation-block="false" data-translation-id="1443">
<ul class="pagination" data-translation-block="false" data-translation-id="1444">
<!-- -->
<!-- --><li class="active" data-translation-block="false" data-translation-id="1445">
<!-- --><a data-translated="false" data-translation-checksum="57ad7d2ec0e248914c2b0ae7efc17011d1435f99d807e43b172697027ffe46ce500c3ff64f5162eaa059c11a23fa5d8c442ab67bd219d74311601bed517cf477" href="#"> 1
        <!-- --><span class="sr-only">(current)</span>
</a>
</li><li data-translation-block="false" data-translation-id="1446">
<!-- --><a data-translated="false" data-translation-checksum="7eece0387dc3c6876397df60e2d7dbe0e2c94ecdc42d7e50d5208a4c84885caa703c487d86900ac97f10ad493893db85144cf7889d8ac8fd008dfd4c8f0e98df" href="#"> 2
        <!-- -->
</a>
</li><li data-translation-block="false" data-translation-id="1447">
<!-- --><a data-translated="false" data-translation-checksum="aa08ec665075172d835562b332e78832e7f9d3b7f3df47d5a32b8f3a1682daaed49831faf19eeaca164d8e94e3449ade2a83d83dfaa83878c832f644fea11f95" href="#"> 3
        <!-- -->
</a>
</li><!-- --><li data-translated="false" data-translation-block="true" data-translation-checksum="7d03f54e74b11d46eacd33365a0aa16a3ba2857949c7f795c2d9c07b5689fbc4230dc22c45af2303eba21a7d8016f197d9b474d4149db6d0df059ce00416e192" data-translation-id="1448">
<a href="#">
          Next
        </a>
</li>
</ul>
</nav>
<!-- --></ha-paginator>
<hr class="big" data-translation-block="true" data-translation-id="1449"/>
</div>
</div>
</div>
</ha-table-search>

db2dz4w8

db2dz4w81#

您在页面上看到的数据是通过JavaScript从外部URL加载的,因此您可以直接从那里获取数据:

import pandas as pd
import requests

url = "https://immi.homeaffairs.gov.au/_layouts/15/api/data.aspx/GetPriceList"

data = requests.post(url, json={"category": "Visa", "onshore": "All"}).json()
df = pd.DataFrame(data["d"]["data"])

df.pop("note")
print(df.head(5))

字符串
打印:

visaSubclassCode                                           visaSubclassText streamCode streamText onShore    basePrice  over18Price under18Price nonInternetPrice subsequentPrice
0              100  Partner (Provisional and Migrant) visa (subclass 309/100)                            No  AUD8,850.00  AUD4,430.00  AUD2,215.00              N/A             N/A
1              101                                  Child visa (subclass 101)                            No  AUD3,055.00  AUD1,530.00    AUD765.00              N/A             N/A
2              102                               Adoption visa (subclass 102)                            No  AUD3,055.00  AUD1,530.00    AUD765.00              N/A             N/A
3              117                        Orphan Relative visa (subclass 117)                            No  AUD1,870.00    AUD935.00    AUD470.00              N/A             N/A
4              124                   Distinguished Talent visa (subclass 124)                            No  AUD4,110.00  AUD2,055.00  AUD1,030.00              N/A             N/A

相关问题