- Python*:Python 3.11.2 *Python编辑器 *:PyCharm 2022.3.3(Community Edition)- Build PC-223.8836.43 * 操作系统 *:Windows 11专业版,22 H2,22621.1413 * 浏览器 *:Chrome 111.0.5563.65(官方版本)(64位)
仍然是一个婴儿Pythoneer,我刮一个URL https://dockets.justia.com/search?parties=Novo+Nordisk
,但也想刮它的10个超链接页面(e.g., "https://dockets.justia.com/docket/puerto-rico/prdce/3:2023cv01127/175963,https://dockets.justia.com/docket/california/cacdce/2:2023cv01929/878409, etc.)
我如何(1)“打开”10个超链接页面,(2)抓取子超链接文档中的信息,例如table-responsive with-gaps table-padding--small table-bordered table-padding-sides--small table-full-width
,然后(3)将捕获的信息附加到打印输出文件(索引)父URL。
我已经研究了Selenium,也许可以用这种方式打开和控制网页,但它似乎并不特别适用于这里。我真的需要Selenium吗?还是有一些漂亮而简单的方法来做到这一点?
这就是我目前所知道的……
from bs4 import BeautifulSoup
import requests
html_text = requests.get("https://dockets.justia.com/search?parties=Novo+Nordisk").text
soup = BeautifulSoup(html_text, "lxml")
cases = soup.find_all("div", class_ = "has-padding-content-block-30 -zb")
# Printing to individual files
for index, case in enumerate(cases):
case_number = case.find("span", class_ = "citation").text.replace(" ","")
case_url = case.find("a", {"class": "case-name"})["href"]
with open(f"posts/{index}.txt", "w") as f:
f.write(f"Case No.: {case_number.strip()} \t")
f.write(f"Case URL: {case_url} \n")
print("File saved: {index}")
# If printing in terminal
# for case in cases:
# case_number = case.find("span", class_ = "citation").text.replace(" ","")
# case_url = case.find("a", {"class": "case-name"})["href"]
# print(f"Case No.: {case_number.strip()}") # strip cleans off tags
# print(f"Case URL: {case_url}")
1条答案
按热度按时间bz4sfanl1#