如何在Python中“点击”子URL，抓取该URL，并将这些抓取的数据附加到父文件的输出中？

dfty9e19 于 2023-03-21 发布在 Python

关注(0)|答案(1)|浏览(87)

Python*：Python 3.11.2 *Python编辑器 *：PyCharm 2022.3.3（Community Edition）- Build PC-223.8836.43 * 操作系统 *：Windows 11专业版，22 H2，22621.1413 * 浏览器 *：Chrome 111.0.5563.65（官方版本）（64位）

仍然是一个婴儿Pythoneer，我刮一个URL https://dockets.justia.com/search?parties=Novo+Nordisk，但也想刮它的10个超链接页面(e.g., "https://dockets.justia.com/docket/puerto-rico/prdce/3:2023cv01127/175963,https://dockets.justia.com/docket/california/cacdce/2:2023cv01929/878409, etc.)
我如何（1）“打开”10个超链接页面，（2）抓取子超链接文档中的信息，例如table-responsive with-gaps table-padding--small table-bordered table-padding-sides--small table-full-width，然后（3）将捕获的信息附加到打印输出文件（索引）父URL。
我已经研究了Selenium，也许可以用这种方式打开和控制网页，但它似乎并不特别适用于这里。我真的需要Selenium吗？还是有一些漂亮而简单的方法来做到这一点？
这就是我目前所知道的……

from bs4 import BeautifulSoup
import requests

html_text = requests.get("https://dockets.justia.com/search?parties=Novo+Nordisk").text
soup = BeautifulSoup(html_text, "lxml")
cases = soup.find_all("div", class_ = "has-padding-content-block-30 -zb")

# Printing to individual files
for index, case in enumerate(cases):
    case_number = case.find("span", class_ = "citation").text.replace(" ","")
    case_url = case.find("a", {"class": "case-name"})["href"]

    with open(f"posts/{index}.txt", "w") as f:
        f.write(f"Case No.: {case_number.strip()} \t")
        f.write(f"Case URL: {case_url} \n")
        print("File saved: {index}")

# If printing in terminal
# for case in cases:
#    case_number = case.find("span", class_ = "citation").text.replace(" ","")
#    case_url = case.find("a", {"class": "case-name"})["href"]
#    print(f"Case No.: {case_number.strip()}") # strip cleans off tags
#    print(f"Case URL: {case_url}")

python

来源：https://stackoverflow.com/questions/75792369/how-does-one-click-a-subsidiary-url-in-python-scrape-that-url-and-append-tho

1条答案

按热度按时间

bz4sfanl1#

from aiohttp import ClientSession
from pyuseragents import random
from bs4 import BeautifulSoup
from asyncio import run

class DocketsJustia:

    def __init__(self):
        self.headers = {
        'authority': 'dockets.justia.com',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'accept-language': 'ro-RO,ro;q=0.9,en-US;q=0.8,en;q=0.7',
        'cache-control': 'max-age=0',
        'referer': 'https://dockets.justia.com/search?parties=Novo+Nordisk',
        'user-agent': random(),
        }

        self.PatchFile = "nametxt.txt"

    async def Parser(self, session):
        count = 1

        while True:
    
            params = {
                'parties': 'Novo Nordisk',
                'page': f'{count}',
                }

            async with session.get(f'https://dockets.justia.com/search?parties=Novo+Nordisk&page={count}', params=params) as response:
                links = BeautifulSoup(await response.text(), "lxml").find_all("div", {"class": "has-padding-content-block-30 -zb"})

                for link in links:
                    try:
                        case_link = link.find("a", {"class": "case-name"}).get("href")
                        case_number = link.find("span", {"class": "citation"}).text
                        print(case_number + "\t" + case_link + "\n")

                        with open(self.PatchFile, "a", encoding='utf-8') as file:
                            file.write(case_number + "\t" + case_link + "\n")
                    except:
                        pass
            count += 1
    async def LoggerParser(self):
        async with ClientSession(headers=self.headers) as session:
            await self.Parser(session)

def StartDocketsJustia():

    run(DocketsJustia().LoggerParser())

if __name__ == '__main__':
    StartDocketsJustia()

赞(0）回复(0）举报 2023-03-21

我来回答

如何在Python中“点击”子URL，抓取该URL，并将这些抓取的数据附加到父文件的输出中？

1条答案

相关问题

热门标签

最新问答