Scrapy剧作家通过点击按钮获得日期

izkcnapc  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(152)

我正在尝试使用scrapy和scrapy playwright来抓取google flights。有一个选择日期输入字段,我想获取输入日期范围,然后从该页面收集其他数据,然后再次更改日期并获取数据等等。现在我有一个脚本,它正在工作,但不完全是我想要的工作
下面是最近的代码:

import scrapy
from scrapy_playwright.page import PageCoroutine
from bs4 import BeautifulSoup

class PwExSpider(scrapy.Spider):
    name = "pw_ex"

    headers = {
        "authority": "www.google.com",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-language": "en,ru;q=0.9",
        "cache-control": "max-age=0",
        # Requests sorts cookies= alphabetically
        # 'cookie': 'ANID=AHWqTUmN_Nw2Od2kmVHB-V-BPMn7lUDKjrsMYy6hJGcTF6v7U8u5YjJPArPDJI4K; SEARCH_SAMESITE=CgQIhpUB; CONSENT=YES+shp.gws-20220509-0-RC1.en+FX+229; OGPC=19022519-1:19023244-1:; SID=LwgAuUOC2U32iRLEjSQUdzx-18XGenx489M7BtkpBNDmZ_obyU799NLH7re0HlcH0tGNpg.; __Secure-1PSID=LwgAuUOC2U32iRLEjSQUdzx-18XGenx489M7BtkpBNDmZ_obMMyHAVo5IhVZXcHbzyERTw.; __Secure-3PSID=LwgAuUOC2U32iRLEjSQUdzx-18XGenx489M7BtkpBNDmZ_obxoNZznCMM25HAO4zuDeNTw.; HSID=A24bEjBTX5lo_2EDh; SSID=AXpmgSwtU6fitqkBi; APISID=PhBKYPpLmXydAQyJ/AzHdHtibgwX2VeVmr; SAPISID=bR71_zlABgKzGVWh/Ae0bo1S1RV74H5p0z; __Secure-1PAPISID=bR71_zlABgKzGVWh/Ae0bo1S1RV74H5p0z; __Secure-3PAPISID=bR71_zlABgKzGVWh/Ae0bo1S1RV74H5p0z; OTZ=6574663_36_36__36_; 1P_JAR=2022-07-02-19; NID=511=V3Tw5Rz0i058NG-nDiH7T8ePoRgiQTzp1MzxA-fzgJxrMiyJmXPbOtsbbIGWUZSY47b9zRw5E_CupzMBaUwWxUfxduldltqHJ8KDFsbW4F_WbUTzaHCFnwoQqEbckzWXG-12Sj94-L-Q8AIFd9UTpOzgi1jglT2pmEUzAdJ2uvO70QZ577hdlROJ4RMxl-FMefvoSJOhJOBEsW2_8H5vffLkJX-PNvl8U9gq_vyUqb_FYGx7zFBfZ5v8YPmQFFia523NrlK_J9VhdyEwGw5B3eaicpWZ8BPTEBFlYyPlnKr5PBhKeHCBL1jjc5N9WOrXHIko0hSPuQLAV8hIaiAwjHdt9ISJM3Lv7-MTiFhz7DJhCH7l72wxJtjpjw2p4gpDA5ewL5EfnhXss6sd; SIDCC=AJi4QfEvHIMmVfhjcEMP5ngU_yyfA1iSDYNmmbNKnGq3w0EspvCZaZ8Hd1oobxtDOIsY1LjJDS8; __Secure-1PSIDCC=AJi4QfEB_vOMIx2aSaNP7YGkLcpMBxMMJQLwZ5MuHjcFPrWipfycBV4V4yjT9dtifeYHAXLU_1I; __Secure-3PSIDCC=AJi4QfFhA4ftN_yWMxTXryTwMwdIdfLZzsAyzZM0lPkjhUrrRYnQwHzg87pPFf12QdgLEvpEFFc',
        "referer": "https://www.google.com/",
        "sec-ch-ua": '" Not A;Brand";v="99", "Chromium";v="100", "Yandex";v="22"',
        "sec-ch-ua-arch": '"x86"',
        "sec-ch-ua-bitness": '"64"',
        "sec-ch-ua-full-version": '"22.5.0.1879"',
        "sec-ch-ua-full-version-list": '" Not A;Brand";v="99.0.0.0", "Chromium";v="100.0.4896.143", "Yandex";v="22.5.0.1879"',
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-model": '""',
        "sec-ch-ua-platform": '"Linux"',
        "sec-ch-ua-platform-version": '"5.4.0"',
        "sec-ch-ua-wow64": "?0",
        "sec-fetch-dest": "document",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.143 Safari/537.36",
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://www.google.com/travel/flights/search?tfs=CBwQAhooagwIAxIIL20vMDE3N3oSCjIwMjItMDctMDNyDAgDEggvbS8wNmM2MhooagwIAxIIL20vMDZjNjISCjIwMjItMDctMjJyDAgDEggvbS8wMTc3enABggELCP___________wFAAUgBmAEB&tfu=EgYIARABGAA&curr=EUR",
            headers=self.headers,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_page_coroutines=[
                    PageCoroutine("wait_for_selector", "h3.zBTtmb.ZSxxwc"),
                ],
            ),
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]

        for i in range(0, 5):

            html = response.text
            # print(html)
            soup = BeautifulSoup(html, "html.parser")
            search_date = soup.find_all("input")[-6]["value"]
            await page.click(
                "#yDmH0d > c-wiz.zQTmif.SSPGKf > div > div:nth-child(2) > c-wiz > div > c-wiz > div.PSZ8D.EA71Tc > div.Ep1EJd > div > div.rIZzse > div.bgJkKe.K0Tsu > div > div > div.dvO2xc.k0gFV > div > div > div:nth-child(1) > div > div.oSuIZ.YICvqf.kStSsc.ieVaIb > div > div.WViz0c.CKPWLe.U9gnhd.Xbfhhd > button"
            )

            yield {
                "search_date": search_date,
            }

上面的脚本只获取"Sun, Jul 3",而不是范围内的所有日期:

[
    {
        "search_date": "Sun, Jul 3"
    },
    {
        "search_date": "Sun, Jul 3"
    },
    {
        "search_date": "Sun, Jul 3"
    },
    {
        "search_date": "Sun, Jul 3"
    },
    {
        "search_date": "Sun, Jul 3"
    }
]

所需输出:

[
{"search_date": "Sun, Jul 3"},
{"search_date": "Mon, Jul 4"},
{"search_date": "Tue, Jul 5"},
{"search_date": "Wed, Jul 6"},
{"search_date": "Thu, Jul 7"}
]

请这里有谁能帮我一把吗,我是一个很新的scrapy剧作家。谢谢

tkclm6bt

tkclm6bt1#

for i in range(0, 5):

            html = response.text
            # print(html)
            soup = BeautifulSoup(html, "html.parser")
            search_date = soup.find_all("input")[-6]["value"]
            await page.click(
                "#yDmH0d > c-wiz.zQTmif.SSPGKf > div > div:nth-child(2) > c-wiz > div > c-wiz > div.PSZ8D.EA71Tc > div.Ep1EJd > div > div.rIZzse > div.bgJkKe.K0Tsu > div > div > div.dvO2xc.k0gFV > div > div > div:nth-child(1) > div > div.oSuIZ.YICvqf.kStSsc.ieVaIb > div > div.WViz0c.CKPWLe.U9gnhd.Xbfhhd > button"
            )

            yield {
                "search_date": search_date,
            }

这种提取范围内所有日期的逻辑不正确。
1.您向航班页面提出请求。
1.你会得到回应。
1.在parse方法中,您尝试提取搜索日期。

search_date = soup.find_all("input")[-6]["value"]

这一行只返回一个日期,而不是一个列表。此外,我不明白for循环背后的逻辑。因为这段代码没有发出任何进一步的请求,因为你正试图通过提取的日期(在问题中提到)来实现这一点。
这里需要注意的是,你必须一次得到整个html响应。你现在可以使用css selectors或soup来提取所有选定的日期。运行5次for循环,并不能解决任何问题,因为你要执行5次相同的信息。
使用response.css('<<Path to the dates in the select>>').getall()可以得到您要查找的所有日期。进一步处理这些信息。

**逻辑上的即兴发挥:**你可以即兴发挥逻辑,我不知道你为什么要提取日期范围,当你可以只提取页面的出发和返回日期。并使用它们进行请求。或者只提取出发日期并增加它,并使用该日期进行另一次请求以获得进一步的信息。

相关问题