scrapy.playwright -抓取动态页面的问题

3vpjnl9f  于 2022-11-23  发布在  其他
关注(0)|答案(1)|浏览(589)

我在抓取动态内容加载页面时遇到了一些问题。
我们的想法是获得每个属性的类型、地址、社区、长度和价格的数据,但是在几次尝试使代码与滚动PageMethod一起工作之后,我仍然无法收集任何数据到.json文件中。
我看了这个“碎片”和“剧作家”的文档,但仍然没有找到让它工作的方法,我已经在www.example.com上做了修改settings.py(在“碎片”的“剧作家”插件文档https://github.com/scrapy-plugins/scrapy-playwright上有描述)。
想知道是否有人能够指出任何提示scrapy。剧作家lib对无限滚动刮。

所用代码

import scrapy
from scrapy_playwright.page import PageMethod

class PwspiderSpider(scrapy.Spider):
    name = 'pwspider'
    allowed_domains = ['quintoandar.com.br']
    
    def start_requests(self):
        yield scrapy.Request(
            url='https://www.quintoandar.com.br/comprar/imovel/sao-paulo-sp-brasil?survey=profiling_survey_sale_v2&survey_origin=home', 
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_page_methods=[
                    PageMethod('wait_for_selector', 'a.sc-j3hcja-0 kTTiFJ'),
                    PageMethod('evaluate', 'window.scrollBy(0, document.body.scrollHeight)'),
                    PageMethod('wait_for_selector', 'a.sc-j3hcja-0 kTTiFJ:nth-child(8)'),
                ],
                errback = self.errback
            )   
        )  
        

    async def parse(self, response):
        page = response.meta['playwright_page']
        for property in response.css('div.MuiBox-root jss98 sc-eCYdqJ fIxulU'):
            yield {
            'type': property.css('span.sc-gsnTZi iRsaMY sc-crXcEl jddosl CozyTypography::text').get(),
            'address': property.css('span.sc-gsnTZi dOINxy sc-crXcEl jddosl CozyTypography::text').get(),
            'neighborhood': property.css('span.sc-gsnTZi TrtQb sc-crXcEl jddosl CozyTypography::text').get(),
            'footage': property.css('small.sc-gsnTZi gvQbKz sc-crXcEl jddosl sc-hhh4j4-1 gepjiM CozyTypography::text').get(),
            'rent_price': property.css('small.sc-gsnTZi iRsaMY sc-crXcEl jddosl CozyTypography::text').get()
            }
    
            
    async def errback(self, failture):
        page = failture.request.meta['playwright_page']
        await page.close()

终端命令和输出的最后部分

>>> scrapy crawl pwspider -o properties.json

playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
waiting for locator("a.sc-j3hcja-0 kTTiFJ") to be visible
============================================================
2022-11-19 19:40:08 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-19 19:40:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/playwright._impl._api_types.TimeoutError': 1,
 'downloader/request_bytes': 858,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2005,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 39.468435,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 11, 19, 22, 40, 8, 934721),
 'httpcompression/response_bytes': 2658,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 530,
 'log_count/ERROR': 1,
 'log_count/INFO': 14,
 'memusage/max': 67170304,
 'memusage/startup': 67170304,
 'playwright/context_count': 1,
 'playwright/context_count/max_concurrent': 1,
 'playwright/context_count/non-persistent': 1,
 'playwright/page_count': 1,
 'playwright/page_count/max_concurrent': 1,
 'playwright/request_count': 260,
 'playwright/request_count/method/GET': 223,
 'playwright/request_count/method/POST': 37,
 'playwright/request_count/navigation': 10,
 'playwright/request_count/resource_type/document': 10,
 'playwright/request_count/resource_type/fetch': 17,
 'playwright/request_count/resource_type/font': 3,
 'playwright/request_count/resource_type/image': 78,
 'playwright/request_count/resource_type/ping': 20,
 'playwright/request_count/resource_type/script': 115,
 'playwright/request_count/resource_type/stylesheet': 3,
 'playwright/request_count/resource_type/xhr': 14,
 'playwright/response_count': 260,
 'playwright/response_count/method/GET': 223,
 'playwright/response_count/method/POST': 37,
 'playwright/response_count/resource_type/document': 10,
 'playwright/response_count/resource_type/fetch': 17,
 'playwright/response_count/resource_type/font': 3,
 'playwright/response_count/resource_type/image': 78,
 'playwright/response_count/resource_type/ping': 20,
 'playwright/response_count/resource_type/script': 115,
 'playwright/response_count/resource_type/stylesheet': 3,
 'playwright/response_count/resource_type/xhr': 14,
 'response_received_count': 1,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 11, 19, 22, 39, 29, 466286)}
2022-11-19 19:40:08 [scrapy.core.engine] INFO: Spider closed (finished)
2022-11-19 19:40:08 [scrapy-playwright] INFO: Closing download handler
2022-11-19 19:40:08 [scrapy-playwright] INFO: Closing download handler
2022-11-19 19:40:09 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False)
2022-11-19 19:40:09 [scrapy-playwright] INFO: Closing browser
dfty9e19

dfty9e191#

您共享的结果指示选择器与页面中的任何元素都不匹配:

playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
waiting for locator("a.sc-j3hcja-0 kTTiFJ") to be visible

选择器中两个类之间的空格不正确。请考虑以下示例:
第一次
另外,您似乎没有在回调中使用Playwright页面,我建议您从 meta中删除playwright_include_page=True以简化操作。之后,您也可以删除errback,因为一旦出现错误,页面将由处理程序自行关闭。此外,由于您在回调中不等待任何内容,因此不需要使用异步生成器,您可以将async def替换为def

相关问题