我在抓取动态内容加载页面时遇到了一些问题。
我们的想法是获得每个属性的类型、地址、社区、长度和价格的数据,但是在几次尝试使代码与滚动PageMethod一起工作之后,我仍然无法收集任何数据到.json文件中。
我看了这个“碎片”和“剧作家”的文档,但仍然没有找到让它工作的方法,我已经在www.example.com上做了修改settings.py(在“碎片”的“剧作家”插件文档https://github.com/scrapy-plugins/scrapy-playwright上有描述)。
想知道是否有人能够指出任何提示scrapy。剧作家lib对无限滚动刮。
所用代码
import scrapy
from scrapy_playwright.page import PageMethod
class PwspiderSpider(scrapy.Spider):
name = 'pwspider'
allowed_domains = ['quintoandar.com.br']
def start_requests(self):
yield scrapy.Request(
url='https://www.quintoandar.com.br/comprar/imovel/sao-paulo-sp-brasil?survey=profiling_survey_sale_v2&survey_origin=home',
meta=dict(
playwright=True,
playwright_include_page=True,
playwright_page_methods=[
PageMethod('wait_for_selector', 'a.sc-j3hcja-0 kTTiFJ'),
PageMethod('evaluate', 'window.scrollBy(0, document.body.scrollHeight)'),
PageMethod('wait_for_selector', 'a.sc-j3hcja-0 kTTiFJ:nth-child(8)'),
],
errback = self.errback
)
)
async def parse(self, response):
page = response.meta['playwright_page']
for property in response.css('div.MuiBox-root jss98 sc-eCYdqJ fIxulU'):
yield {
'type': property.css('span.sc-gsnTZi iRsaMY sc-crXcEl jddosl CozyTypography::text').get(),
'address': property.css('span.sc-gsnTZi dOINxy sc-crXcEl jddosl CozyTypography::text').get(),
'neighborhood': property.css('span.sc-gsnTZi TrtQb sc-crXcEl jddosl CozyTypography::text').get(),
'footage': property.css('small.sc-gsnTZi gvQbKz sc-crXcEl jddosl sc-hhh4j4-1 gepjiM CozyTypography::text').get(),
'rent_price': property.css('small.sc-gsnTZi iRsaMY sc-crXcEl jddosl CozyTypography::text').get()
}
async def errback(self, failture):
page = failture.request.meta['playwright_page']
await page.close()
终端命令和输出的最后部分
>>> scrapy crawl pwspider -o properties.json
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
waiting for locator("a.sc-j3hcja-0 kTTiFJ") to be visible
============================================================
2022-11-19 19:40:08 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-19 19:40:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/playwright._impl._api_types.TimeoutError': 1,
'downloader/request_bytes': 858,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2005,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 39.468435,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 19, 22, 40, 8, 934721),
'httpcompression/response_bytes': 2658,
'httpcompression/response_count': 1,
'log_count/DEBUG': 530,
'log_count/ERROR': 1,
'log_count/INFO': 14,
'memusage/max': 67170304,
'memusage/startup': 67170304,
'playwright/context_count': 1,
'playwright/context_count/max_concurrent': 1,
'playwright/context_count/non-persistent': 1,
'playwright/page_count': 1,
'playwright/page_count/max_concurrent': 1,
'playwright/request_count': 260,
'playwright/request_count/method/GET': 223,
'playwright/request_count/method/POST': 37,
'playwright/request_count/navigation': 10,
'playwright/request_count/resource_type/document': 10,
'playwright/request_count/resource_type/fetch': 17,
'playwright/request_count/resource_type/font': 3,
'playwright/request_count/resource_type/image': 78,
'playwright/request_count/resource_type/ping': 20,
'playwright/request_count/resource_type/script': 115,
'playwright/request_count/resource_type/stylesheet': 3,
'playwright/request_count/resource_type/xhr': 14,
'playwright/response_count': 260,
'playwright/response_count/method/GET': 223,
'playwright/response_count/method/POST': 37,
'playwright/response_count/resource_type/document': 10,
'playwright/response_count/resource_type/fetch': 17,
'playwright/response_count/resource_type/font': 3,
'playwright/response_count/resource_type/image': 78,
'playwright/response_count/resource_type/ping': 20,
'playwright/response_count/resource_type/script': 115,
'playwright/response_count/resource_type/stylesheet': 3,
'playwright/response_count/resource_type/xhr': 14,
'response_received_count': 1,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 11, 19, 22, 39, 29, 466286)}
2022-11-19 19:40:08 [scrapy.core.engine] INFO: Spider closed (finished)
2022-11-19 19:40:08 [scrapy-playwright] INFO: Closing download handler
2022-11-19 19:40:08 [scrapy-playwright] INFO: Closing download handler
2022-11-19 19:40:09 [scrapy-playwright] DEBUG: Browser context closed: 'default' (persistent=False)
2022-11-19 19:40:09 [scrapy-playwright] INFO: Closing browser
1条答案
按热度按时间dfty9e191#
您共享的结果指示选择器与页面中的任何元素都不匹配:
选择器中两个类之间的空格不正确。请考虑以下示例:
第一次
另外,您似乎没有在回调中使用Playwright页面,我建议您从 meta中删除
playwright_include_page=True
以简化操作。之后,您也可以删除errback,因为一旦出现错误,页面将由处理程序自行关闭。此外,由于您在回调中不等待任何内容,因此不需要使用异步生成器,您可以将async def
替换为def
。