scrapy 一个蹩脚的剧作家,除了一个错误什么也没说

j1dl9f46  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(91)

我正在学习Scrapy-playwright,但它在和我战斗。我试图使用CrawlSpider从一个站点收集商店位置,其中包含一个process_request规则,该规则触发请求通过playwright运行。在我的回调def中,我可以打印在页面上找到的值,但不能返回或生成任何内容。我曾试图将数据缓存到一个项目中,并返回/生成一个dict,所有这些都产生了错误。
错误:Spider必须返回请求、项目或“无”,得到“已延迟”
我被难住了。

import re
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from banners.items import StoreItem
from scrapy_playwright.page import PageCoroutine
from scrapy.http.response import Response

def set_playwright_true(request, response):
    request.meta["playwright"] = True
    request.meta["playwright_include_page"] = True
    request.meta["playwright_page_coroutines"] = ('wait_for_selector', 'span.store-name-city')
    return request

class StoreSpider(CrawlSpider):
    name = "retailer"
    allowed_domains = ['retailer.com']
    start_urls = ['https://www.retailer.com/store/0000-city-ak']
    custom_settings = {
        'ROBOTSTXT_OBEY': True ,
        #'DOWNLOAD_DELAY': .5 ,
        #'CONCURRENT_REQUESTS_PER_DOMAIN': 3 ,
        'DOWNLOAD_HANDLERS': {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler" ,
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler" ,
        } ,
        'TWISTED_REACTOR': "twisted.internet.asyncioreactor.AsyncioSelectorReactor" ,
    }

    rules = (
        Rule(LinkExtractor(allow=('directory/ak/anchorage'))),
        Rule(LinkExtractor(allow=(r'store/[0-9]+'), deny=(r'store/[0-9]+.+/.+')), callback='parse_item', follow=False, process_request=set_playwright_true),
    )

    async def parse_item(self, response):
        items = []
        item = StoreItem()
        self.logger.info('*****Start processing ' + response.url + '.*****')
        Name = response.css('meta[itemprop=alternateName]').attrib['content'] + ' - ' + response.css('span.store-name-city::text').get()
        print(Name)

        item['Name'] = Name
        item['StoreID'] = response.css('meta[itemprop=storeID]').attrib['content']
        item['Address1'] = response.css('span.store-address-line-1::text').get()
        item['City'] = response.css('span.store-address-city::text').get()
        item['State'] = response.css('span.store-address-state::text').get()
        item['Zip'] = response.css('span.store-address-postal::text').get()
        item['Phone'] = response.css('div.store-phone::text').get()
        item['Latitude'] = response.css('meta[itemprop=latitude]').attrib['content']
        item['Longitude'] = response.css('meta[itemprop=longitude]').attrib['content']

        items.append(item)
        return(items)
sg3maiej

sg3maiej1#

将parse_item从异步定义更改为普通定义解决了该问题。

async def parse_item(self, response):

更改为

def parse_item(self, response):

相关问题