scrapy 爬行蜘蛛:提取链接前获取数据

6ss1mwsb  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(163)

在CrawlSpider中,如何在提取每个链接之前提取图像中标记为“4天前”的字段?下面提到的CrawlSpider工作正常。但是在'parse_item'中,我想添加一个名为'Add posted'的新字段,在那里我想获得图像上标记的字段。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class PropertySpider(CrawlSpider):
    name = 'property'

    allowed_domains = ['www.openrent.co.uk']
    start_urls = [
        'https://www.openrent.co.uk/properties-to-rent/london?term=London&skip='+ str(x) for x in range(0, 5, 20)
        ]

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@id='property-data']/a"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            'Title': response.xpath("//h1[@class='property-title']/text()").get(),
            'Price': response.xpath("//h3[@class='perMonthPrice price-title']/text()").get(),
            'Links': response.url,
            'Add posted': ?
        }
llycmphe

llycmphe1#

当使用scrappy crawl spider的Rule对象时,提取的链接文本保存在请求的 meta字段link_text中。您可以在parse_item方法中获取该值,并使用regex提取时间信息。您可以从文档中了解更多信息。请参见下面的示例。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import re

class PropertySpider(CrawlSpider):
    name = 'property'

    allowed_domains = ['www.openrent.co.uk']
    start_urls = [
        'https://www.openrent.co.uk/properties-to-rent/london?term=London&skip='+ str(x) for x in range(0, 5, 20)
        ]

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@id='property-data']/a"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        link_text = response.request.meta.get("link_text")
        m = re.search(r"(Last Updated.*ago)", link_text)
        if m:
            posted = m.group(1).replace("\xa0", " ")

        yield {
            'Title': response.xpath("//h1[@class='property-title']/text()").get(),
            'Price': response.xpath("//h3[@class='perMonthPrice price-title']/text()").get(),
            'Links': response.url,
            "Add posted": posted
        }
xwmevbvl

xwmevbvl2#

要在循环中显示,可以使用下面的xpath来接收该数据点:

x = response.xpath('//div[@class="timeStamp"]')
for i in x:
    yield {'result': i.xpath("./i/following-sibling::text()").get().strip() }

相关问题