使用scrapy抓取信息

2uluyalo  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(128)

我试图刮信息如下所示,但他们将提供错误的输出。我在做什么错误?这是页面链接https://www.thegrommet.com/products/the-vintage-pearlmini-peas-in-the-pod-necklace

from scrapy import Spider
from scrapy.http import Request

class AuthorSpider(Spider):
    name = 'book'
    start_urls = ['https://www.thegrommet.com/gifts/by-type/personalized-gifts']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }

    def parse(self, response):
        books = response.xpath("//div[@class='flex-grow | p-t-s']//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

    def parse_book(self, response):
        title=response.xpath("//div[@class='f-heading-xl']//text()").get()
        title=title.strip()
        d3=response.xpath("//div[@class='accordion-section | p-t-s p-b-m']")
        for pro in d3:
            data=[tup for tup in pro.xpath('//div//text()')]
            try:
                trip=data[1].get()
            except:
                trip=''
            trip=trip.strip()
            try:
                tuck=data[2].get()
            except:
                tuck=''
            tuck=tuck.strip()
            try:
                tup=data[3].get()
            except:
                tup=''
            tup=tup.strip()

        yield{ 
            'title':title,
            'd1':trip,
            'd2':tuck,
            'd3':tup,

            }

PIC2:

bjp0bcyl

bjp0bcyl1#

你可以用下面的方法选择d1,d2,d3的xpath表达式,并且不需要使用try,除非scrapy本身处理None值。你也可以使用scrapy内置的方法normalize-space来删除前导和尾随的空格和换行符。

完整的工作代码:

from scrapy import Spider
from scrapy.http import Request

class AuthorSpider(Spider):
    name = 'book'
    start_urls = ['https://www.thegrommet.com/gifts/by-type/personalized-gifts']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.62 Safari/537.36'
    }

    def parse(self, response):
        books = response.xpath("//div[@class='flex-grow | p-t-s']//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

    def parse_book(self, response):
        title=response.xpath("//div[@class='f-heading-xl']//text()").get()
        title=title.strip()

        yield{ 
            'title':title,
            'd1':response.xpath('normalize-space((//*[@class="accordion-section | p-t-s p-b-m"]/div)[1]/text()[1])').get(),
            'd2':response.xpath('normalize-space((//*[@class="accordion-section | p-t-s p-b-m"]/div)[2]/text()[1])').get(),
            'd3':response.xpath('normalize-space((//*[@class="accordion-section | p-t-s p-b-m"]/div)[3]/text()[1])').get(),
            'url':response.url

            }

相关问题