scrapy Webscraping的结果不正确

lf5gs5x2  于 2023-06-23  发布在  其他
关注(0)|答案(1)|浏览(154)

当我刮网站,它是完全正确的,但有很多空白和一些不正确的数据。

import scrapy

class AudibleSpider(scrapy.Spider):
    name = 'audible'
    allowed_domains = ['www.audible.com']
    start_urls = ['https://www.audible.com/search/']

    def parse(self, response):
        # Getting the box that contains all the info we want (title, author, length)
        product_container = response.xpath('//div[@class="adbl-impression-container "]//ul')

        # Looping through each product listed in the product_container box
        for product in product_container:
            book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
            book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
            book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()

            # Return data extracted
            yield {
                'title': book_title,
                'author': book_author,
                'length': book_length,
            }
            

        pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
        next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)

希望有标题,作者和长度作为每页中每本有声书的结果。****结果为:[1]:https://i.stack.imgur.com/st2lm.png

zdwk9cvp

zdwk9cvp1#

如果您为产品容器使用更具体的选择器,您将获得所需的结果。
例如:

def parse(self, response):
        # Getting the box that contains all the info we want (title, author, length)
        product_container = response.xpath('//*[@class="adbl-impression-container "]//ul//div[contains(@class, "bc-col-responsive")]/span//ul')

        # Looping through each product listed in the product_container box
        for product in product_container:
            book_title = product.xpath('.//h3[contains(@class, "bc-heading")]/a/text()').get()
            book_author = product.xpath('.//li[contains(@class, "authorLabel")]/span/a/text()').getall()
            book_length = product.xpath('.//li[contains(@class, "runtimeLabel")]/span/text()').get()

            # Return data extracted
            yield {
                'title': book_title,
                'author': book_author,
                'length': book_length,
            }
            

        pagination = response.xpath('//ul[contains(@class, "pagingElements")]')
        next_page_url = pagination.xpath('.//span[contains(@class, "nextButton")]/a/@href').get()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)

相关问题