尝试使用Scrapy将多个产出添加到单个json文件

vcudknz3  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(100)

我试图弄清楚我的scrapy工具是否正确地命中了请求回调的product_link- 'yield scrapy.Request(product_link,callback=self.parse_new_item)' product_link应该是'https://www.antaira.com/products/10-100Mbps/LNX-500 A',但是我还不能确认我的程序是否跳转到了创建的下一个步骤,以便我可以检索正确的yield返回。谢谢!


# Import the required libraries

import scrapy

# Import the Item class with fields

# mentioned int he items.py file

from ..items import AntairaItem

# Spider class name

class productJumper(scrapy.Spider):

    # Name of the spider
    name = 'productJumper'

    # The domain to be scraped
    allowed_domains = ['antaira.com']

    # The URLs to be scraped from the domain
    start_urls = ['https://www.antaira.com/products/10-100Mbps']
    #target_url = ['https://www.antaira.com/products/10-100Mbps/LNX-500A']

    # First Step: Find every div with the class 'product-container' and step into the links
    def parse(self, response):
        #product_link = response.urljoin(rel_product_link)

        # creating items dictionary
        items = AntairaItem()

        rel_product_link = response.css('div.center767')
        for url in rel_product_link:
            rel_product_link = response.xpath('//div[@class="product-container"]//a/@href').get(),
            product_link = response.urljoin('rel_product_link'),
            items['rel_product_link'] = rel_product_link,
            items['product_link'] = product_link

            #yield items

    # 2nd Step: Return a list of the all products-links that will be scrapped
            #yield {
            #       take the first relative product link
            #        'rel_product_link' : rel_product_link,
            #        'product_link'  :   product_link,
            #}

            yield scrapy.Request(product_link, callback=self.parse_new_item)

    # Final Step: Run through each product and Yield the results
        def parse_new_item(self, response):
            for product in response.css('main.products'):

                name = product.css(('h1.product-name::text').strip(' \t\n\r')).get()
                features = product.css('section.features h3 + ul').getall()
                overview =   product.css('.products .product-overview::text').getall()
                main_image = product.css('div.selectors img::attr(src)').get()
                rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()

                items['name'] = name,
                items['features'] = features,
                items['overview'] = overview,
                items['main_image'] = main_image,
                items['rel_links'] = rel_links,

                yield items
41zrol4v

41zrol4v1#

您有几个问题:

  1. scrapy项本质上是字典,因此是可变的。2你需要为每个yield语句创建一个唯一的项。
    1.您的第二个解析回调函数引用了一个变量项,但由于它是在第一个解析回调函数中定义的,因此它也没有访问权。
    1.在urljoin方法中,您使用的是字符串文字,而不是rel_product_link的变量
    在下面的示例中,我修复了这些问题,并做了一些额外的注解
import scrapy
from ..items import AntairaItem

class ProductJumper(scrapy.Spider):  # classes should be TitleCase

    name = 'productJumper'
    allowed_domains = ['antaira.com']
    start_urls = ['https://www.antaira.com/products/10-100Mbps']

    def parse(self, response):
        # iterate through each of the relative urls
        for url in response.xpath('//div[@class="product-container"]//a/@href').getall():
            product_link = response.urljoin(url)  # use variable
            yield scrapy.Request(product_link, callback=self.parse_new_item)

    def parse_new_item(self, response):
        for product in response.css('main.products'):
            items = AntairaItem() # Unique item for each iteration
            items['product_link'] = response.url # get the product link from response
            name = product.css(('h1.product-name::text').get().strip()
            features = product.css('section.features h3 + ul').getall()
            overview =   product.css('.products .product-overview::text').getall()
            main_image = product.css('div.selectors img::attr(src)').get()
            rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()
            items['name'] = name,
            items['features'] = features,
            items['overview'] = overview,
            items['main_image'] = main_image,
            items['rel_links'] = rel_links,
            yield items

相关问题