Scrapy -尝试迭代到产品页面

c6ubokkw  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(153)

所以我尝试从所有产品页面〉类别〉系列〉产品页面进行迭代。我在日志中得到一个错误,它显示我没有检索到预期的id,但我认为这与我如何迭代到页面有关,我怀疑我没有一路旅行到产品页面。
启动请求

def start_requests(self):
        urls = [
            'https://www.moxa.com/en/products',
        ]
        for url in urls:
            yield scrapy.Request(url, callback=self.parse)

初始剖析为产品页面

def parse(self, response):
        # iterate through each of the relative urls
        for explore_products in response.css('li.alphabet-list--no-margin a.alphabet-list__link::attr(href)').getall():
            category_url = response.urljoin(explore_products)  # use variable
            logging.info("Category_links: " + category_url)
            yield scrapy.Request(category_url, callback=self.parse_categories)

系列的第二次分析

def parse_categories(self, response):
        for category_url in response.css('a.series-card__wrapper::attr(href)').getall():
            series_url = response.urljoin(category_url)
            logging.info("Series_links: " + series_url)
            yield scrapy.Request(series_url, callback=self.parse_series)

第三部分到达产品页面本身(我认为这是它的突破)我希望它,如果它可以检查“目标_id”是否在series_url内,它只返回通过的结果到“产品_url”列表示例-目标_id:TN-5916-WV-T,以及产品URL:https://www.moxa.com/Products/INDUSTRIAL-NETWORK-INFRASTRUCTURE/Secure-Routers/EN-50155-Routers/TN-5900-Series/TN-5916-WV-T,则它应作为true传递并传递到product_links列表中。但如果product_url:https://www.moxa.com/en/products/quotation-list,则它不会通过,也不会返回到列表中。

def parse_series(self, response):
        for series_url in response.css('.model-table a::attr(href)').getall():
            target_list = response.xpath('//table[@class="model-table"]//a/@href').getall()
            target_id = response.css('table.model-table th::attr(data-id)').get()            
            target_path = [p for p in target_list if target_id in p]
            product_url = response.urljoin(series_url)
            self.logger.info("target_id: " + target_id)
            self.logger.info("product_url: " + product_url)
            logging.info("Product_links: " + product_url)
            yield scrapy.Request(product_url, callback=self.parse_new_item)

返回预期的项目结果

def parse_new_item(self, response):
        for product in response.css('section.main-section'):
            items = MoxaItem() # Unique item for each iteration
            items['product_link'] = response.url # get the product link from response
            name_dirty = product.css('h5.series-card__heading.series-card__heading--big::text').get()
            product_sku = name_dirty.strip()
            product_store_description = product.css('p.series-card__intro').get() 
            product_sub_title = product_sku + ' ' + product_store_description
            summary = product.css(('section.features h3 + ul')).getall()
            datasheet = product.css(('li.side-section__item a::attr(href)'))
            description =   product.css('.products .product-overview::text').getall()
            specification = product.css('div.series-card__table').getall()
            products_zoom_image = name_dirty.strip() + '.jpg'
            main_image = response.urljoin(product.css('div.selectors img::attr(src)').get())
            # weight = product.xpath('//div[@class="series-card__table"]//p[@class="title-list__heading"]/text()[contains(., "Weight")]following-sibiling::div//text()').get()
            response.xpath("//div[@class='grdcpnsmllnks']//li[i[contains(@class, 'fa-clock-o')]]/text()").re_first(r"Valid till\s+(\d+/\d+/\d+)")
            rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()

            items['product_sku'] = product_sku,
            items['product_sub_title'] = product_sub_title,
            items['summary'] = summary,
            items['description'] = description,
            items['specification'] = specification,
            items['products_zoom_image'] = products_zoom_image
            items['main_image'] = main_image,
            # items['weight'] = weight,
            #items['rel_links'] = rel_links,
            items['datasheet'] = datasheet,
            yield items

出现错误的日志

File "/home/joel/Desktop/moxa/moxa/spiders/product_series.py", line 57, in parse_new_item
    logging.info("name_dirty: " + name_dirty)
TypeError: can only concatenate str (not "NoneType") to str
ecbunoof

ecbunoof1#

试着这样做...我认为这是一个更好的方法来确保你抓取正确的链接。这样做也可以消除之前生成的许多重复。

def parse_series(self, response):
    for column in response.xpath("//table[@class='model-table']//th"): # iterate columns
        data_id = column.xpath("./@data-id") # grab the columns data_id
        if not data_id:    # Check if data id exists because the first column is
            continue       # is for the labels and doesn't have a data_id
        data_id = data_id.get()    #  get the data_id text
        for link in column.xpath("//a/@href").getall():  # crawl the links in the
            if data_id not in link:                      # column and if the
                continue                                 # data_id is in the url
            url = response.urljoin(link)                 # merge with the domain
            yield scrapy.Request(url, callback=self.parse_new_item)   # yield request

相关问题