所以我尝试从所有产品页面〉类别〉系列〉产品页面进行迭代。我在日志中得到一个错误,它显示我没有检索到预期的id,但我认为这与我如何迭代到页面有关,我怀疑我没有一路旅行到产品页面。
启动请求
def start_requests(self):
urls = [
'https://www.moxa.com/en/products',
]
for url in urls:
yield scrapy.Request(url, callback=self.parse)
初始剖析为产品页面
def parse(self, response):
# iterate through each of the relative urls
for explore_products in response.css('li.alphabet-list--no-margin a.alphabet-list__link::attr(href)').getall():
category_url = response.urljoin(explore_products) # use variable
logging.info("Category_links: " + category_url)
yield scrapy.Request(category_url, callback=self.parse_categories)
系列的第二次分析
def parse_categories(self, response):
for category_url in response.css('a.series-card__wrapper::attr(href)').getall():
series_url = response.urljoin(category_url)
logging.info("Series_links: " + series_url)
yield scrapy.Request(series_url, callback=self.parse_series)
第三部分到达产品页面本身(我认为这是它的突破)我希望它,如果它可以检查“目标_id”是否在series_url内,它只返回通过的结果到“产品_url”列表示例-目标_id:TN-5916-WV-T,以及产品URL:https://www.moxa.com/Products/INDUSTRIAL-NETWORK-INFRASTRUCTURE/Secure-Routers/EN-50155-Routers/TN-5900-Series/TN-5916-WV-T,则它应作为true传递并传递到product_links列表中。但如果product_url:https://www.moxa.com/en/products/quotation-list,则它不会通过,也不会返回到列表中。
def parse_series(self, response):
for series_url in response.css('.model-table a::attr(href)').getall():
target_list = response.xpath('//table[@class="model-table"]//a/@href').getall()
target_id = response.css('table.model-table th::attr(data-id)').get()
target_path = [p for p in target_list if target_id in p]
product_url = response.urljoin(series_url)
self.logger.info("target_id: " + target_id)
self.logger.info("product_url: " + product_url)
logging.info("Product_links: " + product_url)
yield scrapy.Request(product_url, callback=self.parse_new_item)
返回预期的项目结果
def parse_new_item(self, response):
for product in response.css('section.main-section'):
items = MoxaItem() # Unique item for each iteration
items['product_link'] = response.url # get the product link from response
name_dirty = product.css('h5.series-card__heading.series-card__heading--big::text').get()
product_sku = name_dirty.strip()
product_store_description = product.css('p.series-card__intro').get()
product_sub_title = product_sku + ' ' + product_store_description
summary = product.css(('section.features h3 + ul')).getall()
datasheet = product.css(('li.side-section__item a::attr(href)'))
description = product.css('.products .product-overview::text').getall()
specification = product.css('div.series-card__table').getall()
products_zoom_image = name_dirty.strip() + '.jpg'
main_image = response.urljoin(product.css('div.selectors img::attr(src)').get())
# weight = product.xpath('//div[@class="series-card__table"]//p[@class="title-list__heading"]/text()[contains(., "Weight")]following-sibiling::div//text()').get()
response.xpath("//div[@class='grdcpnsmllnks']//li[i[contains(@class, 'fa-clock-o')]]/text()").re_first(r"Valid till\s+(\d+/\d+/\d+)")
rel_links = product.xpath("//script/@src[contains(., '/app/site/hosting/scriptlet.nl')]").getall()
items['product_sku'] = product_sku,
items['product_sub_title'] = product_sub_title,
items['summary'] = summary,
items['description'] = description,
items['specification'] = specification,
items['products_zoom_image'] = products_zoom_image
items['main_image'] = main_image,
# items['weight'] = weight,
#items['rel_links'] = rel_links,
items['datasheet'] = datasheet,
yield items
出现错误的日志
File "/home/joel/Desktop/moxa/moxa/spiders/product_series.py", line 57, in parse_new_item
logging.info("name_dirty: " + name_dirty)
TypeError: can only concatenate str (not "NoneType") to str
1条答案
按热度按时间ecbunoof1#
试着这样做...我认为这是一个更好的方法来确保你抓取正确的链接。这样做也可以消除之前生成的许多重复。