我正在尝试用Scrapy删除Amazon网站。我可以删除产品标题和价格等项,但我不知道如何提取产品的url(在底部的图片中标出)。目前我的def解析函数如下所示:
def parse(self, response):
items = BigItem()
all_boxes = response.css('.s-widget-spacing-small > .sg-col-inner')
for boxes in all_boxes:
name = boxes.css('.s-link-style .a-text-normal').css('::text').extract()
author = boxes.css('.a-color-secondary .a-size-base:nth-child(2)').css('::text').extract()
price = boxes.css('.s-price-instructions-style .a-price-whole').css('::text').extract()
imagelink = boxes.css('.s-image::attr(src)').extract()
rating = boxes.css('.a-spacing-top-small .aok-align-bottom').css('::text').extract()
valuation = boxes.css('.a-spacing-top-small .s-link-style .s-underline-text').css('::text').extract()
link = boxes.css('a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal::attr(href)').extract()
items['name'] = name
items['author'] = author
items['price'] = price
items['imagelink'] = imagelink
items['rating'] = rating
items['valuation'] = valuation
items['link'] = link
yield items
我还尝试提取为::text
,外部为.css(::text)
和.css(::href)
,但它不工作。
1条答案
按热度按时间yws3nbqq1#
使用
.extract_first()
或.get()
方法更新(完整工作代码):
输出:
...等等
P/S:必须注入用户代理