我最近开始学习scrapy,并决定刮this site。
1页上有24个产品,向下滚动时会加载更多产品。
这个页面上应该有大约334个产品。
我用了scrapy试着刮里面的产品和信息,但是我不能让scrapy刮24个以上的产品。
我想,我需要 selenium 或飞溅渲染/滚动到年底,然后我就可以刮它。
这是刮擦24个产品的代码:
import scrapy
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36 OPR/92.0.0.0'
}
class BookSpider(scrapy.Spider):
name = 'basics2'
api_url = 'https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page'
start_urls = ['https://www.zara.com/ru/ru/zhenshchiny-novinki-l1180.html?v1=2111785&page=1']
#Def parse goes to the href of every product
def parse(self, response):
for link in response.xpath("//div[@class='product-grid-product-info__main-info']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--1th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product product-grid-product--ZOOM1-columns product-grid-product--th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[@class='carousel__items']//li[@class='product-grid-product _product carousel__item product-grid-product--ZOOM1-columns product-grid-product--0th-column']//a"):
yield response.follow(link, callback=self.parse_book)
for link in response.xpath("//ul[@class='product-grid-product-info__main-info']//a"):
yield response.follow(link, callback=self.parse_book)
#def parse-book gets all the information inside each product
def parse_book(self, response):
yield{
'title' : response.xpath("//div[@class='product-detail-info__header']/h1/text()").get(),
'normal_price' : response.xpath("//div[@class='money-amount price-formatted__price-amount']//span//text()").get(),
'discounted_price' : response.xpath("(//span[@class='price__amount price__amount--on-sale price-current--with-background']//div[@class='money-amount price-formatted__price-amount']//span)[1]").get(),
'Reference' : response.xpath("//div[@class='product-detail-color-selector product-detail-info__color-selector']//p[@class='product-detail-selected-color product-detail-color-selector__selected-color-name']//text()").get(),
'Description' : response.xpath("//div[@class='expandable-text__inner-content']//p//text()").get(),
'Image' : response.xpath("//picture[@class='media-image']//source//@srcset").extract(),
'item_url' : response.url,
# 'User-Agent': response.request.headers['User-Agent']
}
1条答案
按热度按时间stszievb1#
不需要使用如此缓慢和复杂的 selenium ,你可以从
API
抓取所有需要的数据,如:输出:
**更新:**请参阅更新的答案如何从本网站的
API
响应数据中提取图像URL。输出:
...等等