我试图刮亚马逊的畅销书100产品的一个特定类别.例如-
https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_nav_0
100种产品分为两页,每页50种产品。
以前,页面是静态的,所有50种产品都显示在页面上。但是,现在页面是动态的,我需要向下滚动才能看到页面上的所有50种产品。
我是使用scrapy刮页面早些时候。真的很感激,如果你能帮我这个。谢谢!
在下面添加我的代码-
import scrapy
from scrapy_splash import SplashRequest
class BsrNewSpider(scrapy.Spider):
name = 'bsr_new'
allowed_domains = ['www.amazon.in']
#start_urls = ['https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0']
script = '''
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return splash:html()
end
'''
def start_requests(self):
url = 'https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0'
yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
'lua_source': self.script
})
def parse(self, response):
for rev in response.xpath("//div[@id='gridItemRoot']"):
yield {
'Segment': "Home", #Enter name of the segment here
#'Sub-segment':segment,
'ASIN' : rev.xpath(".//div/div[@class='zg-grid-general-faceout']/div/a[@class='a-link-normal']/@href").re('\S*/dp/(\S+)_\S+')[0][:10],
'Rank' : rev.xpath(".//span[@class='zg-bdg-text']/text()").get(),
'Name' : rev.xpath("normalize-space(.//a[@class='a-link-normal']/span/div/text())").get(),
'No. of Ratings' : rev.xpath(".//span[contains(@class,'a-size-small')]/text()").get(),
'Rating' : rev.xpath(".//span[@class='a-icon-alt']/text()").get(),
'Price' : rev.xpath(".//span[@class='a-size-base a-color-price']//text()").get()
}
next_page = response.xpath("//a[text()='Next page']/@href").get()
if next_page:
url = response.urljoin(next_page)
yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
'lua_source': self.script
})
字符串
问候Sreejan
1条答案
按热度按时间gmxoilav1#
这里有一个不需要Splash的替代方法。
所有50个产品的ASIN都隐藏在第一页上。您可以提取这些ASIN并构建所有50个产品的URL。
字符串