使用python scrapy抓取动态网站(使用react构建)的最佳方法

5cnsuln7  于 2022-11-09  发布在  Python
关注(0)|答案(2)|浏览(102)

我一直在尝试使用scrapy和scrapy-splash来抓取这个网站Link。据我所知,这个网站是在react. response.xpath中开发的,总是返回带有任何类名的空列表。请给我一个方法来抓取这个react网站。我有设置飞溅使用这个link,并能够刮一些其他网站在同一个项目,但无法刮这个React作出的网站。Spider的代码如下所示:

import scrapy
from scrapy_splash import SplashRequest

class NykaaFashionbrandsSpider(scrapy.Spider):
    name = 'nykaa_fashionbrands'

    start_urls = ["https://www.nykaafashion.com/"]
    custom_settings = {
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'fashion_brands.csv'
    }

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse,
                                endpoint='render.html',
                                args={'wait': 3},
                                )

    def parse(self, response):
        print(response.xpath('//*[@class="br-inner"]/ul/li/text()').extract())
        # I am trying to get the list items
czq61nw1

czq61nw11#

如果您需要抓取所有产品或您拥有特定类别中的产品,您可以使用如下API url:

https://www.nykaafashion.com/rest/appapi/V2/categories/products?categoryId=6151&PageSize=12&sort=popularity&currentPage=2&filter_format=v2

这段回应:

"products": [
            {
                "sku": "CTWK0648",
                "imageUrl": "https://adn-static1.nykaa.com/nykdesignstudio-images/pub/media/catalog/product/3/8/3884c_1.jpg?rnd=20200526195200",
                "isOutOfStock": 0,
                "subTitle": "Black Embellished Sandals",
                "title": "Catwalk",
                "price": 1995,
                "tag": {},
                "offerCount": 0,
                "categoryId": [
                    "102",
                    "3528",
                    "3522",
                    "2",
                    "6151",
                    "6557"
                ],
                "discount": 55,
                "offers": null,
                "discountedPrice": 899,
                "actionUrl": "/catwalk-black-embellished-sandals-3/p/537684",
                "aspectRatio": 0.75,
                "sizeVariation": [
                    {
                        "id": "537678",
                        "title": "4"
                    },
                    {
                        "id": "537679",
                        "title": "5"
                    },
                    {
                        "id": "537680",
                        "title": "6"
                    },
                    {
                        "id": "537681",
                        "title": "7"
                    }
                ],
                "type": "configurable",
                "id": "537684"
            },

此网站不需要Splash

whhtz7ly

whhtz7ly2#

我建议你一定要给予一试cloudscraper。我最近测试了刮开OpenSea,效果非常好。
通过运行

pip install cloudscraper

要抓取数据,请执行以下操作:

import cloudscraper
scraper = cloudscraper.create_scraper(browser="chrome")
url = "https://www.nykaafashion.com/"
scraped_status = scraper.get(url) #get status code
scraped = scraper.get(url).text #get the data

相关问题