使用Scrapy动态抓取具有多个类别和页面的网站

gt0wga4j  于 2023-04-06  发布在  其他
关注(0)|答案(1)|浏览(198)

我试图从网上商店刮数据.我目前能够刮一个页面的一个类别从商店使用网站的数据API.然而,我希望能够抓取所有类别和产品,并使用类别名称作为新的列值,这样我就可以将其用作类别标识符。我不确定我是否应该使用selenium或splashy,因为页面没有href,并且根据您选择的过滤器选项,每50/100/150/200个产品动态更新一次。
网站链接是:https://eshop.nomin.mn/
分类列表如下图所示:

示例类别:

正如你所看到的,网站类别有不同的URL,也有不同的数据API。下一页按钮没有HREF,产品是动态刷新/更新的。下一页按钮的HTML是:

<a class="pagination-nextLinkNo-HSU" tabindex="0" role="button" aria-disabled="false" aria-label="Next page" rel="next"><img src="/rightArrow-jkZ.png" alt="1"></a>

基本上,我希望能够刮所有产品名称,价格,和产品说明从网站上的所有类别(插入作为一个分类ID值).我曾试图效仿其他职位,但没有成功.任何和所有的帮助是非常感谢.非常感谢你.
我现在的代码是:

import scrapy
from scrapy import Request
from datetime import datetime
import re

dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' Nomin CPI Foods Data'

class NominCPIFoodsSpider(scrapy.Spider):
    name = 'nomin_cpi_foods'
    allowed_domains = ['https://eshop.nomin.mn/n-foods.html/']
    custom_settings = {
        "FEEDS": {
            f'{filename}.csv': {
                'format': 'csv',
                'overwrite': True}}
    }

    # function used for start url
    def start_requests(self):
        urls = ['https://eshop.nomin.mn/graphql?query=query+category($pageSize:Int!$currentPage:Int!$filters:ProductAttributeFilterInput!$sort:ProductAttributeSortInput){products(pageSize:$pageSize+currentPage:$currentPage+filter:$filters+sort:$sort){items{id+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal{created_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename}new_to_date+short_description{html+__typename}productAttributes{name+value+__typename}price{regularPrice{amount{currency+value+__typename}__typename}__typename}special_price+special_to_date+thumbnail{file_small+url+__typename}url_key+url_suffix+mp_label_data{enabled+name+priority+label_template+label_image+to_date+__typename}...on+ConfigurableProduct{variants{product{sku+special_price+price{regularPrice{amount{currency+value+__typename}__typename}__typename}__typename}__typename}__typename}__typename}page_info{total_pages+__typename}total_count+__typename}}&operationName=category&variables={"currentPage":1,"id":24175,"filters":{"category_id":{"in":"24175"}},"pageSize":50,"sort":{"position":"DESC"}}']
        for url in urls:
            yield Request(url, self.parse)

    # function to parse
    def parse(self, response, **kwargs):
        data = response.json()
        print(data.keys())
        for item in data['data']["products"]["items"]:
            yield {
                "name": item["name"],
                "price": item["price"]["regularPrice"]["amount"]["value"],
                "description": item["short_description"]["html"]
            }

        # handles pagination
        next_url = response.css("nav.custom-pagination > a.next::attr(href)").get()
        if next_url:
            yield scrapy.Request(next_url, self.parse)

# main driver
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(NominCPIFoodsSpider)
    process.start()
g6baxovj

g6baxovj1#

你所要做的就是解构API URL,并弄清楚如何对API端点进行逆向工程。
例如,如果您访问同一网站的第2页,您会注意到它发送了一个不同的请求来获取第二页上列出的项目的数据,然后您可以比较URL并确定如何为其余页面重建它们。
因此,对于这个特定的API,看起来所有的变量都包含在url的末尾,特别是这一部分:

{
    "currentPage": 1,  # adding 1 to this variable get's you the next page
    "id": 24175,    # changing this value changes what category of items
    "filters" : {
        "category_id":{
            "in" : "24175"  # this needs to change for categories too
        }
    },
    "pageSize" : 50,   # you can adjust the number of results per page with this.
    "sort" : {
        "position" : "DESC" 
}

因此,您需要做的就是更改字典的currentPage字段,并使用URL作为您的scrapy请求。

import scrapy
from scrapy import Request
from datetime import datetime
import re

BASE_URL = "https://eshop.nomin.mn/graphql?query=query+category($pageSize:Int!$currentPage:Int!$filters:ProductAttributeFilterInput!$sort:ProductAttributeSortInput){products(pageSize:$pageSize+currentPage:$currentPage+filter:$filters+sort:$sort){items{id+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal{created_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename}new_to_date+short_description{html+__typename}productAttributes{name+value+__typename}price{regularPrice{amount{currency+value+__typename}__typename}__typename}special_price+special_to_date+thumbnail{file_small+url+__typename}url_key+url_suffix+mp_label_data{enabled+name+priority+label_template+label_image+to_date+__typename}...on+ConfigurableProduct{variants{product{sku+special_price+price{regularPrice{amount{currency+value+__typename}__typename}__typename}__typename}__typename}__typename}__typename}page_info{total_pages+__typename}total_count+__typename}}&operationName=category&variables="

dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' Nomin CPI Foods Data'

class NominCPIFoodsSpider(scrapy.Spider):
    name = 'nomin_cpi_foods'
    allowed_domains = ['https://eshop.nomin.mn/n-foods.html/']
    custom_settings = {
        "FEEDS": {
            f'{filename}.csv': {
                'format': 'csv',
                'overwrite': True}}
    }

    # function used for start url
    def start_requests(self):
        for i in range(50):
            url = BASE_URL + '{"currentPage":' + str(i) + ',"id":24175,"filters":{"category_id":{"in":"24175"}},"pageSize":50,"sort":{"position":"DESC"}}'
            yield Request(url, self.parse)

    # function to parse
    def parse(self, response, **kwargs):
        data = response.json()
        print(data.keys())
        for item in data['data']["products"]["items"]:
            yield {
                "name": item["name"],
                "price": item["price"]["regularPrice"]["amount"]["value"],
                "description": item["short_description"]["html"]
            }

        # handles pagination
        next_url = response.css("nav.custom-pagination > a.next::attr(href)").get()
        if next_url:
            yield scrapy.Request(next_url, self.parse)

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(NominCPIFoodsSpider)
    process.start()

相关问题