scrapy 通过浏览多个不同的页面/类别有效地抓取网站

yftpprvb  于 2023-04-12  发布在  其他
关注(0)|答案(1)|浏览(200)

我有困难推进我目前的刮项目/想法.我试图网页刮所有产品在网上商店按类别.该网站的链接是:https://eshop.nomin.mn/
目前,在这个论坛上的伟大开发人员的帮助下,我已经能够使用在线商店的数据API成功地抓取食品/杂货类别(我的代码在我的帖子的底部提供)。虽然我可以通过改变数据API URL来复制其他类别的成功,但我相信这将是非常不够和低效的。
理想情况下,我想使用一个蜘蛛抓取网站的所有类别,而不是为每个类别制作一个蜘蛛。我不知道我应该如何做这件事,因为我以前的项目网站主页上列出了所有产品,而这并没有。此外,添加多个数据API URL似乎对我不起作用。每个类别都有不同的URL和不同的数据API,例如:
1.电气产品(https://eshop.nomin.mn/6011.html
1.食品(https://eshop.nomin.mn/n-foods.html
1.建筑材料(https://eshop.nomin.mn/n-building-materials-tools.html
1.汽车产品及零部件(https://eshop.nomin.mn/n-autoparts-tools.html
1.等等
下图显示了如何浏览网站和分类(翻译成英语)。

理想情况下,我的报废最终产品将是一个长表,如此。我已经包括原价和上市价格分别为一些类别,如电气产品有两个定价HTML如下所示。

<div class="item-specialPricetag-1JM">
<span class="item-oldPrice-1sY">
<span>1</span>
<span>,</span>
<span>899</span>
<span>,</span>
<span>990</span>
<span>₮</span>
</span>
</div>

<div class="item-webSpecial-Z6W">
<span>1</span>
<span>,</span>
<span>599</span>
<span>,</span>
<span>990</span>
<span>₮</span>
</div>

我目前的工作代码,成功地刮食品产品类别和检索3000+产品名称,描述,和价格. * 我还认为,因为我会刮多个页面/类别也许有一个旋转/随机生成的标题/用户代理将是聪明.什么是最好的方式来整合这个想法?*

import scrapy
from scrapy import Request
from datetime import datetime

BASE_URL = "https://eshop.nomin.mn/graphql?query=query+category($pageSize:Int!$currentPage:Int!$filters:ProductAttributeFilterInput!$sort:ProductAttributeSortInput){products(pageSize:$pageSize+currentPage:$currentPage+filter:$filters+sort:$sort){items{id+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal{created_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename}new_to_date+short_description{html+__typename}productAttributes{name+value+__typename}price{regularPrice{amount{currency+value+__typename}__typename}__typename}special_price+special_to_date+thumbnail{file_small+url+__typename}url_key+url_suffix+mp_label_data{enabled+name+priority+label_template+label_image+to_date+__typename}...on+ConfigurableProduct{variants{product{sku+special_price+price{regularPrice{amount{currency+value+__typename}__typename}__typename}__typename}__typename}__typename}__typename}page_info{total_pages+__typename}total_count+__typename}}&operationName=category&variables="

dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' Nomin CPI Foods Data'

class NominCPIFoodsSpider(scrapy.Spider):
    name = 'nomin_cpi_foods'
    allowed_domains = ['https://eshop.nomin.mn/n-foods.html/']
    custom_settings = {
        "FEEDS": {
            f'{filename}.csv': {
                'format': 'csv',
                'overwrite': True}}
    }

    # function used for start url
    def start_requests(self):
        for i in range(50):
            url = BASE_URL + '{"currentPage":' + str(i) + ',"id":24175,"filters":{"category_id":{"in":"24175"}},"pageSize":50,"sort":{"position":"DESC"}}'
            yield Request(url, self.parse)

    # function to parse
    def parse(self, response, **kwargs):
        data = response.json()
        print(data.keys())
        for item in data['data']["products"]["items"]:
            yield {
                "name": item["name"],
                "price": item["price"]["regularPrice"]["amount"]["value"],
                "description": item["short_description"]["html"]
            }

        # handles pagination
        next_url = response.css("nav.custom-pagination > a.next::attr(href)").get()
        if next_url:
            yield scrapy.Request(next_url, self.parse)

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(NominCPIFoodsSpider)
    process.start()

非常感谢你的帮助,谢谢你的帮助。

kb5ga3dv

kb5ga3dv1#

你可以做的是去网站并访问每个类别,获取该类别的API url,查看特定类别有多少页信息,然后从URL中提取类别ID,并在代码中创建一个字典引用,将类别ID作为键,页码作为值。
然后在你的start_requests方法中,你可以对类别做同样的事情,而不是只用一个变量替换当前页面。然后你可以基本上保持其余的不变。
有一件事是不必要的,那就是继续解析实际的网页本身。你需要的所有信息都可以从API中获得,所以产生对不同页面的请求并没有真正给你带来任何好处。
下面是一个使用网站上可用的少数类别的示例。

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request
from datetime import datetime

categories = {
    "19653": 4,
    "24175": 67,
    "21297": 48,
    "19518": 16,
    "19487": 40,
    "26011": 46,
    "19767": 3,
    "19469": 5,
    "19451": 4
}

dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' Nomin'

class Nomin(scrapy.Spider):
    name = 'nomin'
    custom_settings = {
        "FEEDS": {
            f'{filename}.csv': {
                'format': 'csv',
                'overwrite': True}}
    }

    def start_requests(self):
        for cat, pages in categories.items():
            for i in range(1, pages):
                url = f'https://eshop.nomin.mn/graphql?query=query+category%28%24pageSize%3AInt%21%24currentPage%3AInt%21%24filters%3AProductAttributeFilterInput%21%24sort%3AProductAttributeSortInput%29%7Bproducts%28pageSize%3A%24pageSize+currentPage%3A%24currentPage+filter%3A%24filters+sort%3A%24sort%29%7Bitems%7Bid+name+sku+brand+salable_qty+brand_name+c21_available+c21_business_type+c21_reference+c21_street+c21_area+c21_bed_room+mp_daily_deal%7Bcreated_at+date_from+date_to+deal_id+deal_price+remaining_time+deal_qty+discount_label+is_featured+product_id+product_name+product_sku+sale_qty+status+store_ids+updated_at+__typename%7Dnew_to_date+short_description%7Bhtml+__typename%7DproductAttributes%7Bname+value+__typename%7Dprice%7BregularPrice%7Bamount%7Bcurrency+value+__typename%7D__typename%7D__typename%7Dspecial_price+special_to_date+thumbnail%7Bfile_small+url+__typename%7Durl_key+url_suffix+mp_label_data%7Benabled+name+priority+label_template+label_image+to_date+__typename%7D...on+ConfigurableProduct%7Bvariants%7Bproduct%7Bsku+special_price+price%7BregularPrice%7Bamount%7Bcurrency+value+__typename%7D__typename%7D__typename%7D__typename%7D__typename%7D__typename%7D__typename%7Dpage_info%7Btotal_pages+__typename%7Dtotal_count+__typename%7D%7D&operationName=category&variables=%7B%22currentPage%22%3A{i}%2C%22id%22%3A{cat}%2C%22filters%22%3A%7B%22category_id%22%3A%7B%22in%22%3A%22{cat}%22%7D%7D%2C%22pageSize%22%3A50%2C%22sort%22%3A%7B%22news_from_date%22%3A%22ASC%22%7D%7D'
                yield Request(url, self.parse)

    def parse(self, response, **kwargs):
        data = response.json()
        if data and data['data'] and data['data']['products'] and data['data']['products']['items']:
            for item in data['data']["products"]["items"]:
                yield {
                    "name": item["name"],
                    "price": item["price"]["regularPrice"]["amount"]["value"],
                    "description": item["short_description"]["html"]
                }

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(Nomin)
    process.start()

附注:我的页面数可能不准确。我只是使用了第一页底部可见的内容。有些类别可能有更多的页面。
编辑:
要随请求发送类别,您只需将类别名称与id和页数一起存储在字典中,然后在每个start_url请求的cb_kwargs参数中设置它。
例如:

categories = {
    "19653": {
        "pages": 4, 
        "name": "Food"
     },
     "33456": {
         "pages": 12,
         "name": "Outdoor"
     }
}

# This is fake information I made up for the example

然后在开始requests_method中:

def start_requests(self):
    for cat, val in categories.items():
        for page in range(1, val["pages"]):
            url = .....
            yield scrapy.Request(
                url, 
                callback=self.parse, 
                cb_kwargs={"category": val["name"]}
             )

然后在parse方法中:

def parse(self, response, category=None):
        data = response.json()
        if data and data['data'] and data['data']['products'] and data['data']['products']['items']:
            for item in data['data']["products"]["items"]:
                yield {
                    "category": category,
                    "name": item["name"],
                    "price": item["price"]["regularPrice"]["amount"]["value"],
                    "special_price": item["special_price"],
                    "description": item["short_description"]["html"]
                }

相关问题