scrapy 如何抓取多类别和多页数据

hm2xizp9  于 2023-06-29  发布在  其他
关注(0)|答案(1)|浏览(122)

我想从这个在线store中抓取数据。以前,我可以抓取所有我想要的数据**,除了**类别,子类别和子子类别,可以找到here
然而,该网站最近似乎进行了更改,因为我使用允许的域URL获得了DNSError,并且在运行以前版本的代码时也获得了以下错误:

if data and data['data'] and data['data']['products'] and data['data']['products']['items']:
KeyError: 'data'

根据用户的评论,我的开始请求URL似乎有问题。然而,我无法弄清楚是什么,即使在使用谷歌开发者工具-网络。因此,我创建了一个新的scraper,包括错误解析(* 完整的脚本可以在这篇文章的底部找到 *),它捕获了下面的3个错误/bug:
错误/漏洞1:

DEBUG: Rule at line 1 without any user agent to enforce it on.

错误/Bug 2:

File "src\lxml\etree.pyx", line 1582, in lxml.etree._Element.xpath
File "src\lxml\xpath.pxi", line 305, in lxml.etree.XPathElementEvaluator.__call__
File "src\lxml\xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError: Invalid expression

错误/Bug 3:

ValueError: XPath error: Invalid expression in .//a[contains(@class, 'MuiBox-root css-1efcy4n')]/text())

此外,由于我不再使用数据API,我在开始请求URL和分页方面遇到了困难。我想刮不同类别的产品,我包括在列表中。例如,6011是电气产品,而24175是杂货。由于网站似乎是用JavaScript制作的,我也很难删除下一页的数据。我需要 selenium 吗?飞溅?请指示。

categories = {
    "6011": {"pages": 60, "name": "Цахилгаан бараа"},
    "24175": {"pages": 70, "name": "Хүнс"},
    "24273": {"pages": 40, "name": "Гэр ахуй"},
    "21297": {"pages": 70, "name": "Гоо сайхан"},
    "19653": {"pages": 30, "name": "Гутал, хувцас"}
}
    def start_requests(self):
        yield scrapy.Request(url="https://e-shop.nomin.mn/t/6011", errback=self.parse_error)
# handling pagination
        next_page = response.xpath(
            "//a[contains(@class,'number-list-next js-page-filter number-list-line')]/@href").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
            print(f'Scraped {next_page}')

完整代码:

# -*- coding: utf-8 -*-
import scrapy
import re
from scrapy import Request
from datetime import datetime
from scrapy.crawler import CrawlerProcess
from twisted.internet.error import DNSLookupError

dt_today = datetime.now().strftime('%Y%m%d')
filename = dt_today + ' E-CPI Nomin'

categories = {
    "6011": {"pages": 60, "name": "Цахилгаан бараа"},
    "24175": {"pages": 70, "name": "Хүнс"},
    "24273": {"pages": 40, "name": "Гэр ахуй"},
    "21297": {"pages": 70, "name": "Гоо сайхан"},
    "19653": {"pages": 30, "name": "Гутал, хувцас"},
    "19451": {"pages": 10, "name": "Авто бараа"},
    "19518": {"pages": 40, "name": "Барилгын материал"},
    "19853": {"pages": 10, "name": "Аялал, Спорт бараа"},
    "19487": {"pages": 50, "name": "Ном"},
    "19767": {"pages": 20, "name": "Бичиг хэрэг"},
    "19469": {"pages": 10, "name": "Эрүүл мэнд"},
    "19545": {"pages": 20, "name": "Хүүхдийн бараа"},
}

# create Spider class
class ecpiNominSpider(scrapy.Spider):
    name = "cpi_nomin"
    allowed_domains = "www.e-shop.nomin.mn"
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36",
        "FEEDS": {
            f'{filename}.csv': {
                'format': 'csv',
                'overwrite': True}}
    }

    def start_requests(self):
        yield scrapy.Request(url="https://e-shop.nomin.mn/t/6011", errback=self.parse_error)

    def parse_error(self, failure):
        if failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            yield {
                'URL': request.url,
                'Status': failure.value
            }

    def parse(self, response, **kwargs):
        cards = response.xpath("//*[contains(@class,'MuiBox-root css-1kmsi46')]")

        # parse details
        for card in cards:
            name = card.xpath(".//a[contains(@class, 'MuiBox-root css-1efcy4n')]/text())").extract_first()
            price = card.xpath(".//*[contains(@class, 'MuiBox-root css-qr51gz')]").extract_first().strip()
            link = card.xpath(".//a[contains(@href)]/@href").get()

            item = {'name': name,
                    'price': price,
                    'link': 'https://e-shop.nomin.mn/p/' + link
                    }
            # follow absolute link to scrape deeper level
            yield response.follow(link, callback=self.parse_item, meta={'item': item})

        # handling pagination
        next_page = response.xpath(
            "//a[contains(@class,'number-list-next js-page-filter number-list-line')]/@href").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
            print(f'Scraped {next_page}')

    def parse_item(self, response):
        # retrieve previously scraped item between callbacks
        item = response.meta['item']

        # parse additional details
        list_li = response.xpath(".//*[contains(@class, 'MuiBreadcrumbs-ol css-nhb8h9')]/text()").extract()

        # get next layer data
        cat1 = list_li[0].strip()
        cat2 = list_li[1].strip()
        cat3 = list_li[2].strip()
        cat4 = list_li[3].strip()
        skp = response.xpath(".//*[contains(@class, ' MuiBox-root css-jyp6ua')/text()").extract()

        # update item with next layer data
        item.update({
            'category': cat1,
            'sub_category': cat2,
            'sub_sub_category': cat3,
            'productName': cat4,
            'productCode': skp
        })

        yield item

        # main driver
        if __name__ == "__main__":
            process = CrawlerProcess()
            process.crawl(ecpiNominSpider)
            process.start()

我似乎不知道我是否应该试图以某种方式修复我的previous post的代码的前一个版本,或修复我的当前版本张贴在上面。我真的很感激,如果你们能帮助我在刮所有网页的数据(包括cat 1:类别cat 2:子类别信息)。对不起,我对自己感到沮丧,因为我提出了多个问题,在过去的几个月里查询了Stackoverflow,似乎无法取得进展。再次感谢您的帮助!

5kgi1eie

5kgi1eie1#

我建议使用selenium作为它的webdriver实现,它正好具有您正在寻找的功能。例如,您可以使用element.click()并定义总是导航到特定子类别的方法。不过,要适应它还得花点功夫。
这个网站有一个很好的“入门”部分,有助于熟悉使用 selenium 时的所有重要方面。
简介:https://www.selenium.dev/documentation/webdriver/getting_started/
Selenium文档:https://www.selenium.dev/documentation/webdriver/
有关Python的特定文档,请参阅:https://selenium-python.readthedocs.io/

相关问题