scrapy runspider返回一个空文件[调试:已爬网(200)]

euoag5mw  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(170)

为了分析不同产品的价格,我创建了一个函数,通过scrapy库下载它们,但是,当我执行该例程时,返回一个错误消息。
我已经将scrapy.exe文件保存在运行.py文件的同一工作目录中
这是我密码

import scrapy
from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.loader.processors import MapCompose
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from bs4 import BeautifulSoup

class Articulo(Item):
    titulo = Field()
    precio = Field()
    descripcion = Field()

class MercadoLibreCrawler(CrawlSpider):
    name = 'mercadoLibre'
    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
        'CLOSESPIDER_PAGECOUNT': 5
    }
    download_delay = 1

    allowed_domains = ['articulo.mercadolibre.cl', 'listado.mercadolibre.cl']   #puedo poner más dominios solo poniendo comas

    start_urls = ['https://listado.mercadolibre.cl/animales-mascotas/caballos/']

    rules = (
        Rule(  # REGLA #1 => HORIZONTALIDAD POR PAGINACION
            LinkExtractor(
                allow=r'/_Desde_\d+'
            ), follow=True),

        Rule(   # REGLA #2 => VERTICALIDAD AL DETALLE PRODUCTOS
            LinkExtractor(
                allow=r'/MLC-'
            ), follow=True, callback='parse_items'),
    )

def limpiarTexto(self, texto): 
  nuevoTexto = texto.replace('\n', ' ').replace('\r',' ').replace('\t', ' ').strip()
  return nuevoTexto

def parse_items(self, response):
    item = ItemLoader(Articulo(), response)

    item.add_xpath('titulo', '//h1/text()')
    item.add_xpath('descripcion', '//div[@class="ui-pdp-description__content"]/p/text()', MapCompose(self. limpiarTexto))
    item.add_xpath('precio', '//span[@class="andes-money-amount__fraction"]/text()', MapCompose(self.limpiarTexto))

    yield item.load_item()

代码执行时没有问题,尽管结果返回了一个空文件。我认为问题出在这个“DEBUG:爬行(200)+(裁判:None)”消息,但我不太明白如何修复它

C:\Users\gusta\OneDrive\Documentos\Empresa>scrapy runspider 20220910_scraping_mercado_libre.py -o mercado_libre.csv -t csv
C:\Users\gusta\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\commands\__init__.py:131: ScrapyDeprecationWarning: The -t command line option is deprecated in favor of specifying the output format within the output URI. See the documentation of the -o and -O options for more information.
  feeds = feed_process_params_from_cli(
2022-09-11 02:55:48 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: scrapybot)
2022-09-11 02:55:48 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.8.0, Python 3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Windows-10-10.0.19044-SP0
2022-09-11 02:55:49 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_PAGECOUNT': 1,
 'SPIDER_LOADER_WARN_ONLY': True,
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36'}
2022-09-11 02:55:49 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-09-11 02:55:49 [scrapy.extensions.telnet] INFO: Telnet Password: 2f2010a00a1f6efa
2022-09-11 02:55:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-09-11 02:55:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-09-11 02:55:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-09-11 02:55:49 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-09-11 02:55:49 [scrapy.core.engine] INFO: Spider opened
2022-09-11 02:55:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-09-11 02:55:50 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-09-11 02:55:50 [filelock] DEBUG: Attempting to acquire lock 1646952198736 on C:\Users\gusta\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-09-11 02:55:50 [filelock] DEBUG: Lock 1646952198736 acquired on C:\Users\gusta\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-09-11 02:55:50 [filelock] DEBUG: Attempting to release lock 1646952198736 on C:\Users\gusta\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-09-11 02:55:50 [filelock] DEBUG: Lock 1646952198736 released on C:\Users\gusta\AppData\Local\Programs\Python\Python310\lib\site-packages\tldextract\.suffix_cache/publicsuffix.org-tlds\de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-09-11 02:55:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://listado.mercadolibre.cl/animales-mascotas/caballos/> (referer: None)
2022-09-11 02:55:50 [scrapy.core.engine] INFO: Closing spider (closespider_pagecount)
2022-09-11 02:55:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 332,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 105595,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.003421,
 'finish_reason': 'closespider_pagecount',
 'finish_time': datetime.datetime(2022, 9, 11, 5, 55, 51, 184037),
 'httpcompression/response_bytes': 722794,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 6,
 'log_count/INFO': 10,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 52,
 'scheduler/enqueued/memory': 52,
 'start_time': datetime.datetime(2022, 9, 11, 5, 55, 50, 180616)}
2022-09-11 02:55:51 [scrapy.core.engine] INFO: Spider closed (closespider_pagecount)
0yg35tkg

0yg35tkg1#

我在您的规则中发现了一个打字错误:

Rule(   # REGLA #2 => VERTICALIDAD AL DETALLE PRODUCTOS
            LinkExtractor(
                allow=r'/MCL-'
            ), follow=True, callback='parse_items')

它应该是MLC(不是MCL):

Rule(   # REGLA #2 => VERTICALIDAD AL DETALLE PRODUCTOS
                LinkExtractor(
                    allow=r'/MLC-'
                ), follow=True, callback='parse_items')

UPDATE修复后,项目处理程序中出现另一个错误。错误应为:

item.add_xpath('descripcion', '//div[@class="ui-pdp-description__content"]/p/text()', MapCompose(self.limpiarTexto))

相关问题