Scrapy请求未触发回调

u7up0aaq  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(174)

Scrapy请求未触发回调。“1”从未被打印。经过长时间的研究,仍无法解决。它无法在任何不同的URL上触发回调。
在default_settings.py中,指定了ROBOTSTXT_OBEY = False。也指定了dont_filter=True

import scrapy as scrapy    
class TheSpider(scrapy.Spider):
    name = 'Test'
    headers = {
        'Accept': '*/*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
        'Connection': 'keep-alive',
        'Host': 'www.eventscribe.com',
        'Referer': 'https://www.eventscribe.com/2018/ADEA/speakers.asp?h=Browse%20By%20Speaker',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest'
    }
    payload = {'as_epq': 'James Clark', 'tbs': 'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm': 'nws'}

    def run(self):
        scrapy.Request(url='https://www.google.com/',
                              callback=self.parse, method='GET', headers=self.headers,
                              dont_filter=True)

    def parse(self, response,**kwargs):
        print('1')
        self.log("I just visited:" + response.url)
        scrapy.FormRequest.from_response(response, formdata={'startDate': '08.29.2021'},
                                         clickdata={'id': 'calendar-picker-submit'},
                                         method='POST',
                                         callback=self.new_response, headers=self.headers,
                                         dont_filter=True)

    def new_response(self, response):
        self.log("I just visited:" + response.url)
        response.xpath("//div[@class='row numbers-past-results']/div[@class='ball-number']/text()").extract()

theSpider = TheSpider(scrapy.Spider)
theSpider.run()

有谁能帮忙吗?先谢谢了。

rseugnpd

rseugnpd1#

使用scrapy需要解决几个问题。我假设您的目的是将文件作为脚本运行,而不是使用scrapy CLI。下面是您的代码中的一些问题和可能的解决方案,但您似乎还应该阅读scrapy文档的快速入门部分。https://docs.scrapy.org/

  • 如果您希望拥有自包含的脚本和独立的爬行器,则需要导入爬行器进程。
  • 此外,爬行器爬行的入口点是start_requests方法,而不是run
  • 另一个问题是,您的方法都不能产生请求。
  • 还有一些关于你的头被拒绝,因为我假设你使用这些头是有原因的,我不打算修改它们,相反,我只是不会使用它们。

通过这些更改,您现在可以看到在调用parse回调时将1打印到屏幕上。

import scrapy as scrapy    
from scrapy.crawler import CrawlerProcess
class TheSpider(scrapy.Spider):
    name = 'Test'
    headers = {
        'Accept': '*/*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
        'Connection': 'keep-alive',
        'Host': 'www.eventscribe.com',
        'Referer': 'https://www.eventscribe.com/2018/ADEA/speakers.asp?h=Browse%20By%20Speaker',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest'
    }
    payload = {'as_epq': 'James Clark', 'tbs': 'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm': 'nws'}

    def start_requests(self):
        yield scrapy.Request(url='https://www.google.com')

    def parse(self, response,**kwargs):
        print('1')
        yield scrapy.FormRequest.from_response(response, formdata={'startDate': '08.29.2021'},
                                         clickdata={'id': 'calendar-picker-submit'},
                                         method='POST',
                                         callback=self.new_response, headers=self.headers,
                                         dont_filter=True)

    def new_response(self, response):
        self.log("I just visited:" + response.url)
        response.xpath("//div[@class='row numbers-past-results']/div[@class='ball-number']/text()").extract()

process = CrawlerProcess()
process.crawl(TheSpider)
process.start()

相关问题