python-3.x Scrapy requests -我自己的回调函数没有被调用

xfb7svmp  于 2023-04-08  发布在  Python
关注(0)|答案(1)|浏览(151)

我想每隔一段时间就请求页面以确定内容是否已更新,但我自己的回调函数没有被触发。

allowed_domains = ['www1.hkexnews.hk']
start_urls = 'https://www1.hkexnews.hk/search/predefineddoc.xhtml?lang=zh&predefineddocuments=9'

解析部分的代码是

#Crawl all data first at each start
    def parse(self, response):
        Total_records = int(re.findall("\d+",response.xpath("//div[@class='PD-TotalRecords']/text()").extract()[0])[0])
        dict = {}
        is_Latest = True
        global Latest_info
        global previous_hash

        for i in range(1, Total_records + 1):
            content = response.xpath("//table/tbody/tr[{}]//text()".format(i)).extract()

            # Use the group function to group the list by key
            result = list(group(content, self.keys))
            Time = dict['Time'] = result[0].get(self.keys[0])
            Code = dict['Code'] = result[1].get(self.keys[1])
            dict['Name'] = result[2].get(self.keys[2])
            if is_Latest:
                Latest_info = str(Time) + " | " + str(Code)
                is_Latest = False

            yield dict

        previous_hash = get_hash(Latest_info.encode('utf-8'))
        #Monitor data updates and crawl for new data
        while True:
            time.sleep(10)
            # Request website content and calculate hash values
            yield scrapy.Request(url=self.start_urls, callback=self.parse_check, dont_filter=True)

我自己的回调函数是

def parse_check(self, response):
        global previous_hash
        global Latest_info
        dict = {}
        content = response.xpath("//table/tbody/tr[1]//text()").extract()
        # Use the group function to group the list by key
        result = list(group(content, self.keys))
        Time =  result[0].get(self.keys[0])
        Code = result[1].get(self.keys[1])

        current_info = str(Time) + " | " + str(Code)
        current_hash = get_hash(current_info.encode('utf-8'))

        # Compare hash values to determine if website content is updated
        if current_hash != previous_hash:

            dict['Time'] = Time
            dict['Code'] = Code
            dict['Name'] = result[2].get(self.keys[2])

            previous_hash = current_hash
            Latest_info = current_info
        yield dict

我尝试输出errback,但没有内容,之后我尝试使用requests.get而不是yield scrapy.Request请求页面,这起作用了,但我仍然不知道为什么我的回调函数不起作用

nfs0ujit

nfs0ujit1#

我知道为什么,至少这对我有用,那就是尽量不要在scrapy.中使用time.sleep,因为它会阻塞Twisted reactor(Scrapy的底层框架),这将完全阻塞Scrapy spider并停止所有Scrapy并发特性。您可以使用DOWNLOAD_DELAY函数或使用AutoThrottle AutoThrottle

相关问题