我想每隔一段时间就请求页面以确定内容是否已更新,但我自己的回调函数没有被触发。
allowed_domains = ['www1.hkexnews.hk']
start_urls = 'https://www1.hkexnews.hk/search/predefineddoc.xhtml?lang=zh&predefineddocuments=9'
解析部分的代码是
#Crawl all data first at each start
def parse(self, response):
Total_records = int(re.findall("\d+",response.xpath("//div[@class='PD-TotalRecords']/text()").extract()[0])[0])
dict = {}
is_Latest = True
global Latest_info
global previous_hash
for i in range(1, Total_records + 1):
content = response.xpath("//table/tbody/tr[{}]//text()".format(i)).extract()
# Use the group function to group the list by key
result = list(group(content, self.keys))
Time = dict['Time'] = result[0].get(self.keys[0])
Code = dict['Code'] = result[1].get(self.keys[1])
dict['Name'] = result[2].get(self.keys[2])
if is_Latest:
Latest_info = str(Time) + " | " + str(Code)
is_Latest = False
yield dict
previous_hash = get_hash(Latest_info.encode('utf-8'))
#Monitor data updates and crawl for new data
while True:
time.sleep(10)
# Request website content and calculate hash values
yield scrapy.Request(url=self.start_urls, callback=self.parse_check, dont_filter=True)
我自己的回调函数是
def parse_check(self, response):
global previous_hash
global Latest_info
dict = {}
content = response.xpath("//table/tbody/tr[1]//text()").extract()
# Use the group function to group the list by key
result = list(group(content, self.keys))
Time = result[0].get(self.keys[0])
Code = result[1].get(self.keys[1])
current_info = str(Time) + " | " + str(Code)
current_hash = get_hash(current_info.encode('utf-8'))
# Compare hash values to determine if website content is updated
if current_hash != previous_hash:
dict['Time'] = Time
dict['Code'] = Code
dict['Name'] = result[2].get(self.keys[2])
previous_hash = current_hash
Latest_info = current_info
yield dict
我尝试输出errback,但没有内容,之后我尝试使用requests.get而不是yield scrapy.Request请求页面,这起作用了,但我仍然不知道为什么我的回调函数不起作用
1条答案
按热度按时间nfs0ujit1#
我知道为什么,至少这对我有用,那就是尽量不要在scrapy.中使用time.sleep,因为它会阻塞Twisted reactor(Scrapy的底层框架),这将完全阻塞Scrapy spider并停止所有Scrapy并发特性。您可以使用DOWNLOAD_DELAY函数或使用AutoThrottle AutoThrottle