python 为什么在Scrapy中调用yield后没有立即执行回调?

jhkqcmku  于 2023-02-11  发布在  Python
关注(0)|答案(1)|浏览(342)

我正在构建一个网络抓取器来抓取远程作业。蜘蛛的行为方式让我无法理解,如果有人能解释原因,我将不胜感激。
下面是蜘蛛的代码:

import scrapy
import time

class JobsSpider(scrapy.Spider):
    name = "jobs"
    start_urls = [
        "https://stackoverflow.com/jobs/remote-developer-jobs"
    ]
    already_visited_links = []

    def parse(self, response):
        jobs = response.xpath("//div[contains(@class, 'job')]")
        links_to_next_pages = response.xpath("//a[contains(@class, 's-pagination--item')]").css("a::attr(href)").getall()

        # visit each job page (as I do in the browser) and scrape the relevant information (Job title etc.)
        for job in jobs:
            job_id = int(job.xpath('@data-jobid').extract_first()) # there will always be one element
            # now visit the link with the job_id and get the info
            job_link_to_visit = "https://stackoverflow.com/jobs?id=" + str(job_id)
            request = scrapy.Request(job_link_to_visit,
                             callback=self.parse_job)
            yield request

        # sleep for 10 seconds before requesting the next page
        print("Sleeping for 10 seconds...")
        time.sleep(10)

        # go to the next job listings page (if you haven't already been there)
        # not sure if this solution is the best since it has a loop which has a recursion in it
        for link_to_next_page in links_to_next_pages:
            if link_to_next_page not in self.already_visited_links:
                self.already_visited_links.append(link_to_next_page)
                yield response.follow(link_to_next_page, callback=self.parse)

        print("End of parse method")

    def parse_job(self, response):
        print(response.body)
        print("Sleeping for 10 seconds...")
        time.sleep(10)
        pass

下面是输出(相关部分):

Sleeping for 10 seconds...
End of parse method
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525754> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525748> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=497114> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=523136> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525730> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
In parse_job
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs/remote-developer-jobs?so_source=JobSearch&so_medium=Internal> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=523319> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522480> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=511761> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522483> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=249610> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522481> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
In parse_job
In parse_job
In parse_job
In parse_job
...

**我不明白为什么parse方法在parse_job方法被调用之前被完全执行。**据我所知,只要我从jobsyield一个jobparse_job方法就应该被调用。蜘蛛应该浏览作业列表的每一页,并访问该作业列表页上每个作业的详细信息。但是,我在前一句中给出的描述与输出不匹配。我也不明白为什么在每次调用parse_job方法之间有多个GET请求。
有人能解释一下这是怎么回事吗

cld4siwp

cld4siwp1#

Scrapy是事件驱动的。首先,请求由Scheduler排队。排队的请求被传递到Downloader。当响应被下载并准备就绪时,回调函数被调用,然后,响应将作为第一个参数被传递到回调函数。
您正在使用time.sleep()阻止回调。在显示的日志中,在第一次回调调用后,parsed_job()中的过程被阻止了10秒,但与此同时,Downloader正在工作并为回调函数准备响应,这在第一次parse_job()调用后的连续DEBUG: Crawled (200)日志中很明显。因此,在回调被阻止时,Downloader完成了它的工作,响应被排队以提供给回调函数。正如日志的最后一部分所显示的,将响应传递给回调函数是瓶颈。
如果您想在请求之间设置延迟,最好使用DOWNLOAD_DELAY设置而不是time.sleep()
看看这个,了解更多关于Scrapy架构的细节。

相关问题