我正在构建一个网络抓取器来抓取远程作业。蜘蛛的行为方式让我无法理解,如果有人能解释原因,我将不胜感激。
下面是蜘蛛的代码:
import scrapy
import time
class JobsSpider(scrapy.Spider):
name = "jobs"
start_urls = [
"https://stackoverflow.com/jobs/remote-developer-jobs"
]
already_visited_links = []
def parse(self, response):
jobs = response.xpath("//div[contains(@class, 'job')]")
links_to_next_pages = response.xpath("//a[contains(@class, 's-pagination--item')]").css("a::attr(href)").getall()
# visit each job page (as I do in the browser) and scrape the relevant information (Job title etc.)
for job in jobs:
job_id = int(job.xpath('@data-jobid').extract_first()) # there will always be one element
# now visit the link with the job_id and get the info
job_link_to_visit = "https://stackoverflow.com/jobs?id=" + str(job_id)
request = scrapy.Request(job_link_to_visit,
callback=self.parse_job)
yield request
# sleep for 10 seconds before requesting the next page
print("Sleeping for 10 seconds...")
time.sleep(10)
# go to the next job listings page (if you haven't already been there)
# not sure if this solution is the best since it has a loop which has a recursion in it
for link_to_next_page in links_to_next_pages:
if link_to_next_page not in self.already_visited_links:
self.already_visited_links.append(link_to_next_page)
yield response.follow(link_to_next_page, callback=self.parse)
print("End of parse method")
def parse_job(self, response):
print(response.body)
print("Sleeping for 10 seconds...")
time.sleep(10)
pass
下面是输出(相关部分):
Sleeping for 10 seconds...
End of parse method
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525754> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525748> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=497114> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=523136> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525730> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
In parse_job
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs/remote-developer-jobs?so_source=JobSearch&so_medium=Internal> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=523319> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522480> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=511761> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522483> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=249610> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522481> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
In parse_job
In parse_job
In parse_job
In parse_job
...
**我不明白为什么parse
方法在parse_job
方法被调用之前被完全执行。**据我所知,只要我从jobs
yield
一个job
,parse_job
方法就应该被调用。蜘蛛应该浏览作业列表的每一页,并访问该作业列表页上每个作业的详细信息。但是,我在前一句中给出的描述与输出不匹配。我也不明白为什么在每次调用parse_job
方法之间有多个GET
请求。
有人能解释一下这是怎么回事吗
1条答案
按热度按时间cld4siwp1#
Scrapy是事件驱动的。首先,请求由
Scheduler
排队。排队的请求被传递到Downloader
。当响应被下载并准备就绪时,回调函数被调用,然后,响应将作为第一个参数被传递到回调函数。您正在使用
time.sleep()
阻止回调。在显示的日志中,在第一次回调调用后,parsed_job()
中的过程被阻止了10秒,但与此同时,Downloader
正在工作并为回调函数准备响应,这在第一次parse_job()
调用后的连续DEBUG: Crawled (200)
日志中很明显。因此,在回调被阻止时,Downloader
完成了它的工作,响应被排队以提供给回调函数。正如日志的最后一部分所显示的,将响应传递给回调函数是瓶颈。如果您想在请求之间设置延迟,最好使用
DOWNLOAD_DELAY
设置而不是time.sleep()
。看看这个,了解更多关于Scrapy架构的细节。