问题解决了。答案就在本教程中。
我一直在运行一个抓取和抓取脚本。它运行得很好。但是在运行时,它总是在某个点卡住。下面是它显示的内容
[scrapy.extensions.logstats] INFO: Crawled 1795 pages (at 0 pages/min), scraped 1716 items (at 0 items/min)
然后我用Contorl+Z停止了代码的运行,重新运行了蜘蛛程序。然后,在抓取和抓取一些数据后,它又卡住了。你以前遇到过这个问题吗?你是如何克服的?
下面是link的完整代码
这是蜘蛛的代码
import scrapy
from scrapy.loader import ItemLoader
from healthgrades.items import HealthgradesItem
from scrapy_playwright.page import PageMethod
# make the header elements like they are in a dictionary
def get_headers(s, sep=': ', strip_cookie=True, strip_cl=True, strip_headers: list = []) -> dict():
d = dict()
for kv in s.split('\n'):
kv = kv.strip()
if kv and sep in kv:
v=''
k = kv.split(sep)[0]
if len(kv.split(sep)) == 1:
v = ''
else:
v = kv.split(sep)[1]
if v == '\'\'':
v =''
# v = kv.split(sep)[1]
if strip_cookie and k.lower() == 'cookie': continue
if strip_cl and k.lower() == 'content-length': continue
if k in strip_headers: continue
d[k] = v
return d
# spider class
class DoctorSpider(scrapy.Spider):
name = 'doctor'
allowed_domains = ['healthgrades.com']
url = 'https://www.healthgrades.com/usearch?what=Massage%20Therapy&entityCode=PS444&where=New%20York&pageNu m={}&sort.provider=bestmatch&='
# change the header of bot to look like a browser
def start_requests(self):
h = get_headers(
'''
accept: */*
accept-encoding: gzip, deflate, be
accept-language: en-US,en;q=0.9
dnt: 1
origin: https://www.healthgrades.com
referer: https://www.healthgrades.com/
sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: empty
sec-fetch-mode: cors
vsec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
'''
)
for i in range(1, 6): # Change the range to the page numbers. more improvement can be done
# GET request. url to first page
yield scrapy.Request(self.url.format(i), headers =h, meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [PageMethod('wait_for_selector', 'h3.card-name a')] # for waiting for a particular element to load
))
def parse(self, response):
for link in response.css('div h3.card-name a::attr(href)'): # individual doctor's link
yield response.follow(link.get(), callback = self.parse_categories) # enter into the website
def parse_categories(self, response):
l = ItemLoader(item = HealthgradesItem(), selector = response)
l.add_xpath('name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/h1')
l.add_xpath('specialty', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/div[2]/p/span[1]')
l.add_xpath('practice_name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/p')
l.add_xpath('address', 'string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)')
yield l.load_item()
1条答案
按热度按时间nuypyhwy1#
问题是,并发设置有限制。
这是解决方案
并发请求
在Scrapy中添加并发实际上是一个非常简单的任务,已经有了允许的并发请求数量的设置,只需修改即可。
您可以选择在您创建的蜘蛛的自定义设置中修改此设置,或者在影响所有蜘蛛的全局设置中修改此设置。
全局
若要全局添加,只需将以下行添加到设置文件中。
我们已将并发请求数设置为30。您可以在合理的限制范围内使用任意值。
本机
要在本地添加设置,我们必须使用自定义设置来向Scrapy spider添加并发请求。
其他设置
您可以使用许多其他设定来取代CONCURRENT_REQUESTS,或与CONCURRENT_REQUESTS一起使用。
CONCURRENT_REQUESTS_PER_IP
-设置每个IP地址的并发请求数。CONCURRENT_REQUESTS_PER_DOMAIN
-定义每个域允许的并发请求数。MAX_CONCURRENT_REQUESTS_PER_DOMAIN
-设定网域允许的并行请求数目上限。