我试图从http://quotes.toscrape.com/中抓取作者数据,但不幸的是,当我运行spider时,作者页面返回405;而在浏览器中或通过获取Scrapy shell
中的URL,它返回200
。
class AuthorsSpider(scrapy.Spider):
name = 'authors'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
custom_settings = {
'CONCURRENT_REQUESTS': 50,
'DOWNLOAD_DELAY': 0.1,
'FEED_URI': f'output/authors_{datetime.datetime.today().strftime("%Y-%m-%d %H-%M-%S")}.csv',
'FEED_FORMAT': 'csv',
'FEED_EXPORTERS': {'csv': 'scrapy.exporters.CsvItemExporter'},
'FEED_EXPORT_ENCODING': 'utf-8',
'FEED_EXPORT_FIELDS': ('name','birth_date','birth_location','description',)
}
def parse(self, response):
for _ in response.xpath("//div[@class='quote']"):
author_page = response.xpath("//a[text()='(about)']/@href").get()
yield response.follow(author_page,
method="GET",
callback=self.parse_author)
next_page = response.xpath("//li[@class='next']/a/@href").get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_author(self, response):
yield {
'name': response.xpath("//h3[@class='author-title']/text()").get(),
'birth_date': response.xpath("//span[@class='author-born-date']/text()").get(),
'birth_location': response.xpath("//span[@class='author-born-location']/text()").get(),
'description': response.xpath("//div[@class='author-description']/text()").get()
}
下面是我运行scrapy crawl authors
时的部分响应:
2023-01-02 10:53:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/10/> (referer: http://quotes.toscrape.com/page/9/)
2023-01-02 10:53:33 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/Suzanne-Collins/> (referer: http://quotes.toscrape.com/page/7/)
2023-01-02 10:53:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 http://quotes.toscrape.com/author/Suzanne-Collins/>: HTTP status code is not handled or not allowed
2023-01-02 10:53:34 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/W-C-Fields/> (referer: http://quotes.toscrape.com/page/8/)
2023-01-02 10:53:34 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (308) to <NONE http://quotes.toscrape.com/author/John-Lennon/> from <GET http://quotes.toscrape.com/author/John-Lennon>
2023-01-02 10:53:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 http://quotes.toscrape.com/author/W-C-Fields/>: HTTP status code is not handled or not allowed
2023-01-02 10:53:34 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/Alfred-Tennyson/> (referer: http://quotes.toscrape.com/page/8/)
1条答案
按热度按时间2j4z5cfb1#
基本上,通过response.follow(),你是在请求parse函数再次跟踪这个url。如果你想把url传递给另一个函数,那么你需要使用Scrapy.Request()而不是response.follow()。如果你想把作者的页面url传递给parse_author,那么你的代码应该如下所示。
附件图片x1c 0d1x如果您还有任何问题,请回复此答案。快乐学习!