我正在解决以下问题,我的老板想从我创建一个CrawlSpider
在Scrapy
刮文章的细节,如title
,description
和分页只有前5页.
我创建了一个CrawlSpider
,但它是从所有的页面分页,我如何限制CrawlSpider
只分页的前5个最新的网页?
当我们单击pagination next链接时打开的站点文章列表页面标记:
列表页面标记:
<div class="list">
<div class="snippet-content">
<h2>
<a href="https://example.com/article-1">Article 1</a>
</h2>
</div>
<div class="snippet-content">
<h2>
<a href="https://example.com/article-2">Article 2</a>
</h2>
</div>
<div class="snippet-content">
<h2>
<a href="https://example.com/article-3">Article 3</a>
</h2>
</div>
<div class="snippet-content">
<h2>
<a href="https://example.com/article-4">Article 4</a>
</h2>
</div>
</div>
<ul class="pagination">
<li class="next">
<a href="https://www.example.com?page=2&keywords=&from=&topic=&year=&type="> Next </a>
</li>
</ul>
为此,我使用带有restrict_xpaths
参数的Rule
对象来获取所有文章链接,接下来我将执行parse_item
类方法,该方法将从meta
标记中获取文章title
和description
。
Rule(LinkExtractor(restrict_xpaths='//div[contains(@class, "snippet-content")]/h2/a'), callback="parse_item",
follow=True)
详情页标记:
<meta property="og:title" content="Article Title">
<meta property="og:description" content="Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.">
在此之后,我添加了另一个Rule
对象来处理分页CrawlSpider
将使用以下链接打开其他列表页面,并一次又一次地执行相同的过程。
Rule(LinkExtractor(restrict_xpaths='//ul[@class="pagination"]/li[@class="next"]/a'))
下面是我的CrawlSpider
代码:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import w3lib.html
class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["https://www.example.com/"]
custom_settings = {
'FEED_URI': 'articles.json',
'FEED_FORMAT': 'json'
}
total = 0
rules = (
# Get the list of all articles on the one page and follow these links
Rule(LinkExtractor(restrict_xpaths='//div[contains(@class, "snippet-content")]/h2/a'), callback="parse_item",
follow=True),
# After that get pagination next link get href and follow it, repeat the cycle
Rule(LinkExtractor(restrict_xpaths='//ul[@class="pagination"]/li[@class="next"]/a'))
)
def parse_item(self, response):
self.total = self.total + 1
title = response.xpath('//meta[@property="og:title"]/@content').get() or ""
description = w3lib.html.remove_tags(response.xpath('//meta[@property="og:description"]/@content').get()) or ""
return {
'id': self.total,
'title': title,
'description': description
}
有没有一种方法可以限制爬虫只抓取前5页?
1条答案
按热度按时间1cklez4t1#
**方案一:**使用process_request。
**方案二:**覆盖
_requests_to_follow
方法(应该会慢一点)。解决方案是非常自我解释,如果你想让我添加一些东西,请在评论中询问。