Scrapy抓取网站的前5页

hjzp0vay  于 2023-04-12  发布在  其他
关注(0)|答案(1)|浏览(173)

我正在解决以下问题,我的老板想从我创建一个CrawlSpiderScrapy刮文章的细节,如titledescription和分页只有前5页.
我创建了一个CrawlSpider,但它是从所有的页面分页,我如何限制CrawlSpider只分页的前5个最新的网页?
当我们单击pagination next链接时打开的站点文章列表页面标记:

列表页面标记

<div class="list">
      <div class="snippet-content">
        <h2>
          <a href="https://example.com/article-1">Article 1</a>
        </h2>
      </div>
      <div class="snippet-content">
        <h2>
          <a href="https://example.com/article-2">Article 2</a>
        </h2>
      </div>
      <div class="snippet-content">
        <h2>
          <a href="https://example.com/article-3">Article 3</a>
        </h2>
      </div>
      <div class="snippet-content">
        <h2>
          <a href="https://example.com/article-4">Article 4</a>
        </h2>
      </div>
    </div>
    <ul class="pagination">
      <li class="next">
        <a href="https://www.example.com?page=2&keywords=&from=&topic=&year=&type="> Next </a>
      </li>
    </ul>

为此,我使用带有restrict_xpaths参数的Rule对象来获取所有文章链接,接下来我将执行parse_item类方法,该方法将从meta标记中获取文章titledescription

Rule(LinkExtractor(restrict_xpaths='//div[contains(@class, "snippet-content")]/h2/a'), callback="parse_item",
             follow=True)

详情页标记

<meta property="og:title" content="Article Title">
<meta property="og:description" content="Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.">

在此之后,我添加了另一个Rule对象来处理分页CrawlSpider将使用以下链接打开其他列表页面,并一次又一次地执行相同的过程。

Rule(LinkExtractor(restrict_xpaths='//ul[@class="pagination"]/li[@class="next"]/a'))

下面是我的CrawlSpider代码:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import w3lib.html

class ExampleSpider(CrawlSpider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = ["https://www.example.com/"]
    custom_settings = {
        'FEED_URI': 'articles.json',
        'FEED_FORMAT': 'json'
    }
    total = 0

   
    rules = (
        # Get the list of all articles on the one page and follow these links
        Rule(LinkExtractor(restrict_xpaths='//div[contains(@class, "snippet-content")]/h2/a'), callback="parse_item",
             follow=True),
        # After that get pagination next link get href and follow it, repeat the cycle
        Rule(LinkExtractor(restrict_xpaths='//ul[@class="pagination"]/li[@class="next"]/a'))
    )

    def parse_item(self, response):
        self.total = self.total + 1
        title = response.xpath('//meta[@property="og:title"]/@content').get() or ""
        description = w3lib.html.remove_tags(response.xpath('//meta[@property="og:description"]/@content').get()) or ""
       
        return {
            'id': self.total,
            'title': title,
            'description': description
        }

有没有一种方法可以限制爬虫只抓取前5页?

1cklez4t

1cklez4t1#

**方案一:**使用process_request。

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

def limit_requests(request, response):
    # here we have the page number.
    # page_number = request.url[-1]
    # if int(page_number) >= 6:
    #     return None

    # here we use a counter
    if not hasattr(limit_requests, "page_number"):
        limit_requests.page_number = 0
    limit_requests.page_number += 1

    if limit_requests.page_number >= 5:
        return None

    return request

class ExampleSpider(CrawlSpider):
    name = 'example_spider'

    start_urls = ['https://scrapingclub.com/exercise/list_basic/']
    page = 0
    rules = (
        # Get the list of all articles on the one page and follow these links
        Rule(LinkExtractor(restrict_xpaths='//div[@class="card-body"]/h4/a'), callback="parse_item",
             follow=True),
        # After that get pagination next link get href and follow it, repeat the cycle
        Rule(LinkExtractor(restrict_xpaths='//li[@class="page-item"][last()]/a'), process_request=limit_requests)
    )
    total = 0

    def parse_item(self, response):
        title = response.xpath('//h3//text()').get(default='')
        price = response.xpath('//div[@class="card-body"]/h4//text()').get(default='')
        self.total = self.total + 1

        return {
            'id': self.total,
            'title': title,
            'price': price
        }

**方案二:**覆盖_requests_to_follow方法(应该会慢一点)。

from scrapy.http import HtmlResponse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class ExampleSpider(CrawlSpider):
    name = 'example_spider'

    start_urls = ['https://scrapingclub.com/exercise/list_basic/']

    rules = (
        # Get the list of all articles on the one page and follow these links
        Rule(LinkExtractor(restrict_xpaths='//div[@class="card-body"]/h4/a'), callback="parse_item",
             follow=True),
        # After that get pagination next link get href and follow it, repeat the cycle
        Rule(LinkExtractor(restrict_xpaths='//li[@class="page-item"][last()]/a'))
    )
    total = 0
    page = 0
    
    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        if self.page >= 5:  # stopping condition
            return
        seen = set()
        for rule_index, rule in enumerate(self._rules):
            links = [
                lnk
                for lnk in rule.link_extractor.extract_links(response)
                if lnk not in seen
            ]
            for link in rule.process_links(links):
                if rule_index == 1: # assuming there's only one "next button"
                    self.page += 1
                seen.add(link)
                request = self._build_request(rule_index, link)
                yield rule.process_request(request, response)

    def parse_item(self, response):
        title = response.xpath('//h3//text()').get(default='')
        price = response.xpath('//div[@class="card-body"]/h4//text()').get(default='')
        self.total = self.total + 1

        return {
            'id': self.total,
            'title': title,
            'price': price
        }

解决方案是非常自我解释,如果你想让我添加一些东西,请在评论中询问。

相关问题