使用Scrapy抓取链接到网站的所有页面，深度可以任意设定

wfypjpf4 于 2023-02-19 发布在其他

关注(0)|答案(1)|浏览(187)

我很想知道是否有可能抓取网站上的所有页面和链接，即使在跟随几个链接后，顶部的URL发生了变化？这里是一个例子：
首页URL：www.example.comwww.topURL.com
有3个链接：网址：www.topURL.com/link1、网址：www.example.com和网址：www.topURL.com/link2 and www.topURL.com/link3
然后，如果我们单击www.example.com，它会将我们带到一个页面，该页面本身具有www.topURL.com/link1 it takes us to a page that itself has
2链接上：和网站www.topURL.com/link4 and www.topURL.com/link5
但如果我们单击www.example.com，则会转到包含以下2个链接的页面：www.topURL.com/link4 it takes us to a page that has the following 2 links: www.anotherURL.com/link1 and www.thirdURL.com/link1
scrapy或任何python爬虫/spider都可以从www.example.com开始，然后沿着链接到达www.example.com吗 www.topURL.com and then follows links and end up on www.thirdURL.com/link1?
有没有一个限制，它可以多深？有没有代码示例告诉我如何做到这一点？
谢谢你的帮助。

scrapy

来源：https://stackoverflow.com/questions/54164310/using-scrapy-to-crawl-all-pages-that-are-linked-to-a-website-with-any-depth-we-w

1条答案

按热度按时间

vtwuwzda1#

看看Scraby的CrawlSpider蜘蛛类
CrawlSpider是爬行常规网站最常用的蜘蛛，因为它通过定义一组规则提供了一种方便的跟踪链接的机制。
为了实现你的目标，你只需要制定非常基本的规则：

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract and follow all links!
        Rule(LinkExtractor(callback='parse_item'), follow=True),
    )

    def parse_item(self, response):
        self.log('crawling'.format(response.url))

上面的爬虫将抓取网站上与allowed_domains匹配的每个url，并回调到parse_item。
值得注意的是，默认情况下LinkeExtractor忽略媒体页面（如pdf，mp4等）
为了扩展深度主题，Scraby确实有深度限制设置，但默认设置为0（又名无限深度）
https://doc.scrapy.org/en/0.9/topics/settings.html#depth-limit

# settings.py
DEPTH_LIMIT = 0

同样，scrapy默认情况下会先抓取深度，但如果你想更快地覆盖广度，首先可能会改善这一点：https://doc.scrapy.org/en/0.9/topics/settings.html#depth-limit

# settings.py
SCHEDULER_ORDER = 'BFO'

赞(0）回复(0）举报 2023-02-19

我来回答

使用Scrapy抓取链接到网站的所有页面，深度可以任意设定

1条答案

相关问题

热门标签

最新问答