Scrapy链接提取器无法通过第二个链接的深度

js81xvg6 于 2023-04-06 发布在其他

关注(0)|答案(1)|浏览(113)

链接提取器不会得到链接过去的第二个链接在深。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

class TestSpider(scrapy.Spider):

    name = "test"
    link_extractor = LinkExtractor()
    

    def parse(self, response):

        unique_links = []

        for link in self.link_extractor.extract_links(response):
            unique_links.append(link)

        print(unique_links)

process = CrawlerProcess(get_project_settings())
process.crawl('test', start_urls=['https://www.telusinternational.com/'])
process.start(stop_after_crawl=False)

例如，unique_links将获得https://www.telusinternational.com/about/our-team，但不会获得https://www.telusinternational.com/about/our-team/jeff-puritt?INTCMP=ti_our-team_card_jeff-puritt_leadership-team

scrapy

来源：https://stackoverflow.com/questions/75930098/scrapy-link-extractor-doesnt-get-past-second-link-in-depth

1条答案

按热度按时间

0yycz8jy1#

您需要为您希望Spider跟随到下一页的每个链接发出请求。

def parse(self, response):
        unique_links = []
        for link in self.link_extractor.extract_links(response):
            unique_links.append(link)
            yield scrapy.Request(link.url)
        print(unique_links)

赞(0）回复(0）举报 2023-04-06

我来回答

Scrapy链接提取器无法通过第二个链接的深度

1条答案

相关问题

热门标签

最新问答