Scrapy链接提取器无法通过第二个链接的深度

js81xvg6  于 2023-04-06  发布在  其他
关注(0)|答案(1)|浏览(113)

链接提取器不会得到链接过去的第二个链接在深。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

class TestSpider(scrapy.Spider):

    name = "test"
    link_extractor = LinkExtractor()
    

    def parse(self, response):

        unique_links = []

        for link in self.link_extractor.extract_links(response):
            unique_links.append(link)

        print(unique_links)

process = CrawlerProcess(get_project_settings())
process.crawl('test', start_urls=['https://www.telusinternational.com/'])
process.start(stop_after_crawl=False)

例如,unique_links将获得https://www.telusinternational.com/about/our-team,但不会获得https://www.telusinternational.com/about/our-team/jeff-puritt?INTCMP=ti_our-team_card_jeff-puritt_leadership-team

0yycz8jy

0yycz8jy1#

您需要为您希望Spider跟随到下一页的每个链接发出请求。

def parse(self, response):
        unique_links = []
        for link in self.link_extractor.extract_links(response):
            unique_links.append(link)
            yield scrapy.Request(link.url)
        print(unique_links)

相关问题