链接提取器不会得到链接过去的第二个链接在深。
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
class TestSpider(scrapy.Spider):
name = "test"
link_extractor = LinkExtractor()
def parse(self, response):
unique_links = []
for link in self.link_extractor.extract_links(response):
unique_links.append(link)
print(unique_links)
process = CrawlerProcess(get_project_settings())
process.crawl('test', start_urls=['https://www.telusinternational.com/'])
process.start(stop_after_crawl=False)
例如,unique_links将获得https://www.telusinternational.com/about/our-team,但不会获得https://www.telusinternational.com/about/our-team/jeff-puritt?INTCMP=ti_our-team_card_jeff-puritt_leadership-team
1条答案
按热度按时间0yycz8jy1#
您需要为您希望Spider跟随到下一页的每个链接发出请求。