python-3.x 如何在Scrapy CrawlSpider中找到当前的start_url？

qojgxg4l 于 2023-05-02 发布在 Python

关注(0)|答案(2)|浏览(148)

当从自己的脚本运行Scrapy时，该脚本从DB加载URL并跟踪这些网站上的所有内部链接，我遇到了一个小问题。我需要知道当前使用的是哪个start_url，因为我必须保持与数据库（SQL DB）的一致性。但是：当Scrapy使用名为'start_urls'的内置列表来接收要跟踪的链接列表时，这些网站会立即重定向，出现问题。例如，当Scrapy启动时，正在抓取start_urls，并且爬行器跟踪在那里找到的所有内部链接，我以后只能确定当前访问的URL，而不是Scrapy开始的start_url。
来自网络的其他答案是错误的，因为其他用例或弃用，因为似乎有一个在Scrapy的代码去年的变化。
MWE：

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess

class CustomerSpider(CrawlSpider):
    name = "my_crawler"
    rules = [Rule(LinkExtractor(unique=True), callback="parse_obj", ), ]

    def parse_obj(self, response):
        print(response.url)  # find current start_url and do something

a = CustomerSpider
a.start_urls = ["https://upb.de", "https://spiegel.de"]  # I want to re-identify upb.de in the crawling process in process.crawl(a), but it is redirected immediately  # I have to hand over the start_urls this way, as I use the class CustomerSpider in another class
a.allowed_domains = ["upb.de", "spiegel.de"]

process = CrawlerProcess()

process.crawl(a)
process.start()

在这里，我提供了一个MWE，Scrapy（我的爬虫）接收一个URL列表，就像我必须做的那样。重定向URL的一个例子是https://upb.de，它重定向到https://uni-paderborn.de。
我正在寻找一种优雅的方式来处理这个问题，因为我想利用Scrapy的众多功能，如并行爬行等。因此，我不想另外使用类似requests-library的东西。我想找到Scrapy的start_url，它目前在内部使用（在Scrapy库中）。谢谢你的帮助。

python-3.x

来源：https://stackoverflow.com/questions/52257928/how-to-find-the-current-start-url-in-scrapy-crawlspider

2条答案

按热度按时间

jc3wubiy1#

理想情况下，您应该在原始请求上设置meta属性，并在稍后的回调中引用它。不幸的是，CrawlSpider不支持通过Rule传递meta（请参见#929）。
最好构建自己的spider，而不是子类化CrawlSpider。首先将start_urls作为参数传递给process.crawl，这将使其作为示例上的属性可用。在start_requests方法中，为每个url生成一个新的Request，包括作为meta值的数据库键。
当parse收到加载你的url的响应时，在上面运行一个LinkExtractor，并为每个请求生成一个单独抓取它的请求。在这里，您可以再次传递meta，将原始数据库密钥沿着链传播。
代码如下所示：

from scrapy.spiders import Spider
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess

class CustomerSpider(Spider):
    name = 'my_crawler'

    def start_requests(self):
        for url in self.root_urls:
            yield Request(url, meta={'root_url': url})

    def parse(self, response):
        links = LinkExtractor(unique=True).extract_links(response)

        for link in links:
            yield Request(
                link.url, callback=self.process_link, meta=response.meta)

    def process_link(self, response):
        print {
            'root_url': response.meta['root_url'],
            'resolved_url': response.url
        }

a = CustomerSpider
a.allowed_domains = ['upb.de', 'spiegel.de']

process = CrawlerProcess()

process.crawl(a, root_urls=['https://upb.de', 'https://spiegel.de'])
process.start()

# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/video/'}
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/netzwelt/netzpolitik/'}
# {'root_url': 'https://spiegel.de', 'resolved_url': 'http://www.spiegel.de/thema/buchrezensionen/'}

赞(0）回复(0）举报 2023-05-02

sqserrrh2#

我通常只使用response.url来实现

赞(0）回复(0）举报 2023-05-02