Scrapy Recusrive(CrawlSpider)未按预期爬网所有链接

8e2ybdfx  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(126)

所以我的问题是我有一个爬行蜘蛛

name = 'recursiveSpider'
    allowed_domains = ['industrialnetworking.com']

    custom_settings = {
        'DUPEFILTER_CLASS' : 'scrapy.dupefilters.BaseDupeFilter',
    }
    start_urls = [        
        'https://www.industrialnetworking.com/Manufacturers/Hirschmann'
    ]
    rules = (
        Rule(LinkExtractor(restrict_css='div.catCell a::attr(href)'), follow=True),
        Rule(LinkExtractor(allow=r"/Manufacturers/Hirschmann*"), callback='parse_new_item')
    )

我正在尝试访问所有“Hirshmann”产品的产品页面。我知道我的错误是在“规则”的第二行,在那里我允许任何带有Hirschmann* 的内容。尽管我不确定如何添加response.css/response.xpath作为允许的参数。
理想情况下,我希望它,如果爬虫所有“div.catCell a:attr(href)”,并递归通过他们,直到它检测到“response.css('td.cellDesc h2 a::attr(href;')",然后它会发送该链接到我的“parse_new_item”。如果该项目没有找到,然后继续以下所有链接,有“div.catCell a:attr(href)"。

Example URL travel path ->
StartURL: https://www.industrialnetworking.com/Manufacturers/Hirschmann
Category: https://www.industrialnetworking.com/Manufacturers/Hirschmann-Rail-Switches
SubCategory: https://www.industrialnetworking.com/Manufacturers/Hirschmann-Switches-Unmanaged
Series: https://www.industrialnetworking.com/Manufacturers/Hirschmann-SPIDER-Family-Rail-Switches
END GOAL ->
Product: https://www.industrialnetworking.com/Manufacturers/Hirschmann-SPIDER-III-Rail-Switches/Hirschmann-SSL20-5TX-Rail-Switch-942-132-001

编辑-我的目标xpath/css路径的原因是因为链接没有任何明显的模式,我可以用来目标的url。
谢谢大家!

1zmg4dgp

1zmg4dgp1#

我个人并不是crawlspider的忠实粉丝。在一些情况下,它很方便,但我认为在你的情况下,坚持手动爬取链接可能是一种更容易的方法。
由于您有多个具有相同格式的页面,因此您可以将每个链接反馈回主parse方法,直到它找到与td/h2/a链接相匹配的链接,此时它可以使用parse_new_item方法分配一个不同的回调来解析最终的产品页面。
例如:

import scrapy

class MySpider(scrapy.Spider):
    name = 'recursiveSpider'
    allowed_domains = ['industrialnetworking.com']
    start_urls = ['https://www.industrialnetworking.com/Manufacturers/Hirschmann']

    def parse(self, response):
        for url in response.xpath("//div[@class='catCell']/a/@href").getall():
            yield scrapy.Request(response.urljoin(url), callback=self.parse)
        for url in response.xpath("//td[@class='cellDesc']/h2/a/@href").getall():
            yield scrapy.Request(response.urljoin(url), callback=self.parse_new_item)

    def parse_new_item(self, response):
        print(response)
        item_name = response.xpath("//div[@id='itmNam']/h1/text()").get()
        item = {"name": item_name}
        yield item

输出很长,所以我把最后的计数放在下面。

输出

<200 https://www.industrialnetworking.com/Manufacturers/Hirschmann-Greyhound-Switch-Power-Accessories/Hirschmann-Greyhound-1040-Industrial-Power-Supply-GPS1-KSY9HH>
2022-09-14 13:32:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.industrialnetworking.com/Manufacturers/Hirschmann-Greyhound-Switch-Power-Accessories/Hirschmann-Greyhound-1040-Industrial-Power-Supply-GPS1-KSY9H
H>
{'name': 'GPS1-KSY9HH Power Supply'}
2022-09-14 13:32:12 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-14 13:32:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 380892,
 'downloader/request_count': 483,
 'downloader/request_method_count/GET': 483,
 'downloader/response_bytes': 9139340,
 'downloader/response_count': 483,
 'downloader/response_status_count/200': 471,
 'downloader/response_status_count/429': 12,
 'elapsed_time_seconds': 22.988552,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 9, 14, 20, 32, 12, 287889),
 'httpcompression/response_bytes': 41356802,
 'httpcompression/response_count': 471,
 'httperror/response_ignored_count': 4,
 'httperror/response_ignored_status_count/429': 4,
 'item_scraped_count': 401,
 'log_count/DEBUG': 889,
 'log_count/ERROR': 4,
 'log_count/INFO': 14,
 'request_depth_max': 5,
 'response_received_count': 475,
 'retry/count': 8,
 'retry/max_reached': 4,
 'retry/reason_count/429 Unknown Status': 8,
 'scheduler/dequeued': 483,
 'scheduler/dequeued/memory': 483,
 'scheduler/enqueued': 483,
 'scheduler/enqueued/memory': 483,
 'start_time': datetime.datetime(2022, 9, 14, 20, 31, 49, 299337)}
2022-09-14 13:32:12 [scrapy.core.engine] INFO: Spider closed (finished)
at0kjp5o

at0kjp5o2#

你上面提到的网页包含14个列表urls.So你可以使用xpath或css选择器只和你必须使用follow = False摆脱不必要的url

from scrapy.crawler import CrawlerProcess
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class TestSpider(CrawlSpider):
    name = 'test'

    allowed_domains = ['industrialnetworking.com']
    start_urls = ['https://www.industrialnetworking.com/Manufacturers/Hirschmann']

    rules = (
        Rule(LinkExtractor(allow=r'/Manufacturers/Hirschmann-'), follow = True, callback = 'parse_item'),
        )

    def parse_item(self, response):
        yield {
            'Title': response.xpath('//*[@id="itmNam"]/h1/text()').get()
            }

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

相关问题