Scrapy SitemapSpider -如何从sitemap_filter产生条目进行解析

mi7gmzs6  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(88)

我正在构建一个SitemapSpider。我正在尝试过滤网站Map条目以排除链接中包含此子字符串'/p/'的条目:

<url>
       <loc>https://example.co.za/product-name/p/product-id</loc>
       <lastmod>2019-08-27</lastmod>
       <changefreq>daily</changefreq>
</url>

根据Scrapy docs,我们可以定义一个sitemap_filter函数:

for entry in entries:
            date_time = datetime.strptime(entry['lastmod'], '%Y-%m-%d')
            if date_time.year >= 2005:
                yield entry

在我的例子中,我是在entry['loc']上过滤,而不是entry['lastmod']
不幸的是,除了上面的例子之外,我还没有找到使用sitemap_filter的例子。

from scrapy.spiders import SitemapSpider

class mySpider(SitemapSpider)
    name = 'spiderName'
    sitemap_urls = ['https://example.co.za/medias/sitemap']
    # sitemap_rules = [ ('donut/c', 'parse')]

    def sitemap_filter(self, entries):
        for entry in entries:
            if '/p/' not in entry['loc']
                print(entry)
                yield entry
    def parse(self, response):
        ...

代码在没有sitemap_filter函数的情况下运行良好,但是定义所有的sitemap_rules是不可行的。
当我运行上面的代码时,它打印了正确的站点Map条目,但似乎并没有进入解析函数。日志文件没有显示错误:

2022-05-10 17:02:00 [scrapy.core.engine] INFO: Spider opened
2022-05-10 17:02:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-05-10 17:02:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-05-10 17:02:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.co.za/robots.txt> (referer: None)
2022-05-10 17:02:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.co.za/medias/sitemap.xml> (referer: None)
2022-05-10 17:02:05 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 
2022-05-10 17:02:06 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
2022-05-10 17:02:09 [scrapy.core.engine] INFO: Closing spider (shutdown)
2022-05-10 17:02:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

我正在寻找一种方法,将sitemap_filter生成的条目发送到parse函数,或者,在scrapy打开链接之前过滤站点Map条目。

kknvjkwl

kknvjkwl1#

谢谢大家的建议。根据@Georgiy的评论和old answer,用entry.get('loc')替换entry['loc']是有效的。

from scrapy.spiders import SitemapSpider

class mySpider(SitemapSpider)
    name = 'spiderName'
    sitemap_urls = ['https://example.co.za/medias/sitemap']
    # sitemap_rules = [ ('donut/c', 'parse')]

    def sitemap_filter(self, entries):
        for entry in entries:
            if '/p/' not in entry.get('loc')
                #print(entry)
                yield entry
    def parse(self, response):
        ...

相关问题