使用scrapy加载已过滤的RSS源失败

o3imoua4  于 2023-02-22  发布在  其他
关注(0)|答案(1)|浏览(103)

参考我的代码如下:

import scrapy

headers = \
{'Host': 'log.rlsbb.cc',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Referer': 'https://log.rlsbb.cc/',
'Cookie': 'filters=foreign-movies,movies,tv-shows,old-movies,_foreign-movies_f-webrip,_foreign-movies_f-dvdrip-bdrip,\
_foreign-movies_f-bluray-720p,_foreign-movies_f-bluray-1080p,_movies_bluray-1080p,_movies_bluray-720p,_movies_bdrip,\
_movies_webrip,_movies_dvdrip,_movies_4k-uhd,_tv-shows_top,_tv-shows_tv-packs,_movies_old,_foreign-movies_f-old',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Sec-GPC': '1',
'DNT': '1',
'TE': 'trailers'}


class ScrapeRlsBBRssSpider(scrapy.Spider):
    name = 'scrape_rlsbb_rss'
    allowed_domains = ['log.rlsbb.cc/feed']
    start_urls = ['http://https://log.rlsbb.cc/feed/']

    custom_settings={ 'FEED_URI': f"{name}_%(time)s.json",
                      'FEED_FORMAT': 'json'}

    def start_requests(self):
        urls = [
            'https://log.rlsbb.cc/feed/',
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse, headers=headers)

    def parse(self, response):
        for post in response.xpath('//channel/item'):
            yield {
                'title' : post.xpath('title//text()').extract_first(),
                'link': post.xpath('link//text()').extract_first(),
                'pubDate' : post.xpath('pubDate//text()').extract_first(),
                'category': post.xpath('category//text()').extract_first(),
            }

我已经使用website上提供的选项(齿轮图标)过滤了我的RSS feed。当我按RSS图标获取提要链接时,它会显示上面的链接以及所需的过滤链接。
然后我在scrappy中使用上面的feed链接下载了RSS XML。输出不起作用,因为它包含了所有未过滤的链接和一些过滤的链接。然后我在头文件的cookie字段中应用了过滤器(见代码),它返回了一个空文件。
我做错了什么或误解了什么?
任何帮助都将不胜感激。

wbgh16ku

wbgh16ku1#

通过添加相对xpath表达式和删除自定义头,我成功地提取了数据,但我不确定相对xpath部分是否真的有必要。

class ScrapeRlsBBRssSpider(scrapy.Spider):
    name = 'scrape_rlsbb_rss'
    allowed_domains = ['log.rlsbb.cc/feed']
    start_urls = ['http://https://log.rlsbb.cc/feed/']

    custom_settings={ 'FEED_URI': f"{name}_%(time)s.json",
                      'FEED_FORMAT': 'json'}

    def start_requests(self):
        urls = [
            'https://log.rlsbb.cc/feed/',
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)  # no headers

    def parse(self, response):
        for post in response.xpath('//channel/item'):
            yield {  # relative xpaths i.e. './'...
                'title' : post.xpath('./title//text()').extract_first(),
                'link': post.xpath('./link//text()').extract_first(),
                'pubDate' : post.xpath('./pubDate//text()').extract_first(),
                'category': post.xpath('./category//text()').extract_first(),
            }

输出:

2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Accused 2023 S01E05 WEB x264-TGX (246MB)', 'link': 'https://log.rlsbb.cc/accused-2023-s01e05-1080p-web-h264-cakes/', 'pubDate': 'Wed, 22 Feb 2023 02:34:47 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'American Auto S02E05 HDTV x264-TGX (185MB)', 'link': 'https://log.rlsbb.cc/american-auto-s02e05-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 02:34:38 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Accused 2023 S01E05 720p HEVC X265-MeGusta (205MB)', 'link': 'https://log.rlsbb.cc/accused-2023-s01e05-1080p-web-h264-cakes/', 'pubDate': 'Wed, 22 Feb 2023 02:17:29 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'FBI S05E14 480p x264-mSD (212MB)', 'link': 'https://log.rlsbb.cc/fbi-s05e14-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 02:16:43 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'American Auto S02E05 720p HEVC X265-MeGusta (201MB)', 'link': 'https://log.rlsbb.cc/american-auto-s02e05-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 02:15:36 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Accused 2023 S01E05 480p x264-mSD (146MB)', 'link': 'https://log.rlsbb.cc/accused-2023-s01e05-1080p-web-h264-cakes/', 'pubDate': 'Wed, 22 Feb 2023 02:13:38 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'American Auto S02E05 480p x264-mSD (135MB)', 'link': 'https://log.rlsbb.cc/american-auto-s02e05-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 02:10:14 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'FBI S05E14 720p HEVC X265-MeGusta (261MB)', 'link': 'https://log.rlsbb.cc/fbi-s05e14-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 02:09:14 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Accused 2023 S01E05 HDTV x264-RBB (338MB)', 'link': 'https://log.rlsbb.cc/accused-2023-s01e05-1080p-web-h264-cakes/', 'pubDate': 'Wed, 22 Feb 2023 02:04:17 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'FBI S05E14 HDTV x264-RBB (295MB)', 'link': 'https://log.rlsbb.cc/fbi-s05e14-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 02:03:37 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'American Auto S02E05 HDTV x264-RBB (162MB)', 'link': 'https://log.rlsbb.cc/american-auto-s02e05-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 02:03:30 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Accused 2023 S01E05 720p WEB H264-CAKES (1.04GB)', 'link': 'https://log.rlsbb.cc/accused-2023-s01e05-1080p-web-h264-cakes/', 'pubDate': 'Wed, 22 Feb 2023 02:01:15 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Accused 2023 S01E05 1080p WEB H264-CAKES (1.53GB)', 'link': 'https://log.rlsbb.cc/accused-2023-s01e05-1080p-web-h264-cakes/', 'pubDate': 'Wed, 22 Feb 2023 02:01:15 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'American Auto S02E05 720p HDTV x264-SYNCOPY (599MB)', 'link': 'https://log.rlsbb.cc/american-auto-s02e05-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 02:00:21 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'FBI S05E14 720p HDTV x264-SYNCOPY (828MB)', 'link': 'https://log.rlsbb.cc/fbi-s05e14-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:58:32 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Night Court 2023 S01E07 HDTV x264-TGX (157MB)', 'link': 'https://log.rlsbb.cc/night-court-2023-s01e07-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:56:57 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Night Court 2023 S01E07 HDTV x264-RBB (163MB)', 'link': 'https://log.rlsbb.cc/night-court-2023-s01e07-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:56:57 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Night Court 2023 S01E07 480p x264-mSD (118MB)', 'link': 'https://log.rlsbb.cc/night-court-2023-s01e07-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:56:03 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Night Court 2023 S01E07 720p HEVC X265-MeGusta (162MB)', 'link': 'https://log.rlsbb.cc/night-court-2023-s01e07-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:55:38 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Deceived By My Mother-in-law 2021 1080p WEBRip x264-YIFY (1.42GB)', 'link': 'https://log.rlsbb.cc/deceived-by-my-mother-in-law-2021-1080p-amzn-webrip-x264-edph/', 'pubDate': 'Wed, 22 Feb 2023 01:42:04 +0000', 'category': 'Movi
es'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Varisu 2023 WEB H264-RBB (1.26GB)', 'link': 'https://log.rlsbb.cc/varisu-2023-720p-amzn-web-dl-h264-telly/', 'pubDate': 'Wed, 22 Feb 2023 01:34:11 +0000', 'category': 'Foreign Movies'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Nothing Is Impossible 2022 WEB H264-RBB (815MB)', 'link': 'https://log.rlsbb.cc/nothing-is-impossible-2022-1080p-web-dl-h264-ngp/', 'pubDate': 'Wed, 22 Feb 2023 01:33:44 +0000', 'category': 'Movies'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Inside Our Autistic Minds S01E02 WEB H264-RBB (451MB)', 'link': 'https://log.rlsbb.cc/inside-our-autistic-minds-s01e02-720p-ip-web-dl-h264-rng/', 'pubDate': 'Wed, 22 Feb 2023 01:33:25 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Night Court 2023 S01E07 720p HDTV x264-SYNCOPY (523MB)', 'link': 'https://log.rlsbb.cc/night-court-2023-s01e07-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:31:55 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': '9-1-1 Lone Star S04E05 720p HEVC X265-MeGusta (176MB)', 'link': 'https://log.rlsbb.cc/9-1-1-lone-star-s04e05-720p-web-h264-cakes/', 'pubDate': 'Wed, 22 Feb 2023 01:26:15 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'FBI Most Wanted S04E13 720p HEVC X265-MeGusta (258MB)', 'link': 'https://log.rlsbb.cc/fbi-most-wanted-s04e13-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:25:29 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': '9-1-1 Lone Star S04E05 1080p HEVC X265-MeGusta (247MB)', 'link': 'https://log.rlsbb.cc/9-1-1-lone-star-s04e05-720p-web-h264-cakes/', 'pubDate': 'Wed, 22 Feb 2023 01:23:44 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Will Trent S01E07 HDTV x264-TGX (241MB)', 'link': 'https://log.rlsbb.cc/will-trent-s01e07-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:23:22 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'The Rookie S05E16 HDTV x264-TGX (373MB)', 'link': 'https://log.rlsbb.cc/the-rookie-s05e16-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:23:14 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'FBI Most Wanted S04E13 HDTV x264-TGX (247MB)', 'link': 'https://log.rlsbb.cc/fbi-most-wanted-s04e13-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:23:03 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': '9-1-1 Lone Star S04E05 WEB x264-TGX (239MB)', 'link': 'https://log.rlsbb.cc/9-1-1-lone-star-s04e05-720p-web-h264-cakes/', 'pubDate': 'Wed, 22 Feb 2023 01:22:46 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Varisu 2023 1080p AMZN WEB-DL DDP5 1 H 264-DTR (10.0GB)', 'link': 'https://log.rlsbb.cc/varisu-2023-720p-amzn-web-dl-h264-telly/', 'pubDate': 'Wed, 22 Feb 2023 01:20:38 +0000', 'category': 'Foreign Movies'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Varisu 2023 720p AMZN WEB-DL DDP5 1 H 264-Telly (5.17GB)', 'link': 'https://log.rlsbb.cc/varisu-2023-720p-amzn-web-dl-h264-telly/', 'pubDate': 'Wed, 22 Feb 2023 01:20:38 +0000', 'category': 'Foreign Movies'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'The Rookie S05E16 480p x264-mSD (280MB)', 'link': 'https://log.rlsbb.cc/the-rookie-s05e16-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:19:59 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'The Rookie S05E16 720p HEVC X265-MeGusta (472MB)', 'link': 'https://log.rlsbb.cc/the-rookie-s05e16-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:19:59 +0000', 'category': 'TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Inside Our Autistic Minds S01E02 720p IP WEB-DL AAC2 0 H 264-RNG (2.13GB)', 'link': 'https://log.rlsbb.cc/inside-our-autistic-minds-s01e02-720p-ip-web-dl-h264-rng/', 'pubDate': 'Wed, 22 Feb 2023 01:18:44 +0000', 'category': 'T
V Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'Inside Our Autistic Minds S01E02 1080p IP WEB-DL AAC2 0 H 264-RNG (2.23GB)', 'link': 'https://log.rlsbb.cc/inside-our-autistic-minds-s01e02-720p-ip-web-dl-h264-rng/', 'pubDate': 'Wed, 22 Feb 2023 01:18:44 +0000', 'category': '
TV Shows'}
2023-02-21 18:51:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://log.rlsbb.cc/feed/>
{'title': 'FBI Most Wanted S04E13 480p x264-mSD (212MB)', 'link': 'https://log.rlsbb.cc/fbi-most-wanted-s04e13-720p-hdtv-x264-syncopy/', 'pubDate': 'Wed, 22 Feb 2023 01:18:35 +0000', 'category': 'TV Shows'}

相关问题