scrapy 从没有重复的网站中删除链接

ndasle7k  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(158)

我正在尝试从一个网站上的文字废弃所有链接在每个网站上。现在我的代码是创建重复,他们很多,我想避免。你能请帮助我,告诉我在哪里犯了错误?
这是我的蜘蛛

class SuperSpider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    base_url = 'http://quotes.toscrape.com'

    rules = [Rule(LinkExtractor(allow='/'),
                  callback='parse', follow=True)]

    def parse(self, response):
        url_list = []
        for quote in response.css('div'):
            name =  quote.xpath('.//a/@href').get()
            if name in url_list:
                continue
            url_list.append(name)
            yield {
                'Link_without_base_url': quote.xpath('.//a/@href').get(),
                'Text':  response.css("::text").extract(),
            }

我得到json的例子

{"Link_without_base_url": "/", "Text": ["\n", "\n\t", "\n\t", "Quotes to Scrape", "\n    ", "\n    ", "\n", "\n", "\n    ", "\n        ", "\n            ", "\n                ", "\n                    ", "Quotes to Scrape", "\n                ", "\n            ", "\n            ", "\n                ", "\n                \n                    ", "Login", "\n                \n                ", "\n            ", "\n        ", "\n    \n\n", "Viewing tag: ", "better-life-empathy", "\n\n", "\n    ", "\n\n    ", "\n        ", "\u201cYou never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.\u201d", "\n        ", "by ", "Harper Lee", "\n        ", "(about)", "\n        ", "\n        ", "\n            Tags:\n            ", " \n            \n            ", "better-life-empathy", "\n            \n        ", "\n    ", "\n\n    ", "\n        ", "\n            \n            \n        ", "\n    ", "\n    ", "\n    ", "\n        \n            ", "Top Ten tags", "\n            \n            ", "\n            ", "love", "\n            ", "\n            \n            ", "\n            ", "inspirational", "\n            ", "\n            \n            ", "\n            ", "life", "\n            ", "\n            \n            ", "\n            ", "humor", "\n            ", "\n            \n            ", "\n            ", "books", "\n            ", "\n            \n            ", "\n            ", "reading", "\n            ", "\n            \n            ", "\n            ", "friendship", "\n            ", "\n            \n            ", "\n            ", "friends", "\n            ", "\n            \n            ", "\n            ", "truth", "\n            ", "\n            \n            ", "\n            ", "simile", "\n            ", "\n            \n        \n    ", "\n", "\n\n    ", "\n    ", "\n        ", "\n            ", "\n                Quotes by: ", "GoodReads.com", "\n            ", "\n            ", "\n                Made with ", "\u2764", " by ", "Scrapinghub", "\n            ", "\n        ", "\n    ", "\n", "\n"]},
{"Link_without_base_url": "/", "Text": ["\n", "\n\t", "\n\t", "Quotes to Scrape", "\n    ", "\n    ", "\n", "\n", "\n    ", "\n        ", "\n            ", "\n                ", "\n                    ", "Quotes to Scrape", "\n                ", "\n            ", "\n            ", "\n                ", "\n                \n                    ", "Login", "\n                \n                ", "\n            ", "\n        ", "\n    \n\n", "Viewing tag: ", "better-life-empathy", "\n\n", "\n    ", "\n\n    ", "\n        ", "\u201cYou never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.\u201d", "\n        ", "by ", "Harper Lee", "\n        ", "(about)", "\n        ", "\n        ", "\n            Tags:\n            ", " \n            \n            ", "better-life-empathy", "\n            \n        ", "\n    ", "\n\n    ", "\n        ", "\n            \n            \n        ", "\n    ", "\n    ", "\n    ", "\n        \n            ", "Top Ten tags", "\n            \n            ", "\n            ", "love", "\n            ", "\n            \n            ", "\n            ", "inspirational", "\n            ", "\n            \n            ", "\n            ", "life", "\n            ", "\n            \n            ", "\n            ", "humor", "\n            ", "\n            \n            ", "\n            ", "books", "\n            ", "\n            \n            ", "\n            ", "reading", "\n            ", "\n            \n            ", "\n            ", "friendship", "\n            ", "\n            \n            ", "\n            ", "friends", "\n            ", "\n            \n            ", "\n            ", "truth", "\n            ", "\n            \n            ", "\n            ", "simile", "\n            ", "\n            \n        \n    ", "\n", "\n\n    ", "\n    ", "\n        ", "\n            ", "\n                Quotes by: ", "GoodReads.com", "\n            ", "\n            ", "\n                Made with ", "\u2764", " by ", "Scrapinghub", "\n            ", "\n        ", "\n    ", "\n", "\n"]},
{"Link_without_base_url": "/", "Text": ["\n", "\n\t", "\n\t", "Quotes to Scrape", "\n    ", "\n    ", "\n", "\n", "\n    ", "\n        ", "\n            ", "\n                ", "\n                    ", "Quotes to Scrape", "\n                ", "\n            ", "\n            ", "\n                ", "\n                \n                    ", "Login", "\n                \n                ", "\n            ", "\n        ", "\n    \n\n", "Viewing tag: ", "better-life-empathy", "\n\n", "\n    ", "\n\n    ", "\n        ", "\u201cYou never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.\u201d", "\n        ", "by ", "Harper Lee", "\n        ", "(about)", "\n        ", "\n        ", "\n            Tags:\n            ", " \n            \n            ", "better-life-empathy", "\n            \n        ", "\n    ", "\n\n    ", "\n        ", "\n            \n            \n        ", "\n    ", "\n    ", "\n    ", "\n        \n            ", "Top Ten tags", "\n            \n            ", "\n            ", "love", "\n            ", "\n            \n            ", "\n            ", "inspirational", "\n            ", "\n            \n            ", "\n            ", "life", "\n            ", "\n            \n            ", "\n            ", "humor", "\n            ", "\n            \n            ", "\n            ", "books", "\n            ", "\n            \n            ", "\n            ", "reading", "\n            ", "\n            \n            ", "\n            ", "friendship", "\n            ", "\n            \n            ", "\n            ", "friends", "\n            ", "\n            \n            ", "\n            ", "truth", "\n            ", "\n            \n            ", "\n            ", "simile", "\n            ", "\n            \n        \n    ", "\n", "\n\n    ", "\n    ", "\n        ", "\n            ", "\n                Quotes by: ", "GoodReads.com", "\n            ", "\n            ", "\n                Made with ", "\u2764", " by ", "Scrapinghub", "\n            ", "\n        ", "\n    ", "\n", "\n"]},

谢谢大家的支持

nmpmafwu

nmpmafwu1#

简单地说,您可以选择所有列表项并迭代,然后选择链接和文本项如下:

from scrapy.crawler import CrawlerProcess
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for quote in response.xpath('//*[@class="quote"]'):

            yield {
                'Link_without_base_url': 'http://quotes.toscrape.com' + quote.css('.text~span a::attr(href)').get(),
                'Text': quote.xpath('.//*[@class="text"]/text()').get()
            }

if __name__ == "__main__":
    process =CrawlerProcess(QuotesSpider)
    process.crawl()
    process.start()

输出:

{'Link_without_base_url': 'http://quotes.toscrape.com/author/Albert-Einstein', 'Text': '“The world as we have created it is a process of our thinking. It cannot be changed without 
changing our thinking.”'}
2022-06-21 01:28:36 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'Link_without_base_url': 'http://quotes.toscrape.com/author/J-K-Rowling', 'Text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
2022-06-21 01:28:36 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'Link_without_base_url': 'http://quotes.toscrape.com/author/Albert-Einstein', 'Text': '“The world as we have created it is a process of our thinking. It cannot be changed without 
changing our thinking.”'}
2022-06-21 01:28:36 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'Link_without_base_url': 'http://quotes.toscrape.com/author/Jane-Austen', 'Text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
2022-06-21 01:28:36 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'Link_without_base_url': 'http://quotes.toscrape.com/author/Marilyn-Monroe', 'Text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
2022-06-21 01:28:36 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'Link_without_base_url': 'http://quotes.toscrape.com/author/Albert-Einstein', 'Text': '“The world as we have created it is a process of our thinking. It cannot be changed without 
changing our thinking.”'}
2022-06-21 01:28:36 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'Link_without_base_url': 'http://quotes.toscrape.com/author/Andre-Gide', 'Text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
2022-06-21 01:28:36 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'Link_without_base_url': 'http://quotes.toscrape.com/author/Thomas-A-Edison', 'Text': '“The world as we have created it is a process of our thinking. It cannot be changed without 
changing our thinking.”'}
2022-06-21 01:28:36 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'Link_without_base_url': 'http://quotes.toscrape.com/author/Eleanor-Roosevelt', 'Text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
2022-06-21 01:28:36 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'Link_without_base_url': 'http://quotes.toscrape.com/author/Steve-Martin', 'Text': '“The 
world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}

相关问题