scrapy 包含子字符串的URL的临时DropItem

30byixjq  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(116)

我对Python还很陌生,我使用的是Scrapy。现在,我有两个spider,一个用于Google,一个用于页面本身。我计划将它们结合起来,但还没有,因为我想分别对页面进行故障排除。两个spider都工作得很好,但我希望能够从我的报废链接列表中删除内部链接我已经尝试了上百万种不同的方法,包括使用find & regex、更改变量名、不使用变量、在表达式中添加“self”,但似乎没有什么影响它。管道被启用--它似乎什么也不做。任何帮助都是感激的。
pipelines.py

from scrapy.exceptions import DropItem

class SpiderValidationPipeline:
    def drop_links(self, item, spider):
        url = str(item.get('links'))
        marker = '#'

        if item.get('links'):
            if marker in url:
                raise DropItem("Internal Link")
        else:
            return item

items.py

import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
from w3lib.html import remove_tags

def remove_nt(text):
    return text.replace('\n', '').replace('\t', '').replace('[edit]', '').replace('/sæs/', '').replace('\"', '')\
        .replace('\u2014', '—')

class GoogleCrawlItem(scrapy.Item):

    title = scrapy.Field(input_processor=MapCompose(remove_tags), output_processor=TakeFirst())
    link = scrapy.Field(input_processor=MapCompose(remove_tags), output_processor=TakeFirst())
    desc = scrapy.Field(input_processor=MapCompose(remove_tags), output_processor=TakeFirst())

class PageCrawlItem(scrapy.Item):

    title = scrapy.Field(input_processor=MapCompose(remove_tags), output_processor=TakeFirst())
    meta = scrapy.Field()
    h1 = scrapy.Field(input_processor=MapCompose(remove_tags))
    h2 = scrapy.Field(input_processor=MapCompose(remove_tags, remove_nt))
    h3 = scrapy.Field(input_processor=MapCompose(remove_tags, remove_nt))
    h4 = scrapy.Field(input_processor=MapCompose(remove_tags, remove_nt))
    paragraph = scrapy.Field(input_processor=MapCompose(remove_tags, remove_nt))
    links = scrapy.Field(input_processor=MapCompose(remove_tags))

pagespider.py

import scrapy
from scrapy.loader import ItemLoader
from google_crawl.items import PageCrawlItem

class PageSpider(scrapy.Spider):
    name = 'page'
    start_urls = ['https://en.wikipedia.org/wiki/Software_as_a_service']

    def parse(self, response):

        for meta_element in response.css('head'):
            page_item = ItemLoader(item=PageCrawlItem(), selector=meta_element)

            page_item.add_css('title', 'title')
            page_item.add_css('meta', 'meta')

            yield page_item.load_item()

        for par_item in response.css('body'):
            par_item = ItemLoader(item=PageCrawlItem(), selector=par_item)

            par_item.add_css('paragraph', 'p')
            par_item.add_css('h1', 'h1')

            yield par_item.load_item()

        for h2s in response.css('body'):
            h2_item = ItemLoader(item=PageCrawlItem(), selector=h2s)

            h2_item.add_css('h2', 'h2')

            yield h2_item.load_item()

        for h3s in response.css('body'):
            h3_item = ItemLoader(item=PageCrawlItem(), selector=h3s)

            h3_item.add_css('h3', 'h3')

            yield h3_item.load_item()

        for h4s in response.css('body'):
            h4_item = ItemLoader(item=PageCrawlItem(), selector=h4s)

            h4_item.add_css('h4', 'h4')

            yield h4_item.load_item()

        for links in response.css('body'):
            link_item = ItemLoader(item=PageCrawlItem(), selector=links)

            link_item.add_css('links', 'a::attr(href)')

            yield link_item.load_item()

settings.py

BOT_NAME = 'google_crawl'

SPIDER_MODULES = ['google_crawl.spiders']
NEWSPIDER_MODULE = 'google_crawl.spiders'

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'

ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 7

ITEM_PIPELINES = {
    'google_crawl.pipelines.SpiderValidationPipeline': 100,
}
disbfnqx

disbfnqx1#

现在设置蜘蛛的方式是,在一个条目的一个列表中生成所有的“链接”。只有条目中的链接字段是字符串时,管道中的方法才有效。
另一个问题是管道中的方法名称需要更改为process_item,以便与scrapy api一起工作。此外,由于您的项不输出“links”键,因此在尝试过滤掉不需要的URL之前,您需要进行测试以确保该项中存在该字段。
例如,只需进行以下更改:

pipeline.py
class SpiderValidationPipeline:
    def process_item(self, item, spider):
        if "links" in item:
            item["links"] = [i for i in item.get("links") if "#" not in i]
        return item

相关问题