scrapy 从剪贴板中删除不必要的url

gmxoilav 于 2022-11-09 发布在其他

关注(0)|答案(1)|浏览(191)

import scrapy
from scrapy.http import Request

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }

    def parse(self, response):
        books = response.xpath("//div[@class='list-group']//@href").extract()
        for book in books:
            url = response.urljoin(book)
            print(url)

我想从链接中删除这些不必要的网址该网站是https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx

http://www.unbr.ro
http://www.inppa.ro
http://www.uniuneanotarilor.ro/
http://www.caav.ro
http://www.executori.ro/
http://www.csm1909.ro
http://www.inm-lex.ro
http://www.just.ro

scrapy

来源：https://stackoverflow.com/questions/72671624/remove-unnecessary-url-from-scrapy

1条答案

按热度按时间

cgyqldqp1#

您可以应用endswith方法沿着continue关键字来删除所需的url

import scrapy
from scrapy.http import Request

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }

    def parse(self, response):
        books = response.xpath("//div[@class='list-group']//@href").extract()
        for book in books:
            url = response.urljoin(book)
            if url.endswith('.ro') or url.endswith('.ro/'):
                continue
            print(url)

输出：

https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=1091&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159077&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159076&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159075&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159021&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159020&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159019&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=159018&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=21846&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=165927&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=83465&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=47724&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=32097&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=29573&Signature=378270
https://www.ifep.ro/justice/lawyers/LawyerFile.aspx?RecordId=19880&Signature=378270

赞(0）回复(0）举报 2022-11-09

我来回答

scrapy 从剪贴板中删除不必要的url

1条答案

相关问题

热门标签

最新问答