python 使用Bash从文本文件中删除重复的域URL

dauxcl2d 于 2023-02-02 发布在 Python

关注(0)|答案(2)|浏览(124)

文本文件

https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/

预期输出：

https://www.google.com/1/
https://www.bing.com

我尝试了什么

awk -F'/' '!a[$3]++' $file;

产出

https://www.google.com/1/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/

我已经尝试了各种代码，没有一个工作的预期。我只想挑选一个唯一的域名网址每个域名从列表中。
请告诉我如何使用Bash脚本或Python来完成。
PS：我想从列表中过滤并保存完整的URL，而不仅仅是根域。

python

来源：https://stackoverflow.com/questions/75315505/removing-duplicate-domain-urls-from-the-text-file-using-bash

2条答案

按热度按时间

oaxa6hgo1#

以awk和/作为字段分隔符：

awk -F '/' '!seen[$3]++' file

如果您的文件包含Windows换行符（回车），那么我建议：

dos2unix < file | awk -F '/' '!seen[$3]++'

输出：

https://www.google.com/1/
https://www.bing.com

赞(0）回复(0）举报 2023-02-02

yzckvree2#

Python解决方案，使用迭代工具配方和urllib.parse.urlparse之一，令file.txt内容为

https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/

那么

from itertools import filterfalse
from urllib.parse import urlparse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBcCAD', str.lower) --> A B c D
    seen = set()
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen.add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen.add(k)
                yield element

def get_netloc(url):
    return urlparse(url).netloc

with open("file.txt","r") as fin:
    with open("file_uniq.txt","w") as fout:
        for line in unique_everseen(fin,key=get_netloc):
            fout.write(line)

创建包含以下内容的文件file_uniq.txt

https://www.google.com/1/
https://www.bing.com

赞(0）回复(0）举报 2023-02-02

我来回答

python 使用Bash从文本文件中删除重复的域URL

2条答案

相关问题

热门标签

最新问答