python 使用Bash从文本文件中删除重复的域URL

dauxcl2d  于 2023-02-02  发布在  Python
关注(0)|答案(2)|浏览(127)

文本文件

https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/

预期输出:

https://www.google.com/1/
https://www.bing.com

我尝试了什么

awk -F'/' '!a[$3]++' $file;

产出

https://www.google.com/1/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/

我已经尝试了各种代码,没有一个工作的预期。我只想挑选一个唯一的域名网址每个域名从列表中。
请告诉我如何使用Bash脚本或Python来完成。
PS:我想从列表中过滤并保存完整的URL,而不仅仅是根域。

oaxa6hgo

oaxa6hgo1#

awk/作为字段分隔符:

awk -F '/' '!seen[$3]++' file

如果您的文件包含Windows换行符(回车),那么我建议:

dos2unix < file | awk -F '/' '!seen[$3]++'

输出:

https://www.google.com/1/
https://www.bing.com
yzckvree

yzckvree2#

Python解决方案,使用迭代工具配方和urllib.parse.urlparse之一,令file.txt内容为

https://www.google.com/1/
https://www.google.com/2/
https://www.google.com
https://www.bing.com
https://www.bing.com/2/
https://www.bing.com/3/

那么

from itertools import filterfalse
from urllib.parse import urlparse

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBcCAD', str.lower) --> A B c D
    seen = set()
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen.add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen.add(k)
                yield element

def get_netloc(url):
    return urlparse(url).netloc

with open("file.txt","r") as fin:
    with open("file_uniq.txt","w") as fout:
        for line in unique_everseen(fin,key=get_netloc):
            fout.write(line)

创建包含以下内容的文件file_uniq.txt

https://www.google.com/1/
https://www.bing.com

相关问题