在Linux中根据特定行重复的次数对txt文件进行排序和拆分

ogsagwnx 于 2023-03-22 发布在 Linux

关注(0)|答案(3)|浏览(203)

我有一个3GB的txt文件：

Lucy
Mary 
Lily 
John 
Mary 
Ann
John 
Lily
Lily
Mary
Lucy 
Mark

在输出中，必须有如下文件：
3instances.txt:

Lily
Lily
Lily
Mary 
Mary
Mary

2instances.txt:

John 
John 
Lucy
Lucy

1instance.txt:

Ann
Mark

linux

来源：https://stackoverflow.com/questions/75787830/sorting-and-splitting-a-txt-file-in-linux-based-on-how-many-times-a-particular-l

3条答案

按热度按时间

siv3szwd1#

如果你想使用shell，你可以使用下面的命令，假设没有名字包含空格：

awk '{$1=$1; print $1}' INPUT_FILE | sort | uniq -c | awk '{ print $2 > $1"instances.txt"}'

awk '{$1=$1; print $1}'：修剪空格
sort：将相等的名称分组
uniq -c：计算唯一重复行和打印次数，即1 Alice
awk '{ print $2 > $1"instances.txt"}：根据计数（第1列）将名称（第2列）写入文件
将产生：
1instances.txt

Ann
Mark
.
.
.

2instances.txt

John
Lucy
.
.
.

3instances.txt

Lily
Mary
.
.
.

赞(0）回复(0）举报 2023-03-22

ar7v8xwq2#

有一些向导可以使用awk和其他shell命令来完成这些事情。
对于普通人来说，有Python。
请注意，我已经在你的小例子上测试了下面的代码，但不是在3GB的文件上。

#!/usr/bin/env python3

from collections import Counter
import sys

def group_by_count(filename):
    with open(filename, 'r') as f:
        c = Counter(line.strip() for line in f)
    groups = {}
    for (line, count) in c.items():
        groups.setdefault(count, []).append(line)
    return groups

def write_files(groups):
    for n, lines in sorted(groups.items()):
        filename = f'{n}instances.txt'
        with open(filename, 'w') as f:
            for line in lines:
                f.write(line + '\n')

def main(argv):
    if len(argv) > 1:
        groups = group_by_count(argv[1])
        write_files(groups)
    else:
        print('Please specify a file name to read from.')

if __name__ == '__main__':
    main(sys.argv)

结果：

$ chmod +x sort_by_repetitions.py
$ cat test.txt
Lucy
Mary 
Lily 
John 
Mary 
Ann
John 
Lily
Lily
Mary
Lucy 
Mark
$ ./sort_by_repetitions.py test.txt
$ ls *instances*
1instances.txt  2instances.txt  3instances.txt
$ cat 1instances.txt 
Ann
Mark
$ cat 2instances.txt 
Lucy
John
$ cat 3instances.txt 
Mary
Lily

赞(0）回复(0）举报 2023-03-22

bbuxkriu3#

awk '
    {
        cnt[$0]++
    }
    END{
        n=asorti(cnt, sorted); 
        for (i=1; i<=n; i++) {
            out = cnt[sorted[i]] (cnt[sorted[i]]>1 ? "instances.txt" : "instance.txt")
            for (j=1; j<=cnt[sorted[i]]; j++) 
                print sorted[i] > out     
        } 
}' file

$ head ?instance*.txt
==> 1instance.txt <==
Ann
Mark

==> 2instances.txt <==
John
John
Lucy
Lucy

==> 3instances.txt <==
Lily
Lily
Lily
Mary
Mary
Mary

赞(0）回复(0）举报 2023-03-22

我来回答

在Linux中根据特定行重复的次数对txt文件进行排序和拆分

3条答案

相关问题

热门标签

最新问答