# generate 10 million lines, 114M size
(for i in {1..10000000} ; do
# V deliberate space, following original question
echo hi, $i
done) > file
cat file | shuf -o file
使用time命令进行测量:
$ >choose_out.csv <file command time --verbose -- choose --sort-reverse -n --out=50 --field '^[^,]*+. \K.*+'
Command being timed: "choose --sort-reverse -n --out=50 --field ^[^,]*+. \K.*+"
User time (seconds): 12.06
System time (seconds): 0.06
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.13
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4480
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 193
Voluntary context switches: 1
Involuntary context switches: 73
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
$ command time --verbose -- sort -t',' -k2 -nr file | head -50 >sort_out.csv
Command terminated by signal 13
Command being timed: "sort -t, -k2 -nr file"
User time (seconds): 48.07
System time (seconds): 1.46
Percent of CPU this job got: 332%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.90
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 675968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 168596
Voluntary context switches: 16
Involuntary context switches: 3528
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
2条答案
按热度按时间5t7ly7z51#
你可能会说:
对于给定的输入,它返回:
它使用
sort
和以下参数:-t','
将字段分隔符设置为逗号。-k2
表示sort
必须与第2列一起工作。-nr
表示您希望以数字和r
的方式进行排序(第一个最高)。要获取前50行,可以通过管道连接到
head -50
,它将获取上一个命令输出的前50行:hl0ma9xz2#
目前公认的答案是:
这会占用大量内存; gnu sort读取文件,然后对其进行排序,然后将其发送到输出。如果它有下游上下文,它可以做不同的事情,比如删除不再需要的行。
这就是choose的用武之地(我是作者)。下面是等效的命令:
说明:
cat file
:读取文件tail -n +2
:删除标题行(接受的答案不会这样做,但为了正确性需要这样做)--out=50
:将输出限制在前50行--sort-reverse -n
:按反数字顺序排序--field '^[^,]*+. \K.*+'
:匹配每行上的数字。expression是为问题的特定文件格式定制的。其他例子包括:^[^,]*+.\K[^,]*+
匹配csv的第二个字段^(?>(?:[^,]*+.){N})\K[^,]*+
匹配第N个字段(替换N)从现在开始,头被省略了,但是
tail
真的应该被用来去掉头行。对于下一个基准测试,让我们生成一个测试文件:
使用
time
命令进行测量:| | 排序| sort |
| --|--|--|
| 已用(挂钟)时间|0:12.13| 0:14.90|
| 用户时间|十二点零六分|四十八点零七分|
| 最大常驻集大小| 4480 | 675968 |
对于这个测试用例,选择的是运行时间更快、CPU时间更快,并且使用的内存更少。
运行时间比较会因硬件而异,但使用更少CPU和更少内存的总体思路是不变的。