通过排序删除csv第二列的重复项

yrefmtwq  于 2022-12-16  发布在  其他
关注(0)|答案(1)|浏览(181)

我有一个相当大的csv,大约5 GB,其中有这样的条目:

"8976897","this is the abstract text of, some document, and can pretty much contain anything"
"23423","this is the subject text of, some document, and can pretty much contain anything"
"23","this is the full text of, some document, and can pretty much contain anything"
"3443","this is the subject text of, some document, and can pretty much contain anything"

第二栏显示了很多完全相同重复。2我想删除这些。3我的问题是:

[1] is `sort` the correct tool for this job?
[2] how do I ask sort to work only on the second column?
[3] does sort find duplicates (via -u flag) anywhere on the file or just immediately next line duplicates?

我试过这个:

sort -u infile > outfile

它看起来可以工作,但是文件很大,所以我无法检查它是否真的做了我想让它做的事情,因为我在命令行中没有指定对第二列进行操作。
如果是愚蠢的问题我道歉。

sulc1iza

sulc1iza1#

因为您的数据包含带引号的字段,并且这些字段中嵌入了逗号,所以像sort这样的简单工具并不适用于此任务,您需要能够理解CSV格式的工具。
下面是一个perl单行程序,它跳过第二列已经打印过一次的行(换句话说,如果存在重复项,它只打印第一个条目):

$ perl -MText::CSV_XS -e '
    my $csv = Text::CSV_XS->new({binary=>1, always_quote=>1});
    while (my $rec = $csv->getline(*ARGV)) {
      $csv->say(*STDOUT, $rec) unless $seen{$rec->[1]}++
    }' input.csv
"8976897","this is the abstract text of, some document, and can pretty much contain anything"
"23423","this is the subject text of, some document, and can pretty much contain anything"
"23","this is the full text of, some document, and can pretty much contain anything"

它确实依赖于非标准的Text::CSV_XS模块,该模块可以通过您的OS包管理器或喜爱的CPAN客户端获得。
警告:明目张胆的自我推销就在前方。
使用我的tawk utility(一个围绕tcl构建的类似awk的程序,具有CSV感知输入模式)也可以使用类似的方法:

$ tawk -csv 'line { if {![info exists seen($F(2))]} { set seen($F(2)) 1; print }}' input.csv
"8976897","this is the abstract text of, some document, and can pretty much contain anything"
"23423","this is the subject text of, some document, and can pretty much contain anything"
"23","this is the full text of, some document, and can pretty much contain anything"

相关问题