删除csv中带引号字段和嵌入逗号的重复项

sc4hvdpw  于 12个月前  发布在  其他
关注(0)|答案(2)|浏览(84)

我有一个相当大的csv,大约5GB,其中有这样的条目:

"8976897","this is the abstract text of, some document, and can pretty much contain anything"
"23423","this is the subject text of, some document, and can pretty much contain anything"
"23","this is the full text of, some document, and can pretty much contain anything"
"3443","this is the subject text of, some document, and can pretty much contain anything"

第二列在目测时显示许多精确的重复。我想把这些拿掉。我的问题是:

[1] is `sort` the correct tool for this job?
[2] how do I ask sort to work only on the second column?
[3] does sort find duplicates (via -u flag) anywhere on the file or just immediately next line duplicates?

我试过这个:

sort -u infile > outfile

它似乎工作,但文件很大,所以我无法检查这是否真的做了我想要它做的事情,因为在命令行中没有任何地方我指定对第二列进行操作。
如果是愚蠢的问题,我道歉。

isr3a4wc

isr3a4wc1#

由于数据中的引用字段带有嵌入式逗号,因此sort等简单工具不适合执行此任务。你需要一个原生理解CSV格式的东西。
下面是一个perl的一行代码,它跳过第二列已经打印过一次的行(换句话说,如果有重复的,它只打印第一个条目):

$ perl -MText::CSV_XS -e '
    my $csv = Text::CSV_XS->new({binary=>1, always_quote=>1});
    while (my $rec = $csv->getline(*ARGV)) {
      $csv->say(*STDOUT, $rec) unless $seen{$rec->[1]}++
    }' input.csv
"8976897","this is the abstract text of, some document, and can pretty much contain anything"
"23423","this is the subject text of, some document, and can pretty much contain anything"
"23","this is the full text of, some document, and can pretty much contain anything"

它确实依赖于非标准的Text::CSV_XS模块,可通过您的OS包管理器或最喜欢的CPAN客户端获得。
警告:前面有明显的自我推销。
类似的方法,使用我的tawk utility,一个围绕tcl构建的类似awk的程序,具有CSV感知输入模式:

$ tawk -csv 'line { if {![info exists seen($F(2))]} { set seen($F(2)) 1; print }}' input.csv
"8976897","this is the abstract text of, some document, and can pretty much contain anything"
"23423","this is the subject text of, some document, and can pretty much contain anything"
"23","this is the full text of, some document, and can pretty much contain anything"
y4ekin9u

y4ekin9u2#

我认为choose是这个工作负载的工具(我是作者)。
解决方案如下:

$ cat file_contents | choose -u --field '^[^,]*+\K.*+'
"8976897","this is the abstract text of, some document, and can pretty much contain anything"
"23423","this is the subject text of, some document, and can pretty much contain anything"
"23","this is the full text of, some document, and can pretty much contain anything"

这使得具有唯一性的行的第一个示例只查看字段arg匹配的部分。参见表达式here。gnu sort不能做到这一点,因为它只能匹配逗号之间的内容,但你的数据字段本身可以包含逗号。

相关问题