在bash中解析包含不寻常字符、空格、括号和不规则返回的大型CSV文件

pdtvr36n  于 2023-06-03  发布在  其他
关注(0)|答案(6)|浏览(320)

我有一个非常大(1.5 GB)的格式错误的CSV文件,我需要读入R,虽然文件本身是一个CSV,但由于行返回位置不佳,分隔符在一定数量的行后会中断。
我有一个简化的例子attached,但truncated visual representation看起来像这样:

SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000   0.00000000  -0.00000000  -0.00000000   0.00000000
   -0.00000000  -0.00000000   0.00000000   0.00000000   0.00000000
    0.00000000   0.00000000   0.00000000]
 [ -0.00000000  -0.0000000   -0.00000000  -0.00000000  -0.0000000
   -0.0000000   -0.0000000    0.00000000   0.00000000  -0.00000000
   -0.00000000   0.00000000   0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[  1.11111111  -1.11111111  -1.1111111   -1.1111111    1.1111111
    1.11111111   1.11111111   1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222   2.22222222 -2.22222222 -2.22222222  2.2222222  -2.22222222
  -2.22222222 -2.22222222 -2.22222222  2.22222222  2.22222222  2.22222222]
 [-2.22222222 -2.22222222  2.22222222  2.2222222   2.22222222 -2.22222222
   2.2222222  -2.2222222   2.22222222  2.2222222   2.222222   -2.22222222]
 [-2.22222222 -2.2222222   2.22222222  2.2222222   2.22222222 -2.22222222
  -2.22222222 -2.2222222  -2.22222222  2.22222222  2.2222222   2.22222222]
 [-2.22222222 -2.22222222  2.2222222   2.2222222   2.2222222  -2.22222222
  -2.222222   -2.2222222  -2.2222222  -2.22222222  2.22222222  2.2222222 ]
 [-2.22222222 -2.222222    2.22222222  2.22222222  2.22222222 -2.2222222
  -2.2222222  -2.2222222  -2.2222222  -2.22222222  2.22222222 -2.222222  ]
 [ 2.22222222 -2.22222222 -2.222222   -2.222222   -2.2222222  -2.22222222
  -2.222222   -2.22222222  2.2222222  -2.2222222   2.2222222   2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000   0.00000000  -0.00000000   0.000000    -0.00000000
   -0.00000000   0.00000000   0.00000000]]"

CSV中的新行和所有as /n。
为了避免将其全部加载到内存中并尝试在其他环境中将其解析为 Dataframe ,我一直试图将CSV中的相关片段打印到终端,删除字符返回,折叠空格,并在变量之间输入逗号。
如下所示:

000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"

我的主要尝试是从括号和方括号之间的一行中提取所有信息:

awk '/\"\[\[/{found=1} found{print; if (/]]"/) exit}'  Malformed_csv_Abridged.csv | tr -d '\n\r' | tr -s ' ' | tr ' ' ','

输出:

000000000,0000-00-00,0000-00-00,0,FIRST,TEXT,FOR,ZERO,"[[,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[,-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000,]]"

很接近了,但是:
1.它 * 只 * 打印第一个示例,所以我需要一种方法来找到其他示例。
1.它在我要搜索的字符("[[]]")之前的空格中插入逗号,我不需要它这样做。
1.它在括号中留下了一些额外的逗号,由于需要转义字符,我还没有找到正确的tr调用来删除这些逗号。

iqih9akk

iqih9akk1#

我不明白你的目的。CSV文件在我看来是正确的CSV文件。如果你只想删除换行符,你可以使用Miller和clean-whitespace动词:

mlr --csv clean-whitespace Malformed.csv >Malformed_c.csv

得到这个https://gist.githubusercontent.com/aborruso/538e964c0c84a8b27d4c3d3b61d23bb4/raw/1fa83f43238be4a6aeb9c743aaf2e4da36f6cc74/Malformed_c.csv

yqyhoc1h

yqyhoc1h2#

假设:

  • 唯一包含双引号的字段是最后一个字段(broken_column_var
  • 在最后一个字段中,我们不必担心嵌入/转义双引号(即,对于每个数据行,最后一个字段正好有两个双引号)
  • 所有broken_column_var值包含至少一个嵌入式换行(即,每个broken_column_var值至少跨越2个物理行);否则,我们需要添加一些代码来解决驻留在同一行上的两个双引号...可行,但现在将跳过,以便不(进一步)复杂化所提出的代码

一种(详细的)awk方法可以从broken_column_var中删除嵌入的换行符,同时用逗号替换空格:

awk '
NR==1              { print; next }                      # print header
!in_merge && /["]/ { split($0,a,"\"")                   # 1st double quote found; split line on double quote
                     head     = a[1]                    # save 1st part of line
                     data     = "\"" a[2]               # save double quote and 2nd part of line
                     in_merge = 1                       # set flag
                     next
                   }
 in_merge          { data = data " " $0                 # append current line to "data"
                     if ( $0 ~ /["]/ ) {                # if 2nd double quote found => process "data"
                        gsub(/[ ]+/,",",data)           # replace consecutive spaces with single comma
                        gsub(/,[]]/,"]",data)           # replace ",]" with "]"
                        gsub(/[[],/,"[",data)           # replace "[," with "["
                        print head data                 # print new line
                        in_merge = 0                    # clear flag
                     }
                   }
' Malformed.csv

这产生:

SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[1.11111111,-1.11111111,-1.1111111,-1.1111111,1.1111111,1.11111111,1.11111111,1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222,2.22222222,-2.22222222,-2.22222222,2.2222222,-2.22222222,-2.22222222,-2.22222222,-2.22222222,2.22222222,2.22222222,2.22222222],[-2.22222222,-2.22222222,2.22222222,2.2222222,2.22222222,-2.22222222,2.2222222,-2.2222222,2.22222222,2.2222222,2.222222,-2.22222222],[-2.22222222,-2.2222222,2.22222222,2.2222222,2.22222222,-2.22222222,-2.22222222,-2.2222222,-2.22222222,2.22222222,2.2222222,2.22222222],[-2.22222222,-2.22222222,2.2222222,2.2222222,2.2222222,-2.22222222,-2.222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,2.2222222],[-2.22222222,-2.222222,2.22222222,2.22222222,2.22222222,-2.2222222,-2.2222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,-2.222222],[2.22222222,-2.22222222,-2.222222,-2.222222,-2.2222222,-2.22222222,-2.222222,-2.22222222,2.2222222,-2.2222222,2.2222222,2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[-0.00000000,0.00000000,-0.00000000,0.000000,-0.00000000,-0.00000000,0.00000000,0.00000000]]"
kzipqqlq

kzipqqlq3#

使用双引号作为字段分隔符。一个完整的记录有1个或3个字段。

awk '
  BEGIN {FS = OFS = "\""}
  {$0 = prev $0; $1=$1}
  NF % 2 == 1 {print; prev = ""; next}
  {prev = $0}
  END {if (prev) print prev}
' file.csv
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000   0.00000000  -0.00000000  -0.00000000   0.00000000   -0.00000000  -0.00000000   0.00000000   0.00000000   0.00000000    0.00000000   0.00000000   0.00000000] [ -0.00000000  -0.0000000   -0.00000000  -0.00000000  -0.0000000   -0.0000000   -0.0000000    0.00000000   0.00000000  -0.00000000   -0.00000000   0.00000000   0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[  1.11111111  -1.11111111  -1.1111111   -1.1111111    1.1111111    1.11111111   1.11111111   1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222   2.22222222 -2.22222222 -2.22222222  2.2222222  -2.22222222  -2.22222222 -2.22222222 -2.22222222  2.22222222  2.22222222  2.22222222] [-2.22222222 -2.22222222  2.22222222  2.2222222   2.22222222 -2.22222222   2.2222222  -2.2222222   2.22222222  2.2222222   2.222222   -2.22222222] [-2.22222222 -2.2222222   2.22222222  2.2222222   2.22222222 -2.22222222  -2.22222222 -2.2222222  -2.22222222  2.22222222  2.2222222   2.22222222] [-2.22222222 -2.22222222  2.2222222   2.2222222   2.2222222  -2.22222222  -2.222222   -2.2222222  -2.2222222  -2.22222222  2.22222222  2.2222222 ] [-2.22222222 -2.222222    2.22222222  2.22222222  2.22222222 -2.2222222  -2.2222222  -2.2222222  -2.2222222  -2.22222222  2.22222222 -2.222222  ] [ 2.22222222 -2.22222222 -2.222222   -2.222222   -2.2222222  -2.22222222  -2.222222   -2.22222222  2.2222222  -2.2222222   2.2222222   2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000   0.00000000  -0.00000000   0.000000    -0.00000000   -0.00000000   0.00000000   0.00000000]]"

对于一种具有CSV库的语言,我发现perl的Text::CSV对于引用的换行符很有用:

perl -e '
  use Text::CSV;
  my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 });
  open my $fh, "<:encoding(utf8)", "file.csv" or die "test.csv: $!";
  while (my $row = $csv->getline ($fh)) {
    $row->[-1] =~ s/\n//g;
    $csv->say(STDOUT, $row);
  }
'
4sup72z8

4sup72z84#

这可能对你有用(GNU sed):

sed -E '1b
        :a;N;/"$/!ba
        s/"/\n&/
        h
        s/\n/ /2g
        s/.*\n//
        s/ +/,/g
        s/,\]/]/g
        s/\[,/[/g
        H
        g
        s/\n.*\n//' file

忘记标题行。
收集每一个记录。
在最后一个字段前引入一个换行符。
将改进后的记录复制一份。
用空格替换第二行中的所有换行符。
删除到引入的第一个换行符。
用逗号替换空格。
删除方括号前后引入的所有逗号。
将最后一个字段附加到副本。
使副本成为当前副本。
删除引入的换行符之间(包括)的所有内容。
注意:只有每条记录的最后一个字段被双引号括起来。
备选方案:

sed -E '1b;:a;N;/"$/!ba;y/\n/ /;:b;s/("\S+) +/\1,/;tb;s/,\[/[/g;s/\],/]/g' file
92vpleto

92vpleto5#

您可以使用GoCSV的replace命令轻松地去除换行符:

gocsv replace          \
  -c broken_column_var \
  -regex '\s+'         \
  -repl ' '            \
  input.csv

它将所有连续的空格(\s+)规范化为一个空格。
一个非常小的Python脚本也可以处理这个问题:

import csv
import re

ws_re = re.compile(r"\s+")

f_in = open("input.csv", newline="")
reader = csv.reader(f_in)

f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)

writer.writerow(next(reader))  # transfer header

for row in reader:
    row[5] = ws_re.sub(" ", row[5])
    writer.writerow(row)
bvjveswy

bvjveswy6#

使用bashloadablecsv模块:

纯bash,来自How to parse a CSV file in Bash?,稍微修改了一下,检查双引号的奇偶性。

#!/bin/bash

enable -f /usr/lib/bash/csv csv

exec {FD}< "$1"
read -ru $FD line
csv -a headline "$line"
printf -v fieldfmt '%-8s: "%%q"\\n' "${headline[@]}"

numcols=${#headline[@]}

while read -ru $FD line; do
    while chk=${line//[^\"]}
    csv -a row -- "$line"
    [[ -n ${chk//\"\"} ]] || (( ${#row[@]} < numcols )); do
        read -ru $FD sline || break 2
        line+=$'\n'"$sline"
    done
    printf "$fieldfmt\\n" "${row[@]}"
done
exec {FD}>&-

通过您的 * broken_input.csv *,显示:

SubID   : "000000000"
Date1   : "0000-00-00"
date2   : "0000-00-00"
var1    : "0"
var2    : "FIRST\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[ -0.00000000   0.00000000  -0.00000000  -0.00000000   0.00000000\n-0.00000000  -0.00000000   0.00000000   0.00000000   0.00000000\n0.00000000   0.00000000   0.00000000]\n[ -0.00000000  -0.0000000   -0.00000000  -0.00000000  -0.0000000\n-0.0000000   -0.0000000    0.00000000   0.00000000  -0.00000000\n-0.00000000   0.00000000   0.0000000 ]]'"

SubID   : "000000000"
Date1   : "1111-11-11"
date2   : "1111-11-11"
var1    : "1"
var2    : "SECOND\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[  1.11111111  -1.11111111  -1.1111111   -1.1111111    1.1111111\n1.11111111   1.11111111   1.11111111]]'"

SubID   : "000000000"
Date1   : "2222-22-22"
date2   : "2222-22-22"
var1    : "2"
var2    : "THIRD\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[-2.2222222   2.22222222 -2.22222222 -2.22222222  2.2222222  -2.22222222\n-2.22222222 -2.22222222 -2.22222222  2.22222222  2.22222222  2.22222222]\n[-2.22222222 -2.22222222  2.22222222  2.2222222   2.22222222 -2.22222222\n2.2222222  -2.2222222   2.22222222  2.2222222   2.222222   -2.22222222]\n[-2.22222222 -2.2222222   2.22222222  2.2222222   2.22222222 -2.22222222\n-2.22222222 -2.2222222  -2.22222222  2.22222222  2.2222222   2.22222222]\n[-2.22222222 -2.22222222  2.2222222   2.2222222   2.2222222  -2.22222222\n-2.222222   -2.2222222  -2.2222222  -2.22222222  2.22222222  2.2222222 ]\n[-2.22222222 -2.222222    2.22222222  2.22222222  2.22222222 -2.2222222\n-2.2222222  -2.2222222  -2.2222222  -2.22222222  2.22222222 -2.222222  ]\n[ 2.22222222 -2.22222222 -2.222222   -2.222222   -2.2222222  -2.22222222\n-2.222222   -2.22222222  2.2222222  -2.2222222   2.2222222   2.22222222]]'"

SubID   : "111111111"
Date1   : "0000-00-00"
date2   : "0000-00-00"
var1    : "00"
var2    : "FIRST\ TEXT\ FOR\ ONE"
broken_column_var: "$'[[ -0.00000000   0.00000000  -0.00000000   0.000000    -0.00000000\n-0.00000000   0.00000000   0.00000000]]'"

你的CSV看起来不那么破碎!!

注意:

不建议使用bash处理大文件(1,5Gb)!!使用pythonc与适当的库可以获得更好的结果!

相关问题