读入变量双引号内包含逗号的多行CSV

weylhg0b  于 2023-04-27  发布在  其他
关注(0)|答案(2)|浏览(156)

我正在寻找一种方法在bash只为linux和posix环境(没有gawk,没有python,没有库或语言依赖性,除了linux/posix的默认cli工具)方法,用于阅读多行csv文件到变量中,一次一行进行处理。CSV值在双引号内有逗号,这会破坏现有的代码:

while IFS=, read -r field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13 field14 field15 field18 field17 field18 field19 field29 
do

单行CSV输入示例:
"AC XA, S.A.","City of Commerce","00","0A348E541E6F5C258A12A5674AEF25F28BA7DCFAECEECC4EE63B71B361606AC3","Included","Included","Included","Not Included","09/30/2003","09/30/2037","","SZ=City of Commerce; FE=http://website.org; O=AC XA SA CIF A39201827; C=QR","--BINARY BLOB--maXJtYSBTQSBDSUYgQTgyNzQzMjg3MSMwIQYDVQQLExpodHRwOi8vd3d3LmNoYW1iZXJzaWduLm9yZzEiMCAGA1UEAxMZQ2hhbWJlcnMgb2YgQ29tbWVyY2UgUm9vdDAeFw0wMzA5MzAUIC0gMjAxNjESMBAGA1UEBRMJQTgyNzQzMjg3MRgwFgYDVQRhDA9WQVRFUy1BODI3NDMyODcxGzAZBgNVBAoMEkFDIENBTUVSRklSTUEgUy5BLjEnMCUGA1UEAwweR0xPQkFMIENIQU1CRVJTSUdOIFJPT1QgLSAyMDE2MB4XDTE2MDQxNDA3NTAwNloXDTQwMDQwODA3NTAwNlowggEIMQswCQYDVQQGEwJFUzEPMA0GA1UECAwGTUFEUklEMQ8wDQYDVQQHDAZNQURSSUQxOjA4BgNVBAsMMXNlZSBjdXJyZW50IGFkZHJlc3MgYXQgd3d3LmNhbWVyZmlybWEuY29tL2FkZHJlc3MxJzAlBgNVBAsMHkdMT0JBTCBDSEFNQkVSU0lHTiBST09UIC0gMjAxNjESMBAGA1UEBRMJQTgyNzQzMjg3MRgwFgYDVQRhDA9WQVRFUy1BODI3NDMyODcxGzAZBgNVBAoMEkFDIENBTUVSRklSTUEgUy5BLjEnMCUGA1UEAwweR0xPQkFMIENIQU1CRVJTSUdOIFJPT1QgLSAyMDE2MIICIjANBgkqhkiG9w0BAQEF--END--","SZ1qSRW","Email","AFX Client;Email","AFX Method;AFX Method;Entry Point;AFX Email;Time Stamping","No Entry","",""
所需的输出是将每行的以下值字符串读入变量:

"AC XA, S.A."
"City of Commerce"
"00"
"0A348E541E6F5C258A12A5674AEF25F28BA7DCFAECEECC4EE63B71B361606AC3"
"Included"
"Included"
"Included"
"Not Included"
"09/30/2003"
"09/30/2037"
""
"SZ=City of Commerce; FE=http://website.org; O=AC XA SA CIF A39201827; C=QR"
"--BINARY BLOB--maXJtYSBTQSBDSUYgQTgyNzQzMjg3MSMwIQYDVQQLExpodHRwOi8vd3d3LmNoYW1iZXJzaWduLm9yZzEiMCAGA1UEAxMZQ2hhbWJlcnMgb2YgQ29tbWVyY2UgUm9vdDAeFw0wMzA5MzAUIC0gMjAxNjESMBAGA1UEBRMJQTgyNzQzMjg3MRgwFgYDVQRhDA9WQVRFUy1BODI3NDMyODcxGzAZBgNVBAoMEkFDIENBTUVSRklSTUEgUy5BLjEnMCUGA1UEAwweR0xPQkFMIENIQU1CRVJTSUdOIFJPT1QgLSAyMDE2MB4XDTE2MDQxNDA3NTAwNloXDTQwMDQwODA3NTAwNlowggEIMQswCQYDVQQGEwJFUzEPMA0GA1UECAwGTUFEUklEMQ8wDQYDVQQHDAZNQURSSUQxOjA4BgNVBAsMMXNlZSBjdXJyZW50IGFkZHJlc3MgYXQgd3d3LmNhbWVyZmlybWEuY29tL2FkZHJlc3MxJzAlBgNVBAsMHkdMT0JBTCBDSEFNQkVSU0lHTiBST09UIC0gMjAxNjESMBAGA1UEBRMJQTgyNzQzMjg3MRgwFgYDVQRhDA9WQVRFUy1BODI3NDMyODcxGzAZBgNVBAoMEkFDIENBTUVSRklSTUEgUy5BLjEnMCUGA1UEAwweR0xPQkFMIENIQU1CRVJTSUdOIFJPT1QgLSAyMDE2MIICIjANBgkqhkiG9w0BAQEF--END--"
"SZ1qSRW"
"Email"
"AFX Client;Email"
"AFX Method;AFX Method;Entry Point;AFX Email;Time Stamping"
"No Entry"
""
""

目前,IFS方法由于第一个值“AC XA,S.A.”中的逗号而中断。我已经尝试了以下awk并取得了一些成功,但不清楚是否可以将每个项目读入变量并指定行号:
awk -F"," '{for(i=1;i<=NF;i++){if(i%NF==0){ORS="\n"} {if($i ~ /^\"/ || $i ~ /\"$/) {a=a OFS $i;j++;{if(j%2==0){sub(/^\,/,X,a); print a;j=0;a=""}}} else {print $i}}} {ORS=","}}' ORS="," OFS=, file.csv

pepwfjgg

pepwfjgg1#

不要将字段读入名为field1field2等的20个不同标量变量中,而是将它们读入一个数组中。使用bash和awk可以使用NUL作为ORS,例如GNU awk,但不仅仅是GNU awk,所以无论你已经使用的awk都可以工作:

$ cat tst.sh
#!/usr/bin/env bash

while IFS=$'\n' read -d '' -r -a fields; do
    printf '%s\n' "${fields[@]}"
done < <( awk -v ORS='\0' '{gsub(/","/,"\"\n\"")} 1' "${@:--}" )
$ ./tst.sh file
"AC XA, S.A."
"City of Commerce"
"00"
"0A348E541E6F5C258A12A5674AEF25F28BA7DCFAECEECC4EE63B71B361606AC3"
"Included"
"Included"
"Included"
"Not Included"
"09/30/2003"
"09/30/2037"
""
"SZ=City of Commerce; FE=http://website.org; O=AC XA SA CIF A39201827; C=QR"
"--BINARY BLOB--maXJtYSBTQSBDSUYgQTgyNzQzMjg3MSMwIQYDVQQLExpodHRwOi8vd3d3LmNoYW1iZXJzaWduLm9yZzEiMCAGA1UEAxMZQ2hhbWJlcnMgb2YgQ29tbWVyY2UgUm9vdDAeFw0wMzA5MzAUIC0gMjAxNjESMBAGA1UEBRMJQTgyNzQzMjg3MRgwFgYDVQRhDA9WQVRFUy1BODI3NDMyODcxGzAZBgNVBAoMEkFDIENBTUVSRklSTUEgUy5BLjEnMCUGA1UEAwweR0xPQkFMIENIQU1CRVJTSUdOIFJPT1QgLSAyMDE2MB4XDTE2MDQxNDA3NTAwNloXDTQwMDQwODA3NTAwNlowggEIMQswCQYDVQQGEwJFUzEPMA0GA1UECAwGTUFEUklEMQ8wDQYDVQQHDAZNQURSSUQxOjA4BgNVBAsMMXNlZSBjdXJyZW50IGFkZHJlc3MgYXQgd3d3LmNhbWVyZmlybWEuY29tL2FkZHJlc3MxJzAlBgNVBAsMHkdMT0JBTCBDSEFNQkVSU0lHTiBST09UIC0gMjAxNjESMBAGA1UEBRMJQTgyNzQzMjg3MRgwFgYDVQRhDA9WQVRFUy1BODI3NDMyODcxGzAZBgNVBAoMEkFDIENBTUVSRklSTUEgUy5BLjEnMCUGA1UEAwweR0xPQkFMIENIQU1CRVJTSUdOIFJPT1QgLSAyMDE2MIICIjANBgkqhkiG9w0BAQEF--END--"
"SZ1qSRW"
"Email"
"AFX Client;Email"
"AFX Method;AFX Method;Entry Point;AFX Email;Time Stamping"
"No Entry"
""
""

如果你的awk不能使用NUL s,那么使用任何awk:

$ cat tst.sh
#!/usr/bin/env bash

while IFS= read -r field; do
    if [[ -z "$field" ]]; then
        printf '%s\n' "${fields[@]}"
        fields=()
    else
        fields+=( "$field" )
    fi
done < <( awk -v ORS='\n\n' '{gsub(/","/,"\"\n\"")} 1' "${@:--}" )
$ ./tst.sh file
"AC XA, S.A."
"City of Commerce"
"00"
"0A348E541E6F5C258A12A5674AEF25F28BA7DCFAECEECC4EE63B71B361606AC3"
"Included"
"Included"
"Included"
"Not Included"
"09/30/2003"
"09/30/2037"
""
"SZ=City of Commerce; FE=http://website.org; O=AC XA SA CIF A39201827; C=QR"
"--BINARY BLOB--maXJtYSBTQSBDSUYgQTgyNzQzMjg3MSMwIQYDVQQLExpodHRwOi8vd3d3LmNoYW1iZXJzaWduLm9yZzEiMCAGA1UEAxMZQ2hhbWJlcnMgb2YgQ29tbWVyY2UgUm9vdDAeFw0wMzA5MzAUIC0gMjAxNjESMBAGA1UEBRMJQTgyNzQzMjg3MRgwFgYDVQRhDA9WQVRFUy1BODI3NDMyODcxGzAZBgNVBAoMEkFDIENBTUVSRklSTUEgUy5BLjEnMCUGA1UEAwweR0xPQkFMIENIQU1CRVJTSUdOIFJPT1QgLSAyMDE2MB4XDTE2MDQxNDA3NTAwNloXDTQwMDQwODA3NTAwNlowggEIMQswCQYDVQQGEwJFUzEPMA0GA1UECAwGTUFEUklEMQ8wDQYDVQQHDAZNQURSSUQxOjA4BgNVBAsMMXNlZSBjdXJyZW50IGFkZHJlc3MgYXQgd3d3LmNhbWVyZmlybWEuY29tL2FkZHJlc3MxJzAlBgNVBAsMHkdMT0JBTCBDSEFNQkVSU0lHTiBST09UIC0gMjAxNjESMBAGA1UEBRMJQTgyNzQzMjg3MRgwFgYDVQRhDA9WQVRFUy1BODI3NDMyODcxGzAZBgNVBAoMEkFDIENBTUVSRklSTUEgUy5BLjEnMCUGA1UEAwweR0xPQkFMIENIQU1CRVJTSUdOIFJPT1QgLSAyMDE2MIICIjANBgkqhkiG9w0BAQEF--END--"
"SZ1qSRW"
"Email"
"AFX Client;Email"
"AFX Method;AFX Method;Entry Point;AFX Email;Time Stamping"
"No Entry"
""
""

只要你的所有字段都用引号括起来,并且没有一个字段包含字符串","或换行符,它就可以工作。

pprl5pva

pprl5pva2#

最简单的方法是使用csv解析器。
一个常见的解析器是Ruby:

ruby -r csv -e 'CSV.parse($<.read).
                    flatten.
                    each{|field| 
                        puts [field].to_csv(force_quotes: true)
                    }' file

以你为例:

"AC XA, S.A."
"City of Commerce"
"00"
"0A348E541E6F5C258A12A5674AEF25F28BA7DCFAECEECC4EE63B71B361606AC3"
"Included"
"Included"
"Included"
"Not Included"
"09/30/2003"
"09/30/2037"
""
"SZ=City of Commerce; FE=http://website.org; O=AC XA SA CIF A39201827; C=QR"
"--BINARY BLOB--maXJtYSBTQSBDSUYgQTgyNzQzMjg3MSMwIQYDVQQLExpodHRwOi8vd3d3LmNoYW1iZXJzaWduLm9yZzEiMCAGA1UEAxMZQ2hhbWJlcnMgb2YgQ29tbWVyY2UgUm9vdDAeFw0wMzA5MzAUIC0gMjAxNjESMBAGA1UEBRMJQTgyNzQzMjg3MRgwFgYDVQRhDA9WQVRFUy1BODI3NDMyODcxGzAZBgNVBAoMEkFDIENBTUVSRklSTUEgUy5BLjEnMCUGA1UEAwweR0xPQkFMIENIQU1CRVJTSUdOIFJPT1QgLSAyMDE2MB4XDTE2MDQxNDA3NTAwNloXDTQwMDQwODA3NTAwNlowggEIMQswCQYDVQQGEwJFUzEPMA0GA1UECAwGTUFEUklEMQ8wDQYDVQQHDAZNQURSSUQxOjA4BgNVBAsMMXNlZSBjdXJyZW50IGFkZHJlc3MgYXQgd3d3LmNhbWVyZmlybWEuY29tL2FkZHJlc3MxJzAlBgNVBAsMHkdMT0JBTCBDSEFNQkVSU0lHTiBST09UIC0gMjAxNjESMBAGA1UEBRMJQTgyNzQzMjg3MRgwFgYDVQRhDA9WQVRFUy1BODI3NDMyODcxGzAZBgNVBAoMEkFDIENBTUVSRklSTUEgUy5BLjEnMCUGA1UEAwweR0xPQkFMIENIQU1CRVJTSUdOIFJPT1QgLSAyMDE2MIICIjANBgkqhkiG9w0BAQEF--END--"
"SZ1qSRW"
"Email"
"AFX Client;Email"
"AFX Method;AFX Method;Entry Point;AFX Email;Time Stamping"
"No Entry"
""
""

相关问题