csv BASH脚本:给定用户输入,对TSV文件执行归一化

92vpleto  于 2023-04-27  发布在  其他
关注(0)|答案(2)|浏览(132)

我想写一个动态BASH脚本,它将从命令行获取用户输入,并对提供的TSV文件执行标准化。
TSV输入示例:

COL1    COL2    COL3    COL4
YES    NOT    NAME,EX    LOL,OK
Y      N      NICE,ED    LOUD,ONE
YE     NO     NEED,ER    LONE,TWO

作为输入,脚本将采用分隔符、列号(一个或多个)和输入文件,并允许用户将列表中的每个项目的行复制为列字段。

COL1    COL2    COL3    COL4
YES    NOT     NAME      LOL
YES    NOT     EX        OK
Y      N       NICE      LOUD
Y      N       ED        ONE
YE     NO      NEED      LONE
YE     NO      ER        TWO

我尝试了很多不同的组合,例如这个:

#!/bin/bash
# Get user input for the selected columns (starting from 1)
echo "Enter the column numbers to split (separated by spaces):"
read -ra col_nums

# Get user input for the delimiter
echo "Enter the delimiter:"
read delimiter

# Get the input file name
echo "Enter the input file name:"
read input_file

# Get the output file name
echo "Enter the output file name:"
read output_file

# Create a temporary file for storing the modified data
temp_file=$(mktemp)

# Set the field separator to the delimiter
awk -v col_nums="${col_nums[*]}" -v delimiter="$delimiter" '
BEGIN { FS = OFS = "\t"; split(col_nums, cn, " ") }

{
 for (i in cn) {
    # Split the selected column
    split($(cn[i]), a, delimiter)

    # Loop through the split array and create the new columns
    for (j = 1; j <= length(a); j++) {
        # Print the original columns up to the selected column
        for (k = 1; k < cn[i]; k++) {
            printf("%s%s", $k, OFS)
        }

        # Print the new column
        printf("%s%s", a[j], OFS)

        # Print the original columns after the selected column
        for (k = cn[i] + 1; k <= NF; k++) {
            printf("%s%s", $k, (k == NF ? ORS : OFS))
        }
    }
}
}' "$input_file" > "$temp_file"

# Rename the temporary file to the output file name
mv "$temp_file" "$output_file"

echo "Done."

然而,在运行这个作为输出后,我得到:

COL1     COL2    COL4     COL4
YES     NOT     LOL      LOL,OK
YES     NOT     OK       LOL,OK
Y       N       LOUD     LOUD,ONE
Y       N       ONE      LOUD,ONE
YE      NO      LONE     LONE,TWO
YE      NO      TWO      LONE,TWO

你能告诉我我做错了什么吗?什么是解决这个问题的最佳方法?如果用一行程序解决这个问题,我也很好,但我的成功率更低。
如果定界符和列号是预先知道的,我就知道如何做到这一点,但对我来说,问题是输出取决于用户输入,并且可以提供多个列作为输入。

tkqqtvp1

tkqqtvp11#

使用任何awk:

$ cat tst.sh
#!/usr/bin/env bash

col_nums='3 4'
delimiter=','
input_file='file'
output_file='out'

awk -v col_nums="$col_nums" -v delimiter="$delimiter" '
    BEGIN {
        FS = OFS = "\t"
        split(col_nums, cn, " ")
        for ( i in cn ) {
            split_cols[cn[i]]
        }
    }
    {
        num_parts = split($(cn[1]),tmp,delimiter)
        for ( part_nr=1; part_nr<=num_parts; part_nr++ ) {
            for ( col_nr=1; col_nr<=NF; col_nr++ ) {
                if ( col_nr in split_cols ) {
                    split($col_nr,parts,delimiter)
                    val = parts[part_nr]
                }
                else {
                    val = $col_nr
                }
                printf "%s%s", val, (col_nr<NF ? OFS : ORS)
            }
        }
    }
' "$input_file" > "$output_file"
$ ./tst.sh
$ cat out
COL1    COL2    COL3    COL4
YES     NOT     NAME    LOL
YES     NOT     EX      OK
Y       N       NICE    LOUD
Y       N       ED      ONE
YE      NO      NEED    LONE
YE      NO      ER      TWO

上面的假设,正如你在评论中所说:
如果指定了两列,则它们具有相同数量的逗号分隔标记。
显然,编写任何您喜欢的shell代码来从用户输入中读取/填充col_nums等。
我将shell变量col_nums从数组更改为标量,因为在该代码中没有理由将其作为数组。

jtw3ybtb

jtw3ybtb2#

下面是一个Ruby来实现这一点:

ruby -r csv -r json -e '
opts=JSON.parse(File.open(ARGV[0]).read)
cols=opts["cols"]
delim=opts["delimiter"]
inp=CSV.read(ARGV[1], **{headers:false, col_sep:"\t"})

puts CSV.generate(**{headers:false, col_sep:"\t"}){|csv| 
    inp.each{|row| 
    row=[row]
    while row[-1].select.
            with_index{|e, i| cols.include?(i)}.any?{|s| s.include?(delim)} do
        row<<row[-1]
        row[-2]=row[-2].map.with_index{|s,i| 
            if cols.include?(i) then s.split(delim,2)[0] else s end}
        row[-1]=row[-1].map.with_index{|s,i| 
            if s[/,/] && cols.include?(i) then s.split(delim, 2)[1] else s end }
    end
    row.each{|sa| csv<<sa }    
    }    
}' <(echo '{"cols":[2,3],"delimiter":","}') your_file

打印此TSV输出:

COL1    COL2    COL3    COL4
YES NOT NAME    LOL
YES NOT EX  OK
Y   N   NICE    LOUD
Y   N   ED  ONE
YE  NO  NEED    LONE
YE  NO  ER  TWO

使用JSON来发送配置(我认为这是不言自明的),并使用shell脚本来填充这一简单的JSON行:

#!/bin/bash

echo "Enter the column numbers to split (separated by commas):"
read col_nums

# Get user input for the delimiter
echo "Enter the delimiter:"
read delimiter

opts=$(printf "{\"cols\":[%s], \"delimiter\":\"%s\"}\n" "$col_nums" "$delimiter") 
# opts = '{"cols":[2,3], "delimiter":","}'

如果你想要一个 table(其中列对齐)与 TSV,在TSV输出的管道中使用column -t

ruby [that ruby] <(echo "$opts") your_file | column -t

图纸:

COL1  COL2  COL3  COL4
YES   NOT   NAME  LOL
YES   NOT   EX    OK
Y     N     NICE  LOUD
Y     N     ED    ONE
YE    NO    NEED  LONE
YE    NO    ER    TWO

这里的优点是Ruby版本更好地支持兼容的CSV / TSV输入和输出,包括引用字段,日期和转换。

相关问题