shell 规范化列表列-为列表中的每个项目复制行

mbzjlibv  于 2023-04-07  发布在  Shell
关注(0)|答案(5)|浏览(107)

我想在BASH中为每个项目创建一个单独的行,由逗号分隔
例如,要转换此表:

TYPE  NAME  
Fruit  apple,strawberry
Vegetable  potato

在此表中:

TYPE  NAME  
Fruit  apple
Fruit  strawberry
Vegetable  potato

我试过这个脚本:

#!/bin/bash

# define the name of the input file
input_file="plants.tsv"

# define the name of the output file
output_file="normalized_plants.tsv"

# define the index of the list column (counting from 1)
list_column=2

# create a new file with the headers for the output table
head -n 1 "$input_file" > "$output_file"

# read each line of the input file
tail -n +2 "$input_file" | while IFS=$'\t' read -r line; do
  # extract the values for the list column
  list_values=$(echo "$line" | awk -F$'\t' '{print $'"$list_column"'}' | tr ',' '\n')
  # iterate over each value in the list column
  echo "$line" | awk -F$'\t' -v OFS=$'\t' -v list_column="$list_column" -v list_values="$list_values" '
    NR == 1 { next } # skip the header row
    { 
      split(list_values, values, "\n")
      for (i in values) {
        $list_column = values[i]
        print $0
      }
    }' >> "$output_file"
done

但是我得到的是一个空的输出文件。你知道这里出了什么问题吗?或者有更好的解决方案来实现这个问题?我是BASH的初学者,这可能不是实现规范化的最佳方法。

k97glaaz

k97glaaz1#

不要使用shell读取循环,参见why-is-using-a-shell-loop-to-process-text-considered-bad-practice,只需一个awk脚本就可以运行得更快,更可移植,更容易编写(例如-您目前有2个使用shell读取循环的答案,如果“type”包含空白,则两个答案都会失败,如果输入包含任何反斜杠,其中一个也会失败),例如使用任何awk:

$ cat tst.sh
#!/usr/bin/env bash

# define the name of the input file
input_file="plants.tsv"

# define the name of the output file
output_file="normalized_plants.tsv"

# define the index of the list column (counting from 1)
list_column=2

awk -v list_column="$list_column" '
    BEGIN { FS=OFS="\t" }
    {
        n = split($list_column,names,",")
        for ( i=1; i<=n; i++ ) {
            print $1, names[i]
        }
    }
' "$input_file" > "$output_file"
$ ./tst.sh
$ cat normalized_plants.tsv
TYPE    NAME
Fruit   apple
Fruit   strawberry
Vegetable       potato

我使用for ( i=1; i<=n; i++ )而不是上面的for ( i in names ),以保证输入中的名称顺序在输出中得到保留,请参阅https://www.gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array。

r55awzrz

r55awzrz2#

这个答案只是告诉你,你的脚本使用纯bash可以压缩为:

#!/bin/bash

while read -r type names; do
    echo "$type"$'\t'"${names//,/$'\n'$type$'\t'}"
done < plants.tsv > normalized_plants.tsv

通常情况下,awk解决方案应该是首选。

d6kp6zgx

d6kp6zgx3#

bash

while read type name_list; do                # Read the 2 fields in type and name_list
    readarray -d, -t names <<< "$name_list," # Split the name_list by comma and save it in names array.
    unset names[-1]                          # This line is only to remove the tailing newline for the last entry.
    for name in "${names[@]}"; do            # For each name, ...
        echo "$type $name"                   # ... print type and name
    done
done < plants.tsv > output_plants.tsv        # Input, output file redirection.

awk版本:

awk '{split($2, s, ","); for(i in s){print $1, s[i]}}' plants.tsv > output_plants.tsv
4smxwvx5

4smxwvx54#

只是为了多样化,一个简单的字符串处理解决方案与sed

$: sed -E ':x ; s/^([^[:space:]]+)[[:space:]]+([^,]+),/\1\t\2\n\1\t/; t;' file
TYPE  NAME
Fruit   apple
Fruit   strawberry
Vegetable  potato

使用给定的简单文件。确保您确认任何更复杂的文件。

06odsfpq

06odsfpq5#

echo '
TYPE  NAME  
Fruit  apple,strawberry,banana
Vegetable  potato' |
mawk 'NR==!_ || $NF!~/,/ || gsub(",[^,]+", "\n"$!_ " &", $NF) + gsub(",",_)'
TYPE  NAME  
Fruit apple
Fruit strawberry
Fruit banana
Vegetable  potato

如果你想对输出间隔的差距吹毛求疵,那么

gawk 'NR==!_ ? OFS = substr($_, match($_, "[ \t]+"),RLENGTH) \
             : $NF!~/,/ || gsub(",[^,]+", "\n" $!_ OFS "&", $NF) gsub(",",_)'
TYPE  NAME  
Fruit  apple
Fruit  strawberry
Fruit  banana
Vegetable  potato

相关问题