shell 规范化列表列-为列表中的每个项目复制行

mbzjlibv 于 2023-04-07 发布在 Shell

关注(0)|答案(5)|浏览(107)

我想在BASH中为每个项目创建一个单独的行，由逗号分隔
例如，要转换此表：

TYPE  NAME  
Fruit  apple,strawberry
Vegetable  potato

在此表中：

TYPE  NAME  
Fruit  apple
Fruit  strawberry
Vegetable  potato

我试过这个脚本：

#!/bin/bash

# define the name of the input file
input_file="plants.tsv"

# define the name of the output file
output_file="normalized_plants.tsv"

# define the index of the list column (counting from 1)
list_column=2

# create a new file with the headers for the output table
head -n 1 "$input_file" > "$output_file"

# read each line of the input file
tail -n +2 "$input_file" | while IFS=$'\t' read -r line; do
  # extract the values for the list column
  list_values=$(echo "$line" | awk -F$'\t' '{print $'"$list_column"'}' | tr ',' '\n')
  # iterate over each value in the list column
  echo "$line" | awk -F$'\t' -v OFS=$'\t' -v list_column="$list_column" -v list_values="$list_values" '
    NR == 1 { next } # skip the header row
    { 
      split(list_values, values, "\n")
      for (i in values) {
        $list_column = values[i]
        print $0
      }
    }' >> "$output_file"
done

但是我得到的是一个空的输出文件。你知道这里出了什么问题吗？或者有更好的解决方案来实现这个问题？我是BASH的初学者，这可能不是实现规范化的最佳方法。

shell

来源：https://stackoverflow.com/questions/75919219/bash-normalize-list-columns-replicate-row-for-each-item-in-a-list

5条答案

按热度按时间

k97glaaz1#

不要使用shell读取循环，参见why-is-using-a-shell-loop-to-process-text-considered-bad-practice，只需一个awk脚本就可以运行得更快，更可移植，更容易编写（例如-您目前有2个使用shell读取循环的答案，如果“type”包含空白，则两个答案都会失败，如果输入包含任何反斜杠，其中一个也会失败），例如使用任何awk：

$ cat tst.sh
#!/usr/bin/env bash

# define the name of the input file
input_file="plants.tsv"

# define the name of the output file
output_file="normalized_plants.tsv"

# define the index of the list column (counting from 1)
list_column=2

awk -v list_column="$list_column" '
    BEGIN { FS=OFS="\t" }
    {
        n = split($list_column,names,",")
        for ( i=1; i<=n; i++ ) {
            print $1, names[i]
        }
    }
' "$input_file" > "$output_file"

$ ./tst.sh

$ cat normalized_plants.tsv
TYPE    NAME
Fruit   apple
Fruit   strawberry
Vegetable       potato

我使用for ( i=1; i<=n; i++ )而不是上面的for ( i in names )，以保证输入中的名称顺序在输出中得到保留，请参阅https://www.gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array。

赞(0）回复(0）举报 2023-04-07

r55awzrz2#

这个答案只是告诉你，你的脚本使用纯bash可以压缩为：

#!/bin/bash

while read -r type names; do
    echo "$type"$'\t'"${names//,/$'\n'$type$'\t'}"
done < plants.tsv > normalized_plants.tsv

通常情况下，awk解决方案应该是首选。

赞(0）回复(0）举报 2023-04-07

d6kp6zgx3#

bash：

while read type name_list; do                # Read the 2 fields in type and name_list
    readarray -d, -t names <<< "$name_list," # Split the name_list by comma and save it in names array.
    unset names[-1]                          # This line is only to remove the tailing newline for the last entry.
    for name in "${names[@]}"; do            # For each name, ...
        echo "$type $name"                   # ... print type and name
    done
done < plants.tsv > output_plants.tsv        # Input, output file redirection.

awk版本：

awk '{split($2, s, ","); for(i in s){print $1, s[i]}}' plants.tsv > output_plants.tsv

赞(0）回复(0）举报 2023-04-07

4smxwvx54#

只是为了多样化，一个简单的字符串处理解决方案与sed。

$: sed -E ':x ; s/^([^[:space:]]+)[[:space:]]+([^,]+),/\1\t\2\n\1\t/; t;' file
TYPE  NAME
Fruit   apple
Fruit   strawberry
Vegetable  potato

使用给定的简单文件。确保您确认任何更复杂的文件。

赞(0）回复(0）举报 2023-04-07

06odsfpq5#

echo '
TYPE  NAME  
Fruit  apple,strawberry,banana
Vegetable  potato' |

mawk 'NR==!_ || $NF!~/,/ || gsub(",[^,]+", "\n"$!_ " &", $NF) + gsub(",",_)'

TYPE  NAME  
Fruit apple
Fruit strawberry
Fruit banana
Vegetable  potato

如果你想对输出间隔的差距吹毛求疵，那么

gawk 'NR==!_ ? OFS = substr($_, match($_, "[ \t]+"),RLENGTH) \
             : $NF!~/,/ || gsub(",[^,]+", "\n" $!_ OFS "&", $NF) gsub(",",_)'

TYPE  NAME  
Fruit  apple
Fruit  strawberry
Fruit  banana
Vegetable  potato

赞(0）回复(0）举报 2023-04-07

我来回答

shell 规范化列表列-为列表中的每个项目复制行

5条答案

相关问题

热门标签

最新问答