如何在AWK中获取CSV列的中值？(linux)

hc8w905p 于 2023-01-12 发布在 Linux

关注(0)|答案(2)|浏览(147)

我想从这个CSV文件的第二列捕捉中值按大陆分组在脚本awk文件：

Continent, Ratio
Asia,4.313743132
Europe,1.275057509
Africa,2.587215976
Europe,0.382270638
Africa,1.915438434
Latin America and the Caribbean,1.801681569
Latin America and the Caribbean,1.416472111
Asia,2.039169471
Latin America and the Caribbean,0.622595518

我试过了，但不起作用，因为我没有对CSV进行排序，甚至没有按大洲分组：

BEGIN {
    FS = ","
    Continent["Africa"];Continent["Asia"];Continent["Europe"];Continent["Latin America and the Caribbean"];
}
FNR>1 {
    if ($1!="" && $2!="") {
        a[i++]=$2
}

}END {
    for (i in Continent) {
        x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1];
     }
}

然而，我希望首先按大陆和数字（desc或asc，无关紧要）订购csv：

Continent, Ratio
Africa,2.587215976
Africa,1.915438434
Asia,4.313743132
Asia,2.039169471
Europe,1.275057509
Europe,0.382270638
Latin America and the Caribbean,1.801681569
Latin America and the Caribbean,1.416472111
Latin America and the Caribbean,0.622595518

最后得到各洲的中位数：

Continent, Median
Africa 2,2513
Asia   3,1764
Europe 0,8286
Latin America and the Caribbean, 1.4164

linux

来源：https://stackoverflow.com/questions/75071894/how-to-get-the-median-value-from-a-csv-column-in-awk-linux

2条答案

按热度按时间

h4cxqtbf1#

使用gnu-awk时，您可以使用以下解决方案：

cat med.awk

function med(d) {
   return gensub(/\./, ",", "1", sprintf("%.4f",
      (n%2 ? d[(n+1)/2] : (d[n/2] + d[n/2+1]) / 2)))
}
BEGIN {
   FS = ","
   OFS = "\t"
}
NR > 1 {
   a[$0] = $2
}
END {
   PROCINFO["sorted_in"] = "@ind_num_asc"   # this is for sorting
   print "Continent", "Median
   for (i in a) {
      k = gensub(/,.*/, "", "1", i)
      if (k != pk) {
         if (pk)
            print pk, med(d)
         n = 0
         delete d
      }
      d[++n] = a[i]
      pk = k
   }
   print pk, med(d)
}

然后将其用作：

awk -f med.awk file | column -t -s $'\t'

Continent                        Median
Africa                           2,2513
Asia                             3,1765
Europe                           0,8287
Latin America and the Caribbean  1,4165

赞(0）回复(0）举报 2023-01-12

jv2fixgn2#

为了好玩，您可以在单个gawk脚本中编写此代码，但要与sort协调，使用two-way pipe (or coprocess)：

中值.awk

#!/usr/bin/env gawk -f

function median(values, _, len) {
  len = length(values)

  if (len % 2 == 1) {
    return values[(len + 1) / 2]
  } else {
    return (values[int(len / 2)] + values[int(len / 2) + 1]) / 2
  }
}

BEGIN {
  # Set (Input) Field and Output Field Separators.
  FS = OFS = ","

  # The `sort` process to which we'll send data.
  #
  # -t,    → set command as the input separator
  # -k1,1  → 1st sort key starts at the 1st field and ends at the 1st field
  # -k2,2n → 2nd sort key starts at the 2nd field and ends at the 2nd field
  #          and use numeric comparisons.
  sorted = "LC_ALL=C sort -t, -k1,1 -k2,2n"
}

NR == 1 {
  next # skip first line
}

{
  # Pipe incoming lines to a sorting coprocess.
  # See: https://www.gnu.org/software/gawk/manual/html_node/Two_002dway-I_002fO.html
  print $0 |& sorted
}

END {
  # Close the writing end of the `sorted` pipe.
  close(sorted, "to")

  while (1) {
    # Read sorted data.
    more = sorted |& getline

    # Print median when there's no more data or when we're switching between regions.
    if (!more || (region && region != $1)) {
      print region, sprintf("%.6f", median(vals))
      vals_i = 0
      delete vals
    }

    # Clean up when there's no more data.
    if (!more) {
      close(sorted)
      break
    }

    region = $1
    vals[++vals_i] = $2
  }
}

结果

$ ./median.awk data.txt 
Africa,2.251327
Asia,3.176456
Europe,0.828664
Latin America and the Caribbean,1.416472

赞(0）回复(0）举报 2023-01-12

我来回答

如何在AWK中获取CSV列的中值？(linux)

2条答案

中值.awk

结果

相关问题

热门标签

最新问答