unix 创建类似awk直方图的条柱

sycxhyv7  于 2022-11-04  发布在  Unix
关注(0)|答案(4)|浏览(212)

下面是我的输入文件:

1.37987
1.21448
0.624999
1.28966
1.77084
1.088
1.41667

我想创建我选择的大小的bin,以获得类似直方图的输出,例如,从0开始的0.1个bin的输出如下所示:

0 0.1 0
...
0.5 0.6 0
0.6 0.7 1
...
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
...

我的文件对于R来说太大了,所以我在寻找一个awk解决方案(也对我能理解的任何其他东西开放,因为我仍然是一个Linux初学者)。
这在这篇文章中已经得到了回答:awk histogram in buckets,但该解决方案不适合我。

uqdfh47h

uqdfh47h1#

这应该是非常接近的,如果不是完全正确的话。至少把它当作一个起点,并自己验证/弄清楚数学(特别是决定/验证像0.2这样的精确边界匹配应该进入哪个桶- 0.1到0.2和/或0.2到0.3?):

$ cat tst.awk
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
    bucketNr = int(($0+delta) / delta)
    cnt[bucketNr]++
    numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
    for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
        end = beg + delta
        printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
        beg = end
    }
}

$ awk -f tst.awk file
0.0 0.1 0
0.1 0.2 0
0.2 0.3 0
0.3 0.4 0
0.4 0.5 0
0.5 0.6 0
0.6 0.7 1
0.7 0.8 0
0.8 0.9 0
0.9 1.0 0
1.0 1.1 1
1.1 1.2 0
1.2 1.3 2
1.3 1.4 1
1.4 1.5 1
1.5 1.6 0
1.6 1.7 0
1.7 1.8 1

请注意,您可以在命令行上分配存储桶增量大小,0.1只是默认值:

$ awk -v delta='0.3' -f tst.awk file
0.0 0.3 0
0.3 0.6 0
0.6 0.9 1
0.9 1.2 1
1.2 1.5 4
1.5 1.8 1

$ awk -v delta='0.5' -f tst.awk file
0.0 0.5 0
0.5 1.0 1
1.0 1.5 5
1.5 2.0 1
wrrgggsh

wrrgggsh2#

这也是可能的:

awk -v size=0.1 
  '{ b=int($1/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin }
   END { for(i=bmin;i<=bmax;++i) print i*size,(i+1)*size,a[i] }' <file>

它本质上与EdMorton的解决方案相同,但从最小值(默认为0)开始打印桶。它本质上考虑了负数。

ca1c2owp

ca1c2owp3#

这是我用Awk解决这个问题的尝试。
要运行:awk -f belowscript.awk inputfile

BEGIN {
    PROCINFO["sorted_in"]="@ind_num_asc";
    delta = (delta == "") ? 0.1 : delta;
};

/^-?([0-9][0-9]*|[0-9]*(\.[0-9][0-9]*))/ {
    # Special case the [-delta - 0] case so it doesn't bin in the [0-delta] bin
    fractBin=$1/delta
    if (fractBin < 0 && int(fractBin) == fractBin)
        fractBin = fractBin+1
    prefix = (fractBin <= 0 && int(fractBin) == 0) ? "-" : ""
    bins[prefix int(fractBin)]++
}

END {
    for (var in bins)
    {
        srange = sprintf("%0.2f",delta * ((var >= 0) ? var : var-1))
        erange = sprintf("%0.2f",delta * ((var >= 0) ? var+1 : var))
        print srange " " erange " " bins[var]
    }
}

注意事项:

  • 我像EdMorton一样在命令行上添加了提供bin大小的支持。
  • 它只打印包含某些内容的垃圾箱
  • 精确匹配进入哪个bin--较小或较大的bin,当出现负值时,这种方法自然会否定,并需要调整以使其一致。
  • 0边界需要对第一个负bin中的数字进行特殊的大小写处理,因为不存在-0这样的数字。Awk的关联数组使用字符串作为键,因此“-0”是可能的,并且在for循环中使用@ind_num_asc排序顺序,似乎可以正确地对-0进行排序-尽管这可能不可移植。
gfttwv5a

gfttwv5a4#

Python的另一种解决方案


# draw histogram in command line with Python

# 

# usage: $ cat datafile.txt | python this_script.py [nbins] [nscale]

# The input should be one column of numbers to be piped in.

# 

# forked from https://gist.github.com/bgbg

from __future__ import print_function
import sys
import numpy as np

def asciihist(it, bins=10, minmax=None, str_tag='',
              scale_output=30, generate_only=False, print_function=print):
    """Create an ASCII histogram from an interable of numbers.
    Author: Boris Gorelik boris@gorelik.net. based on  http://econpy.googlecode.com/svn/trunk/pytrix/pytrix.py
    License: MIT
    """
    ret = []
    itarray = np.asanyarray(it)
    if minmax == 'auto':
        minmax = np.percentile(it, [5, 95])
        if minmax[0] == minmax[1]:
            # for very ugly distributions
            minmax = None
    if minmax is not None:
        # discard values that are outside minmax range
        mn = minmax[0]
        mx = minmax[1]
        itarray = itarray[itarray >= mn]
        itarray = itarray[itarray <= mx]
    if itarray.size:
        total = len(itarray)
        counts, cutoffs = np.histogram(itarray, bins=bins)
        cutoffs = cutoffs[1:]
        if str_tag:
            str_tag = '%s ' % str_tag
        else:
            str_tag = ''
        if scale_output is not None:
            scaled_counts = counts.astype(float) / counts.sum() * scale_output
        else:
            scaled_counts = counts

        if minmax is not None:
            ret.append('Trimmed to range (%s - %s)' % (str(minmax[0]), str(minmax[1])))
        for cutoff, original_count, scaled_count in zip(cutoffs, counts, scaled_counts):
            ret.append("{:s}{:>8.2f} |{:<7,d} | {:s}".format(
                str_tag,
                cutoff,
                original_count,
                "*" * int(scaled_count))
                       )
        ret.append(
            "{:s}{:s} |{:s} | {:s}".format(
                str_tag,
                '-' * 8,
                '-' * 7,
                '-' * 7
            )
        )
        ret.append(
            "{:s}{:>8s} |{:<7,d}".format(
                str_tag,
                'N=',
                total
            )
        )
    else:
        ret = []
    if not generate_only:
        for line in ret:
            print_function(line)
    ret = '\n'.join(ret)
    return ret

if __name__ == '__main__':

    nbins=30
    if len(sys.argv) >= 2:
        nbins = int(sys.argv[1])
    nscale=400
    if len(sys.argv) == 3:
        nscale = int(sys.argv[2])

    dataIn =[]

    for line in sys.stdin:
        if line.strip() != '':
           dataIn.append( float(line)) 

    asciihist(dataIn, bins=nbins, scale_output=nscale, minmax=None, str_tag='BIN');

相关问题