linux GNU Diff的百分比值

xkrw2x1b  于 2023-05-22  发布在  Linux
关注(0)|答案(4)|浏览(189)

使用diff来显示两个文件之间的百分比差异的好方法是什么?
例如,如果一个文件有100行,而一个副本有15行已被更改,则差异百分比将为15%。

v64noz0r

v64noz0r1#

像这样的吗?
两个文件,A1和A2。
$ sdiff -B -b -s A1 A2 | wc会给予你有多少行不同。wc给出总数,只需除以。
-B和-B表示忽略空白和空行,-s表示抑制公共行。

2izufjch

2izufjch2#

https://superuser.com/questions/347560/is-there-a-tool-to-measure-file-difference-percentage对此有一个很好的解决方案,

wdiff -s file1.txt file2.txt

更多选项请参见man wdiff

ibrsph3r

ibrsph3r3#

下面是一个脚本,它将比较所有.txt文件,并显示具有超过15%重复的文件:

#!/bin/bash

# walk through all files in the current dir (and subdirs)
# and compare them with other files, showing percentage
# of duplication.

# which type files to compare?
# (wouldn't make sense to compare binary formats)
ext="txt"

# support filenames with spaces:
IFS=$(echo -en "\n\b")

working_dir="$PWD"
working_dir_name=$(echo $working_dir | sed 's|.*/||')
all_files="$working_dir/../$working_dir_name-filelist.txt"
remaining_files="$working_dir/../$working_dir_name-remaining.txt"

# get information about files:
find -type f -print0 | xargs -0 stat -c "%s %n" | grep -v "/\." | \
     grep "\.$ext" | sort -nr > $all_files

cp $all_files $remaining_files

while read string; do
    fileA=$(echo $string | sed 's/.[^.]*\./\./')
    tail -n +2 "$remaining_files" > $remaining_files.temp
    mv $remaining_files.temp $remaining_files
    # remove empty lines since they produce false positives
    sed '/^$/d' $fileA > tempA

    echo Comparing $fileA with other files...

    while read string; do
        fileB=$(echo $string | sed 's/.[^.]*\./\./')
        sed '/^$/d' $fileB > tempB
        A_len=$(cat tempA | wc -l)
        B_len=$(cat tempB | wc -l)

        differences=$(sdiff -B -s tempA tempB | wc -l)
        common=$(expr $A_len - $differences)

        percentage=$(echo "100 * $common / $B_len" | bc)
        if [[ $percentage -gt 15 ]]; then
            echo "  $percentage% duplication in" \
                 "$(echo $fileB | sed 's|\./||')"
        fi
    done < "$remaining_files"
    echo " "
done < "$all_files"

rm tempA
rm tempB
rm $all_files
rm $remaining_files
e7arh2l6

e7arh2l64#

下面是一个使用comm的快速bash解决方案。空白行将被忽略。

file_1="file_1.txt"
file_2="file_2.txt"
lines_1="$(grep -c '.' "$file_1")"
lines_2="$(grep -c '.' "$file_2")"
max_lines=$((lines_1 > lines_2 ? lines_1 : lines_2))
same_lines="$(comm -1 -2 <(grep '.' "$file_1" |sort) <(grep '.' "$file_2" |sort) |grep -c '.')"
diff_lines=$((max_lines-same_lines))
pct_change=0
[[ $max_lines -gt 0 ]] && pct_change=$((100*$diff_lines/$max_lines))
echo "Percent change = ${pct_change}% ($diff_lines of $max_lines lines are different.)"

示例结果:

Percent change = 33% (4 of 12 lines are different.)

wdiffsdiff解决方案非常好,但这些实用程序通常不会安装在默认环境中。

相关问题