如果两个CSV文件中都存在值,则比较两列

58wvjzkj  于 2023-09-27  发布在  其他
关注(0)|答案(2)|浏览(82)

目标是收集要复制到目标FileSystem的文件名(使用AWK):
1.如果它们在source.csv中并且在target.csv中不存在
1.文件大小不同
1.源中时间戳大于目标中的时间戳
Source.csv

"2023-08-25","test/test2/filename1","10.00 B"
"2023-07-25","test/test2/filename2","15.00 B"
"2023-07-25","test/test2/filename3","5.00 B"
"2023-07-25","test/test2/filename4","5.00 B"

Target.csv

"2023-08-25","test/test2/filename0","10.00 B"
"2023-07-25","test/test2/filename2","10.00 B"
"2023-07-24","test/test2/filename3","5.00 B"
"2023-07-25","test/test2/filename4","5.00 B"

预期输出:

"2023-08-25","test/test2/filename1","10.00 B"  ### Because does not exists in target.csv
"2023-07-25","test/test2/filename2","10.00 B"  ### Because the size is different
"2023-07-24","test/test2/filename3","5.00 B"   ### Because the timestamp in source.csv is grater than in target.csv (meaning new version in source, not in target)

对于我使用的唯一文件:
awk -v FS="," 'BEGIN { OFS = FS } FNR == NR { unique[$2]; next } !($2 in unique) { print $2 }' target.csv source.csv | tr -d "\"" > files_to_copy.txt
但对于其他两个条件,我无法开发代码。缺少AWK知识。任何帮助?:)

nc1teljy

nc1teljy1#

假设条件:

  • 所有字段都用一对双引号括起来
  • 数据字段中没有嵌入/转义双引号
  • 文件名在一个文件中是唯一的(即一个文件名在一个文件中不会出现多次)
  • 所有尺寸的测量单位为B
  • 大小字段中的第一个非空字符是数字

一个awk的想法:

awk -F'"' '                          # input field separator is double quote => data values are in even-numbered fields
FNR==NR { unique[$4]                 # use filename index for arrays
          size[$4]=$6+0              # "+0" will strip spaces and trailing "B", leaving us with just a number
          date[$4]=$2
          next
        }
!( $4 in unique       ) ||           #      if source file not in unique[] array then print current line
 ( size[$4] != ($6+0) ) ||           # (or) if sizes are different then print current line
 ( $2 > date[$4]      )              # (or) if source date is greater than target date then print current line
' target.csv source.csv

这产生:

"2023-08-25","test/test2/filename1","10.00 B"
"2023-07-25","test/test2/filename2","15.00 B"
"2023-07-25","test/test2/filename3","5.00 B"
rta7y2nd

rta7y2nd2#

使用任何POSIX awk,无论CSV中的文件名中出现哪些字符(除换行符外),并假设每个文件名都是唯一的:

$ cat tst.awk
BEGIN { FS="," }
{
    name = $0
    gsub(/^"[^"]*|[^"]*"$/,"",name)
}
NR == FNR {
    d[name] = $1
    s[name] = $NF
    next
}
!(name in d) || ($1 > d[name]) || ($NF != s[name])
$ awk -f tst.awk Target.csv Source.csv
"2023-08-25","test/test2/filename1","10.00 B"
"2023-07-25","test/test2/filename2","15.00 B"
"2023-07-25","test/test2/filename3","5.00 B"

上面的代码假设CSV的第一个或最后一个字段中没有逗号或双引号。

相关问题