用于填充文件中日期之间空白的Unix命令

lstz6jyr  于 2023-10-18  发布在  Unix
关注(0)|答案(6)|浏览(152)

我想找到名字和年龄相同但日期不同的记录。然后,如果有一个前一行日期和下一行日期之间的差距,我必须填补差距。
样本数据文件. txt

20230907,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230922,Daniel,28, Photographer

到目前为止我所做的:

#create zero byte file for all filled gaps
cat /dev/null > fillGap.txt
For line in `awk -F"," '{print $2","$3}' file.txt`;do
#if name,age NOT found in fillGap.txt then grep everything that matches in file.txt
if [[ -z `grep -w ${line} fillGap.txt` ]];then
grep -w ${line} file.txt > MatchNameAge.txt
#this is the part of checking if there is a gap between dates and if so, gaps will be filled. I haven't figured it out on how I should do it. maybe you could help me
#after filling the gap, the transformed will be appended in fileGap.txt
else
  #if name,age have already found in filledGap.txt, there's nothing to do.
fi
done

我的代码缺乏大量的上下文,因为我只有想法,并继续一点一点地尝试如何编码。

这个想法可能工作得很好,但是,对大文件使用for循环会导致长时间运行。
你能帮助我实现目标,尽可能缩短运行时间?
预期输出:

20230907,Allan,29,Marketing
20230908,Allan,29,Marketing
20230909,Allan,29,Marketing
20230910,Allan,29,Marketing
20230911,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230921,Daniel,28,Engineer
20230922,Daniel,28, Photographer

我想你添加这样的情况下在我的数据如下。如果我有一个日期是月底,下一个日期是下个月,

20230930,Allan,29,Programmer
20231004,Allan,29,Engineer

输出应该如下所示

20230930,Allan,29,Programmer
20231001,Allan,29,Programmer
20231002,Allan,29,Programmer
20231003,Allan,29,Programmer
20231004,Allan,29,Engineer

重要的是,用相同的“名字,年龄”来填补日期之间的空白

xesrikrc

xesrikrc1#

使用任何sort加上GNU awk的时间函数和gensub()

$ cat tst.sh
#!/usr/bin/env bash

sort -t, -k2,2 -k1,1n "${@:--}" |
awk '
    BEGIN { FS=OFS="," }
    ($2 == p[2]) && ($3 == p[3]) {
        for ( date=nextDate(p[1]); date<$1; date=nextDate(date) ) {
            sub(/[^,]+/,date,p[0])
            print p[0]
        }
    }
    {
        print
        split($0,p)
        p[0] = $0
    }

    function nextDate(date,secs) {
        date = gensub(/(.{4})(..)(..)/,"\\1 \\2 \\3",1,date)
        secs = mktime(date " 12 0 0") + (24 * 60 * 60)
        return strftime("%Y%m%d",secs)
    }
'
$ ./tst.sh file.txt
20230907,Allan,29,Marketing
20230908,Allan,29,Marketing
20230909,Allan,29,Marketing
20230910,Allan,29,Marketing
20230911,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230921,Daniel,28,Engineer
20230922,Daniel,28, Photographer

或者,使用任何awk只需将上面的nextDate()函数替换为以下版本:

function nextDate(date,     i,tmp,year,month,day,maxDays) {
        year  = substr(date,1,4)+0
        month = substr(date,5,2)+0
        day   = substr(date,7,2)+0

        # see https://www.timeanddate.com/date/leapyear.html
        if ( (month == 2) && !(year % 4) && ((year % 100) || !(year % 400)) ) {
            maxDays = 29
        }
        else {
            if ( !(1 in daysIn) ) {
                split("1 31 2 28 3 31 4 30 5 31 6 30 7 31 8 31 9 30 10 31 11 30 12 31",tmp," ")
                for ( i=1; i in tmp; i+=2 ) {
                    daysIn[i] = tmp[i+1]
                }
            }
            maxDays = daysIn[month]
        }

        if ( ++day > maxDays ) {
            day = 1
            if ( ++month > 12 ) {
                month = 1
                year++
            }
        }

        return sprintf("%04d%02d%02d",year,month,day)
    }
wko9yo5t

wko9yo5t2#

awk中处理日期有点棘手,所以我倾向于编写专门的函数来处理它们。
下面是一个假设排序输入的示例解决方案(使用gawkmawk进行测试):

  • parse.awk*
BEGIN { 
  sid = 60 * 60 * 24;   # Seconds in a day
  FS = OFS = ",";
}

# Convert date to seconds-since-the-unix-epoch (Jan 1, 1970)
function getdate(d) {
  year  = substr(d, 1, 4);
  month = substr(d, 5, 2);
  day   = substr(d, 7, 2);

  return mktime(year" "month" "day" 00 00 00");
}

# Add one day to date
function addday(d) {
  return strftime("%Y%m%d", getdate(d) + sid);
}

($2 SUBSEP $3) in nameage {
  d = nameage[$2,$3];

  while (addday(d) < $1) {
    print (addday(d), $2, $3, occupation[$2,$3]);
    d = addday(d);
  }
}

{
  nameage[$2,$3]    = $1;
  occupation[$2,$3] = $4;
}

1;

这样运行:

awk -f parse.awk infile

输出量:

20230907,Allan,29,Marketing
20230908,Allan,29,Marketing
20230909,Allan,29,Marketing
20230910,Allan,29,Marketing
20230911,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230921,Daniel,28,Engineer
20230922,Daniel,28, Photographer
pengsaosao

pengsaosao3#

TXR中的解决方案:

$ txr filldate.txr data
20230907,Allan,29,Marketing
20230908,Allan,29,Marketing
20230909,Allan,29,Marketing
20230910,Allan,29,Marketing
20230911,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230921,Daniel,28,Engineer
20230922,Daniel,28, Photographer

filldate.txr中的代码:

@(repeat)
@  (cases)
@date0,@name,@age,@rest0
@    (trailer)
@date1,@name,@age,@rest1
@    (bind dates @[giterate (op nequal date1)
                            (opip
                              (time-parse "%Y%m%d")
                              .(time-utc)
                              (+ 86400)
                              (time-str-utc "%Y%m%d"))
                            date0])
@    (output)
@      (repeat)
@dates,@name,@age,@rest0
@      (end)
@    (end)
@  (or)
@line
@    (do (put-line line))
@  (end)
@(end)

用Vim着色:

rdrgkggo

rdrgkggo4#

这可能对你有用(GNU sed和date):

sed -E '/\n/!N
       /(,[^,]*,[^,]*,).*\n.*\1/{/^([^,]*,).*\n\1/D
       s/^([^,]*)(,.*\n)/echo "&$(date -d "\1 +1 day" +%Y%m%d)\2"/e}
       P;D' file

使用N;P;D sed习惯用法,将下一行追加到当前行,打印/删除第一行并重复。
在这种情况下,如果模式空间中有一行或没有其他行,则追加下一行。
比较这些行的第二个和第三个字段,如果相同:
比较这些行的日期,如果相同,则删除第一行并重复。
如果日期不同,请在第一行和第二行之间插入一行明天的日期。
无论如何,打印/删除第一行并重复。
注意:工作的主要部分是由需要插值echo命令的求值替换完成的。
如果文件未排序,则用途:

sort -t, -k2,3 -k1,1 file |
sed -E '/\n/!N
       /(,[^,]*,[^,]*,).*\n.*\1/{/^([^,]*,).*\n\1/D
       s/^([^,]*)(,.*\n)/echo "&$(date -d "\1 +1 day" +%Y%m%d)\2"/e}
       P;D'
wpx232ag

wpx232ag5#

使用sqlite3

#!/bin/bash

# create header in file.txtA
a=$(head -1 file.txt | sed -e 's/[^,]*,*/#/g')
b=abcdefgh;
c=$(echo ${b:0:${#a}} | sed -E 's/(.)/\1,/g')
c=${c%?}
sed -i -e "1i$c" file.txt

# create output
sql=".import file.txt file
   update file set a=date(substr(a,1,4)||'-'||substr(a,5,2)||'-'||substr(a,7,2));
   with cte as (
      select a,b,c,d,lead(a) over (partition by b order by a) as m from file
      union all
      select date(a,'1 day'),b,c,d,m
      from cte
      where date(a,'1 day') < m
   )
   select replace(a,'-','') as a,b,c,d
   from cte
   order by a,b;
"
db=$(mktemp)
echo "$sql" | sqlite3 -csv $db

# strip header and remove db
sed -i -e '1d' file.txt
rm -f $db
  • 将为数据库创建一个临时文件(使用:mktemp
  • 头被插入到file.txt。这是使用我能想到的最简单的名称来完成的,所以第一列将命名为a,第二列将命名为b,.(见注解)
  • 在SQL部分,日期被格式化为20230907变成2023-09-07。这样SQLite明天就可以使用date('2023-09-07','1 day')进行计算。
  • 查询-时,日期中的符号再次被删除。
  • cte根据lead(a) over (partition by b order by a)的值添加记录,lead(a) over (partition by b order by a)是个人可用的下一个数据。
  • 因为我改变了原始的输入文件(添加了一个头),所以我也删除了脚本末尾的这一行。(见注解)
  • 最后,删除临时数据库。

注意:我不应该改变原来的输入,但使用另一个临时文件,.
注2:因为CSV是一种奇怪的格式,所以Photographer的输出不是您所期望的。输出为" Photographer"。更多关于这方面的阅读,请参阅:Spaces between separator and value in csv file

uqzxnwby

uqzxnwby6#

这里有一种方法可以做到这一点,而不需要以大量正则表达式为代价不停地调用mktime()strftime()(它应该能够处理所有闰年计算):

echo ' 
20230907,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20240317,Daniel,28, Photographer' |
mawk '
function ______(__, ___, ____, _____, _) {
      __ = (_ = "")__
    ____ = (_)____

    while((__ = sprintf("%d",__ += __~/([01][1-9]|10|2[0-7])$/ ?    1 \
                                 : __~                 /1231$/ ? 8870 \
       : __~/(11|[469])30$/ ? 71 : __~           /[013578]31$/ ?   70 \
       : __~/22[89]$/        &&    __~((_ = \ 
         __~"(([2468][048]|[13579][26]|(^|[0+-])[48])(00)?|"\
            "0000|^[+-]?0*)....$") ? "9$" : "8$") ? 73-_ : 1)) < ____)
        print __,
             ___, _____ 

 } BEGIN { FS =   (_ = "[ \t]*") (____ = OFS = ",")_ } {
 
     if (___[_ = $2____$3] && __[_] + 1 < +$1)
         ______(__[_], _, $1, ___[_])

     print __[_] = $1, _, ___[_] = $4 }'
20230907,Allan,29,Marketing
20230908,Allan,29,Marketing
20230909,Allan,29,Marketing
20230910,Allan,29,Marketing
20230911,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230921,Daniel,28,Engineer
20230922,Daniel,28,Engineer
20230923,Daniel,28,Engineer
20230924,Daniel,28,Engineer
20230925,Daniel,28,Engineer
20230926,Daniel,28,Engineer
20230927,Daniel,28,Engineer
20230928,Daniel,28,Engineer
20230929,Daniel,28,Engineer
20230930,Daniel,28,Engineer
20231001,Daniel,28,Engineer
20231002,Daniel,28,Engineer
20231003,Daniel,28,Engineer
20231004,Daniel,28,Engineer
20231005,Daniel,28,Engineer
20231006,Daniel,28,Engineer
20231007,Daniel,28,Engineer
20231008,Daniel,28,Engineer
20231009,Daniel,28,Engineer
20231010,Daniel,28,Engineer
20231011,Daniel,28,Engineer
20231012,Daniel,28,Engineer
20231013,Daniel,28,Engineer
20231014,Daniel,28,Engineer
20231015,Daniel,28,Engineer
20231016,Daniel,28,Engineer
20231017,Daniel,28,Engineer
20231018,Daniel,28,Engineer
20231019,Daniel,28,Engineer
20231020,Daniel,28,Engineer
20231021,Daniel,28,Engineer
20231022,Daniel,28,Engineer
20231023,Daniel,28,Engineer
20231024,Daniel,28,Engineer
20231025,Daniel,28,Engineer
20231026,Daniel,28,Engineer
20231027,Daniel,28,Engineer
20231028,Daniel,28,Engineer
20231029,Daniel,28,Engineer
20231030,Daniel,28,Engineer
20231031,Daniel,28,Engineer
20231101,Daniel,28,Engineer
20231102,Daniel,28,Engineer
20231103,Daniel,28,Engineer
20231104,Daniel,28,Engineer
20231105,Daniel,28,Engineer
20231106,Daniel,28,Engineer
20231107,Daniel,28,Engineer
20231108,Daniel,28,Engineer
20231109,Daniel,28,Engineer
20231110,Daniel,28,Engineer
20231111,Daniel,28,Engineer
20231112,Daniel,28,Engineer
20231113,Daniel,28,Engineer
20231114,Daniel,28,Engineer
20231115,Daniel,28,Engineer
20231116,Daniel,28,Engineer
20231117,Daniel,28,Engineer
20231118,Daniel,28,Engineer
20231119,Daniel,28,Engineer
20231120,Daniel,28,Engineer
20231121,Daniel,28,Engineer
20231122,Daniel,28,Engineer
20231123,Daniel,28,Engineer
20231124,Daniel,28,Engineer
20231125,Daniel,28,Engineer
20231126,Daniel,28,Engineer
20231127,Daniel,28,Engineer
20231128,Daniel,28,Engineer
20231129,Daniel,28,Engineer
20231130,Daniel,28,Engineer
20231201,Daniel,28,Engineer
20231202,Daniel,28,Engineer
20231203,Daniel,28,Engineer
20231204,Daniel,28,Engineer
20231205,Daniel,28,Engineer
20231206,Daniel,28,Engineer
20231207,Daniel,28,Engineer
20231208,Daniel,28,Engineer
20231209,Daniel,28,Engineer
20231210,Daniel,28,Engineer
20231211,Daniel,28,Engineer
20231212,Daniel,28,Engineer
20231213,Daniel,28,Engineer
20231214,Daniel,28,Engineer
20231215,Daniel,28,Engineer
20231216,Daniel,28,Engineer
20231217,Daniel,28,Engineer
20231218,Daniel,28,Engineer
20231219,Daniel,28,Engineer
20231220,Daniel,28,Engineer
20231221,Daniel,28,Engineer
20231222,Daniel,28,Engineer
20231223,Daniel,28,Engineer
20231224,Daniel,28,Engineer
20231225,Daniel,28,Engineer
20231226,Daniel,28,Engineer
20231227,Daniel,28,Engineer
20231228,Daniel,28,Engineer
20231229,Daniel,28,Engineer
20231230,Daniel,28,Engineer
20231231,Daniel,28,Engineer
20240101,Daniel,28,Engineer
20240102,Daniel,28,Engineer
20240103,Daniel,28,Engineer
20240104,Daniel,28,Engineer
20240105,Daniel,28,Engineer
20240106,Daniel,28,Engineer
20240107,Daniel,28,Engineer
20240108,Daniel,28,Engineer
20240109,Daniel,28,Engineer
20240110,Daniel,28,Engineer
20240111,Daniel,28,Engineer
20240112,Daniel,28,Engineer
20240113,Daniel,28,Engineer
20240114,Daniel,28,Engineer
20240115,Daniel,28,Engineer
20240116,Daniel,28,Engineer
20240117,Daniel,28,Engineer
20240118,Daniel,28,Engineer
20240119,Daniel,28,Engineer
20240120,Daniel,28,Engineer
20240121,Daniel,28,Engineer
20240122,Daniel,28,Engineer
20240123,Daniel,28,Engineer
20240124,Daniel,28,Engineer
20240125,Daniel,28,Engineer
20240126,Daniel,28,Engineer
20240127,Daniel,28,Engineer
20240128,Daniel,28,Engineer
20240129,Daniel,28,Engineer
20240130,Daniel,28,Engineer
20240131,Daniel,28,Engineer
20240201,Daniel,28,Engineer
20240202,Daniel,28,Engineer
20240203,Daniel,28,Engineer
20240204,Daniel,28,Engineer
20240205,Daniel,28,Engineer
20240206,Daniel,28,Engineer
20240207,Daniel,28,Engineer
20240208,Daniel,28,Engineer
20240209,Daniel,28,Engineer
20240210,Daniel,28,Engineer
20240211,Daniel,28,Engineer
20240212,Daniel,28,Engineer
20240213,Daniel,28,Engineer
20240214,Daniel,28,Engineer
20240215,Daniel,28,Engineer
20240216,Daniel,28,Engineer
20240217,Daniel,28,Engineer
20240218,Daniel,28,Engineer
20240219,Daniel,28,Engineer
20240220,Daniel,28,Engineer
20240221,Daniel,28,Engineer
20240222,Daniel,28,Engineer
20240223,Daniel,28,Engineer
20240224,Daniel,28,Engineer
20240225,Daniel,28,Engineer
20240226,Daniel,28,Engineer
20240227,Daniel,28,Engineer
20240228,Daniel,28,Engineer
20240229,Daniel,28,Engineer
20240301,Daniel,28,Engineer
20240302,Daniel,28,Engineer
20240303,Daniel,28,Engineer
20240304,Daniel,28,Engineer
20240305,Daniel,28,Engineer
20240306,Daniel,28,Engineer
20240307,Daniel,28,Engineer
20240308,Daniel,28,Engineer
20240309,Daniel,28,Engineer
20240310,Daniel,28,Engineer
20240311,Daniel,28,Engineer
20240312,Daniel,28,Engineer
20240313,Daniel,28,Engineer
20240314,Daniel,28,Engineer
20240315,Daniel,28,Engineer
20240316,Daniel,28,Engineer
20240317,Daniel,28,Photographer

相关问题