csv Bash脚本查找重复值并修改它们

wmtdaxz3  于 2023-02-27  发布在  其他
关注(0)|答案(2)|浏览(151)

有一个脚本,解析csv文件,但它不工作,因为需要这里是一些内容,从csv文件

id,location_id,name,title,email,department
1,1,Susan houston,Director of Services,,
2,1,Christina Gonzalez,Director,,
3,2,Brenda brown,"Director, Second Career Services",,
4,3,Howard Lader,"Manager, Senior Counseling",,
5,4,Kimberly Pesavento,Commercial director,,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,,
7,6,peter Olson,Director,,
8,6,Bart charlow,Executive Director,,
9,7,Bart Charlow,Executive Director,,
#!/bin/bash

#create path to redirect accounts.csv to same directory as accounts_new.csv
path=$(dirname $1)

# Substituted commmas with vertical lines, so sed command works 
awk -F'"' -v OFS='"' '{ for (i=2; i<=NF; i+=2) gsub(",", "|", $i) } 1' accounts.csv |

# Changed first letters of names to uppercase
awk -F, -v col=3 ' 
NR > 1{
  n=split(tolower($col),a," ")
  $col=toupper(substr(a[1],1,1)) substr(a[1],2)
  for(i=2;i<=n;i++) {
    $col=$col " " toupper(substr(a[i],1,1)) substr(a[i],2)
  }
}1' OFS="," |

# Generated email addresses
sed -E 's/([^,]*,([^,]*),) ?(([[:alpha:]])[^ ]* +)(([^,]*),[^,]*,)[^,]*/\1\u\3\u\5\L\4\6\@abc.com/' |

awk -F, '{for (i=5;i<=5;i++){if (v[i,$i]++){b[$i]=$i; 
  $i=split($3,arr," ")
  val=(substr($3,1,1) arr[2]$2"@abc.com")
  $5=tolower(val)
  }};print $0}' OFS="," |

# Added missing commas and sent output to new file
sed -E 's/\|/\,/g' > $path"/accounts_new.csv"

以下是脚本的输出

id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,shouston@abc.com,
2,1,Christina Gonzalez,Director,cgonzalez@abc.com,
3,2,Brenda Brown,Director, Second Career Services,bbrown@abc.com,
4,3,Howard Lader,Manager, Senior Counseling,hlader@abc.com,
5,4,Kimberly Pesavento,Commercial director,kpesavento@abc.com,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,jbloom@abc.com,
7,6,Peter Olson,Director,polson@abc.com,
8,6,Bart Charlow,Executive Director,bcharlow@abc.com,
9,7,Bart Charlow,Executive Director,bcharlow7@abc.com,

但期望的输出是

id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,shouston@abc.com,
2,1,Christina Gonzalez,Director,cgonzalez@abc.com,
3,2,Brenda Brown,Director, Second Career Services,bbrown@abc.com,
4,3,Howard Lader,Manager, Senior Counseling,hlader@abc.com,
5,4,Kimberly Pesavento,Commercial director,kpesavento@abc.com,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,jbloom@abc.com,
7,6,Peter Olson,Director,polson@abc.com,
8,6,Bart Charlow,Executive Director,bcharlow6@abc.com,
9,7,Bart Charlow,Executive Director,bcharlow7@abc.com,

正如您从最后2行中看到的,location_id仅附加到第二个相同的电子邮件,但第一个保持不变,我如何将location_id也添加到第一个?
你能帮帮我吗?谢谢!
我尝试使用while语句代替if语句,但没有用

jvidinwx

jvidinwx1#

假设/理解:

  • 某些输入字段可能会用双引号括起来;这些双引号 * 不 * 被删除(不像OP所需的输出)
  • 由于逗号同时显示为分隔符和数据,我们将临时将数据示例转换为另一个在数据中不显示的字符;在本练习中,我们将使用竖线字符(|),否则我们将需要选择另一个替换字符
  • 所有name字段都由两个空格分隔的字符串组成(即,不必担心Bob B Burns IIILastname, Firstname之类的名称)
  • name字段中的两个空格分隔字符串都只需要将首字母大写(例如,我们不必担心名称混合大小写,如dePalmaMcDonald
  • 如果我们以重复的电子邮件帐户结束,则location_id保证是唯一的
  • 输入文件可以放入内存(如awk数组)

一旦awk被拉入混合,就很少需要sed或额外的awk脚本,也就是说,我们应该能够使用单个awk脚本执行整个操作。
一个awk创意:

awk '
BEGIN { FS="\""; OFS="," }                              # input is delimited by double qutoes
NR==1 { print; next }
      { line=""
        for (i=1;i<NF;i+=2) {                           # loop through odd numbered fields
            gsub(/,/,"|",$(i+1))                        # in even numbered double-quote-delimited fields replace commas with pipes
            line=line $i FS $(i+1) FS                   # rebuild the current line
        }
        line=line $NF                                   # add last field to new line

        split(line,a,",")                               # split new line on commas
        split(tolower(a[3]),b,/[[:space:]]+/)           # split tolower(name field) on white space

        # rebuild name with first characters of first/last names uppercased

        name=toupper(substr(b[1],1,1)) substr(b[1],2) " " toupper(substr(b[2],1,1)) substr(b[2],2)

        acct=substr(b[1],1,1) b[2]                      # build email acct name

        lines[NR]=a[1] OFS a[2] OFS name OFS a[4]       # rebuild current line based on first 4 fields
        locid[NR]=a[2]                                  # make note of location_id for current line
        email[NR]=acct                                  # make note of email acct for current line
        count[acct]++                                   # keep count of number of times we see this email acct
      }

END   { for (i=2;i<=NR;i++) {                           # loop through our lines of output
            gsub(/\|/,",",lines[i])                     # replace pipes with original commas

            # print final line of output; if email acct has been seen more than once then append the location_id to the email acct; add the "@abc.com" domain and the trailing comma

            print lines[i] OFS email[i] (count[email[i]] > 1 ? locid[i] : "") "@abc.com" OFS
        }
      }
' accounts.csv

这将产生:

id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,shouston@abc.com,
2,1,Christina Gonzalez,Director,cgonzalez@abc.com,
3,2,Brenda Brown,"Director, Second Career Services",bbrown@abc.com,
4,3,Howard Lader,"Manager, Senior Counseling",hlader@abc.com,
5,4,Kimberly Pesavento,Commercial director,kpesavento@abc.com,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,jbloom@abc.com,
7,6,Peter Olson,Director,polson@abc.com,
8,6,Bart Charlow,Executive Director,bcharlow6@abc.com,
9,7,Bart Charlow,Executive Director,bcharlow7@abc.com,
tzdcorbm

tzdcorbm2#

使用任何awk的2遍方法:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR == 1 { print }
FNR == 1 { next }
{ lc_name = tolower($3) }
NR == FNR {
    cnt[lc_name]++
    next
}
{
    n = split(lc_name,parts," ")
    for ( i=1; i<=n; i++ ) {
        name = ( i>1 ? name " " : "" ) toupper(substr(parts[i],1,1)) substr(parts[i],2)
        email = ( i>1 ? email parts[i] : substr(parts[i],1,1) )
    }

    $3 = name
    $(NF-1) = email ( cnt[lc_name] > 1 ? $2 : "" ) "@example.com"
    print
}
$ awk -f tst.awk accounts.csv accounts.csv
id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,shouston@example.com,
2,1,Christina Gonzalez,Director,cgonzalez@example.com,
3,2,Brenda Brown,"Director, Second Career Services",bbrown@example.com,
4,3,Howard Lader,"Manager, Senior Counseling",hlader@example.com,
5,4,Kimberly Pesavento,Commercial director,kpesavento@example.com,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,jbloom@example.com,
7,6,Peter Olson,Director,polson@example.com,
8,6,Bart Charlow,Executive Director,bcharlow6@example.com,
9,7,Bart Charlow,Executive Director,bcharlow7@example.com,

在上面的例子中,我假设一个包含两个以上部分的名字,比如Billy Jean King,应该会得到一个类似bjeanking@example.com的电子邮件地址,如果你想要其他的,这是一个很小的调整。
我也没有删除字段中的引号,因为这会破坏数据,但如果您真的想要这样做,那么,再一次,这是一个微不足道的调整。

相关问题