从csv文件中的“备注”列中提取特定单词

smtd7mpg  于 2023-09-28  发布在  其他
关注(0)|答案(1)|浏览(98)

我有一个csv输入文件'out_test.csv',其中有一个备注列。我想从那里提取'aaaa',并以'aaaa'为中心+- 1个单词。然后需要添加一个名为'info'的附加列,并将这3个单词存储在那里。下面是文件:enter image description here
注意:'aaaa'可能不在列中,在这种情况下,在'info'列中写入'NA'。
我已经这样做了:

with open('out_test.csv','r+') as csvf:   # open csv file
    lines = csvf.read().split("\n")  # split contents into lines
    for i, line in enumerate(lines): 
        row = line.split(",")   #  split lines into columns
        for j, col in enumerate(row):   
            if "aaaa" in col:   # check if keyword in column
                row.append(str(j))   # append the row to last column 
        lines[i] = ','.join(row)     

with open('out_test.csv', 'wt') as csvfw:
    csvfw.write('\n'.join(lines))    # write lines back to file.

注意事项:
1.我从另一篇文章中引用了或多或少相同的问题。text
1.我也遇到了一些权限错误。

ruarlubt

ruarlubt1#

我会这样做:

import csv, re

WORD, N = "aaaa", 1

pattern = (
    rf"((?:\S+ +){{0,{N}}}\S*"  # word(s) before target
    rf"\b{re.escape(WORD)}\b"   # the targeted word
    rf"\S*(?: +\S+){{0,{N}}})"  # word(s) after target
)

with open("input.csv", "r") as inpf, open("output.csv", "w") as outf:

    data = [
        {**line, "info": re.search(pattern, line["remarks"]).group()
        if re.search(pattern, line["remarks"]) else "NA"}
        for line in csv.DictReader(inpf)
    ]

    writer = csv.DictWriter(outf, fieldnames=data[0].keys())
    writer.writeheader()
    writer.writerows(data)

Regex:[ demo ]
或者使用pandas的等价物(* 使用与上面相同的 * pattern *):

#pip install pandas
import pandas as pd

(
    pd.read_csv("input.csv")
      .assign(info=lambda x: x["remarks"]
              .str.extract(pattern, expand=False).fillna("NA"))
      .to_csv("output.csv", index=False)
)

output.csv的最后2个字段(* 以表格格式 *):

remarks                  info
0  xxxyyyy zzzz aaaa bbbb ccc dd erfdefgrgesrg        zzzz aaaa bbbb
1  xxxyyyy zzzz aaaa bbbb ccc dd erfdefgrgesrg        zzzz aaaa bbbb
2                   aaaa bbbbb wwww dddd zzzzz            aaaa bbbbb
3  xxxyyyy zzzz aaaa bbbb ccc dd erfdefgrgesrg        zzzz aaaa bbbb
4  xxxyyyy zzzz aaaa bbbb ccc dd erfdefgrgesrg        zzzz aaaa bbbb
5            chuifheurfhr asbfswgfduwsegf aaaa  asbfswgfduwsegf aaaa
6                        loezlerl oooezp bbbll                    NA

使用input.csv

id,f_name,1_name,fc_id,remarks
1,Raj,Sharma,1,xxxyyyy zzzz aaaa bbbb ccc dd erfdefgrgesrg
1,Raj,Sharma,2,xxxyyyy zzzz aaaa bbbb ccc dd erfdefgrgesrg
1,Raj,Sharma,2,aaaa bbbbb wwww dddd zzzzz
2,Ram,Kapoor,1,xxxyyyy zzzz aaaa bbbb ccc dd erfdefgrgesrg
2,Ram,Kapoor,2,xxxyyyy zzzz aaaa bbbb ccc dd erfdefgrgesrg
3,Raju,Verma,3,chuifheurfhr asbfswgfduwsegf aaaa
4,XXXX,Yyyyy,4,loezlerl oooezp bbbll # <-- added by me
  • 更新 *:

如果你需要的目标不仅仅是一个单词,你可以尝试下面的模式:

WORDS, N = ["aaaa", "tttt"], 1

pattern = (
    rf"((?:\S+ +){{0,{N}}}\S*"
    fr"\b(?:{'|'.join(map(re.escape, WORDS))})\b"
    rf"\S*(?: +\S+){{0,{N}}})"
)

相关问题