pandas 使用多个正则表达式从panda Dataframe 列中获取值

oipij1gg  于 2023-03-11  发布在  其他
关注(0)|答案(1)|浏览(102)

我还在学习python和panda的 Dataframe 。
我的目标是使用正则表达式从dataframe列的文本中获取值(名称),但文本没有相同的模式,所以我提出了多个正则表达式,我需要验证结果,所以它只是获取正确的名称值。这最终使我在dataframe和正则表达式列表上循环。
下面是我在python中的尝试:
数据:

rawdata = ['Current Trending Voice Actress Takahashi Rie was a..',
           'One of the legend voice actor Tsuda Kenjiro is a blabalabla he was',
           'The most popular amongs the fans voice actor Akari Kito is known',
           'From Demon Slayer series voice actor Hanae Natsuki said he was in problem with his friend',
           'Shibuya February 2023, voice actor Yuki Kaji and His wife announced birth of new child they was',
           'Most popular female voice actress Ayane Sakura began',
           'Known as Kirito from SAO Voice Actor Matsuoka Yoshitsugu was'
]

Dataframe :

import pandas as pd
import re

df = pd.DataFrame({'text': rawdata})

正则表达式列表:

regex_list = [
    r'(?<=voice actor )(.*)(?= was)',
    r'(?<=voice actor )(.*)(?= is)',
    r'(?<=voice actor )(.*)(?= said)',
    r'(?<=voice actor )(.*)(?= and)'
]

行动:

res = []
for ind in df.index:

  for n, rule in enumerate(regex_list):
     result = re.findall(regex_list[n], df['text'][ind], re.MULTILINE | re.IGNORECASE)
     if result:
       if len(result[0]) > 20:
         result = re.findall(regex_list[n+1], df['text'][ind], re.MULTILINE | re.IGNORECASE)
       else:
         n = 0
         re.append(result[0])
         break
     if not result and n==len(regex_list)-1:
      re.append('Not Found')
       


df["Result"] = res  
print(df)

结果:

text               Result
0  Current Trending Voice Actress Takahashi Rie w...            Not Found
1  One of the legend voice actor Tsuda Kenjiro is...        Tsuda Kenjiro
2  The most popular amongs the fans voice actor A...           Akari Kito
3  From Demon Slayer series voice actor Hanae Nat...        Hanae Natsuki
4  Shibuya February 2023, voice actor Yuki Kaji a...            Yuki Kaji
5  Most popular female voice actress Ayane Sakura...            Not Found
6  Known as Kirito from SAO Voice Actor Matsuoka ...  Matsuoka Yoshitsugu

结果已经让我很满意了,但是我担心的是当我处理更大的数据,有很多正则表达式模式时,这个过程会花费很多时间和资源,因为它必须做很多次迭代。
有没有更好的办法?
谢谢。

wljmcqd8

wljmcqd81#

你可以直接使用extract来匹配你的文本并得到结果,在名字字符周围使用一个捕获组,然后你可以使用fillna来替换任何不匹配的Not Found

df['Result'] = df['text'].str.extract(r'voice (?:actor|actress)\s+(.*?)\s+(?:is|was|said|and)\b', re.I).fillna('Not Found')

输出:

text               Result
0                                             Current Trending Voice Actress Takahashi Rie was a..        Takahashi Rie
1                               One of the legend voice actor Tsuda Kenjiro is a blabalabla he was        Tsuda Kenjiro
2                                 The most popular amongs the fans voice actor Akari Kito is known           Akari Kito
3        From Demon Slayer series voice actor Hanae Natsuki said he was in problem with his friend        Hanae Natsuki
4  Shibuya February 2023, voice actor Yuki Kaji and His wife announced birth of new child they was            Yuki Kaji
5                                             Most popular female voice actress Ayane Sakura began            Not Found
6                                     Known as Kirito from SAO Voice Actor Matsuoka Yoshitsugu was  Matsuoka Yoshitsugu

注我已经更新了正则表达式,使用交替来同时匹配所有可能的单词,并在它后面添加了一个\b(单词边界),以确保它不匹配类似Andrew的内容:

(?:is|was|said|and)

并且还以相同的方式匹配actoractress

(?:actor|actress)

我还在捕获组外部添加了空格匹配,这样名称就不需要修剪了:

\s+(.*?)\s+

相关问题