regex 匹配列表中的多个字符串匹配项,并为每个匹配项创建一个新行

nzk0hqpo  于 2023-10-22  发布在  其他
关注(0)|答案(3)|浏览(105)

我有一个数据框,其中一列中有文本,我使用正则表达式格式的字符串来查看是否可以从三个列表中找到任何匹配项。但是,当列表1中有多个匹配项时,我想为每个匹配项创建一个重复列。需要注意的是,匹配必须是连续的,列表list_2和list_3中的元素是可选的。
我下面有一个例子,我希望所需的输出。

list_1 = ['chest', 'test', 'west', 'nest']
list_2 = ['mike', 'bike', 'like', 'pike']
list_3 = ['hay', 'day', 'may', 'say']

样品DF:
| 文本|匹配_1| match_2| match_3|
| --|--|--|--|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|胸部|自行车|天|
| 自行车可以骑自行车,|巢|楠|楠|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|测试|像|干草|
期望输出:
| 文本|匹配_1| match_2| match_3|
| --|--|--|--|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|胸部|自行车|天|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|测试|迈克|楠|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|西|楠|楠|
| 自行车可以骑自行车,|巢|楠|楠|
| 自行车可以骑自行车,|巢|自行车|可以|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|测试|像|干草|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|西|楠|楠|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|西|像|楠|
我希望我上面的描述不会太混乱。我的当前方法无法匹配来自list_1的多个匹配项(如上面的示例所示),而来自list_2和list_3的可选匹配项是连续的。
感谢您的所有努力!

yhived7q

yhived7q1#

你可以从你的单词列表中编程构建一个正则表达式,使用嵌套的可选部分来允许可能缺少的第二个,第三个等。匹配:

list_1 = ['chest', 'test', 'west', 'nest']
list_2 = ['mike', 'bike', 'like', 'pike']
list_3 = ['hay', 'day', 'may', 'say']
word_list = [list_1, list_2, list_3]
pattern = r'\b' + r'(?:\b\s+'.join(fr"(?P<match_{i+1}>{'|'.join(w)})" for i, w in enumerate(word_list)) + r'\b' + ''.join(')?' for _ in range(1, len(word_list)))

对于您的样本数据,这将提供:

\b(?P<match_1>chest|test|west|nest)(?:\b\s+(?P<match_2>mike|bike|like|pike)(?:\b\s+(?P<match_3>hay|day|may|say)\b)?)?

你可以在regex101上看到这一点。
然后,您可以将该正则表达式与extractall一起使用,以查找每个文本值中的所有匹配项,并将该结果连接回原始列。

out = df[['text']].join(
        df['text'].str.extractall(pattern)
        .droplevel(1)
      ).reset_index(drop=True)

对于您的示例数据,它给出以下结果:

text match_1 match_2 match_3
0   zzz zzz zz chest bike day zzzz z test mike zz...   chest    bike     day
1   zzz zzz zz chest bike day zzzz z test mike zz...    test    mike     NaN
2   zzz zzz zz chest bike day zzzz z test mike zz...    west     NaN     NaN
3   aaa aa aaa a nest aa aaaa aaa nest bike may a...    nest     NaN     NaN
4   aaa aa aaa a nest aa aaaa aaa nest bike may a...    nest    bike     may
5   ggg gg ggg ggg ggg test like hay ggg gg west ...    test    like     hay
6   ggg gg ggg ggg ggg test like hay ggg gg west ...    west     NaN     NaN
7   ggg gg ggg ggg ggg test like hay ggg gg west ...    west    like     NaN

请注意,使用变量list_1list_2不是一个好的编程实践,你应该使用一个列表的列表(像上面的word_list)。

hof1towb

hof1towb2#

示例

import pandas as pd
data1 = {'text': [' zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz ',
  ' aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa            ',
  ' ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like ']}
df1 = pd.DataFrame(data1)

df1

text
0   zzz zzz zz chest bike day zzzz z test mike zz...
1   aaa aa aaa a nest aa aaaa aaa nest bike may a...
2   ggg gg ggg ggg ggg test like hay ggg gg west ...

步骤1

首先制作图案

pat_list = ['(?P<match_{}>{})'.format(i, '|'.join(globals()["list_%i" % i])) for i in range(1, 4)]

pat_list

['(?P<match_1>chest|test|west|nest)',
 '(?P<match_2>mike|bike|like|pike)',
 '(?P<match_3>hay|day|may|say)']

我使用了for循环来提取要提取到list_1、list_2和list_3中的值。如果它与您的机制不同,您也可以手动创建。

第二步

接下来,从df1的“text”列中提取模式,并将生成的DataFrame定义为df2。

df2 = df1['text'].str.extractall('|'.join(pat_list)).droplevel(1)

DF2

match_1 match_2 match_3
0   chest   NaN     NaN
0   NaN     bike    NaN
0   NaN     NaN     day
0   test    NaN     NaN
0   NaN     mike    NaN
0   west    NaN     NaN
1   nest    NaN     NaN
1   nest    NaN     NaN
1   NaN     bike    NaN
1   NaN     NaN     may
2   test    NaN     NaN
2   NaN     like    NaN
2   NaN     NaN     hay
2   west    NaN     NaN
2   west    NaN     NaN
2   NaN     like    NaN

步骤3

按顺序压缩df2的值并与df1连接。

grp = df2['match_1'].notna().groupby(df2.index).cumsum()
df3 = df2.groupby([df2.index, grp]).first().droplevel(1)
out = df1[['text']].join(df3)

出来

text                                            match_1 match_2 match_3
0   zzz zzz zz chest bike day zzzz z test mike zz...    chest   bike    day
0   zzz zzz zz chest bike day zzzz z test mike zz...    test    mike    None
0   zzz zzz zz chest bike day zzzz z test mike zz...    west    None    None
1   aaa aa aaa a nest aa aaaa aaa nest bike may a...    nest    None    None
1   aaa aa aaa a nest aa aaaa aaa nest bike may a...    nest    bike    may
2   ggg gg ggg ggg ggg test like hay ggg gg west ...    test    like    hay
2   ggg gg ggg ggg ggg test like hay ggg gg west ...    west    None    None
2   ggg gg ggg ggg ggg test like hay ggg gg west ...    west    like    None
l2osamch

l2osamch3#

您可以通过编程方式创建一个正则表达式来与str.extractall一起使用:

lists = [list_1, list_2, list_3]
pats = [f"(?P<match_{i}>{'|'.join(l)})" for i, l in enumerate(lists, start=1)]
# ['(?P<match_1>chest|test|west|nest)',
#  '(?P<match_2>mike|bike|like|pike)',
#  '(?P<match_3>hay|day|may|say)']

pat = pats[-1]
for p in pats[-2::-1]:
    pat = f'{p}(?: +{pat})?'
# '(?P<match_1>chest|test|west|nest)(?: +(?P<match_2>mike|bike|like|pike)(?: +(?P<match_3>hay|day|may|say))?)?'

out = df['text'].str.extractall(pat).droplevel(1)

输出量:

match_1 match_2 match_3
0   chest    bike     day
0    test    mike     NaN
0    west     NaN     NaN
1    nest     NaN     NaN
1    nest    bike     may
2    test    like     hay
2    west     NaN     NaN
2    west    like     NaN

regex demo
要将结果连接到原始DataFrame,请执行以下操作:

out = df.join(df['text'].str.extractall(pat).droplevel(1))

输出量:

text match_1 match_2 match_3
0   zzz zzz zz chest bike day zzzz z test mike zz...   chest    bike     day
0   zzz zzz zz chest bike day zzzz z test mike zz...    test    mike     NaN
0   zzz zzz zz chest bike day zzzz z test mike zz...    west     NaN     NaN
1   aaa aa aaa a nest aa aaaa aaa nest bike may a...    nest     NaN     NaN
1   aaa aa aaa a nest aa aaaa aaa nest bike may a...    nest    bike     may
2   ggg gg ggg ggg ggg test like hay ggg gg west ...    test    like     hay
2   ggg gg ggg ggg ggg test like hay ggg gg west ...    west     NaN     NaN
2   ggg gg ggg ggg ggg test like hay ggg gg west ...    west    like     NaN

相关问题