我有一个数据框,其中一列中有文本,我使用正则表达式格式的字符串来查看是否可以从三个列表中找到任何匹配项。但是,当列表1中有多个匹配项时,我想为每个匹配项创建一个重复列。需要注意的是,匹配必须是连续的,列表list_2和list_3中的元素是可选的。
我下面有一个例子,我希望所需的输出。
list_1 = ['chest', 'test', 'west', 'nest']
list_2 = ['mike', 'bike', 'like', 'pike']
list_3 = ['hay', 'day', 'may', 'say']
样品DF:
| 文本|匹配_1| match_2| match_3|
| --|--|--|--|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|胸部|自行车|天|
| 自行车可以骑自行车,|巢|楠|楠|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|测试|像|干草|
期望输出:
| 文本|匹配_1| match_2| match_3|
| --|--|--|--|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|胸部|自行车|天|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|测试|迈克|楠|
| zzz zzz zzz胸部自行车天zzz z测试迈克zzz zzz西zzz|西|楠|楠|
| 自行车可以骑自行车,|巢|楠|楠|
| 自行车可以骑自行车,|巢|自行车|可以|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|测试|像|干草|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|西|楠|楠|
| gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg|西|像|楠|
我希望我上面的描述不会太混乱。我的当前方法无法匹配来自list_1的多个匹配项(如上面的示例所示),而来自list_2和list_3的可选匹配项是连续的。
感谢您的所有努力!
3条答案
按热度按时间yhived7q1#
你可以从你的单词列表中编程构建一个正则表达式,使用嵌套的可选部分来允许可能缺少的第二个,第三个等。匹配:
对于您的样本数据,这将提供:
你可以在regex101上看到这一点。
然后,您可以将该正则表达式与
extractall
一起使用,以查找每个文本值中的所有匹配项,并将该结果连接回原始列。对于您的示例数据,它给出以下结果:
请注意,使用变量
list_1
,list_2
不是一个好的编程实践,你应该使用一个列表的列表(像上面的word_list
)。hof1towb2#
示例
df1
步骤1
首先制作图案
pat_list
我使用了for循环来提取要提取到list_1、list_2和list_3中的值。如果它与您的机制不同,您也可以手动创建。
第二步
接下来,从df1的“text”列中提取模式,并将生成的DataFrame定义为df2。
DF2
步骤3
按顺序压缩df2的值并与df1连接。
出来
l2osamch3#
您可以通过编程方式创建一个正则表达式来与
str.extractall
一起使用:输出量:
regex demo
要将结果连接到原始DataFrame,请执行以下操作:
输出量: