regex 正则表达式根据条件匹配3个列表

nc1teljy  于 2023-10-22  发布在  其他
关注(0)|答案(1)|浏览(152)

从列表中匹配多个字符串匹配并为每个匹配创建一个新行
其中一个建议的解决方案按预期工作,但我想知道是否可以进行一些轻微的调整,以实现以下结果。保持与上面链接的第一个问题相同的条件,如果我想匹配list_3中的元素,而不管它在文本中的位置,新的正则表达式模式会是什么样子?(记住,来自列表_1和列表_2的元素将始终保持连续)。
举例说明:

list_1 = ['chest', 'test', 'west', 'nest']
list_2 = ['mike', 'bike', 'like', 'pike']
list_3 = ['hay', 'day', 'may', 'say']

text = 'zzz zzz chest bike zz zz day zzz'

新的正则表达式应该匹配chest、bike和day。在最初的问题中,胸部自行车和白天是连续的。在这里,他们仍然是,但有额外的文本之间的前两个比赛(胸部和自行车)和最后一场比赛(一天)。
最后,
最后,如果第一个列表(list_1)中的一个元素已经被匹配,我想跳过同一个单词的任何其他匹配。
举例说明:

text = 'zzz zz west like say zz zzz west bike zzz lay zzz zz zz nest mike zzz'

这将匹配west,比如说和nest,mike,因为west在文本中出现不止一次,而nest只出现一次。
文本仍将采用数据框格式,输出也应该采用该格式。
我现在使用的正则表达式如下:

pattern = r'\b' + r'(?:\b\s+'.join(fr"(?P<match_{i+1}>{'|'.join(w)})" for i, w in enumerate(word_list)) + r'\b' + ''.join(')?' for _ in range(1, len(word_list)))

谢谢你,谢谢

q1qsirdb

q1qsirdb1#

您需要对正则表达式进行重大更改以满足新条件。首先,您需要将第三组设置为可选的,而不管第二组是否存在。其次,要处理像zzz zz west like zz zzz nest bike zz say zzz这样的字符串,你需要确保在搜索可选的第三个单词时,不要超出第一个单词(这样字符串将匹配west, likenest, bike, say,而不是west, like, say)。你可以用一个温和的贪婪令牌来做到这一点。对于给定的示例数据,正则表达式(为了可读性,使用了换行符)应该是:

\b(?P<match_1>chest|test|west|nest)
(?:\b\s+(?P<match_2>mike|bike|like|pike))?
(?:\b\s+(?:(?:(?!\b(?:chest|test|west|nest)\b).)*?)(?P<match_3>hay|day|may|say)\b)?

regex101上的正则表达式演示
你可以使用以下代码构建正则表达式:

l1 = '|'.join(list_1)
l2 = '|'.join(list_2)
l3 = '|'.join(list_3)

pattern = fr"\b(?P<match_1>{l1})(?:\b\s+(?P<match_2>{l2}))?(?:\b\s+(?:(?:(?!\b(?:{l1})\b).)*?)(?P<match_3>{l3})\b)?"

然后,您可以将其应用于您的框架,如上一个问题中所述:

out = df.join(
        df['text'].str.extractall(pattern)
        .droplevel(1)
      ).reset_index(drop=True)

使用此示例输入数据:

text
0        zzz zzz zz chest bike day zzzz z test mike zzz zzzz west zzz zz
1                   aaa aa aaa a nest aa aaaa aaa nest bike may aaaa aaa
2        ggg gg ggg ggg ggg test like hay ggg gg west ggg gggg west like
3                                       zzz zzz chest bike zz zz day zzz
4  zzz zz west like say zz zzz west bike zzz lay zzz zz zz nest mike zzz
5                                            zzz zzz chest zz zz day zzz

这给出:

text match_1 match_2 match_3
0   zzz zzz zz chest bike day zzzz z test mike zzz...   chest    bike     day
1   zzz zzz zz chest bike day zzzz z test mike zzz...    test    mike     NaN
2   zzz zzz zz chest bike day zzzz z test mike zzz...    west     NaN     NaN
3   aaa aa aaa a nest aa aaaa aaa nest bike may aa...    nest     NaN     NaN
4   aaa aa aaa a nest aa aaaa aaa nest bike may aa...    nest    bike     may
5   ggg gg ggg ggg ggg test like hay ggg gg west g...    test    like     hay
6   ggg gg ggg ggg ggg test like hay ggg gg west g...    west     NaN     NaN
7   ggg gg ggg ggg ggg test like hay ggg gg west g...    west    like     NaN
8                    zzz zzz chest bike zz zz day zzz   chest    bike     day
9   zzz zz west like say zz zzz west bike zzz lay ...    west    like     say
10  zzz zz west like say zz zzz west bike zzz lay ...    west    bike     NaN
11  zzz zz west like say zz zzz west bike zzz lay ...    nest    mike     NaN
12                        zzz zzz chest zz zz day zzz   chest     NaN     day

然后,您可以根据textmatch_1删除重复项:

out = out.drop_duplicates(subset=['text', 'match_1'], keep='first').reset_index(drop=True)

这给了你想要的结果:

text match_1 match_2 match_3
0  zzz zzz zz chest bike day zzzz z test mike zzz...   chest    bike     day
1  zzz zzz zz chest bike day zzzz z test mike zzz...    test    mike     NaN
2  zzz zzz zz chest bike day zzzz z test mike zzz...    west     NaN     NaN
3  aaa aa aaa a nest aa aaaa aaa nest bike may aa...    nest     NaN     NaN
4  ggg gg ggg ggg ggg test like hay ggg gg west g...    test    like     hay
5  ggg gg ggg ggg ggg test like hay ggg gg west g...    west     NaN     NaN
6                   zzz zzz chest bike zz zz day zzz   chest    bike     day
7  zzz zz west like say zz zzz west bike zzz lay ...    west    like     say
8  zzz zz west like say zz zzz west bike zzz lay ...    nest    mike     NaN
9                        zzz zzz chest zz zz day zzz   chest     NaN     day

相关问题