regex 为什么这个正则表达式捕获组不停止设置条件，并继续捕获，直到行尾？[关闭]

igetnqfo 于 2023-02-05 发布在其他

关注(0)|答案(1)|浏览(83)

- 已关闭**。此问题需要details or clarity。当前不接受答案。
- 想要改进此问题？**添加详细信息并通过editing this post阐明问题。

import re

input_text = "((PL_ADVB)alrededor (NOUN)(del auto rojizo, algo grande y completamente veloz)). Luego dentro del baúl rápidamente abajo de una caja por sobre ello vimos una caña." #example input

#place_reference = r"((?i:\w\s*)+)?"
#place_reference = r"(?i:[\w,;.]\s*)+" <--- greedy regex
place_reference = r"(?i:[\w,;.]\s*)+?"

list_all_adverbs_of_place = ["adentro", "dentro", "al rededor", "alrededor", "abajo", "hacía", "hacia", "por sobre", "sobre"]
list_limiting_elements = list_all_adverbs_of_place + ["vimos", "hemos visto", "encontramos", "hemos encontrado", "rápidamente", "rapidamente", "intensamente", "durante", "luego", "ahora", ".", ":", ";", ",", "(", ")", "[", "]", "¿", "?", "¡", "!", "&", "="]

pattern = re.compile(rf"(?:(?<=\s)|^)({'|'.join(re.escape(x) for x in list_all_adverbs_of_place)})?(\s+{place_reference})\s*({'|'.join(re.escape(x) for x in list_limiting_elements)})", flags = re.IGNORECASE)

input_text = re.sub(pattern,
                    #lambda m: f"((PL_ADVB){m[1]}{m[2]}){m[3]}",
                    lambda m: f"((PL_ADVB){m[1]}{m[2]}){m[3]}" if m[2] else f"((PL_ADVB){m[1]} NO_DATA){m[3]}",
                    input_text)

print(repr(input_text)) #--> output

当我使用lambda m: f"((PL_ADVB){m[1]}{m[2]}){m[3]}" if m[2] else f"((PL_ADVB){m[1]} NO_DATA){m[3]}"时，我得到了以下错误的输出：
'((PL_ADVB)alrededor (NOUN)(del auto rojizo, algo grande y completamente veloz)). Luego ((PL_ADVB)dentro del baúl rápidamente abajo de una caja por sobre ello vimos una caña).'
可以注意到捕获组{m[3]}如何仅捕获.
这并不完全正确，因为您不应该将所有内容都放在括号内，以便获得以下正确的输出：

"((PL_ADVB)alrededor ((NOUN)del auto rojizo, algo grande y completamente veloz)). Luego ((PL_ADVB)dentro del baúl) rápidamente ((PL_ADVB)abajo de una caja) ((PL_ADVB)por sobre ello) vimos una caña."

list_all_adverbs_of_place表示捕获组的开始，list_limiting_elements表示捕获组的结束。

regex

来源：https://stackoverflow.com/questions/75327510/why-doesnt-this-regex-capture-group-stop-with-the-set-condition-and-continue-ca

1条答案

按热度按时间

nue99wik1#

如果我理解你的问题是正确的，问题是文本"por sobre ello"没有突出显示的正则表达式。
正则表达式尝试从第一个列表中查找一个单词，然后是我们感兴趣的单词，最后是第三个列表中的单词。
如果我们运行您的示例，下面是它对给定文本所做的匹配：

input_text = "((PL_ADVB)alrededor (NOUN)(del auto rojizo, algo grande y completamente veloz)). Luego dentro del baúl rápidamente abajo de una caja por sobre ello vimos una caña."

list_all_adverbs_of_place = [
    "adentro",
    "dentro",
    "al rededor",
    "alrededor",
    "abajo",
    "hacía",
    "hacia",
    "por sobre",
    "sobre"]

list_limiting_elements = list_all_adverbs_of_place + [
    "vimos",
    "hemos visto",
    "encontramos",
    "hemos encontrado",
    "rápidamente", "rapidamente",
    "intensamente",
    "durante",
    "luego",
    "ahora", ".", ":", ";", ",", "(", ")", "[", "]", "¿", "?", "¡", "!", "&", "="]

# For the sake of this question, this could all be simplified
pattern = re.compile(
    rf"(?:(?<=\s)|^)({'|'.join(re.escape(x) for x in list_all_adverbs_of_place)})?(\s+{place_reference})\s*({'|'.join(re.escape(x) for x in list_limiting_elements)})", flags = re.IGNORECASE)

for match in pattern.finditer(input_text):
    print(match.group(1, 2, 3))

这显示了结果：

('dentro', ' del baúl ', 'rápidamente')
('abajo', ' de una caja ', 'por sobre')

运行上面的代码将得到以下输出

'((PL_ADVB)alrededor (NOUN)(del auto rojizo, algo grande y completamente veloz)). Luego ((PL_ADVB)dentro del baúl )rápidamente ((PL_ADVB)abajo de una caja )por sobre ello vimos una caña.'

pattern = re.compile(
    rf"(?:(?<=\s)|^)({'|'.join(re.escape(x) for x in list_all_adverbs_of_place)})?(\s+{place_reference})\s*((?={'|'.join(re.escape(x) for x in list_limiting_elements)}))", flags = re.IGNORECASE)

赞(0）回复(0）举报 2023-02-05

我来回答

regex 为什么这个正则表达式捕获组不停止设置条件，并继续捕获，直到行尾？[关闭]

1条答案

相关问题

热门标签

最新问答