regex 为什么如果我在捕获组上放置一个向后看约束，它能确保遵从性，但也能捕获给定约束之前的内容？

s71maibg 于 2023-08-08 发布在其他

关注(0)|答案(1)|浏览(70)

Python中的正则表达式模式未捕获正确的子字符串，给出意外输出

import re

#example
input_text = "It is close to that place, the NY hospital was the place where I was born, the truth is that's all I know, and it happened in November of the year 2000."

#regex pattern
# partner_match = re.search(r"(?:(?:[^.,;\n]+)(?<=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags = re.IGNORECASE) # here I tried using the negation operator [^...] but it doesn't work
partner_match = re.search(r"(?:(?:\.|,|;|\n)(?<=\s)|^)\s*(.+?)\s*(?:was|would be|is)\s*the\s*(?:place|side)", input_text, flags=re.IGNORECASE)

#here print captured string
if partner_match: print(partner_match.group(1))

字符串
为什么不给我这个输出：

the NY hospital

型
它错误地给了我所有这些字符串：

It is close to that place, the NY hospital

型
我应该在正则表达式捕获限制中修复什么？

regex

来源：https://stackoverflow.com/questions/76831309/why-if-im-placing-a-lookbehind-constraint-on-the-capturing-group-does-it-ensur

1条答案

按热度按时间

irtuqstp1#

正如注解中所解释的，正则表达式引擎最初会尝试在字符串的开头进行匹配。在该位置匹配(?:\.|,|;|\n)(?<=\s)失败后，它尝试匹配^并成功。因此，它将不会进一步尝试匹配(?:\.|,|;|\n)(?<=\s)，因此产生不期望的结果。
顺便说一句，注意(?:\.|,|;|\n)(?<=\s)是一个错误的结构，因为它只能匹配换行符（\n），而不能匹配句点、逗号或分号。它的内容是，“匹配一个句点、逗号、分号或换行符，前提是后面的字符前面有一个空格字符”，但是当然，X后面的字符前面的字符就是X本身。在这里，换行符是四个字符中唯一的空格字符。
另一个问题是，通过使用\s*而不是\s+，字符串"wasthesize"（例如）被正则表达式\s*(?:was|would be|is)\s*the\s*(?:place|side)匹配，我认为这是一种不受欢迎的行为。
请注意，(?:\.|,|;|\n)可以更紧凑地表示为字符类：[.,;\n]。
我假设问题是匹配一个子字符串，该子字符串开始于句号、逗号、分号或空格后面的空格之后，并一直持续到后面跟着“was the place”、“was the side”、“would be the place”、“would be the side”、“is the place”或“is the side”。
如果字符串包含一个以上的空格，前面是句点、逗号、分号或空格，则必须确定哪一个标识匹配字符串的开头。我假设这是最后一个符合所有要求的配对。
因此，您可以尝试匹配以下正则表达式。

[.,;\n] (.+) +(?:was|would be|is) +the +(?:place|side)

字符串
其中所需结果将包含在捕获基团1中。
Demo
可以替代地使用 * 正向后看 * 和 * 正向前看 * 来简单地匹配（但不捕获）期望的字符串。

(?<=[.,;\n] ).+(?= +(?:was|would be|is) +the +(?:place|side))

型
Demo的

赞(0）回复(0）举报 2023-08-08

我来回答

regex 为什么如果我在捕获组上放置一个向后看约束，它能确保遵从性，但也能捕获给定约束之前的内容？

1条答案

相关问题

热门标签

最新问答