Python Regex引擎-“look-behind需要固定宽度的模式”错误

nkhmeac6  于 2023-03-04  发布在  Python
关注(0)|答案(3)|浏览(373)

我正在尝试处理CSV格式的字符串中不匹配的双引号。
准确地说,

"It "does "not "make "sense", Well, "Does "it"

应更正为

"It" "does" "not" "make" "sense", Well, "Does" "it"

所以基本上我要做的就是
替换所有"""
1.前面没有行首或逗号(and)
1.后面不跟逗号或行尾
带"""
为此,我使用下面的正则表达式

(?<!^|,)"(?!,|$)

问题是Ruby正则表达式引擎(http://www.rubular.com/)能够解析正则表达式,而python正则表达式引擎(https://pythex.org/http://www.pyregex.com/)抛出以下错误

Invalid regular expression: look-behind requires fixed-width pattern

在python 2.7.3中,它会抛出

sre_constants.error: look-behind requires fixed-width pattern

有人能告诉我是什么让python烦恼吗?

    • 编辑:**

按照Tim的响应,我得到了下面的多行字符串输出

>>> str = """ "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it" """
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str)
' "It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" " '

在每一行的末尾,在"it"旁边加上了两个双引号。
所以我对正则表达式做了一个很小的改动来处理一个新行。

re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)

但这给出了

>>> re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
' "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it" " '

最后一个"it"有两个双引号。
但是我想知道为什么"$"行结束字符不会标识行已经结束。
最终的答案是

re.sub(r'\b\s*"(?!,|[ \t]*$)', '" "', str,flags=re.MULTILINE)
58wvjzkj

58wvjzkj1#

Python re lookbehind确实需要固定宽度,当你在lookbehind模式中有不同长度的替换时,有几种方法可以处理这种情况:

  • 重写模式,这样就不必使用交替(例如,Tim的上述答案使用了单词边界,或者您也可以使用与当前模式完全等价的(?<=[^,])"(?!,|$),该模式要求在双引号前使用字符而不是逗号,或者使用常见模式来匹配用空格括起来的单词,(?<=\s|^)\w+(?=\s|$)可以写成(?<!\S)\w+(?!\S)),或者
  • 拆分后向查找:
  • 需要在组中交替使用正后向查找(例如,(?<=a|bc)应重写为(?:(?<=a)|(?<=bc))
  • 如果lookbehind中的模式是锚与单个字符的交替,则可以反转lookbehind的符号,并使用内部带有字符的求反字符类。例如,(?<=\s|^)匹配空白或字符串/行的开头(如果使用了re.M),所以在Python re中,使用(?<!\S)(?<=^|;)将被转换为(?<![^;]),如果你还想确保行首匹配,将\n添加到求反的字符类中,例如(?<![^;\n])(参见Python Regex: Match start of line, or semi-colon, or start of string, none capturing group)。注意,这对于(?<!\S)来说是不必要的,因为\S不匹配换行符。
  • 负的lookbehind可以直接连接(例如(?<!^|,)"(?!,|$)应该看起来像(?<!^)(?<!,)"(?!,|$))。

或者,只需使用pip install regex(或pip3 install regex)安装PyPi regex module,即可享受无限宽度的lookbehind。

31moq8wy

31moq8wy2#

Python lookbehindAssert需要固定宽度,但是你可以尝试这样做:

>>> s = '"It "does "not "make "sense", Well, "Does "it"'
>>> re.sub(r'\b\s*"(?!,|$)', '" "', s)
'"It" "does" "not" "make" "sense", Well, "Does" "it"'
    • 说明:**
\b      # Start the match at the end of a "word"
\s*     # Match optional whitespace
"       # Match a quote
(?!,|$) # unless it's followed by a comma or end of string
ugmeyewa

ugmeyewa3#

最简单的解决办法是:

import regex as re

regex支持不同长度的look-behind模式。

相关问题