regex 如何用不在括号内的逗号分隔?

koaltpgm  于 2023-08-08  发布在  其他
关注(0)|答案(7)|浏览(103)

假设我有一个这样的字符串,其中的项用逗号分隔,但在包含括号内容的项中也可能有逗号:
(EDIT:抱歉,忘记提及某些项目可能没有括号内的内容)

"Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"

字符串
如何仅通过不在括号内的逗号来拆分字符串?即:

["Water", "Titanium Dioxide (CI 77897)", "Black 2 (CI 77266)", "Iron Oxides (CI 77491, 77492, 77499)", "Ultramarines (CI 77007)"]


我想我必须使用一个正则表达式,可能是这样的:

([(]?)(.*?)([)]?)(,|$)


但我还在努力让它正常工作。

3xiyfsfu

3xiyfsfu1#

使用negative lookahead匹配所有不在括号内的逗号。根据匹配的逗号分割输入字符串将给予所需的输出。

,\s*(?![^()]*\))

字符串
DEMO

>>> import re
>>> s = "Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
>>> re.split(r',\s*(?![^()]*\))', s)
['Water', 'Titanium Dioxide (CI 77897)', 'Black 2 (CI 77266)', 'Iron Oxides (CI 77491, 77492, 77499)', 'Ultramarines (CI 77007)']

ercv8c1e

ercv8c1e2#

您可以使用str.replacestr.split来实现。您可以使用任何字符来替换),

a = "Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
a = a.replace('),', ')//').split('//')
print a

字符串
输出:-

['Titanium Dioxide (CI 77897)', ' Black 2 (CI 77266)', ' Iron Oxides (CI 77491, 77492, 77499)', ' Ultramarines (CI 77007)']

w8f9ii69

w8f9ii693#

我相信我有一个更简单的正则表达式:

rx_comma = re.compile(r",(?![^(]*\))")
result = rx_comma.split(string_to_split)

字符串
regexp的解释:

  • 匹配,
  • 是否NOT后跟:
  • )结尾的字符列表,其中:
  • ,)之间的字符列表不包含(

它在嵌套括号的情况下不起作用,如a,b(c,d(e,f))。如果需要这样做,一个可能的解决方案是通过split的结果,如果字符串有一个开括号而没有关闭,那么就进行merge:),如:

"a"
"b(c" <- no closing, merge this 
"d(e" <- no closing, merge this
"f))

n6lpvg4x

n6lpvg4x4#

这个版本似乎可以使用嵌套的括号、方括号([]或<>)和大括号:

def split_top(string, splitter, openers="([{<", closers = ")]}>", whitespace=" \n\t"):
    ''' Splits strings at occurance of 'splitter' but only if not enclosed by brackets.
        Removes all whitespace immediately after each splitter.
        This assumes brackets, braces, and parens are properly matched - may fail otherwise '''

outlist = []
outstring = []

depth = 0

for c in string:
    if c in openers:
        depth += 1
    elif c in closers:
        depth -= 1

        if depth < 0:
            raise SyntaxError()

    if not depth and c == splitter:
        outlist.append("".join(outstring))
        outstring = []
    else:
        if len(outstring):
            outstring.append(c)
        elif c not in whitespace:
            outstring.append(c)

outlist.append("".join(outstring))

return outlist

字符串
这样使用它:

s = "Water, Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"

split = split_top(s, ",") # splits on commas


我知道这可能不是最快的。

tyu7yeag

tyu7yeag5#

这里有两个较短的(更优雅?)版本,将处理嵌套括号。
发电机:

def split(s, sep=","):
    i = d = 0
    for j in range(len(s)):
        d += {"(": 1, ")": -1}.get(s[j], 0)
        if s[j] == sep and d == 0:
            yield s[i:j]
            i = j + 1
    yield s[i:]

字符串
更实用的风格:

def split(s, sep=","):
    b = accumulate(s, lambda br, ch: br + {"(": 1, ")": -1}.get(ch, 0), initial=0)
    c = (ch != sep for ch in s)
    st = [i for i, x in enumerate(chain([0], starmap(int.__or__, zip(b, c)), [0])) if x == 0]
    return [s[st[i]:st[i + 1] - 1] for i in range(len(st) - 1)]


如果你不介意more_itertools,你可以从它导入locate,并将第4行修改得稍微可读一些:st = list(locate(chain([0], starmap(int.__or__, zip(b, c)), [0]), (0).__eq__))

0ve6wy6x

0ve6wy6x6#

试试正则表达式

[^()]*\([^()]*\),?

字符串
代码:

>>x="Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
>> re.findall("[^()]*\([^()]*\),?",x)
['Titanium Dioxide (CI 77897),', ' Black 2 (CI 77266),', ' Iron Oxides (CI 77491, 77492, 77499),', ' Ultramarines (CI 77007)']


查看正则表达式如何工作http://regex101.com/r/pS9oV3/1

vd2z7a6w

vd2z7a6w7#

使用regex,这可以通过findall函数轻松完成。

import re
s = "Titanium Dioxide (CI 77897), Black 2 (CI 77266), Iron Oxides (CI 77491, 77492, 77499), Ultramarines (CI 77007)"
re.findall(r"\w.*?\(.*?\)", s) # returns what you want

字符串
如果你想更好地理解正则表达式,请使用http://www.regexr.com/,这里是python文档的链接:https://docs.python.org/2/library/re.html
编辑:我修改了正则表达式字符串以接受没有括号的内容:\w[^,(]*(?:\(.*?\))?

相关问题