使用regex根据循环模式进行提取，不包括换行符

6yt4nkrj 于 2023-01-18 发布在其他

关注(0)|答案(4)|浏览(189)

我有一个字符串如下：

27223525
 
West Food Group B.V.9
 
52608670
 
Westcon
 
Group European Operations Netherlands Branch
 
30221053
 
Westland Infra Netbeheer B.V.
 
27176688
 
Wetransfer  85 B.V.
 
34380998
 
WETRAVEL B.V.
 
70669783

这个字符串包含许多换行符，我想明确忽略这些字符以及所有包含6个或更多数字的多位数。我想出了下面的正则表达式：

[^\n\d{6,}].+

这几乎把我带到了那里，因为它返回了所有的公司名称，然而，在公司名称本身包含一个换行符的情况下，它们会作为两个不同的公司名称返回。例如，Westcon是一个匹配项，Group European Operations Netherlands Branch也是一个匹配项。我想调整上面的表达式，以确保最终匹配的是Westcon European Operations Netherlands Branch。我应该使用什么正则表达式概念来实现这一点？

编辑

我根据下面的评论尝试了以下方法，但得到了错误的结果

text = 'West Food Group B.V.9\n \n52608670\n \nWestcon\n \nGroup European Operations Netherlands Branch\n \n30221053\n \nWestland Infra Netbeheer B.V.\n \n27176688\n \nWetransfer 85 B.V.\n \n34380998\n \nWETRAVEL B.V.\n \n70669783\n \nWeWork Companies (International) B.V.\n \n61501220\n \nWeWork Netherlands B.V.\n \n61505439\n \nWexford Finance B.V.\n \n27124941\n \nWFC\n-\nFood Safety B.V.\n \n11069471\n \nWhale Cloud Technology Netherlands B.V.\n \n63774801\n \nWHILL Europe B.V.\n \n72465700\n \nWhirlpool Nederland B.V.\n \n20042061\n \nWhitaker\n-\nTaylor Netherlands B.V.\n \n66255163\n \nWhite Oak B.V.\n'

re.findall(r'[^\n\d{6,}](?:(?:[a-z\s.]+(\n[a-z\s.])*)|.+)',text)

regex

来源：https://stackoverflow.com/questions/54008208/using-regex-to-extract-based-on-a-recurring-pattern-excluding-newline-characters

4条答案

按热度按时间

koaltpgm1#

我想你只需要公司的名字。如果是这样的话，这应该可以。

input = '''27223525

West Food Group B.V.9

52608670

Westcon

Group European Operations Netherlands Branch

30221053

Westland Infra Netbeheer B.V.

27176688

Wetransfer 85 B.V.

34380998

WETRAVEL B.V.

70669783

'''

company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*\d{1,5}.*', input)

pprint(company_name_regex)

['West Food Group B.V.9',
 'Westcon',
 'Group European Operations Netherlands Branch',
 'Westland Infra Netbeheer B.V.',
 'Wetransfer 85 B.V.'
 'WETRAVEL B.V.']

赞(0）回复(0）举报 2023-01-18

nukf8bse2#

这将为没有编号的行创建一个组。
正则表达式：/(?!(\d{6,}|\n))[a-zA-Z .\n]+/g
演示：https://regex101.com/r/MMLGw6/1

赞(0）回复(0）举报 2023-01-18

hiz5n14c3#

假设您的公司名称以字母开头，您可以使用带有re.M修饰符的正则表达式：

^[a-zA-Z].*(?:\n+[a-zA-Z].*)*(?=\n+\d{6,}$)

RegEx Demo
在python中：

regex = re.compile(r"^[a-zA-Z].*(?:\n+[a-zA-Z].*)*(?=\n+\d{6,}$)", re.M)

这将匹配从[a-zA-Z]开始直到行尾的行，然后匹配以\n分隔且也以[a-zA-Z]字符开始的更多行。
(?=\n+\d{6,}$)是一个前瞻Assert，用于确保我们的公司名称前面有一个换行符和6位以上的数字。

赞(0）回复(0）举报 2023-01-18

hkmswyz64#

如果你可以解决这个没有正则表达式，它 * 应该 * 解决没有正则表达式：

useful = []

for line in text.split():
    if line.strip() and not line.isdigit():
        useful.append(line)

这应该工作-或多或少。从我的手机回复，所以不能测试。

赞(0）回复(0）举报 2023-01-18

我来回答

使用regex根据循环模式进行提取，不包括换行符

编辑

4条答案

相关问题

热门标签

最新问答