regex 捕获一个标签开始和另一个标签开始之间的文本?

eulz3vhy  于 2023-06-30  发布在  其他
关注(0)|答案(2)|浏览(116)

我有以下字符串:

<<John Smith, Youtube>> 
I'm having a great day today 
<<Jane Doe, Google>> 
I'm going to the gym later 
<<Speaker>>
Time for people to speak 
<<Beff Jezos>> 
Buy something from my online shop. You might like it

你可以通过下面的脚本命令在python中加载这个字符串:

string = '''<<John Smith, Youtube>> 
I'm having a great day today 
<<Jane Doe, Google>> 
I'm going to the gym later 
<<Speaker>>
Time for people to speak 
<<Beff Jezos>> 
Buy something from my online shop. You might like it'''

加载python包:

import re

我试图找到一种方法来捕获以下信息:我想提取从**<<开始到下一个<<**的所有文本之间的信息
例如,这意味着提取以下字符串的方式:

string1: John Smith, Youtube>> 
I'm having a great day today 

string2: Jane Doe, Google>> 
I'm going to the gym later 

string3: Speaker>>
Time for people to speak 

string4: Beff Jezos>> 
Buy something from my online shop. You might like it

输出可以是一个列表或一个带有键值对的命名字典,标签< >之间的值< and >是标识符,但并不总是唯一的,有些会重复。
感谢任何帮助-当前的正则表达式已经让我走了这么远:/(?=<<)(.*)(?=<<)/gm
新字符串:

Welcome to the first meeting today between Yotube, Google and Amazon  Special guest speaker today is Beff Jezos 
<<John Smith, Youtube>>  I'm having a great day today  
<<Jane Doe, Google>>  I'm going to the gym later  
<<Speaker>> Time for people to speak  
<<Beff Jezos>>  Buy something from my online shop. You might like it
czq61nw1

czq61nw11#

这是否给了你想要的结果?

import re

string = '''<<John Smith, Youtube>> 
I'm having a great day today 
<<Jane Doe, Google>> 
I'm going to the gym later 
<<Speaker>>
Time for people to speak 
<<Beff Jezos>> 
Buy something from my online shop. You might like it'''

pattern = r'<<(.*?)>>\s*(.*?)\s*(?=(?:<<|$))'
matches = re.findall(pattern, string, re.DOTALL)

result = []
for match in matches:
    identifier = match[0]
    content = match[1]
    result.append((identifier, content))

print(result)

那么他们做了什么呢?:
<<(.*?)>>捕获<<>>之间的内容
\s*是关于空格字符的
(.*?)捕获>><<之后的内容。
编辑:Tim的回答更简单,解释得更好。

chy5wohz

chy5wohz2#

我们可以做一个正则表达式查找所有搜索如下:

matches = re.findall(r'<<(.*?)(?=\s*<<|$)', string, flags=re.S)
print(matches)

["John Smith, Youtube>> \nI'm having a great day today",
 "Jane Doe, Google>> \nI'm going to the gym later",
 'Speaker>>\nTime for people to speak',
 'Beff Jezos>> \nBuy something from my online shop. You might like it']

这里使用的正则表达式模式表示匹配:

  • <<
  • (.*?)匹配并捕获所有内容,直到到达最近的
  • (?=\s*<<|$)可选空格,后跟<<下一个标记的开始或输入的结束

请注意,我们在dotall模式下执行正则表达式搜索,如re.S标志所示,因此.*将跨行匹配。

相关问题