python 如何检索regex sub删除的文本?

gxwragnw  于 2023-02-11  发布在  Python
关注(0)|答案(3)|浏览(147)

我在Python中有一个正则表达式,它应该删除所有出现的单词“NOTE.”和下面的句子,我怎样才能正确地做到这一点,并返回所有被删除的句子呢?

import re
text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
clean_text = re.sub("NOTE\..*?(?=\.)", "", text)

预期成果:

清除文本:

The weather is good. The sky is blue. Note that it's a dummy text.

删除的唯一语句:

["This is the subsequent sentence to be removed.", "This is another subsequent sentence to be removed."]
v1uwarro

v1uwarro1#

窃取The fourth bird's正则表达式,但使用re.split,所以我们只需要搜索一次。它返回一个列表,在不匹配和匹配部分之间交替。连接前者以获得文本,后者是您的删除。

import re
 
pattern = r"\bNOTE\.\s*([^.]*\.)\s*"
text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
 
parts = re.split(pattern, text)
 
clean_text = ''.join(parts[::2])
print(clean_text)
 
unique_sentences_removed = parts[1::2]
print(unique_sentences_removed)

输出:

The weather is good. The sky is blue. Note that it's a dummy text.
['This is the subsequent sentence to be removed.', 'This is another subsequent sentence to be removed.']

Demo

lmyy7pcs

lmyy7pcs2#

删除NOTE部分的一个选择是使用一个模式,该模式也匹配下一行后面的点,后跟可选的空格字符,而不是只Assert点。
如果将捕获组添加到模式中,则可以使用具有相同模式的re. findall返回捕获组值。
模式匹配:

  • \bNOTE\.\s*匹配单词NOTE,后跟.和可选的空白字符
  • ([^.]*\.)捕获组1,匹配.以外的可选字符,然后匹配.
  • \s*匹配可选空白字符

请参阅此regex101 demo和一个Python demo中的匹配项和捕获组值。

import re
 
pattern = r"\bNOTE\.\s*([^.]*\.)\s*"
text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
clean_text = re.sub(pattern, "", text)
print(clean_text)
 
unique_sentences_removed = re.findall(pattern, text)
print(unique_sentences_removed)

产出

The weather is good. The sky is blue. Note that it's a dummy text.
['This is the subsequent sentence to be removed.', 'This is another subsequent sentence to be removed.']
zbq4xfa0

zbq4xfa03#

您可以使用替换函数一次性捕获删除的句子,该函数的副作用是保存删除的句子:

import re

def clean(text):
    removed = []
    def repl(m):
        removed.append(m.group(1))
        return ''
    clean_text = re.sub("NOTE\.\s*(.*?\.)\s*", repl, text)
    return clean_text, removed

text = "NOTE. This is the subsequent sentence to be removed. The weather is good. NOTE. This is another subsequent sentence to be removed. The sky is blue. Note that it's a dummy text."
result, removed = clean(text)
print(result)
print(removed)

输出:

The weather is good. The sky is blue. Note that it's a dummy text.
['This is the subsequent sentence to be removed.', 'This is another subsequent sentence to be removed.']

相关问题