使用regex标识和替换一些存储在列表中的字符串,这些字符串可能包含也可能不包含

6psbrbz9  于 2023-01-18  发布在  其他
关注(0)|答案(1)|浏览(123)
import re

#list of names to identify in input strings
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']

result_list.sort() # sorts normally by alphabetical order (optional)
result_list.sort(key=len, reverse=True) # sorts by descending length

#example 1
input_text = "Melissa went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so Thomas Edd is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker."

#In this example 2, it is almost the same however, some of the names were already encapsulated
# under the ((PERS)name) structure, and should not be encapsulated again.
input_text = "((PERS)Melissa) went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1

for i in result_list:
    input_text = re.sub(r"\(\(PERS\)" + r"(" + str(i) + r")" + r"\)",
                        lambda m: (f"((PERS){m[1]})"),
                        input_text)

print(repr(input_text)) # --> output

注意,名称满足必须识别它们的特定条件,即,它们必须位于2个空格\s*the searched name\s*的中间,或者位于输入字符串的开头(?:(?<=\s)|^)或/和结尾。
也可能是名称后面跟着逗号的情况,例如"Ada White, Melissa and Louis went shopping",或者如果意外地遗漏了空格"Ada White,Melissa and Louis went shopping"。因此,在[.,;]后面找到名称的可能性是很重要的。
不应封装名称的情况,例如...
"the Edd's business"
"The whitespace"
"the pasteurization process takes time"
"Those White-spaces in that text are unnecessary"
这是因为在这些情况下,该名称后面或前面有不应该是正被搜索的名称的一部分的另一个词。
对于示例1和示例2(注意示例2与示例1相同,但是已经封装了一些名称,并且必须防止它们再次被封装),应该得到以下输出。

"((PERS)Melissa) went for a walk in the park, then ((PERS)Melisa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker."
kmpatx3s

kmpatx3s1#

您可以使用lookaround来排除已经封装的名称以及后面跟有'、字母数字字符或-的名称:

import re

result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort(key=len, reverse=True) # sorts by descending length

input_text = "((PERS)Melissa) went for a walk in the park, then Melissa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1

pat = re.compile(rf"(?<!\(PERS\))({'|'.join(result_list)})(?!['\w)-])")
input_text = re.sub(pat, r'((PERS)\1)', input_text)

输出:

((PERS)Melissa) went for a walk in the park, then ((PERS)Melissa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker.

当然,您可以根据进一步的边缘情况细化前瞻的内容。

相关问题