import re
#list of names to identify in input strings
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort() # sorts normally by alphabetical order (optional)
result_list.sort(key=len, reverse=True) # sorts by descending length
#example 1
input_text = "Melissa went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so Thomas Edd is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker."
#In this example 2, it is almost the same however, some of the names were already encapsulated
# under the ((PERS)name) structure, and should not be encapsulated again.
input_text = "((PERS)Melissa) went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
for i in result_list:
input_text = re.sub(r"\(\(PERS\)" + r"(" + str(i) + r")" + r"\)",
lambda m: (f"((PERS){m[1]})"),
input_text)
print(repr(input_text)) # --> output
注意,名称满足必须识别它们的特定条件,即,它们必须位于2个空格\s*the searched name\s*
的中间,或者位于输入字符串的开头(?:(?<=\s)|^)
或/和结尾。
也可能是名称后面跟着逗号的情况,例如"Ada White, Melissa and Louis went shopping"
,或者如果意外地遗漏了空格"Ada White,Melissa and Louis went shopping"
。因此,在[.,;]
后面找到名称的可能性是很重要的。
不应封装名称的情况,例如..."the Edd's business"
"The whitespace"
"the pasteurization process takes time"
"Those White-spaces in that text are unnecessary"
这是因为在这些情况下,该名称后面或前面有不应该是正被搜索的名称的一部分的另一个词。
对于示例1和示例2(注意示例2与示例1相同,但是已经封装了一些名称,并且必须防止它们再次被封装),应该得到以下输出。
"((PERS)Melissa) went for a walk in the park, then ((PERS)Melisa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker."
1条答案
按热度按时间kmpatx3s1#
您可以使用lookaround来排除已经封装的名称以及后面跟有
'
、字母数字字符或-
的名称:输出:
当然,您可以根据进一步的边缘情况细化前瞻的内容。