以下列数据为例input_corpus =“这是一个示例。\n我正在尝试提取它。\n”我正尝试使用以下代码提取前后正好2个单词。\n
for m in re.finditer('(?:\S+\s+){2,}[\.][\n]\s*(?:\S+\b\s*){0,2}',input_corpus): print(m)
预期输出:
an example. I am extract it.
实际产量:什么都没被捕获有人能告诉我正则表达式有什么问题吗?
qzlgjiam1#
你可以使用这个regex:
r'(?:^|\S+\s+\S+)\n(?:\s*\S+\s+\S+|$)'
RegEx Demo
代码:
>>> input_corpus = "this is an example.\n I am trying to extract it.\n" >>> print re.findall(r'(?:^|\S+\s+\S+)\n(?:\s*\S+\s+\S+|$)', input_corpus) ['an example.\n I am', 'extract it.\n']
详情:
(?:^|\S+\s+\S+)
\n
(?:\s*\S+\s+\S+|$)
ecr0jaav2#
一种非正则表达式的方式:
sen = 'Beneficiary Name / John Hunter Alex' sub_sen = 'John Hunter' def sentence_proximity(sentence, ner_string): sen_l = sen.split() sub_sen_l = sub_sen.split() start_idx = 0 end_idx = 0 search_string = '' if len(sub_sen_l) < 2: print('single word') curr_idx = sen_l.index(sub_sen_l[0]) if curr_idx >= 0: start_idx = curr_idx - 1 else: start_idx = curr_idx if curr_idx < len(sen_l) - 1: end_idx = curr_idx + 2 else: end_idx = curr_idx + 1 else: print('multiple words') curr_start_idx = sen_l.index(sub_sen_l[0]) if curr_start_idx >= 0: start_idx = curr_start_idx - 1 else: start_idx = curr_start_idx curr_end_idx = sen_l.index(sub_sen_l[-1]) if curr_end_idx < len(sen_l) - 1: end_idx = curr_end_idx + 2 else: end_idx = curr_end_idx + 1 search_string = ' '.join(sen_l[start_idx:end_idx]) print(f'Generated string: {search_string}') sentence_proximity(sen, sub_sen)
2条答案
按热度按时间qzlgjiam1#
你可以使用这个regex:
RegEx Demo
代码:
详情:
(?:^|\S+\s+\S+)
:匹配前面的2个单词或行首\n
:匹配新行(?:\s*\S+\s+\S+|$)
:匹配下2个单词或行结束ecr0jaav2#
一种非正则表达式的方式: