regex python:如何使用正则表达式提取匹配前后的单词

uoifb46i  于 2023-06-25  发布在  Python
关注(0)|答案(2)|浏览(120)

以下列数据为例
input_corpus =“这是一个示例。\n我正在尝试提取它。\n”
我正尝试使用以下代码提取前后正好2个单词。\n

for m in re.finditer('(?:\S+\s+){2,}[\.][\n]\s*(?:\S+\b\s*){0,2}',input_corpus):
   print(m)

预期输出:

an example. I am
extract it.

实际产量:什么都没被捕获
有人能告诉我正则表达式有什么问题吗?

qzlgjiam

qzlgjiam1#

你可以使用这个regex:

r'(?:^|\S+\s+\S+)\n(?:\s*\S+\s+\S+|$)'

RegEx Demo

代码:

>>> input_corpus = "this is an example.\n I am trying to extract it.\n"
>>> print re.findall(r'(?:^|\S+\s+\S+)\n(?:\s*\S+\s+\S+|$)', input_corpus)
['an example.\n I am', 'extract it.\n']

详情:

  • (?:^|\S+\s+\S+):匹配前面的2个单词或行首
  • \n:匹配新行
  • (?:\s*\S+\s+\S+|$):匹配下2个单词或行结束
ecr0jaav

ecr0jaav2#

一种非正则表达式的方式:

sen = 'Beneficiary Name / John Hunter Alex'
sub_sen = 'John Hunter'

def sentence_proximity(sentence, ner_string):
  sen_l = sen.split()
  sub_sen_l = sub_sen.split()

  start_idx = 0
  end_idx = 0
  search_string = ''

  if len(sub_sen_l) < 2:
    print('single word')
    curr_idx = sen_l.index(sub_sen_l[0])
    if curr_idx >= 0:
      start_idx = curr_idx - 1
    else: 
      start_idx = curr_idx

    if curr_idx < len(sen_l) - 1:
      end_idx = curr_idx + 2
    else:
      end_idx = curr_idx + 1

  else:
    print('multiple words')
    curr_start_idx = sen_l.index(sub_sen_l[0])
    if curr_start_idx >= 0:
      start_idx = curr_start_idx - 1
    else: 
      start_idx = curr_start_idx

    curr_end_idx = sen_l.index(sub_sen_l[-1])
    if curr_end_idx < len(sen_l) - 1:
      end_idx = curr_end_idx + 2
    else:
      end_idx = curr_end_idx + 1

  search_string = ' '.join(sen_l[start_idx:end_idx])
  print(f'Generated string: {search_string}')

sentence_proximity(sen, sub_sen)

相关问题