regex python文本解析,用于将列表拆分为块,包括前面的分隔符

k3bvogb1  于 2023-02-10  发布在  Python
关注(0)|答案(2)|浏览(100)

"我拥有的一切"
经过OCR'ing一些公共问答沉积PDF文件,其中有一个问答形式,我有原始文本如下:

text = """\na\n\nQ So I do first want to bring up exhibit No. 46, which is in the binder 
in front of\nyou.\n\nAnd that is a letter [to] Alston\n& Bird...
\n\nIs that correct?\n\nA This is correct.\n\nQ Okay."""

...我希望将其拆分为单独的问题和答案。每个问题或答案都以'\nQ ''\nA ''\nQ_''\nA_'开头(例如,匹配regex "\n[QA]_?\s"
"我迄今所做的"
我可以用下面的代码得到所有问题和答案的列表:

pattern = "\n[QA]_?\s"
q_a_list = re.split(pattern, text)
print(q_a_list)

得到q_a_list

['\na\n', 
'So I do first want to bring up exhibit No. 46, which is in the binder \nin front of\nyou.\n\nAnd that is a letter [to] Alston\n& Bird...\n\n\nIs that correct?\n', 
'This is correct.\n', 
'Okay.']

"我想要的"
这接近我想要的,但有以下问题:

  • 一个陈述是问题还是答案并不总是很清楚,而且
  • 有时,例如在这个特定示例中,列表中的第一项可能既不是问题也不是答案,而只是第一个\Q分隔符之前的随机文本。

我想修改一下上面的my q_a_list,但是它通过将每个文本块链接到它前面的分隔符来解决两个项目符号问题。

[{'0': '\na\n', 
  '\nQ': 'So I do first want to bring up exhibit No. 46, which is in the binder \nin front of\nyou.\n\nAnd that is a letter [to] Alston\n& Bird...\n\n\nIs that correct?\n',
  '\nA': 'This is correct.\n',
  '\nQ': 'Okay.'}]

[{'\nQ': 'So I do first want to bring up exhibit No. 46, which is in the binder \nin front of\nyou.\n\nAnd that is a letter [to] Alston\n& Bird...\n\n\nIs that correct?\n',
  '\nA': 'This is correct.\n',
  '\nQ': 'Okay.'}]

或者甚至可能只是一个带有前置分隔符的列表:

['\nQ: So I do first want to bring up exhibit No. 46, which is in the binder \nin front of\nyou.\n\nAnd that is a letter [to] Alston\n& Bird...\n\n\nIs that correct?\n',
'\nA: This is correct.\n',
'\nQ: Okay.'
]
hgb9j2n6

hgb9j2n61#

这可能不是最优雅的回答,但似乎很有效。接下来几天我不会接受这个回答,以防有人发布更好的回答:

# this gets me the location (index start & end) of each occurrence of my regex pattern 
delims = list(re.finditer(pattern, text))

# now let's iterate through each pair of delimiter and next-delimiter locations
q_a_list = []

for delim, next_delim in zip(delims[:-1], delims[1:]):

    # pull "Q" or "A" out of the current delimiter
    prefix = text[delim.span()[0]:delim.span()[1]].strip()

    # The actual question or answer text spans from the end of this 
    # delimiter to the start of the next delimiter
    text_chunk = text[delim.span()[1]:next_delim.span()[0]]

    q_a_list.append(f"{prefix}: {text_chunk}")

# q_a_list is missing the final prefix and text_chunk, because
# they have no next_delim, so the zip() above doesn't get to it
final_delim = delims[-1]

final_prefix = text[final_delim.span()[0]: final_delim.span()[1]].strip()
final_text_chunk = text[final_delim.span()[1]:]

q_a_list.append(f"{final_prefix}: {final_text_chunk}")

现在的结果是:

>>> print(q_a_list)
['Q: So I do first want to bring up exhibit No. 46, which is in the binder \nin front of\nyou.\n\nAnd that is a letter [to] Alston\n& Bird...\n\n\nIs that correct?\n', 
'A: This is correct.\n', 
'Q: Okay.']
dwbf0jvd

dwbf0jvd2#

我不确定我是否完全理解了这个问题,但我希望这可能会有所帮助:
试试看,

questions = []
answers = []
for item in text.split('\n\n'):
    questions.append(item) if item.startswith('Q ' or 'Q_') else answers.append(item)

print(f'questions: {questions}')
print(f'answers: {answers}')

输出:

questions: ['Q So I do first want to bring up exhibit No. 46, which is in the binder \nin front of\nyou.', 'Q Okay.']
answers: ['\na', 'And that is a letter [to] Alston\n& Bird...', '\nIs that correct?', 'A This is correct.']

相关问题