"我拥有的一切"
经过OCR'ing一些公共问答沉积PDF文件,其中有一个问答形式,我有原始文本如下:
text = """\na\n\nQ So I do first want to bring up exhibit No. 46, which is in the binder
in front of\nyou.\n\nAnd that is a letter [to] Alston\n& Bird...
\n\nIs that correct?\n\nA This is correct.\n\nQ Okay."""
...我希望将其拆分为单独的问题和答案。每个问题或答案都以'\nQ '
、'\nA '
、'\nQ_'
或'\nA_'
开头(例如,匹配regex "\n[QA]_?\s"
)
"我迄今所做的"
我可以用下面的代码得到所有问题和答案的列表:
pattern = "\n[QA]_?\s"
q_a_list = re.split(pattern, text)
print(q_a_list)
得到q_a_list
:
['\na\n',
'So I do first want to bring up exhibit No. 46, which is in the binder \nin front of\nyou.\n\nAnd that is a letter [to] Alston\n& Bird...\n\n\nIs that correct?\n',
'This is correct.\n',
'Okay.']
"我想要的"
这接近我想要的,但有以下问题:
- 一个陈述是问题还是答案并不总是很清楚,而且
- 有时,例如在这个特定示例中,列表中的第一项可能既不是问题也不是答案,而只是第一个
\Q
分隔符之前的随机文本。
我想修改一下上面的my q_a_list
,但是它通过将每个文本块链接到它前面的分隔符来解决两个项目符号问题。
[{'0': '\na\n',
'\nQ': 'So I do first want to bring up exhibit No. 46, which is in the binder \nin front of\nyou.\n\nAnd that is a letter [to] Alston\n& Bird...\n\n\nIs that correct?\n',
'\nA': 'This is correct.\n',
'\nQ': 'Okay.'}]
或
[{'\nQ': 'So I do first want to bring up exhibit No. 46, which is in the binder \nin front of\nyou.\n\nAnd that is a letter [to] Alston\n& Bird...\n\n\nIs that correct?\n',
'\nA': 'This is correct.\n',
'\nQ': 'Okay.'}]
或者甚至可能只是一个带有前置分隔符的列表:
['\nQ: So I do first want to bring up exhibit No. 46, which is in the binder \nin front of\nyou.\n\nAnd that is a letter [to] Alston\n& Bird...\n\n\nIs that correct?\n',
'\nA: This is correct.\n',
'\nQ: Okay.'
]
2条答案
按热度按时间hgb9j2n61#
这可能不是最优雅的回答,但似乎很有效。接下来几天我不会接受这个回答,以防有人发布更好的回答:
现在的结果是:
dwbf0jvd2#
我不确定我是否完全理解了这个问题,但我希望这可能会有所帮助:
试试看,
输出: