我写了下面的python脚本,作为我当前pdf解析挂起的一个可复制的例子。
- 从网上下载pdf文本(Cassidy哈钦森于2022年9月14日接受J6 C采访时的文本)
- 读取PDF到文本的/OCR
- 我试图把这篇文章分成一系列的问答段落
- 我运行了一系列的测试,这些测试是我根据我对脚本的手动阅读编写的
运行下面的python代码将生成以下输出:
~/askliz main !1 ?21 python stack_overflow_q_example.py ✔ docenv Py 22:41:00
Test for passage0 passed.
Test for passage1 passed.
Test for passage7 passed.
Test for passage8 passed.
Traceback (most recent call last):
File "/home/max/askliz/stack_overflow_q_example.py", line 91, in <module>
assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg
AssertionError: Failed on passage 10
你的使命,如果你选择接受它:我希望在下面的extract_q_a_locations
中有一个聪明的正则表达式或其他修改可以达到这个目的,但是我对任何通过所有这些测试的解决方案都持开放态度,因为我特意选择了这些测试段落。
关于这段文字的一些背景知识,以防你阅读起来不像我读起来那么有趣:有时候一段话以“Q”或“A”开头,有时候以一个名字开头(例如“切尼女士”)。第10段不及格的测试是,一名工作人员问了一个问题,然后他的名字被编辑了。我设法通过这个测试的唯一方法是无意中破坏了其他测试之一,因为并非所有密文都表示问题的开始。(注意:在我使用的pdf/ocr库中,pdfplumber,编辑过的文本通常显示为一堆额外的空格)。
代码如下:
import nltk
import re
import requests
import pdfplumber
def extract_q_a_locations(examination_text:str)->list:
# (when parsed by pdfplumber) every Q/A starts with a newline, then spaces,
# then a line number and more spaces
prefix_regex = '\n\s+\d+\s+'
# sometimes what comes next is a 'Q' or 'A' and more spaces
qa_regex = '[QA]\s+'
# other times what comes next is the name of a congressperson or lawyer for the witness
speaker_regex = "(?:(?:Mr\.|Ms\.) \w+\.|-\s+)"
# the combined regex I've been using is looking for the prefix then QA or Speaker regex
pattern = f"{prefix_regex}(?:{speaker_regex}|{qa_regex})"
delims = list(re.finditer(pattern, text))
return delims
def get_q_a_passages(qa_delimiters, text):
q_a_list = []
for delim, next_delim in zip(qa_delimiters[:-1], qa_delimiters[1:]):
# prefix is either 'Q', 'A', or the name of the speaker
prefix = text[delim.span()[0]:delim.span()[1]].strip().split()[-1]
# the text chunk is the actual dialogue text. everything from current delim to next one
text_chunk = text[delim.span()[1]:next_delim.span()[0]]
# now we want to remove some of the extra cruft from layout=True OCR in pdfplumber
text_chunk = re.sub("\n\s+\d+\s+", " ", text_chunk) # remove line numbers
text_chunk = " ".join(text_chunk.split()) # remove extra whitespace
q_a_list.append(f"{prefix} {text_chunk}")
return q_a_list
if __name__ == "__main__":
# download pdf
PDF_URL = "https://www.govinfo.gov/content/pkg/GPO-J6-TRANSCRIPT-CTRL0000928888/pdf/GPO-J6-TRANSCRIPT-CTRL0000928888.pdf"
FILENAME = "interview_transcript_stackoverflow.pdf"
response = requests.get(PDF_URL)
with open(FILENAME, "wb") as f:
f.write(response.content)
# read pdf as text
with pdfplumber.open(FILENAME) as pdf:
text = "".join([p.extract_text(layout=True) for p in pdf.pages])
# I care about the Q&A transcript, which starts after the "EXAMINATION" header
startidx = text.find("EXAMINATION")
text = text[startidx:]
# extract Q&A passages
passage_locations = extract_q_a_locations(text)
passages = get_q_a_passages(passage_locations, text)
# TESTS
ACCEPTABLE_TEXT_DISCREPANCY = 2
# The tests below all pass already.
actual_passage0_start = "Q So I do first want to bring up exhibit"
assert nltk.edit_distance(passages[0][:len(actual_passage0_start)], actual_passage0_start) <= ACCEPTABLE_TEXT_DISCREPANCY
print("Test for passage0 passed.")
actual_passage1 = "A This is correct."
assert nltk.edit_distance(passages[1][:len(actual_passage1)], actual_passage1) <= ACCEPTABLE_TEXT_DISCREPANCY
print("Test for passage1 passed.")
# (Note: for the next two passages/texts, prefix/questioner is captured as "Cheney" &
# "Jordan", not "Ms. Cheney" & "Mr. Jordan". I'm fine with either way.
actual_passage7_start = "Cheney. And we also, just as"
assert nltk.edit_distance(passages[7][:len(actual_passage7_start)], actual_passage7_start) <= ACCEPTABLE_TEXT_DISCREPANCY
print("Test for passage7 passed.")
actual_passage8_start = "Jordan. They are pro bono"
assert nltk.edit_distance(passages[8][:len(actual_passage8_start)], actual_passage8_start) <= ACCEPTABLE_TEXT_DISCREPANCY
print("Test for passage8 passed.")
# HERE'S MY PROBLEM.
# This test fails because my regex fails to capture the question which starts with the
# redacted name of the staff/questioner. The only way I've managed to get this test to
# pass has also broken at least one of the tests above.
actual_passage10_start = " So at this point, as we discussed earlier, I'm going to"
e_msg = "Failed on passage 10"
assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg
1条答案
按热度按时间q3qa4bjr1#
我假设段落之间的删节是不需要的。我所做的是用
Ms. Fakename.
替换删节的名称的空格。我这样做是因为正如你在问题中提到的,所需的段落要么以名称开头,要么以Q或A开头。当它以名称开头时,你会注意到名字以句号结尾,然后以大写字母开头。当名字被修改时,这是一个答案,在它前面有很多空格。结合所有这些观察,通过添加以下代码片段,我能够让所有测试都通过最终代码为
注意,在上一个测试中,我添加了“Fakename”作为前缀,如果不希望这样做,可以更新
passages
列表以删除手动添加的前缀。