regex 解析面试记录(问答)时遇到问题,其中提问者姓名有时会被编辑

wfauudbj  于 2023-02-25  发布在  其他
关注(0)|答案(1)|浏览(103)

我写了下面的python脚本,作为我当前pdf解析挂起的一个可复制的例子。

  • 从网上下载pdf文本(Cassidy哈钦森于2022年9月14日接受J6 C采访时的文本)
  • 读取PDF到文本的/OCR
  • 我试图把这篇文章分成一系列的问答段落
  • 我运行了一系列的测试,这些测试是我根据我对脚本的手动阅读编写的

运行下面的python代码将生成以下输出:

~/askliz  main !1 ?21  python stack_overflow_q_example.py                                                      ✔  docenv Py  22:41:00 
Test for passage0 passed.
Test for passage1 passed.
Test for passage7 passed.
Test for passage8 passed.
Traceback (most recent call last):
  File "/home/max/askliz/stack_overflow_q_example.py", line 91, in <module>
    assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg
AssertionError: Failed on passage 10

你的使命,如果你选择接受它:我希望在下面的extract_q_a_locations中有一个聪明的正则表达式或其他修改可以达到这个目的,但是我对任何通过所有这些测试的解决方案都持开放态度,因为我特意选择了这些测试段落。
关于这段文字的一些背景知识,以防你阅读起来不像我读起来那么有趣:有时候一段话以“Q”或“A”开头,有时候以一个名字开头(例如“切尼女士”)。第10段不及格的测试是,一名工作人员问了一个问题,然后他的名字被编辑了。我设法通过这个测试的唯一方法是无意中破坏了其他测试之一,因为并非所有密文都表示问题的开始。(注意:在我使用的pdf/ocr库中,pdfplumber,编辑过的文本通常显示为一堆额外的空格)。

代码如下:

import nltk
import re
import requests
import pdfplumber

def extract_q_a_locations(examination_text:str)->list:

    # (when parsed by pdfplumber) every Q/A starts with a newline, then spaces, 
    # then a line number and more spaces 
    prefix_regex = '\n\s+\d+\s+'

    # sometimes what comes next is a 'Q' or 'A' and more spaces
    qa_regex = '[QA]\s+'

    # other times what comes next is the name of a congressperson or lawyer for the witness
    speaker_regex = "(?:(?:Mr\.|Ms\.) \w+\.|-\s+)"

    # the combined regex I've been using is looking for the prefix then QA or Speaker regex
    pattern = f"{prefix_regex}(?:{speaker_regex}|{qa_regex})"
    delims = list(re.finditer(pattern, text))
    return delims

def get_q_a_passages(qa_delimiters, text):
    q_a_list = []
    for delim, next_delim in zip(qa_delimiters[:-1], qa_delimiters[1:]):
        # prefix is either 'Q', 'A', or the name of the speaker
        prefix = text[delim.span()[0]:delim.span()[1]].strip().split()[-1]

        # the text chunk is the actual dialogue text. everything from current delim to next one
        text_chunk = text[delim.span()[1]:next_delim.span()[0]]
        
        # now we want to remove some of the extra cruft from layout=True OCR in pdfplumber
        text_chunk = re.sub("\n\s+\d+\s+", " ", text_chunk)  # remove line numbers
        text_chunk = " ".join(text_chunk.split())            # remove extra whitespace
        
        q_a_list.append(f"{prefix} {text_chunk}")

    return q_a_list

if __name__ == "__main__":

    # download pdf
    PDF_URL = "https://www.govinfo.gov/content/pkg/GPO-J6-TRANSCRIPT-CTRL0000928888/pdf/GPO-J6-TRANSCRIPT-CTRL0000928888.pdf"
    FILENAME = "interview_transcript_stackoverflow.pdf"

    response = requests.get(PDF_URL)
    with open(FILENAME, "wb") as f:
        f.write(response.content)

    # read pdf as text
    with pdfplumber.open(FILENAME) as pdf:
        text = "".join([p.extract_text(layout=True) for p in pdf.pages])

    # I care about the Q&A transcript, which starts after the "EXAMINATION" header
    startidx = text.find("EXAMINATION")
    text = text[startidx:]

    # extract Q&A passages
    passage_locations = extract_q_a_locations(text)
    passages = get_q_a_passages(passage_locations, text)

    # TESTS
    ACCEPTABLE_TEXT_DISCREPANCY = 2

    # The tests below all pass already.
    actual_passage0_start = "Q So I do first want to bring up exhibit"
    assert nltk.edit_distance(passages[0][:len(actual_passage0_start)], actual_passage0_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage0 passed.")

    actual_passage1 = "A This is correct."
    assert nltk.edit_distance(passages[1][:len(actual_passage1)], actual_passage1) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage1 passed.")

    # (Note: for the next two passages/texts, prefix/questioner is captured as "Cheney" & 
    # "Jordan", not "Ms. Cheney" & "Mr. Jordan". I'm fine with either way.
    actual_passage7_start = "Cheney. And we also, just as" 
    assert nltk.edit_distance(passages[7][:len(actual_passage7_start)], actual_passage7_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage7 passed.")

    actual_passage8_start = "Jordan. They are pro bono"
    assert nltk.edit_distance(passages[8][:len(actual_passage8_start)], actual_passage8_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage8 passed.")

    # HERE'S MY PROBLEM. 
    # This test fails because my regex fails to capture the question which starts with the 
    # redacted name of the staff/questioner. The only way I've managed to get this test to 
    # pass has also broken at least one of the tests above. 
    actual_passage10_start = " So at this point, as we discussed earlier, I'm going to"
    e_msg = "Failed on passage 10"
    assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg
q3qa4bjr

q3qa4bjr1#

我假设段落之间的删节是不需要的。我所做的是用Ms. Fakename.替换删节的名称的空格。我这样做是因为正如你在问题中提到的,所需的段落要么以名称开头,要么以Q或A开头。当它以名称开头时,你会注意到名字以句号结尾,然后以大写字母开头。当名字被修改时,这是一个答案,在它前面有很多空格。结合所有这些观察,通过添加以下代码片段,我能够让所有测试都通过

lines = text.splitlines()

    for i in range(len(lines)):
        if re.fullmatch(r" {10,}\d{1,2} {15,}[A-Z].+", lines[i]):
            lines[i] = re.sub(r" {15,}", "       Ms. Fakename. ", lines[i], count=1)
    
    text = "\n".join(lines)

最终代码为

import nltk
import re
import requests
import pdfplumber

def extract_q_a_locations(examination_text:str)->list:

    # (when parsed by pdfplumber) every Q/A starts with a newline, then spaces, 
    # then a line number and more spaces 
    prefix_regex = '\n\s+\d+\s+'

    # sometimes what comes next is a 'Q' or 'A' and more spaces
    qa_regex = '[QA]\s+'

    # other times what comes next is the name of a congressperson or lawyer for the witness
    speaker_regex = "(?:(?:Mr\.|Ms\.) \w+\.|-\s+)"

    # the combined regex I've been using is looking for the prefix then QA or Speaker regex
    pattern = f"{prefix_regex}(?:{speaker_regex}|{qa_regex})"
    delims = list(re.finditer(pattern, text))
    return delims

def get_q_a_passages(qa_delimiters, text):
    q_a_list = []
    for delim, next_delim in zip(qa_delimiters[:-1], qa_delimiters[1:]):
        # prefix is either 'Q', 'A', or the name of the speaker
        prefix = text[delim.span()[0]:delim.span()[1]].strip().split()[-1]

        # the text chunk is the actual dialogue text. everything from current delim to next one
        text_chunk = text[delim.span()[1]:next_delim.span()[0]]
        
        # now we want to remove some of the extra cruft from layout=True OCR in pdfplumber
        text_chunk = re.sub("\n\s+\d+\s+", " ", text_chunk)  # remove line numbers
        text_chunk = " ".join(text_chunk.split())            # remove extra whitespace
        
        q_a_list.append(f"{prefix} {text_chunk}")

    return q_a_list

if __name__ == "__main__":

    # download pdf
    PDF_URL = "https://www.govinfo.gov/content/pkg/GPO-J6-TRANSCRIPT-CTRL0000928888/pdf/GPO-J6-TRANSCRIPT-CTRL0000928888.pdf"
    FILENAME = "interview_transcript_stackoverflow.pdf"

    response = requests.get(PDF_URL)
    with open(FILENAME, "wb") as f:
        f.write(response.content)

    # read pdf as text
    with pdfplumber.open(FILENAME) as pdf:
        text = "".join([p.extract_text(layout=True) for p in pdf.pages])
    
    lines = text.splitlines()

    for i in range(len(lines)):
        if re.fullmatch(r" {10,}\d{1,2} {15,}[A-Z].+", lines[i]):
            lines[i] = re.sub(r" {15,}", "       Ms. Fakename. ", lines[i], count=1)
    
    text = "\n".join(lines)

    # I care about the Q&A transcript, which starts after the "EXAMINATION" header
    startidx = text.find("EXAMINATION")
    text = text[startidx:]

    # extract Q&A passages
    passage_locations = extract_q_a_locations(text)
    passages = get_q_a_passages(passage_locations, text)

    # TESTS
    ACCEPTABLE_TEXT_DISCREPANCY = 2

    # The tests below all pass already.
    actual_passage0_start = "Q So I do first want to bring up exhibit"
    assert nltk.edit_distance(passages[0][:len(actual_passage0_start)], actual_passage0_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage0 passed.")

    actual_passage1 = "A This is correct."
    assert nltk.edit_distance(passages[1][:len(actual_passage1)], actual_passage1) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage1 passed.")

    # (Note: for the next two passages/texts, prefix/questioner is captured as "Cheney" & 
    # "Jordan", not "Ms. Cheney" & "Mr. Jordan". I'm fine with either way.
    actual_passage7_start = "Cheney. And we also, just as" 
    assert nltk.edit_distance(passages[7][:len(actual_passage7_start)], actual_passage7_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage7 passed.")

    actual_passage8_start = "Jordan. They are pro bono"
    assert nltk.edit_distance(passages[8][:len(actual_passage8_start)], actual_passage8_start) <= ACCEPTABLE_TEXT_DISCREPANCY
    print("Test for passage8 passed.")

    # HERE'S MY PROBLEM. 
    # This test fails because my regex fails to capture the question which starts with the 
    # redacted name of the staff/questioner. The only way I've managed to get this test to 
    # pass has also broken at least one of the tests above. 
    actual_passage10_start = "Fakename So at this point, as we discussed earlier, I'm going to"
    e_msg = "Failed on passage 10"
    assert nltk.edit_distance(passages[10][:len(actual_passage10_start)], actual_passage10_start) <= ACCEPTABLE_TEXT_DISCREPANCY, e_msg

注意,在上一个测试中,我添加了“Fakename”作为前缀,如果不希望这样做,可以更新passages列表以删除手动添加的前缀。

相关问题