csv 在提取文本从PDF到txt文件所产生的txt文件包含双引号在一些行，我无法摆脱

htrmnn0y 于 12个月前发布在其他

关注(0)|答案(1)|浏览(93)

当我试图从PDF文件中提取文本到txt文件时，在某些情况下，有些行被双引号“”抓住，原因我不明白。问题最有可能出在我处理文本数据的方式上，因为我在写它之前将它存储在一个列表中。
我将非常感谢任何帮助或想法做什么。
我使用pdfplumber从pdf中逐页提取文本，并将其拆分为附加到列表的行。然而，一些作为字符串添加的行是用单引号添加的，例如。'黑洞-丢失的抄本'和一些双引号，例如。“你好，我是瓦鲁扬·戈尔吉安[.]。”
以下是列表的第一部分，作为PDF的提取结果，其中字符串以不同的符号引用：
['黑洞-丢失的抄本'，'介绍'，“你好，我是Varoujan Gorjian，我是NASA喷气推进实验室的研究天文学家。"，“我今天受到挑战，要在五个不同的层次上谈论一个概念。”，“复杂性。今天我们要讨论黑洞。黑洞的一个基本定义是，它是大量的质量塞进一个非常小的体积，这样的逃逸“，”速度是光速。|291字（解释者：“你听说过黑洞吗？”'，...]
下面是我的代码：`对于数据中的文件：path = os.path.join（dataDirectory，str（file））

lines = []
    pdffile = path

    with pdfplumber.open(pdffile) as pdf:
        pages = pdf.pages
        for page in pdf.pages:
            text = page.extract_text()
            for line in text.split('\n'):
                lines.append(line)

    csvFile = os.path.join(goalDirectory, str(file)[:-3]+'txt')

    # open CSV file with coded Text
    with open(csvFile, "w", newline='', encoding='utf-8') as converted:
        writer = csv.writer(converted)
        print(lines)

        for line in lines:
            writer.writerow([line])`

PDF看起来是这样的，所以没有双引号，我希望创建的txt文件中有纯文本：PDF of which text is extracted
生成的txt文件看起来像这样，其中一些行中添加了双引号：Text file with unwanted double quoutes
我试过使用csv.writer方法的 quotechar 和 quoting，但无法找到解决方案。
我可能需要一个解决方案，防止字符串在这个块中被双引号抓住，因为当添加到“lines”时，有些行是（随机）双引号：

`with pdfplumber.open(pdffile) as pdf:
    pages = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            lines.append(line)`

csv

来源：https://stackoverflow.com/questions/76849741/within-extracting-text-from-pdf-to-txt-file-the-resulting-txt-file-contains-doub

1条答案

按热度按时间

qq24tv8q1#

从OCR或其他格式提取文本并使用字符分隔值时，通常会使用换行符分隔\n

然而，在某些情况下，例如这里使用csv.writer，分隔可能是逗号，因此现有的"strings with ,'s"需要保护以防止分隔。

在使用逗号作为换行符之前，请将其展开

Black Hole - The Lost Transcript

Intro

"Hi, I'm Varoujan Gorjian, I'm a Research Astronomer at NASA's Jet Propulsion Laboratory.",  
I've been challenged today to talk about once concept at five different levels of increasing,  
complexity. Today we're going to be talking about black holes. A basic definition of a black,  
"hole is that it's a lot of mass crammed into a very tiny volume, such that the escaped",  
velocity is the speed of light.,  

Level 1 | 291 words (explainer: 221)

"So, have you ever heard of something called a black hole?"

What is a black hole?

"Well, it has to do with, a lot with gravity, do you know what gravity is?"

"No, not at all."

It's what Keeps us on the earth.

What?

"The reason we're not just flying off the earth is because earth has gravity, so if we throw something up, it comes back down, so that's why when we're walking on the earth, we don't fly off the earth because the earth has gravity, and it keeps us down."

Nice.

赞(0）回复(0）举报 12个月前

我来回答

csv 在提取文本从PDF到txt文件所产生的txt文件包含双引号在一些行，我无法摆脱

1条答案

相关问题

热门标签

最新问答