下面是我使用pdfminer.six
提取PDF内容的代码
from pdfminer.high_level import extract_text
import pyttsx3
text = extract_text(pdf_file_path, page_numbers =[1,3])
# text content is shown below
# this text need to applied RegEX to convert into proper paragraphs
engine = pyttsx3.init()
engine.say(text)
engine.runAndWait()
text
内容如下所示:
Introduction
The Book of Secrets became an Osho “classic” shortly after it was first
published. And no wonder – it contains not only a comprehensive
overview of Osho’s unique, contemporary take on the eternal human quest
for meaning, but also the most comprehensive set of meditation
techniques available to help find that meaning within our own lives.
As Osho explains in the first chapter:
These are the oldest, most ancient techniques. But you can call them
the latest, also, because nothing can be added to them. They have taken in
all the possibilities, all the ways of cleaning the mind, transcending the
mind. Not a single method could be added to [these] one hundred and
twelve methods. It is the most ancient and yet the latest, yet the newest.
Old like old hills – the methods seem eternal – and they are new like a
dewdrop before the sun, because they are so fresh. These one hundred and
twelve methods constitute the whole science of transforming mind.
Issue:engine.say(text)
正常工作,但在说话时,回车后会出现长时间停顿(例如:“first”、“comprehensive”、“quest”...),以获得与句号(.)后的暂停匹配的间隔。所以,为了 * 顺利阅读 * 我想先转换这些段落在适当的格式。
解决方案:由于读者在句子结尾和段落结尾都有相同的停顿,我们可以选择以下方法:
1.将整个段落转换为单个句子并传递给读者。
1.或者,将每一个句子(两个句号之间)传递给读者。
预期文本(方法1 -首选):
Introduction
The Book of Secrets became an Osho “classic” shortly after it was first published. And no wonder – it contains not only a comprehensive overview of Osho’s unique, contemporary take on the eternal human quest for meaning, but also the most comprehensive set of meditation techniques available to help find that meaning within our own lives.
As Osho explains in the first chapter:
These are the oldest, most ancient techniques. But you can call them the latest, also, because nothing can be added to them. They have taken in all the possibilities, all the ways of cleaning the mind, transcending the mind. Not a single method could be added to [these] one hundred and twelve methods. It is the most ancient and yet the latest, yet the newest. Old like old hills – the methods seem eternal – and they are new like a dewdrop before the sun, because they are so fresh. These one hundred and twelve methods constitute the whole science of transforming mind.
预期文本(方法2 -首选):
Introduction
The Book of Secrets became an Osho “classic” shortly after it was first published.
And no wonder – it contains not only a comprehensive overview of Osho’s unique, contemporary take on the eternal human quest for meaning, but also the most comprehensive set of meditation techniques available to help find that meaning within our own lives.
As Osho explains in the first chapter:
These are the oldest, most ancient techniques. But you can call them the latest, also, because nothing can be added to them.
They have taken in all the possibilities, all the ways of cleaning the mind, transcending the mind.
Not a single method could be added to [these] one hundred and twelve methods. It is the most ancient and yet the latest, yet the newest.
Old like old hills – the methods seem eternal – and they are new like a dewdrop before the sun, because they are so fresh.
These one hundred and twelve methods constitute the whole science of transforming mind.
我对RegEx
相当陌生,无法提出删除换行符但保留段落结构的RegEx。
1条答案
按热度按时间xxls0lw81#
对于第一种方法,可以使用
re.sub(r"(?<!\n|:)\n(?!\n)", " ", text)
。可能有一种更好的方法,但它确实适用于示例文本。它的功能是检查换行符是否为:1.单换行符
1.不出现在
:
字符之后并且用空格替换匹配。然而,它确实在第一行和一些
.
字符之后添加了一个空格,这不太理想,但可能不会对我不熟悉的读者产生重大影响。注意:使用
file.write
是为了复制和显示,因为它太长了,无法在我的终端中显示,它不是该步骤的必要部分。