Python/RegEx:将坏段落转换为好段落

35g0bw71  于 2023-06-07  发布在  Python
关注(0)|答案(1)|浏览(200)

下面是我使用pdfminer.six提取PDF内容的代码

from pdfminer.high_level import extract_text
import pyttsx3

text = extract_text(pdf_file_path, page_numbers =[1,3])
# text content is shown below
# this text need to applied RegEX to convert into proper paragraphs

engine = pyttsx3.init()
engine.say(text)
engine.runAndWait()

text内容如下所示:

Introduction

The Book of Secrets became an Osho “classic” shortly after it was first
published.  And  no  wonder  –  it  contains  not  only  a  comprehensive
overview of Osho’s unique, contemporary take on the eternal human quest
for  meaning,  but  also  the  most  comprehensive  set  of  meditation
techniques available to help find that meaning within our own lives.

As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.  But  you  can  call  them
the latest, also, because nothing can be added to them. They have taken in
all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the
mind.  Not  a  single  method  could  be  added  to  [these]  one  hundred  and
twelve  methods.  It  is  the  most ancient  and yet  the  latest, yet  the  newest.
Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a
dewdrop before the sun, because they are so fresh. These one hundred and
twelve methods constitute the whole science of transforming mind.

Issueengine.say(text)正常工作,但在说话时,回车后会出现长时间停顿(例如:“first”、“comprehensive”、“quest”...),以获得与句号(.)后的暂停匹配的间隔。所以,为了 * 顺利阅读 * 我想先转换这些段落在适当的格式。
解决方案:由于读者在句子结尾和段落结尾都有相同的停顿,我们可以选择以下方法:

1.将整个段落转换为单个句子并传递给读者。
1.或者,将每一个句子(两个句号之间)传递给读者。
预期文本(方法1 -首选):

Introduction

The Book of Secrets became an Osho “classic” shortly after it was first published.  And  no  wonder  –  it  contains  not  only  a  comprehensive overview of Osho’s unique,  contemporary take on the eternal human quest for  meaning,  but  also  the  most  comprehensive  set  of  meditation techniques available to help find that meaning within our own lives.

As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.  But  you  can  call  them     the latest, also, because nothing can be added to them. They have taken in     all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the     mind.  Not  a  single  method  could  be  added  to  [these]  one  hundred  and     twelve  methods.  It  is  the  most ancient  and yet  the  latest, yet  the  newest.    Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a     dewdrop before the sun, because they are so fresh. These one hundred and     twelve methods constitute the whole science of transforming mind.

预期文本(方法2 -首选):

Introduction    
The Book of Secrets became an Osho “classic” shortly after it was first published.
And  no  wonder  –  it  contains  not  only  a  comprehensive overview of Osho’s unique,  contemporary take on the eternal human quest for  meaning,  but  also  the  most  comprehensive  set  of  meditation techniques available to help find that meaning within our own lives.
As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.  But  you  can  call  them     the latest, also, because nothing can be added to them. 
They have taken in     all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the     mind.  
Not  a  single  method  could  be  added  to  [these]  one  hundred  and     twelve  methods.  It  is  the  most ancient  and yet  the  latest, yet  the  newest.    
Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a     dewdrop before the sun, because they are so fresh. 
These one hundred and     twelve methods constitute the whole science of transforming mind.

我对RegEx相当陌生,无法提出删除换行符但保留段落结构的RegEx。

xxls0lw8

xxls0lw81#

对于第一种方法,可以使用re.sub(r"(?<!\n|:)\n(?!\n)", " ", text)。可能有一种更好的方法,但它确实适用于示例文本。它的功能是检查换行符是否为:
1.单换行符
1.不出现在:字符之后
并且用空格替换匹配。然而,它确实在第一行和一些.字符之后添加了一个空格,这不太理想,但可能不会对我不熟悉的读者产生重大影响。

text = """
Introduction

The Book of Secrets became an Osho “classic” shortly after it was first
published.  And  no  wonder  –  it  contains  not  only  a  comprehensive
overview of Osho’s unique, contemporary take on the eternal human quest
for  meaning,  but  also  the  most  comprehensive  set  of  meditation
techniques available to help find that meaning within our own lives.

As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.  But  you  can  call  them
the latest, also, because nothing can be added to them. They have taken in
all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the
mind.  Not  a  single  method  could  be  added  to  [these]  one  hundred  and
twelve  methods.  It  is  the  most ancient  and yet  the  latest, yet  the  newest.
Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a
dewdrop before the sun, because they are so fresh. These one hundred and
twelve methods constitute the whole science of transforming mind.
"""

import re

with open("blank_txt.txt", mode="w", encoding="utf-8") as f:
    f.write(re.sub(r"(?<!\n|:)\n(?!\n)", " ", text))

注意:使用file.write是为了复制和显示,因为它太长了,无法在我的终端中显示,它不是该步骤的必要部分。

Introduction

The Book of Secrets became an Osho “classic” shortly after it was first published.  And  no  wonder  –  it  contains  not  only  a  comprehensive overview of Osho’s unique, contemporary take on the eternal human quest for  meaning,  but  also  the  most  comprehensive  set  of  meditation techniques available to help find that meaning within our own lives.

As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.  But  you  can  call  them the latest, also, because nothing can be added to them. They have taken in all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the mind.  Not  a  single  method  could  be  added  to  [these]  one  hundred  and twelve  methods.  It  is  the  most ancient  and yet  the  latest, yet  the  newest. Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a dewdrop before the sun, because they are so fresh. These one hundred and twelve methods constitute the whole science of transforming mind.

相关问题