使用python替换所有换行符

gcuhipw9  于 2022-12-30  发布在  Python
关注(0)|答案(7)|浏览(246)

我正在尝试使用python阅读一个pdf文件,内容中有很多换行符(crlf)。我尝试使用下面的代码删除它们:

from tika import parser

filename = 'myfile.pdf'
raw = parser.from_file(filename)
content = raw['content']
content = content.replace("\r\n", "")
print(content)

但输出保持不变。我尝试使用双反斜杠也没有解决这个问题。有人能请建议?

rqqzpn5f

rqqzpn5f1#

content = content.replace("\\r\\n", "")

你得加倍地躲开他们。

3b6akqbq

3b6akqbq2#

我没有权限访问你的pdf文件,所以我在我的系统上处理了一个。我也不知道你是否需要删除所有的新行或只是两个新行。下面的代码删除了两个新行,这使得输出更具可读性。
请让我知道这是否适合您当前的需求。

from tika import parser

filename = 'myfile.pdf'

# Parse the PDF
parsedPDF = parser.from_file(filename)

# Extract the text content from the parsed PDF
pdf = parsedPDF["content"]

# Convert double newlines into single newlines
pdf = pdf.replace('\n\n', '\n')

#####################################
# Do something with the PDF
#####################################
print (pdf)
jexiocij

jexiocij3#

如果遇到不同形式的换行符问题,请尝试str.splitlines()函数,然后使用所需的字符串重新连接结果,如下所示:

content = "".join(l for l in content.splitlines() if l)

然后,你只需要将引号中的值修改为你需要连接的值。这将允许你检测这里找到的所有行边界。但是要注意,str.splitlines()返回的是一个列表而不是迭代器。所以,对于大字符串,这将大大减少内存使用。在这种情况下,你最好使用文件流或io.StringIO,并逐行读取。

vof42yt1

vof42yt14#

print(open('myfile.txt').read().replace('\n', ''))
3wabscal

3wabscal5#

当你写类似t.replace("\r\n", "")的东西时,python会寻找一个回车后的换行符。
Python不会自己替换回车符,也不会自己替换换行符。
请考虑以下几点:

t = "abc abracadabra abc"
t.replace("abc", "x")
  • t.replace("abc", "x")是否将每次出现的字母a替换为字母x
  • t.replace("abc", "x")是否将每次出现的字母b替换为字母x
  • t.replace("abc", "x")是否将每次出现的字母c替换为字母x

t.replace("abc", "x")将做什么?
t.replace("abc", "x")将用字母"x"替换整个字符串"abc"
请考虑以下几点:

test_input = "\r\nAPPLE\rORANGE\nKIWI\n\rPOMEGRANATE\r\nCHERRY\r\nSTRAWBERRY"

t = test_input
for _ in range(0, 3):
    t = t.replace("\r\n", "")
    print(repr(t))

result2 = "".join(test_input.split("\r\n"))
print(repr(result2))

发送到控制台的输出如下所示:

'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'
'APPLE\rORANGE\nKIWI\n\rPOMEGRANATECHERRYSTRAWBERRY'

请注意:

  • str.replace()替换目标字符串的每个**匹配项,而不仅仅是最左边的匹配项。
  • str.replace()替换目标字符串,但不是目标字符串的每个字符。

如果你想删除所有的换行符和回车符,类似下面的代码就可以完成这项工作:

in_string = "\r\n-APPLE-\r-ORANGE-\n-KIWI-\n\r-POMEGRANATE-\r\n-CHERRY-\r\n-STRAWBERRY-"

out_string = "".join(filter(lambda ch: ch not in "\n\r", in_string))

print(repr(out_string))
# prints -APPLE--ORANGE--KIWI--POMEGRANATE--CHERRY--STRAWBERRY-
j8ag8udp

j8ag8udp6#

你也可以用

text = '''
As she said these words her foot slipped, and in another moment, splash! she
was up to her chin in salt water. Her first idea was that she had somehow
fallen into the sea, “and in that case I can go back by railway,”
she said to herself.”'''

text = ' '.join(text.splitlines())

print(text)
# As she said these words her foot slipped, and in another moment, splash! she was up to her chin in salt water. Her first idea was that she had somehow fallen into the sea, “and in that case I can go back by railway,” she said to herself.”
6ovsh4lw

6ovsh4lw7#

#write a file 
enter code here
write_File=open("sample.txt","w")
write_File.write("line1\nline2\nline3\nline4\nline5\nline6\n")
write_File.close()

#open a file without new line of the characters
open_file=open("sample.txt","r")
open_new_File=open_file.read()
replace_string=open_new_File.replace("\n",." ")
print(replace_string,end=" ")
open_file.close()

输出

line1 line2 line3 line4 line5 line6

相关问题