如何用Python替换pdf中的单词

uinbv5nw 于 2023-02-21 发布在 Python

关注(0)|答案(1)|浏览(146)

我想在一个pdf文件中替换一个单词，但当我尝试这样做时，它总是返回相同的pdf文件。这是我的代码块。当前我使用的是pypdf2，但如果有任何建议，我可以切换它。我的代码中缺少的部分是什么？

with open(file_path, 'rb') as file:
        pdf_reader = PdfFileReader(file)
        # Encrypt the word in the PDF content
        encrypted_word = self.cipher.encrypt(word_to_encrypt_bytes)
        encrypted_word_b64 = base64.b64encode(encrypted_word)
        # Write the encrypted PDF content to a new PDF file
        pdf_writer = PdfFileWriter()
        for i in range(pdf_reader.getNumPages()):
            page = pdf_reader.getPage(i)
            page_content = page.extractText()
            page_content_b = page_content.encode('utf-8')
            page_content_b = page_content_b.replace(word_to_encrypt.encode(), encrypted_word_b64)
            page_content = page_content_b.decode('utf-8')
            pdf_writer.addPage(page)

        output_path = os.path.join(file_dir, file_name_without_ext + '_encryptedm' + ext)
        with open(output_path, 'wb') as output_file:
            pdf_writer.write(output_file)

我想在我的pdf里放一个词。

python

来源：https://stackoverflow.com/questions/75506843/how-to-replace-a-word-in-pdf-with-python

1条答案

按热度按时间

2q5ifsrm1#

看起来您只是替换了提取文本中的单词，而实际上并没有更新PDF页面内容。为此，您可以使用页面对象的setContentStreams方法用更新的内容替换内容流。
下面是一个更新的代码块，应该工作：

from PyPDF2 import PdfFileReader, PdfFileWriter
import base64

with open(file_path, 'rb') as file:
    pdf_reader = PdfFileReader(file)
    # Encrypt the word in the PDF content
    encrypted_word = self.cipher.encrypt(word_to_encrypt_bytes)
    encrypted_word_b64 = base64.b64encode(encrypted_word)
    # Write the encrypted PDF content to a new PDF file
    pdf_writer = PdfFileWriter()
    for i in range(pdf_reader.getNumPages()):
        page = pdf_reader.getPage(i)
        page_content = page.extractText()
        page_content_b = page_content.encode('utf-8')
        updated_content_b = page_content_b.replace(word_to_encrypt.encode(), encrypted_word_b64)
        page_content = updated_content_b.decode('utf-8')
        page_content_streams = [b"q\n"] + page.getContents().split(b"q\n")[1:]
        updated_content_streams = [b"q\n"] + updated_content_b.split(b"q\n")[1:]
        page.setContentStreams(updated_content_streams)
        pdf_writer.addPage(page)

    output_path = os.path.join(file_dir, file_name_without_ext + '_encryptedm' + ext)
    with open(output_path, 'wb') as output_file:
        pdf_writer.write(output_file)

在这段更新后的代码中，我们首先将页面内容提取为文本，替换单词，然后将其转换回字节，然后使用getContents方法获取页面的现有内容流，并在q操作符上拆分它们（其标记新图形状态的开始），并将Q运算符附加到更新的内容流（因为第一图形状态不包括在提取的内容中）最后，我们使用page对象的setContentStreams方法设置更新的内容流，并将更新的页面添加到PDF编写器。

赞(0）回复(0）举报 2023-02-21

我来回答

如何用Python替换pdf中的单词

1条答案

相关问题

热门标签

最新问答