在python中创建多页pdf

jutyujz0 于 2023-01-27 发布在 Python

关注(0)|答案(3)|浏览(181)

我正在使用pytesseract对图像进行OCR。我有3-4页长的报表PDF。我需要一种方法将它们转换为多个.jpg/.png图像，并逐个对这些图像进行OCR。截至目前，我正在将单个页面转换为图像，然后运行

text=str(pytesseract.image_to_string(Image.open("imagename.jpg"),lang='eng'))

之后我使用正则表达式来提取信息并创建一个 Dataframe 。正则表达式逻辑对所有页面都是相同的。可以理解的是，如果我可以在循环中读取图像文件，该过程可以自动用于任何相同格式的PDF格式。

python-3.x

来源：https://stackoverflow.com/questions/62429161/ocr-a-multipage-pdf-in-python

3条答案

按热度按时间

ebdffaop1#

PyMuPDF是另一个循环浏览图像文件的选项，下面是实现方法：

import fitz
from PIL import Image
import pytesseract 

input_file = 'path/to/your/pdf/file'
pdf_file = input_file
fullText = ""

doc = fitz.open(pdf_file) # open pdf files using fitz bindings 
### ---- If you need to scale a scanned image --- ###
zoom = 1.2 # scale your pdf file by 120%
mat = fitz.Matrix(zoom, zoom)
noOfPages = doc.pageCount 

for pageNo in range(noOfPages):
    page = doc.loadPage(pageNo) # number of pages
    pix = page.getPixmap(matrix = mat) # if you need to scale a scanned image
    output = '/path/to/save/image/files' + str(pageNo) + '.jpg'
    pix.writePNG(output) # skip this if you don't need to render a page

    text = str(((pytesseract.image_to_string(Image.open(output)))))
    fullText += text

fullText = fullText.splitlines() # or do something here to extract information using regex

这是非常方便的，取决于你想如何处理pdf文件。关于PyMuPDF的更详细的信息，这些链接可能会有帮助：PyMuPDF和git for PyMuPDF教程
希望这个有用。

- EDIT**使用PyMuPDF执行此操作的另一个更直接的方法是直接解释反向转换的文本，如果您有一个干净的PDF文件格式，在page = doc.loadPage(pageNo)之后，只需执行以下操作就足够了：

blocks = page.getText("blocks")
blocks.sort(key=lambda block: block[3])  # sort by 'y1' values

for block in blocks:
    print(block[4])  # print the lines of this block

免责声明：以上使用blocks的想法来自repo维护者。更详细的信息可以在这里找到：issues discussion on git

赞(0）回复(0）举报 2023-01-27

xlpyo6sf2#

answer from liamsuma似乎已被弃用。
这对我很有效（Python 3.9）：

import fitz
from PIL import Image
import pytesseract #Should be added to path

input_file = 'path/to/your/pdf/file.pdf'
full_text = ""
zoom = 1.2 

with fitz.open(input_file) as doc:
    mat = fitz.Matrix(zoom, zoom)
    for page in doc:
        pix = page.get_pixmap(matrix=mat)
        output = f'/path/to/save/image/files/{page.number}.jpg'
        pix.save(output)
        res = str(pytesseract.image_to_string(Image.open(output)))
        full_text += res

full_text = full_text.splitlines()
print(full_text)

赞(0）回复(0）举报 2023-01-27

c7rzv4ha3#

对我来说以下作品

from wand.api import library
from wand.image import Image
with Image(filename=r"imagepath.pdf", resolution=300) as img:

    library.MagickResetIterator(img.wand)
    for idx in range(library.MagickGetNumberImages(img.wand)):
        library.MagickSetIteratorIndex(img.wand, idx)

    img.save(filename="output.tiff")

现在的问题是读取tiff文件中的每一页。因为如果我提取为

text=str(pytesseract.image_to_string(Image.open("test.tiff"),lang='eng'))

它将仅对第一页进行OCR

赞(0）回复(0）举报 2023-01-27

我来回答

在python中创建多页pdf

3条答案

相关问题

热门标签

最新问答