如何使用Python 3.7 pdf2image库解决MemoryError?

xxls0lw8  于 2023-04-22  发布在  Python
关注(0)|答案(6)|浏览(160)

我正在使用Python PDF2Image库运行一个简单的PDF到图像转换。我当然可以理解这个库正在越过最大内存阈值来达到这个错误。但是,the PDF是6.6 MB(大约),那么为什么它会占用GB的内存来抛出内存错误呢?

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pdf2image import convert_from_path
>>> pages = convert_from_path(r'C:\Users\aakashba598\Documents\pwc-annual-report-2017-2018.pdf', 200)
Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 917, in _bootstrap_inner
    self.run()
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\subprocess.py", line 1215, in _readerthread
    buffer.append(fh.read())
MemoryError

还有,可能的解决方案是什么?
更新:当我从convert_from_path函数中减少dpi参数时,它工作起来很有魅力。但是产生的图片质量很低(原因很明显)。有没有办法解决这个问题?比如每次都批量创建图像并清除内存。如果有办法,怎么做?

o4tp2gmn

o4tp2gmn1#

一次转换10页的PDF(1-10,11-20......等等)

from pdf2image import pdfinfo_from_path, convert_from_path
info = pdfinfo_from_path(pdf_file, userpw=None, poppler_path=None)

maxPages = info["Pages"]
for page in range(1, maxPages+1, 10) : 
    convert_from_path(pdf_file, dpi=200, first_page=page, last_page = min(page+10-1,maxPages))
kpbpu008

kpbpu0082#

我说这个有点晚了,但问题确实与进入内存的136页有关。你可以做三件事。
1.指定转换图像的格式。
默认情况下,pdf 2 image使用PPM作为其图像格式,它更快,但也需要更多的内存(每个图像超过30 MB!)。您可以做的是使用更内存友好的格式,如jpeg或png。

convert_from_path('C:\path\to\your\pdf', fmt='jpeg')

这可能会解决这个问题,但主要是因为压缩,在某个时候(比如500页PDF),这个问题会再次出现。
1.使用输出目录
这是我推荐的一个,因为它允许你处理任何PDF。README页面上的示例很好地解释了它:

import tempfile

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('C:\path\to\your\pdf', output_folder=path)

这会将镜像临时写入到您的计算机存储中,因此您不必手动删除它。但请确保在退出with上下文之前完成任何需要执行的处理!
1.分块处理PDF文件
pdf2image允许您定义要处理的第一页和最后一页。这意味着在您的情况下,对于136页的PDF,您可以执行以下操作:

for i in range(0, 136 // 10 + 1):
    convert_from_path('C:\path\to\your\pdf', first_page=i*10, last_page=(i+1)*10)
bmp9r5qi

bmp9r5qi3#

公认的答案有一个小问题。
maxPages = pdf2image._page_count(pdf_file)
不能再使用,因为_page_count已被弃用。我找到了相同的工作解决方案。

from PyPDF2 import PdfFileWriter, PdfFileReader    
inputpdf = PdfFileReader(open(pdf, "rb"))
maxPages = inputpdf.numPages
for page in range(1, maxPages, 100):
    pil_images = pdf2image.convert_from_path(pdf, dpi=200, first_page=page,
                                                     last_page=min(page + 100 - 1, maxPages), fmt= 'jpg',
                                                     thread_count=1, userpw=None,
                                                     use_cropbox=False, strict=False)

这样,无论文件有多大,它都将一次处理100个,并且RAM的使用总是最小的。

s4n0splo

s4n0splo4#

一个相对较大的PDF会占用你所有的内存,并导致进程被杀死(除非你使用输出文件夹)https://github.com/Belval/pdf2image我想这会帮助你理解。
解决方案:将PDF文件分成小部分并将其转换为图像。图像可以合并为图像。

from PyPDF2 import PdfFileWriter, PdfFileReader

 inputpdf = PdfFileReader(open("document.pdf", "rb"))

 for i in range(inputpdf.numPages):
     output = PdfFileWriter()
     output.addPage(inputpdf.getPage(i))
     with open("document-page%s.pdf" % i, "wb") as outputStream:
         output.write(outputStream)

split a multi-page pdf file into multiple pdf files with python?

import numpy as np
 import PIL

 list_im = ['Test1.jpg', 'Test2.jpg', 'Test3.jpg']
 imgs    = [ PIL.Image.open(i) for i in list_im ]
 # pick the image which is the smallest, and resize the others to match it (can be   arbitrary image shape here)
 min_shape = sorted( [(np.sum(i.size), i.size ) for i in imgs])[0][1]
 imgs_comb = np.hstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )

 # save that beautiful picture
 imgs_comb = PIL.Image.fromarray( imgs_comb)
 imgs_comb.save( 'Trifecta.jpg' )    

 # for a vertical stacking it is simple: use vstack
 imgs_comb = np.vstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )
 imgs_comb = PIL.Image.fromarray( imgs_comb)
 imgs_comb.save( 'Trifecta_vertical.jpg' )

参考:Combine several images horizontally with Python

64jmpszr

64jmpszr5#

最终,结合这些技术,我最终编码如下,考虑到将pdf转换为pptx的目标,避免内存溢出和良好的速度:

import os, sys, tempfile, pprint
from PIL import Image
from pdf2image import pdfinfo_from_path,convert_from_path
from pptx import Presentation
from pptx.util import Inches
from io import BytesIO

pdf_file = sys.argv[1]
print("Converting file: " + pdf_file)

# Prep presentation
prs = Presentation()
blank_slide_layout = prs.slide_layouts[6]

# Create working folder
base_name = pdf_file.split(".pdf")[0]

# Convert PDF to list of images
print("Starting conversion...")
print()
path: str = "C:/ppttemp"  #temp dir (use cron to delete files older than 1h hourly)
slideimgs = []
info = pdfinfo_from_path(pdf_file, userpw=None, poppler_path='C:/Program Files/poppler-0.90.1/bin/')
maxPages = info["Pages"]
for page in range(1, maxPages+1, 5) : 
   slideimgs.extend( convert_from_path(pdf_file, dpi=250, output_folder=path, first_page=page, last_page = min(page+5-1,maxPages), fmt='jpeg', thread_count=4, poppler_path='C:/Program Files/poppler-0.90.1/bin/', use_pdftocairo=True)   )

print("...complete.")
print()

# Loop over slides
for i, slideimg in enumerate(slideimgs):
    if i % 5 == 0:
        print("Saving slide: " + str(i))

    imagefile = BytesIO()
    slideimg.save(imagefile, format='jpeg')
    imagedata = imagefile.getvalue()
    imagefile.seek(0)
    width, height = slideimg.size

    # Set slide dimensions
    prs.slide_height = height * 9525
    prs.slide_width = width * 9525

    # Add slide
    slide = prs.slides.add_slide(blank_slide_layout)
    pic = slide.shapes.add_picture(imagefile, 0, 0, width=width * 9525, height=height * 9525)
    

# Save Powerpoint
print("Saving file: " + base_name + ".pptx")
prs.save(base_name + '.pptx')
print("Conversion complete. :)")
print()
iaqfqrcu

iaqfqrcu6#

此代码将PDF转换为块,然后将图像添加到数组中:

from pdf2image import pdfinfo_from_path, convert_from_path

PDF = "/path/to/pdf.pdf"
CHUNK_SIZE = 20 # depends on your RAM
MAX_PAGES = pdfinfo_from_path(PDF)["Pages"]

images = []
for page in range(1, MAX_PAGES, CHUNK_SIZE):
    images += convert_from_path(PDF, first_page=page, last_page=page + CHUNK_SIZE - 1)

相关问题