python-3.x 如何忽略字符集错误,而解析PDF与PDF矿工

5fjcxozz  于 2023-10-21  发布在  Python
关注(0)|答案(1)|浏览(114)

嗨,伙计们,我在使用PDF矿工解析PDF文件时遇到了编码错误。

from io import BytesIO
from pdfminer import layout

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBoxHorizontal, LTTextContainer
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer

#pageno=0
#for page_layout in extract_pages("./Statements/Manoj Kotak.pdf"):
#    pageno+=1
#    print(str(len(page_layout))+" page No:"+str(pageno))
#    for element in page_layout:
#        if(isinstance(element,LTTextBoxHorizontal)):
#            
#            print(element.get_text())
#    if pageno==2:
#        break

#Open Pdf 
fp=open("../pathto/pdffile.pdf")

#Pdf Parser Instantiation
parser =PDFParser(fp)

#Reading Parsed Document

document=PDFDocument(parser)

#Text Extraction is Implementable

if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

#Initiating Resource Manager to store Shared Objects in Pdf document
rsrmgr=PDFResourceManager()

#Begining Page Layout Analysis
# Parameters for analysis
laparams=LAParams()

#Device Initialsation
device=PDFPageAggregator(rsrmgr,laparams=laparams)

# PDF interpreter Initialisation
interpreter=PDFPageInterpreter(rsrmgr,device)

#Function to Parse Parsed Pdf Object
def parse_obj(layout_objs):
        # looping Through the Pdf 
        for obj in layout_objs:
            if isinstance(obj,pdfminer.layout.LTTextBoxHorizontal):
                print ("%6 %6 %s".format(obj.bbox[0],obj.bbox[1],obj.get_text()))

for page in PDFPage.create_pages(document):
        interpreter.process_page(page)
        layout=device.get_result()

        parse_obj(layout.objs)

以上是我用python解析pdf的源代码

Traceback (most recent call last):
  File "bin\statementparser.py", line 37, in <module>
    document=PDFDocument(parser)
  File "D:\python\lib\site-packages\pdfminer\pdfdocument.py", line 571, in __init__
    pos = self.find_xref(parser)
  File "D:\python\lib\site-packages\pdfminer\pdfdocument.py", line 788, in find_xref
    for line in parser.revreadlines():
  File "D:\python\lib\site-packages\pdfminer\psparser.py", line 267, in revreadlines
    s = self.fp.read(prevpos-pos)
  File "D:\python\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1803: character maps to <undefined>
gk7wooem

gk7wooem1#

请确保将文件模式指定为'rb',以二进制读取模式打开文件,因为默认情况下,'open'* 会以只读方式加载文件。
在原始代码中,文件路径设置如下:

fp = open("../pathto/pdffile.pdf")

你应该把它更新成这样:

fp = open("../path/to/pdffile.pdf", 'rb')

此外,您可以将f字符串格式添加到print语句中,以获得更结构化的输出,如;

print(f"X: {obj.bbox[0]}, Y: {obj.bbox[1]}, Text: {obj.get_text()}")

相关问题