使用Python获取PDF附件

zz2j4svz  于 2023-04-28  发布在  Python
关注(0)|答案(5)|浏览(188)

我承认我是Python的新手。我们必须处理带有附件或带注解附件的PDF文件。我正在尝试使用PyPDF 2库从PDF文件中提取附件。
在GitHub上找到的唯一(!)示例包含以下代码:

import PyPDF2

def getAttachments(reader):
      
      catalog = reader.trailer["/Root"]
      # VK
      print (catalog)
      
          # 
      fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']

这一呼吁是:

rootdir = "C:/Users/***.pdf"  # My file path
handler = open(rootdir, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)

我得到一个KeyError:'/EmbeddedFiles'
目录的打印确实不包含EmbeddedFiles:'/扩展名':'/ADBE':'/1.7','/ExtensionLevel':3}},'/元数据':IndirectObject(2,0),“/Names’:IndirectObject(5,0),“/OpenAction':IndirectObject(6,0),'/PageLayout':'/OneColumn','/Pages':IndirectObject(3,0),“/PieceInfo”:IndirectObject(7,0),“/Type”:“/目录”
此PDF包含9个附件。我怎样才能得到它们?

pinkon5k

pinkon5k1#

后期编辑

随着PyPDF的rebase(Back to Roots)取代PyPDF 2,这被Add“reader关闭。PyPDF 3.5.0(2023年2月)见
https://github.com/py-pdf/pypdf/commit/5e792c2519f101045e512ec047ebfcaf5e87ee28

较旧的答案*太长的评论,我没有亲自测试过这段代码,它看起来与你在问题中的大纲非常相似,但是我在这里添加了其他人来测试。这是合并请求https://github.com/mstamy2/PyPDF2/pull/440的主题,以下是Kevin M Loeffler在https://kevinmloeffler.com/2018/07/08/how-to-extract-pdf-file-attachments-using-python-and-pypdf2/ * 中描述的完整更新序列

如果你能提供一个你有问题的输入类型的例子,这样其他人就可以调整提取例程来适应,这总是有帮助的。

在收到一个错误时“我猜脚本正在损坏,因为PDF的嵌入文件部分并不总是存在,所以试图访问它会抛出一个错误。““我会尝试将get_attachments方法中'catalog'行后面的所有内容放在try-catch中。”

不幸的是,有许多未包含在PyPDF 2***中的未决拉取请求,但现在包含在新的Re-Incarnation中,作为PyPDF***https://github.com/mstamy2/PyPDF2/pulls,其他请求也可能相关或需要帮助解决这个和其他缺点。因此,你需要看看这些是否也有帮助。
有关一个尚未完成的try catch示例,您可以将其包含在/中并将其应用于其他用例,请参见https://github.com/mstamy2/PyPDF2/pull/551/commits/9d52ef517319b538f007669631ba6b778f8ec3a3
除了/Type/EmbeddedFiles之外,嵌入文件的相关关键字包括/Type /Filespec/Subtype /FileAttachment,请注意,这些关键字对可能并不总是有空格,因此也许可以查看这些关键字是否可以查询附件
同样,在最后一点上,该示例搜索以复数索引的/EmbeddedFiles,而任何单独的条目本身被标识为单数

vxbzzdmp

vxbzzdmp2#

这可以改进,但它已经过测试(使用PyMuPDF)。
它检测损坏的PDF文件,加密,附件,注解和投资组合。
我还没有将产出与我们的内部分类进行比较。
生成可以导入Excel的分号分隔文件。

import fitz                      # = PyMuPDF
import os

outfile = open("C:/Users/me/Downloads/testPDF3.txt", "w", encoding="utf-8")
folder = "C:/Users/me/Downloads"

print ("filepath;","encrypted;","pages;", "embedded;","attachments;","annotations;","portfolio",  file = outfile)
enc=pages=count=names=annots=collection=''

for subdir, dirs, files in os.walk(folder):
    for file in files:
        #print (os.path.join(subdir, file))
        filepath = subdir + os.sep + file

        if filepath.endswith(".pdf"):
            #print (filepath, file = outfile)
            
            try:
                doc = fitz.open(filepath)
 
                enc = doc.is_encrypted
                #print("Encrypted? ", enc, file = outfile)
                pages = doc.page_count
                #print("Number of pages: ", pages, file = outfile)
                count = doc.embfile_count()
                #print("Number of embedded files:", count, file = outfile)     # shows number of embedded files
                names = doc.embfile_names()
                #print("Embedded files:", str(names), file = outfile) 
                #if count > 0:
                #    for emb in names:
                #        print(doc.embfile_info(emb), file = outfile)
                annots = doc.has_annots()
                #print("Has annots?", annots, file = outfile) 
                
                links = doc.has_links()
                #print("Has links?", links, file = outfile)
                trailer = doc.pdf_trailer()
                #print("Trailer: ", trailer, file = outfile)
                xreflen = doc.xref_length()  # length of objects table
                for xref in range(1, xreflen):  # skip item 0!
                    #print("", file = outfile)
                    #print("object %i (stream: %s)" % (xref, doc.is_stream(xref)), file = outfile)
                    #print(doc.xref_object(i, compressed=False), file = outfile)
                    
                    if "Collection" in doc.xref_object(xref, compressed=False): 
                        #print ("Portfolio", file = outfile)
                        collection ='True'
                        break
                    else: collection="False"
                    #print(doc.xref_object(xref, compressed=False), file = outfile)
                    
            except:
                #print ("Not a valid PDF", file = outfile)
                enc=pages=count=names=annots=collection="Not a valid PDF"
            print(filepath,";", enc,";",pages, ";",count, ";",names, ";",annots, ";",collection, file = outfile )                
outfile.close()
zour9fqk

zour9fqk3#

我也遇到了同样的问题与几个pdf,我有。我能够对引用的代码进行这些更改,使其为我工作:

import PyPDF2

def getAttachments(reader):
    """
    Retrieves the file attachments of the PDF as a dictionary of file names
    and the file data as a bytestring.

    :return: dictionary of filenames and bytestrings
    """
    attachments = {}
    #First, get those that are pdf attachments
    catalog = reader.trailer["/Root"]
    if "/EmbeddedFiles" in catalog["/Names"]:
        fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
        for f in fileNames:
            if isinstance(f, str):
                name = f
                dataIndex = fileNames.index(f) + 1
                fDict = fileNames[dataIndex].getObject()
                fData = fDict['/EF']['/F'].getData()
                attachments[name] = fData

    #Next, go through all pages and all annotations to those pages
    #to find any attached files
    for pagenum in range(0, reader.getNumPages()):
        page_object = reader.getPage(pagenum)
        if "/Annots" in page_object:
            for annot in page_object['/Annots']:
                annotobj = annot.getObject()
                if annotobj['/Subtype'] == '/FileAttachment':
                    fileobj = annotobj["/FS"]
                    attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].getData()
    return attachments


handler = open(filename, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
for fName, fData in dictionary.items():
    with open(fName, 'wb') as outfile:
        outfile.write(fData)
jxct1oxe

jxct1oxe4#

我知道这是一个迟来的答复,但我昨天才开始调查。我使用PyMuPdf库来提取嵌入的文件。下面是我的代码:

import os
import fitz

def get_embedded_pdfs(input_pdf_path, output_path=None): 
  input_path = "/".join(input_pdf_path.split('/')[:-1])

  if not output_path : 
    output_path = input_pdf_path.split(".")[0] + "_embeded_files/"

  if output_path not in os.listdir(input_path):
    os.mkdir(output_path)

  doc = fitz.open(input_pdf_path)

  item_name_dict = {}
  for each_item in doc.embfile_names():
    item_name_dict[each_item] = doc.embfile_info(each_item)["filename"]

  for item_name, file_name in item_name_dict.items():
      out_pdf =  output_path + file_name
      ## get embeded_file in bytes
      fData = doc.embeddedFileGet(item_name)
      ## save embeded file
      with open(out_pdf, 'wb') as outfile: 
        outfile.write(fData)
o8x7eapl

o8x7eapl5#

免责声明:我是borb的作者(本答案中使用的库)

borb是一个开源的纯Python PDF库。它抽象了处理PDF的大部分不愉快(例如必须处理字典和必须知道PDF语法和结构)。
这里有一个巨大的示例库,包含一个关于处理嵌入式文件的部分,您可以在这里找到。
为了完整性,我将在这里重复相关的示例:

import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF

def main():

    # read the Document
    doc: typing.Optional[Document] = None
    with open("output.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle)

    # check whether we have read a Document
    assert doc is not None

    # retrieve all embedded files and their bytes
    for k, v in doc.get_embedded_files().items():

        # display the file name, and the size
        print("%s, %d bytes" % (k, len(v)))

if __name__ == "__main__":
    main()

在读取Document之后,您可以简单地要求它提供一个dict,将文件名Map到字节。

相关问题