python 有没有办法提高PyPDF2.PdfFileReader的文件阅读速度，读取多个文件太耗时

f45qwnt8 于 2023-01-19 发布在 Python

关注(0)|答案(1)|浏览(196)

我有一个代码来搜索.pdf文件阅读内部数据的pdf文件。我的解决方案给我正确的文件，但它是缓慢的。有没有办法使它更快？

keyword = keyword.lower()

for subdir, dirs, files in os.walk(folder_path):
    for file in files:
        filepath = subdir + os.sep + file
        fpath = subdir + os.sep
        if(keyword in file.lower()):
            if filepath not in tflist:
                tflist.append(os.path.join(filepath))
        if filepath.endswith(".pdf"):
            if filepath not in tflist:
                with open(os.path.join(fpath,file), "rb") as f:
                    reader = PyPDF2.PdfFileReader(f)
                    for i in range(reader.getNumPages()):
                        page = reader.getPage(i)
                        page_content = page.extractText().lower()
                        if(keyword in page_content):
                            tflist.append(os.path.join(filepath))
                            break
                            #print (str(1+reader.getPageNumber(page)))
                            #print(keyword)

print(tflist)

python

来源：https://stackoverflow.com/questions/57033045/is-there-a-way-to-increase-the-file-reading-speed-of-pypdf2-pdffilereader-it-ta

1条答案

按热度按时间

csbfibhn1#

您可以使用multiprocessing.Pool。
将代码分成两部分，第一部分使用os.walk生成路径列表，我们将其命名为list_of_filenames。
第二部分是一个函数，它读取文件并根据条件返回每页的文件名和True或False：

def worker(path):
    rv = {}
    with open(path, "rb") as f:             
        reader = PyPDF2.PdfFileReader(f)       
        for i in range(reader.getNumPages()):
            page = reader.getPage(i)
            page_content = page.extractText().lower()
            if(keyword in page_content):
                 rv[i] = True
            else:
                 rv[i] = False
    return (path, rv)

像这样使用它：

import multiprocessing as mp

 p = mp.Pool()
 for path, rv in p.imap_unordered(worker, list_of_filenames):
     print('File:', path)
     print('Results:', rv)

假设您的CPU有 n 个内核，这将比一次处理一个文件快大约 n 倍。

赞(0）回复(0）举报 2023-01-19

我来回答

python 有没有办法提高PyPDF2.PdfFileReader的文件阅读速度，读取多个文件太耗时

1条答案

相关问题

热门标签

最新问答