我有一个基于图像的多页PDF,我必须提取包含EPC的行和与之关联的下一行
示例:
ENERGY PERFORMANCE CERTIFICATE
D(139)
我试了这个代码:
import os
from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import re
poppler_path = "C:/Users/poddaral/temp/poppler-0.68.0/bin"
pytesseract.pytesseract.tesseract_cmd = r"C:/Users/poddaral/temp/tesseract.exe"
pdf_path = "C:/Users/poddaral/temp/4-6 Etloe Road, Westbury Park, Bristol, BS6 7PF.pdf"
images = convert_from_path(pdf_path=pdf_path, poppler_path=poppler_path)
for count, img in enumerate(images):
img_name = f"page_{count}.png"
img.save(img_name, "PNG")
png_files = [f for f in os.listdir(".") if f.endswith(".png")]
for png_file in png_files:
extracted_text = pytesseract.image_to_string(Image.open(png_file))
print(extracted_text)
pattern = re.compile('ENERGY PERFORMANCE CERTIFICATE')
def find_following_line(extracted_text):
lines = extracted_text.splitlines()
for i, line in enumerate(lines):
if re.search(pattern, line):
return lines[i+2]
print(find_following_line(extracted_text))
3条答案
按热度按时间yzuktlbb1#
在函数
def find_following_line(extracted_text):
中对索引做一个小改动就可以了d4so4syb2#
我改变了你的索引从2到1,这正是你搜索:
提供以下输出:
xfb7svmp3#
您可以避免计算行数,如果前一行匹配,则返回当前行
如果其他人想***有时候***也返回匹配的行,