描述bug
一个奇怪的bug。IndexError: list index out of range
当OCR识别PDF文档的一部分时,但根据分割大小,它并不总是发生。我的猜测是第一页很重要。
相关堆栈跟踪:
.venv/lib/python3.11/site-packages/unstructured/partition/ocr.py:171:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
filename = '/var/folders/5w/hcnw_g8d3cn9j_373dxm6jrm0000gn/T/tmpglza5pmp', out_layout = <unstructured_inference.inference.layout.DocumentLayout object at 0x16fddcdd0>, is_image = False
infer_table_structure = True, ocr_languages = 'eng', ocr_mode = 'entire_page', pdf_image_dpi = 200
def process_file_with_ocr(
filename: str,
out_layout: "DocumentLayout",
is_image: bool = False,
infer_table_structure: bool = False,
ocr_languages: str = "eng",
ocr_mode: str = OCRMode.FULL_PAGE.value,
pdf_image_dpi: int = 200,
) -> "DocumentLayout":
"""
Process OCR data from a given file and supplement the output DocumentLayout
from unsturcutured-inference with ocr.
Parameters:
- filename (str): The path to the input file, which can be an image or a PDF.
- out_layout (DocumentLayout): The output layout from unstructured-inference.
- is_image (bool, optional): Indicates if the input data is an image (True) or not (False).
Defaults to False.
- infer_table_structure (bool, optional): If true, extract the table content.
- ocr_languages (str, optional): The languages for OCR processing. Defaults to "eng" (English).
- ocr_mode (str, optional): The OCR processing mode, e.g., "entire_page" or "individual_blocks".
Defaults to "entire_page". If choose "entire_page" OCR, OCR processes the entire image
page and will be merged with the output layout. If choose "individual_blocks" OCR,
OCR is performed on individual elements by cropping the image.
- pdf_image_dpi (int, optional): DPI (dots per inch) for processing PDF images. Defaults to 200.
Returns:
DocumentLayout: The merged layout information obtained after OCR processing.
"""
merged_page_layouts = []
try:
if is_image:
with PILImage.open(filename) as images:
image_format = images.format
for i, image in enumerate(ImageSequence.Iterator(images)):
image = image.convert("RGB")
image.format = image_format
merged_page_layout = supplement_page_layout_with_ocr(
out_layout.pages[i],
image,
infer_table_structure=infer_table_structure,
ocr_languages=ocr_languages,
ocr_mode=ocr_mode,
)
merged_page_layouts.append(merged_page_layout)
return DocumentLayout.from_pages(merged_page_layouts)
else:
with tempfile.TemporaryDirectory() as temp_dir:
_image_paths = pdf2image.convert_from_path(
filename,
dpi=pdf_image_dpi,
output_folder=temp_dir,
paths_only=True,
)
image_paths = cast(List[str], _image_paths)
for i, image_path in enumerate(image_paths):
with PILImage.open(image_path) as image:
merged_page_layout = supplement_page_layout_with_ocr(
> out_layout.pages[i],
image,
infer_table_structure=infer_table_structure,
ocr_languages=ocr_languages,
ocr_mode=ocr_mode,
)
E IndexError: list index out of range
.venv/lib/python3.11/site-packages/unstructured/partition/ocr.py:161: IndexError
重现问题
请提供一个代码片段来重现这个问题。
0uupv_Artisi+-+Brochure+-+FINAL06.06.23.pdf
如果你将其分割为每10页一个分割,你会发现30-40的范围会抛出这个错误,但其余的都没有问题。每5页一个分割也会出现这个问题。但是对于其他分割大小,如40,没有错误。
预期行为
它不应该随机出现错误,取决于分割大小 :)
截图
无
环境信息
Mac上的Python 3.11;也在Ubuntu上看到过
附加上下文
hi_res
提取;仅在处理此特定PDF文件时遇到过一次此错误,如下所示。
1条答案
按热度按时间k5hmc34c1#
我在使用API时遇到了相同的问题,无法提取图片。
我正在使用环境变量中的并行模式。