描述bug
当使用元素的坐标来创建边界框时,使用默认策略和'hi_res'策略得到的坐标是不同的。
重现步骤
sudo apt-get install -y poppler-utils tesseract-ocr
pip install "unstructured[pdf]==0.12.5" PyMuPDF poppler-utils unstructured_inference==0.7.23
#Image.open() issue with higher version of unstructured_interface 0.7.24 has compatibility issue with unstructured 0.12.5 so downgrading to 0.7.23
# Partition the PDF into chunks
import fitz
from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Element
elements_high_res = partition_pdf(
filename=document,
chunk_size=chunk_size,
extract_images_in_pdf=True,
extract_image_block_output_dir="/content/images",
strategy = "hi_res",
use_gpu=True
)
elements = partition_pdf(
filename=document,
chunk_size=chunk_size
)
document = "/content/1706.03762v7.pdf"
# Using hi_res strategy
output_pdf_path = "/content/1706.03762v7_modded_high_res.pdf"
chunk_size = 0
pdf_document = fitz.open(document)
for element in elements_high_res:
if isinstance(element, Element):
page_number = element.metadata.page_number
bbox = element.metadata.coordinates.to_dict()
top_left, bottom_right = bbox['points'][0], bbox['points'][2]
if page_number is not None and bbox is not None:
page = pdf_document[page_number - 1] # PyMuPDF uses 0-based indexing for pages
rect = fitz.Rect(top_left, bottom_right)
page.draw_rect(rect, color=(1, 0, 0), width=2) # Draw a red rectangle with a width of 2
# Save the modified PDF
pdf_document.save(output_pdf_path)
pdf_document.close()
# Using default strategy
output_pdf_path = "/content/1706.03762v7_modded.pdf"
chunk_size = 0
pdf_document = fitz.open(document)
for element in elements:
if isinstance(element, Element):
page_number = element.metadata.page_number
bbox = element.metadata.coordinates.to_dict()
top_left, bottom_right = bbox['points'][0], bbox['points'][2]
if page_number is not None and bbox is not None:
page = pdf_document[page_number - 1] # PyMuPDF uses 0-based indexing for pages
rect = fitz.Rect(top_left, bottom_right)
page.draw_rect(rect, color=(1, 0, 0), width=2) # Draw a red rectangle with a width of 2
# Save the modified PDF
pdf_document.save(output_pdf_path)
pdf_document.close()
[1706.03762v7_modded_high_res.pdf](https://github.com/Unstructured-IO/unstructured/files/15441444/1706.03762v7_modded_high_res.pdf)
[1706.03762v7_modded.pdf](https://github.com/Unstructured-IO/unstructured/files/15441445/1706.03762v7_modded.pdf)
[1712.05889v2.pdf](https://github.com/Unstructured-IO/unstructured/files/15441446/1712.05889v2.pdf)
[1706.03762v7.pdf](https://github.com/Unstructured-IO/unstructured/files/15441447/1706.03762v7.pdf)
预期行为
边界框不应该因为策略的改变而改变
截图
截图以PDF的形式附加,但这里仍然有一个截图:
默认策略
高分辨率策略
环境信息
请运行 python scripts/collect_env.py
,并将输出粘贴在这里。这将帮助我们更好地了解在哪个环境中出现了bug。
公共工作簿链接 https://colab.research.google.com/drive/1z2dwE9t6zsgTcejx9RQzj_nTDHOdS4Vv?usp=sharing
额外的上下文
无
4条答案
按热度按时间ocebsuys1#
@leah1985 - 这看起来是模型输出的问题还是预处理/后处理问题?
s3fp2yjn2#
我认为这不是一个"hi_res"策略问题,而是一个由于CoordinateSystem导致的"fast"策略问题。我会对此问题进行更深入的调查。
ajsxfq5m3#
听起来不错,谢谢!
pw9qyyiw4#
根据我的经验,高分辨率使用将PDF转换为图像的输出坐标,而这并非
fast
方法必须执行的任务。首先将PDF转换为图像的像素密度要高得多,导致加载了fritz的文档的坐标超出页面范围。请使用from unstructured_inference.inference.layout import convert_pdf_to_image
加载图像以获得正确的格式和坐标系统。