在讨论了#510之后,我正在测试Parsr对于小文件和没有(或只有很少)模块的情况。在样本中提供了README.pdf
(8页):
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): Using extractor: PDFJsExtractor
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): Running extractor PDF.js
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): executing command: qpdf --decrypt --no-warn /tmp/f2f1cf2c1053576eca2a6acd83e045/a02a5859e0d4634f2e54dd4cb23680.pdf /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): Qpdf repair succeed --> /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): executing command: mutool clean -g /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf /tmp/c2003b6514e0b99f4dba757b46a3dc.pdf
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): Mutool clean succeed --> /tmp/c2003b6514e0b99f4dba757b46a3dc.pdf
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 1.428s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Exporting json...
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Writing file: /tmp/ba66b32a7782915beef6706b8fdc9a.json
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running cleaner...
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running module: OutOfPageRemovalModule, Options: {}
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 0.005s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running module: WhitespaceRemovalModule, Options: {"minWidth":0}
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 0.02s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running module: RedundancyDetectionModule, Options: {"minOverlap":0.5}
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 0.073s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running module: HeaderFooterDetectionModule, Options: {"ignorePages":[],"maxMarginPercentage":15,"similaritySizePercentage":10}
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Detecting marginals (headers and footers) with maxMarginPercentage: 15 ...
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Document margins for maxMarginPercentage 15: top: 125, bottom: 715, left: undefined, right: 559
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 0.013s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Running module: ReadingOrderDetectionModule, Options: {"minVerticalGapWidth":5,"minColumnWidthInPagePercent":15}
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Elapsed time: 0.07s
[2022-10-24T16:39:21] INFO (parsr-api/7 on 32b79c3646c1): Total elapsed time: 0.184s
第一个阶段已经花费了1.5秒。调用API并获取完成状态的总时间超过了4秒。作为比较,PyMuPDF大约需要40毫秒。对于一个40页的文档,数字是10秒vs 200毫秒。有什么想法可以加快它的速度吗?配置如下:
[2022-10-24T16:39:19] INFO (parsr-api/7 on 32b79c3646c1): Config {
version: 0.9,
cleaner: [
'out-of-page-removal',
'whitespace-removal',
'redundancy-detection',
[
'header-footer-detection',
[Object]
],
[
'reading-order-detection',
[Object]
]
],
extractor: {
pdf: 'pdfjs',
ocr: 'tesseract',
language: [
'en'
]
},
output: {
granularity: 'word',
includeMarginals: true,
includeDrawings: false,
formats: {
json: true,
text: false,
csv: false,
markdown: false,
pdf: false
}
}
}
2条答案
按热度按时间qaxu7uf21#
看起来你在其他地方有一些开销,因为总耗时远远小于4秒。
你的管道是什么,你如何调用Parsr的API?
q7solyqu2#
为了计时整个操作,我正在使用Jupyter笔记本中的Python客户端:
客户端通过其Docker镜像示例化。