Parsr对于小文件和没有(或只有很少)模块的文件速度也很慢,

pvabu6sv  于 5个月前  发布在  其他
关注(0)|答案(2)|浏览(73)

在讨论了#510之后,我正在测试Parsr对于小文件和没有(或只有很少)模块的情况。在样本中提供了README.pdf(8页):

[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): Using extractor: PDFJsExtractor
[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): Running extractor PDF.js
[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): executing command: qpdf --decrypt --no-warn /tmp/f2f1cf2c1053576eca2a6acd83e045/a02a5859e0d4634f2e54dd4cb23680.pdf /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf
[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): Qpdf repair succeed --> /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf
[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): executing command: mutool clean -g /tmp/0fa7abe3f0b24684eccaef72eb454f.pdf /tmp/c2003b6514e0b99f4dba757b46a3dc.pdf
[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): Mutool clean succeed --> /tmp/c2003b6514e0b99f4dba757b46a3dc.pdf
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Elapsed time: 1.428s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Exporting json...
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Writing file: /tmp/ba66b32a7782915beef6706b8fdc9a.json
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running cleaner...
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running module: OutOfPageRemovalModule, Options: {}
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1):   Elapsed time: 0.005s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running module: WhitespaceRemovalModule, Options: {"minWidth":0}
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1):   Elapsed time: 0.02s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running module: RedundancyDetectionModule, Options: {"minOverlap":0.5}
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1):   Elapsed time: 0.073s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running module: HeaderFooterDetectionModule, Options: {"ignorePages":[],"maxMarginPercentage":15,"similaritySizePercentage":10}
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Detecting marginals (headers and footers) with maxMarginPercentage: 15 ...
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Document margins for maxMarginPercentage 15: top: 125, bottom: 715, left: undefined, right: 559
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1):   Elapsed time: 0.013s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Running module: ReadingOrderDetectionModule, Options: {"minVerticalGapWidth":5,"minColumnWidthInPagePercent":15}
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1):   Elapsed time: 0.07s
[2022-10-24T16:39:21] INFO  (parsr-api/7 on 32b79c3646c1): Total elapsed time: 0.184s

第一个阶段已经花费了1.5秒。调用API并获取完成状态的总时间超过了4秒。作为比较,PyMuPDF大约需要40毫秒。对于一个40页的文档,数字是10秒vs 200毫秒。有什么想法可以加快它的速度吗?配置如下:

[2022-10-24T16:39:19] INFO  (parsr-api/7 on 32b79c3646c1): Config {
  version: 0.9,
  cleaner: [
    'out-of-page-removal',
    'whitespace-removal',
    'redundancy-detection',
    [
      'header-footer-detection',
      [Object]
    ],
    [
      'reading-order-detection',
      [Object]
    ]
  ],
  extractor: {
    pdf: 'pdfjs',
    ocr: 'tesseract',
    language: [
      'en'
    ]
  },
  output: {
    granularity: 'word',
    includeMarginals: true,
    includeDrawings: false,
    formats: {
      json: true,
      text: false,
      csv: false,
      markdown: false,
      pdf: false
    }
  }
}
qaxu7uf2

qaxu7uf21#

看起来你在其他地方有一些开销,因为总耗时远远小于4秒。
你的管道是什么,你如何调用Parsr的API?

q7solyqu

q7solyqu2#

为了计时整个操作,我正在使用Jupyter笔记本中的Python客户端:

%%timeit -n1 -r1

parsr.send_document(
                    file_path=pdf_file, 
                    config_path='/tmp/parsr_config.json', 
                    document_name='Test',
                    save_request_id=True)

while 'progress-percentage' in parsr.get_status()['server_response']:
    time.sleep(0.1)

客户端通过其Docker镜像示例化。

相关问题