shell 用R做OCR

baubqpgj  于 2023-11-21  发布在  Shell
关注(0)|答案(3)|浏览(168)

我一直在尝试在R中做OCR(读取PDF数据,其中数据作为扫描图像)。一直在阅读关于这个@http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
这是一个非常好的帖子。
有效的3个步骤:
1.将pdf转换为ppm(一种图像格式)
1.将ppm转换为tif,为tesseract做好准备(使用ImageMagick进行转换)
1.将tif转换为文本文件
以上3个步骤的有效代码根据链接后的内容:

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
  # but you can change that easily, just remove or edit the 
  # -f 1 -l 10 bit in the line below
  shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })

字符串
前两个步骤进行得很好。(虽然花了很长的时间,为4页的pdf,但将看看可伸缩性部分后,首先尝试如果这工作或不)
跑步的时候,拳头两步功很好。
在运行第3步时,即

shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))


如果出现此错误:
错误:计算嵌套太深:无限递归/选项(expressions=)?
或者宇宙魔方要崩溃了。
如有任何变通办法或根本原因分析,我们将不胜感激。

fcwjkofz

fcwjkofz1#

通过使用“tesseract”,我创建了一个工作的示例脚本。甚至它也适用于扫描的PDF。

library(tesseract)
library(pdftools)

# Render pdf to png image

img_file <- pdftools::pdf_convert("F:/gowtham/A/B/invoice.pdf", format = 'tiff',  dpi = 400)

# Extract text from png image
text <- ocr(img_file)
write.table(text, "F:/gowtham/A/B/mydata.txt")

字符串
我是R和编程的新手。如果有错误请指导我。希望这对你有帮助。

busg9geu

busg9geu2#

新发布的tesseract包可能值得一试,它允许你在R内部执行整个过程,而不需要shell调用。
使用help documentation of the tesseract package中使用的过程,您的函数看起来像这样:

lapply(myfiles, function(i){
  # convert pdf to jpef/tiff and perform tesseract OCR on the image

  # Read in the PDF
  pdf <- pdf_text(i)
  # convert pdf to tiff
  bitmap <- pdf_render_page(news, dpi = 300)
  tiff::writeTIFF(bitmap, paste0(i, ".tiff"))
  # perform OCR on the .tiff file
  out <- ocr(paste0, (".tiff"))
  # delete tiff file
  file.remove(paste0(i, ".tiff" ))
})

字符串

p4tfgftt

p4tfgftt3#

以下是另一种可以考虑的方法:

library(reticulate)
conda_Env <- conda_list()

if(any(conda_Env[, 1] == "ocrTable") == FALSE)
{
  reticulate::conda_create(envname = "ocrTable", python_version = "3.7.16")
  reticulate::conda_install(envname = "ocrTable", packages = "transformers", pip = TRUE)
  reticulate::conda_install(envname = "ocrTable", packages = "torch", pip = TRUE)
  reticulate::conda_install(envname = "ocrTable", packages = "requests", pip = TRUE)
  reticulate::conda_install(envname = "ocrTable", packages = "Pillow", pip = TRUE)
}

reticulate::use_condaenv("ocrTable")

transformers <- import("transformers")
TrOCRProcessor <- transformers$TrOCRProcessor
VisionEncoderDecoderModel <- transformers$VisionEncoderDecoderModel
processor <- TrOCRProcessor$from_pretrained("microsoft/trocr-base-handwritten")
model <- VisionEncoderDecoderModel$from_pretrained("microsoft/trocr-base-handwritten")

requests <- import("requests")
PIL <- import("PIL")
Image <- PIL$Image
url <- "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
image <- Image$open(requests$get(url, stream = TRUE)$raw)$convert("RGB")
pixel_values <- processor(image, return_tensors = "pt")$pixel_values
generated_ids <- model$generate(pixel_values)
generated_text <- processor$batch_decode(generated_ids, skip_special_tokens = TRUE)
generated_text

[1] "industry, \" Mr. Brown commented icily. \" Let us have a"

字符串

相关问题