我想从一堆PDF中提取表格。为此,我使用AWS Texextract Python管道。
请告诉我如何才能做到这一点没有SNS和SQS?我希望它是同步的:为我的管道提供一个PDF文件,调用AWS Texextract并获取结果。
以下是我目前使用的,请建议我应该改变什么:
import boto3
import time
def startJob(s3BucketName, objectName):
response = None
client = boto3.client('textract')
response = client.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': objectName
}
})
return response["JobId"]
def isJobComplete(jobId):
# For production use cases, use SNS based notification
# Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
time.sleep(5)
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(5)
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status
def getJobResults(jobId):
pages = []
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "Amazon-Textract-Pdf.pdf"
jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
response = getJobResults(jobId)
#print(response)
# Print detected text
for resultPage in response:
for item in resultPage["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')
2条答案
按热度按时间juzqafwq1#
您目前无法直接与Texextract同步处理PDF文档。来自Texextract文档:
Amazon Textract同步操作(
DetectDocumentText
和AnalyzeDocument
)支持PNG和JPEG图像格式。异步操作(StartDocumentTextDetection
、StartDocumentAnalysis
)也支持PDF文件格式。一种解决方法是在代码中添加convert the PDF document into images,然后使用这些图像的同步API操作来处理文档。
2g32fytz2#
谢谢你的回答,这些回答帮助我分析了更多。我发现Texextract中的detect_document_text方法可以用于PDF文档文本提取,条件是 *PDF文档应该只有一个页面 *。这是一个同步过程。我们根本不需要将PDF转换为图像。
这是来自AWS的参考链接。https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract/client/detect_document_text.html
下面是代码片段,我从S3对象传递二进制内容