英文:
Document AI batch process operation different payload returned
问题
我正在解决一个使用文档 AI 分割文档的问题。在这个问题中,我正在按照文档 AI 的官方 GitHub 仓库进行批处理。
批处理函数返回一个长时间运行的操作。
然后通过操作元数据进行轮询和填充该操作。
当我尝试使用长时间运行的操作 API 分别检索操作时,我得到了操作对象,但元数据不同,因此我无法进一步处理文档。
为了稍后检索操作,我在相同的仓库中使用了 get_operation 函数。
提前感谢!
batch_process_sample 能够正常工作,但是即使我单独检索操作对象并尝试以类似的方式处理,也需要获得相同的结果。
英文:
I'm working on a problem to split documents using document AI. In this problem I'm following the official github repo by document AI for batch processing.
The batch process function returns a long running operation.
The operation is then polled and populated using the operation metadata
when I try to retrieve the operation separately using long running operation api I'm getting the operation object but the metadata is different so I'm not able to process the document further.
For retrieving the operation later I'm using the get_operation function in same repo.
Thanks in advance!!
repository link:
https://github.com/GoogleCloudPlatform/python-docs-samples/tree/239d42f8dcb564db35c0b9fc79d8c07f6f6fe489/documentai/snippets
the batch_process_sample works fine but need to get same result even if I retrieve the operation object separately and try to process similarly
答案1
得分: 0
The Operation
数据从 get_operation()
返回的格式与直接从 batch_process_documents()
返回的格式略有不同。这似乎是 Google API 处理操作的一种怪癖。
代码示例和文档没有包括关于这一点的信息,但我已经找出如何使用内置方法来处理它。 (我正在添加功能到 Document AI Toolbox SDK,以从 BatchProcessMetadata
中的 GCS URI 或从 Operation
名称中提取 Document
输出,以使这更容易。
更新: Document AI Toolbox 代码
from google.cloud import documentai
from google.cloud.documentai_toolbox import document
project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"
operation = client.batch_process_documents(request)
# 格式: projects/{project_id}/locations/{location}/operations/15842030886767182557
operation_name = operation.operation.name
# 使用这个封装的文档来获取所需的提取信息。
wrapped_document = document.from_batch_process_operation(location, operation_name)
主要 API
from google.api_core.client_options import ClientOptions
from google.cloud import documentai
from google.longrunning.operations_pb2 import GetOperationRequest
project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"
operation_name = (
f"projects/{project_id}/locations/{location}/operations/15842030886767182557"
)
client = documentai.DocumentProcessorServiceClient(
client_options=ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
)
while True:
operation = client.get_operation(
request=GetOperationRequest(name=operation_name)
)
if operation.done:
break
# BatchProcessMetadata 信息被序列化,必须反序列化以访问值
metadata = documentai.BatchProcessMetadata.deserialize(operation.metadata.value)
# 获取 individual_process_statuses
for process in list(metadata.individual_process_statuses):
# 根据需要处理响应
print(process.output_gcs_destination)
英文:
The Operation
data returned from get_operation()
is in a slightly different format than how it's returned directly from batch_process_documents()
. This seems to be a quirk of how Google APIs handle operations.
The code sample and documentation don't include info about this, but I figured out how to do it using the built in methods. (I'm in the process of adding features to the Document AI Toolbox SDK that pulls the Document
output from the GCS URIs in BatchProcessMetadata
or from an Operation
name to make this easier.
Update: Code for Document AI Toolbox
from google.cloud import documentai
from google.cloud.documentai_toolbox import document
project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"
operation = client.batch_process_documents(request)
# Format: projects/{project_id}/locations/{location}/operations/15842030886767182557
operation_name = operation.operation.name
# Use this wrapped document to get the extraction information you need.
wrapped_document = document.from_batch_process_operation(location, operation_name)
Main APIs
from google.api_core.client_options import ClientOptions
from google.cloud import documentai
from google.longrunning.operations_pb2 import GetOperationRequest
project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"
operation_name = (
f"projects/{project_id}/locations/{location}/operations/15842030886767182557"
)
client = documentai.DocumentProcessorServiceClient(
client_options=ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
)
while True:
operation = client.get_operation(
request=GetOperationRequest(name=operation_name)
)
if operation.done:
break
# The BatchProcessMetadata information is serialized, must be deserialized to access the values
metadata = documentai.BatchProcessMetadata.deserialize(operation.metadata.value)
# Get the individual_process_statuses
for process in list(metadata.individual_process_statuses):
# Handle the response however you need
print(process.output_gcs_destination)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论