Document AI批处理操作返回不同的有效载荷

huangapple go评论45阅读模式
英文:

Document AI batch process operation different payload returned

问题

我正在解决一个使用文档 AI 分割文档的问题。在这个问题中,我正在按照文档 AI 的官方 GitHub 仓库进行批处理。

批处理函数返回一个长时间运行的操作。
然后通过操作元数据进行轮询和填充该操作。

当我尝试使用长时间运行的操作 API 分别检索操作时,我得到了操作对象,但元数据不同,因此我无法进一步处理文档。

为了稍后检索操作,我在相同的仓库中使用了 get_operation 函数。

提前感谢!

仓库链接:
https://github.com/GoogleCloudPlatform/python-docs-samples/tree/239d42f8dcb564db35c0b9fc79d8c07f6f6fe489/documentai/snippets

batch_process_sample 能够正常工作,但是即使我单独检索操作对象并尝试以类似的方式处理,也需要获得相同的结果。

英文:

I'm working on a problem to split documents using document AI. In this problem I'm following the official github repo by document AI for batch processing.

The batch process function returns a long running operation.
The operation is then polled and populated using the operation metadata

when I try to retrieve the operation separately using long running operation api I'm getting the operation object but the metadata is different so I'm not able to process the document further.

For retrieving the operation later I'm using the get_operation function in same repo.

Thanks in advance!!

repository link:
https://github.com/GoogleCloudPlatform/python-docs-samples/tree/239d42f8dcb564db35c0b9fc79d8c07f6f6fe489/documentai/snippets

the batch_process_sample works fine but need to get same result even if I retrieve the operation object separately and try to process similarly

答案1

得分: 0

The Operation 数据从 get_operation() 返回的格式与直接从 batch_process_documents() 返回的格式略有不同。这似乎是 Google API 处理操作的一种怪癖。

代码示例和文档没有包括关于这一点的信息,但我已经找出如何使用内置方法来处理它。 (我正在添加功能到 Document AI Toolbox SDK,以从 BatchProcessMetadata 中的 GCS URI 或从 Operation 名称中提取 Document 输出,以使这更容易。

更新: Document AI Toolbox 代码

from google.cloud import documentai
from google.cloud.documentai_toolbox import document

project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"

operation = client.batch_process_documents(request)
# 格式: projects/{project_id}/locations/{location}/operations/15842030886767182557
operation_name = operation.operation.name

# 使用这个封装的文档来获取所需的提取信息。
wrapped_document = document.from_batch_process_operation(location, operation_name)

主要 API

from google.api_core.client_options import ClientOptions
from google.cloud import documentai
from google.longrunning.operations_pb2 import GetOperationRequest

project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"
operation_name = (
    f"projects/{project_id}/locations/{location}/operations/15842030886767182557"
)
client = documentai.DocumentProcessorServiceClient(
    client_options=ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
)

while True:
    operation = client.get_operation(
        request=GetOperationRequest(name=operation_name)
    )

    if operation.done:
        break

# BatchProcessMetadata 信息被序列化,必须反序列化以访问值
metadata = documentai.BatchProcessMetadata.deserialize(operation.metadata.value)

# 获取 individual_process_statuses
for process in list(metadata.individual_process_statuses):
    # 根据需要处理响应
    print(process.output_gcs_destination)
英文:

The Operation data returned from get_operation() is in a slightly different format than how it's returned directly from batch_process_documents(). This seems to be a quirk of how Google APIs handle operations.

The code sample and documentation don't include info about this, but I figured out how to do it using the built in methods. (I'm in the process of adding features to the Document AI Toolbox SDK that pulls the Document output from the GCS URIs in BatchProcessMetadata or from an Operation name to make this easier.

Update: Code for Document AI Toolbox

from google.cloud import documentai
from google.cloud.documentai_toolbox import document

project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"

operation = client.batch_process_documents(request)
# Format: projects/{project_id}/locations/{location}/operations/15842030886767182557
operation_name = operation.operation.name

# Use this wrapped document to get the extraction information you need.
wrapped_document = document.from_batch_process_operation(location, operation_name)

Main APIs

from google.api_core.client_options import ClientOptions
from google.cloud import documentai
from google.longrunning.operations_pb2 import GetOperationRequest

project_id = "YOUR_PROJECT_ID"
location = "YOUR_PROCESSOR_LOCATION"
operation_name = (
    f"projects/{project_id}/locations/{location}/operations/15842030886767182557"
)
client = documentai.DocumentProcessorServiceClient(
    client_options=ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
)

while True:
    operation = client.get_operation(
        request=GetOperationRequest(name=operation_name)
    )

    if operation.done:
        break

# The BatchProcessMetadata information is serialized, must be deserialized to access the values
metadata = documentai.BatchProcessMetadata.deserialize(operation.metadata.value)

# Get the individual_process_statuses
for process in list(metadata.individual_process_statuses):
    # Handle the response however you need
    print(process.output_gcs_destination)

huangapple
  • 本文由 发表于 2023年4月4日 03:09:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75922977.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定