批处理作业提交错误 “无法处理所有文档”,URI 似乎正确?

huangapple go评论70阅读模式
英文:

Batch job submission error "Failed to process all documents", uris seem correct?

问题

以下是您要翻译的内容:

"I've been trying to get Document AI batch submission working and having some difficulty. I have single file submission working using RawDocument and suppose I could just iterate over my data set (27k images) but chose batch since it seems like the more appropriate technique.

When I run my code I am seeing an error: "Failed to process all documents". The first few lines of the debug information are:

O:17:"Google\Rpc\Status":5:{
s:7:"*code";i:3;s:10:"*message";s:32:"Failed to process all documents.";
s:26:"Google\Rpc\Statusdetails";
O:38:"Google\Protobuf\Internal\RepeatedField":4:{
s:49:"Google\Protobuf\Internal\RepeatedFieldcontainer";a:0:{}s:44:"Google\Protobuf\Internal\RepeatedFieldtype";i:11;s:45:"Google\Protobuf\Internal\RepeatedFieldklass";s:19:"Google\Protobuf\Any";s:52:"Google\Protobuf\Internal\RepeatedFieldlegacy_klass";s:19:"Google\Protobuf\Any";}s:38:"Google\Protobuf\Internal\Messagedesc";O:35:"Google\Protobuf\Internal\Descriptor":13:{s:46:"Google\Protobuf\Internal\Descriptorfull_name";s:17:"google.rpc.Status";s:42:"Google\Protobuf\Internal\Descriptorfield";a:3:{i:1;O:40:"Google\Protobuf\Internal\FieldDescriptor":14:{s:46:"Google\Protobuf\Internal\FieldDescriptorname";s:4:"code";```

The support for this error states that the reason for the error is:

The gcsUriPrefix and gcsOutputConfig.gcsUri parameters need to begin with gs:// and end with a trailing backslash character (/). Check the configuration for the Bucket URIs.

I am not using gcsUriPrefix (should I? My buckets > max batch limit) but my gcsOutputConfig.gcsUri is within these limits. The file list I've provided gives file names (pointed at the right bucket) so should not have a trailing backslash.

Advice welcome

function filesFromBucket( $directoryPrefix ) {
    // NOT recursive, does not search the structure
    $gcsDocumentList = [];

    // see https://cloud.google.com/storage/docs/samples/storage-list-files-with-prefix
    $bucketName = 'my-input-bucket';
    $storage = new StorageClient();
    $bucket = $storage->bucket($bucketName);
    $options = ['prefix' => $directoryPrefix];
    foreach ($bucket->objects($options) as $object) {
        $doc = new GcsDocument();
        $doc->setGcsUri('gs://'.$object->name());
        $doc->setMimeType($object->info()['contentType']);
        array_push( $gcsDocumentList, $doc );
    }

    $gcsDocuments = new GcsDocuments();
    $gcsDocuments->setDocuments($gcsDocumentList);
    return $gcsDocuments;
}

function batchJob ( ) {
    $inputConfig = new BatchDocumentsInputConfig( ['gcs_documents'=>filesFromBucket('the-bucket-path/')] );

    // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentOutputConfig
    // nb: all uri paths must end with / or an error will be generated.
    $outputConfig = new DocumentOutputConfig( 
        [ 'gcs_output_config' =>
               new GcsOutputConfig( ['gcs_uri'=>'gs://my-output-bucket/'] ) ]
    );

    // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentProcessorServiceClient
    $documentProcessorServiceClient = new DocumentProcessorServiceClient();
    try {
        // derived from the prediction endpoint
        $name = 'projects/######/locations/us/processors/#######';
        $operationResponse = $documentProcessorServiceClient->batchProcessDocuments($name, ['inputDocuments'=>$inputConfig, 'documentOutputConfig'=>$outputConfig]);
        $operationResponse->pollUntilComplete();
        if ($operationResponse->operationSucceeded()) {
            $result = $operationResponse->getResult();
            printf('<br>result: %s<br>',serialize($result));
        // doSomethingWith($result)
        } else {
            $error = $operationResponse->getError();
            printf('<br>error: %s<br>', serialize($error));
            // handleError($error)
        }
    } finally {
        $documentProcessorServiceClient->close();
    }    
}
英文:

I've been trying to get Document AI batch submission working and having some difficulty. I have single file submission working using RawDocument and suppose I could just iterate over my data set (27k images) but chose batch since it seems like the more appropriate technique.

When I run my code I am seeing an error: "Failed to process all documents". The first few lines of the debug information are:

> O:17:"Google\Rpc\Status":5:{
s:7:"*code";i:3;s:10:"*message";s:32:"Failed to process all documents.";
s:26:"Google\Rpc\Statusdetails";
O:38:"Google\Protobuf\Internal\RepeatedField":4:{
s:49:"Google\Protobuf\Internal\RepeatedFieldcontainer";a:0:{}s:44:"Google\Protobuf\Internal\RepeatedFieldtype";i:11;s:45:"Google\Protobuf\Internal\RepeatedFieldklass";s:19:"Google\Protobuf\Any";s:52:"Google\Protobuf\Internal\RepeatedFieldlegacy_klass";s:19:"Google\Protobuf\Any";}s:38:"Google\Protobuf\Internal\Messagedesc";O:35:"Google\Protobuf\Internal\Descriptor":13:{s:46:"Google\Protobuf\Internal\Descriptorfull_name";s:17:"google.rpc.Status";s:42:"Google\Protobuf\Internal\Descriptorfield";a:3:{i:1;O:40:"Google\Protobuf\Internal\FieldDescriptor":14:{s:46:"Google\Protobuf\Internal\FieldDescriptorname";s:4:"code";```

The support for this error states that the reason for the error is:

>The gcsUriPrefix and gcsOutputConfig.gcsUri parameters need to begin with gs:// and end with a trailing backslash character (/). Check the configuration for the Bucket URIs.

I am not using gcsUriPrefix (should I? My buckets > max batch limit) but my gcsOutputConfig.gcsUri is within these limits. The file list I've provided gives file names (pointed at the right bucket) so should not have a trailing backslash.

Advice welcome

    function filesFromBucket( $directoryPrefix ) {
        // NOT recursive, does not search the structure
        $gcsDocumentList = [];
    
        // see https://cloud.google.com/storage/docs/samples/storage-list-files-with-prefix
        $bucketName = &#39;my-input-bucket&#39;;
        $storage = new StorageClient();
        $bucket = $storage-&gt;bucket($bucketName);
        $options = [&#39;prefix&#39; =&gt; $directoryPrefix];
        foreach ($bucket-&gt;objects($options) as $object) {
            $doc = new GcsDocument();
            $doc-&gt;setGcsUri(&#39;gs://&#39;.$object-&gt;name());
            $doc-&gt;setMimeType($object-&gt;info()[&#39;contentType&#39;]);
            array_push( $gcsDocumentList, $doc );
        }
    
        $gcsDocuments = new GcsDocuments();
        $gcsDocuments-&gt;setDocuments($gcsDocumentList);
        return $gcsDocuments;
    }
    
    function batchJob ( ) {
        $inputConfig = new BatchDocumentsInputConfig( [&#39;gcs_documents&#39;=&gt;filesFromBucket(&#39;the-bucket-path/&#39;)] );
    
        // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentOutputConfig
        // nb: all uri paths must end with / or an error will be generated.
        $outputConfig = new DocumentOutputConfig( 
            [ &#39;gcs_output_config&#39; =&gt;
                   new GcsOutputConfig( [&#39;gcs_uri&#39;=&gt;&#39;gs://my-output-bucket/&#39;] ) ]
        );
     
        // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentProcessorServiceClient
        $documentProcessorServiceClient = new DocumentProcessorServiceClient();
        try {
            // derived from the prediction endpoint
            $name = &#39;projects/######/locations/us/processors/#######&#39;;
            $operationResponse = $documentProcessorServiceClient-&gt;batchProcessDocuments($name, [&#39;inputDocuments&#39;=&gt;$inputConfig, &#39;documentOutputConfig&#39;=&gt;$outputConfig]);
            $operationResponse-&gt;pollUntilComplete();
            if ($operationResponse-&gt;operationSucceeded()) {
                $result = $operationResponse-&gt;getResult();
                printf(&#39;&lt;br&gt;result: %s&lt;br&gt;&#39;,serialize($result));
            // doSomethingWith($result)
            } else {
                $error = $operationResponse-&gt;getError();
                printf(&#39;&lt;br&gt;error: %s&lt;br&gt;&#39;, serialize($error));
                // handleError($error)
            }
        } finally {
            $documentProcessorServiceClient-&gt;close();
        }    
    }

答案1

得分: 2

这原来是一个ID-10-T错误,带有明显的PEBKAC色彩。

$object->name()不会在路径中返回存储桶名称。

$doc->setGcsUri('gs://'.$object->name());更改为$doc->setGcsUri('gs://'.$bucketName.'/'.$object->name());解决了这个问题。

英文:

This turns out to be an ID-10-T error, with definite PEBKAC overtones.

$object->name() does not return the bucket name as part of the path.

Changing $doc-&gt;setGcsUri(&#39;gs://&#39;.$object-&gt;name()); to $doc-&gt;setGcsUri(&#39;gs://&#39;.$bucketName.&#39;/&#39;.$object-&gt;name()); resolves the issue.

答案2

得分: 1

通常,“处理所有文档失败”的错误原因是输入文件或输出存储桶的语法不正确。由于格式不正确的路径可能仍然是Cloud Storage的“有效”路径,但不是您预期的文件。(感谢您首先查看错误消息页面!)

如果您提供要处理的特定文档列表,您不必使用gcsUriPrefix。尽管根据您的代码,看起来您正在将来自GCS目录的所有文件添加到BatchDocumentsInputConfig.gcs_documents字段中,因此将前缀发送到BatchDocumentsInputConfig.gcs_uri_prefix而不是单独的文件列表可能是有意义的。

注意:每个批处理请求中最多可以发送1000个文件,并且特定的处理器有自己的页面限制。

您可以尝试将文件分开成多个批处理请求,以避免达到此限制。Document AI Toolbox Python SDK具有内置函数可供使用,但您也可以尝试在PHP中为自己的用例重新实现此功能。https://github.com/googleapis/python-documentai-toolbox/blob/ba354d8af85cbea0ad0cd2501e041f21e9e5d765/google/cloud/documentai_toolbox/utilities/gcs_utilities.py#L213

英文:

Usually, the reason for the error &quot;Failed to process all documents&quot; is an incorrect syntax for the input files or output bucket. Since an incorrectly formatted path might still be a "valid" path for Cloud Storage, but not the files you're expecting. (Thank you for checking the error messages page first!)

You don't have to use gcsUriPrefix if you're providing a list of specific documents to process. Although, based on your code, it looks like you're adding all of the files from a GCS directory to the BatchDocumentsInputConfig.gcs_documents field anyway, so it would make sense to try sending the prefix in BatchDocumentsInputConfig.gcs_uri_prefix instead of a list of individual files.

Note: There is a maximum number of files (1000) that can be sent in an individual batch processing request, and specific processors have their own limits for pages.

https://cloud.google.com/document-ai/quotas#content_limits

You can try separating out the files into multiple batch requests to avoid hitting this limit. The Document AI Toolbox Python SDK has built-in functions for this, but you can try re-implementing this in PHP for your own use case. https://github.com/googleapis/python-documentai-toolbox/blob/ba354d8af85cbea0ad0cd2501e041f21e9e5d765/google/cloud/documentai_toolbox/utilities/gcs_utilities.py#L213

huangapple
  • 本文由 发表于 2023年6月1日 03:21:18
  • 转载请务必保留本文链接:https://go.coder-hub.com/76376656.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定