2023年7月13日 17:17:28go评论99阅读模式

英文:

AWS Textract API not showing table data in multipage documents (only shows table on 1st page)

问题

我已经在使用Node.js中的AWS Textract提取PDF文档中的表格和表单的脚本上进行了工作。我遇到的问题是，当我使用textract的异步（甚至同步）操作时，我在S3上上传的文档的第一页之后无法获得表格。所有文本数据和表单键值似乎都正常，但响应在第1页之后不显示表格。

然而，有趣的是，使用AWS控制台中的Textract BulkUploader在CSV结果中显示识别的表格，这非常奇怪！

当我使用aws-sdk时，textract api响应的“Blocks”在第1页之后的页面上不显示“TABLE”类型的块。请帮助我解决这个问题，因为AWS控制台上显示的结果实际上在第1页之后显示表格。所以为什么在通过脚本进行api调用时会有差异呢？任何帮助都将不胜感激！

以下是我尝试过的代码：

const {
  TextractClient,
  StartDocumentAnalysisCommand,
  GetDocumentAnalysisCommand,
} = require("@aws-sdk/client-textract");
const startJob = async (file, bucket) => {
  try {
    const params = {
      DocumentLocation: {
        S3Object: {
          Bucket: bucket,
          Name: file,
        },
      },
      FeatureTypes: ["FORMS", "TABLES"],
    };
    const command = new StartDocumentAnalysisCommand(params);
    const response = await textractClient.send(command);
    const jobId = response.JobId;
    console.log("Textract job started with ID:", jobId);
    // Wait for the job to complete
    await waitForJobCompletion(jobId, file);
  } catch (err) {
    console.log("Error starting Textract job:", err);
  }
};
// Wait for the Textract job to complete
const waitForJobCompletion = async (jobId, file) => {
  try {
    const jobParams = {
      JobId: jobId,
    };
    let response;
    let jobStatus;
    do {
      //   const command = new GetDocumentTextDetectionCommand(params); //for text detection
      const command = new GetDocumentAnalysisCommand(jobParams);
      response = await textractClient.send(command);
      jobStatus = response.JobStatus;
      console.log("Job status:", jobStatus);
      if (jobStatus === "SUCCEEDED") {
        // Job completed successfully, retrieve the results
        if (response && response.Blocks) {
          fs.writeFile(`./s3-textract-results/tabledata.json`, JSON.stringify(response), 'utf8', (err) => {
            if (err) {
              console.error('Error writing to file:', err);
            } else {
              console.log('Data written to file.');
            }
          });
          console.log(response.Blocks);
        }
      } else if (jobStatus === "FAILED" || jobStatus === "PARTIAL_SUCCESS") {
        // Job failed or partially succeeded, handle the error
        console.log("Job failed or partially succeeded:", response);
      } else {
        // Job is still in progress, wait for a while and check again
        await new Promise((resolve) => setTimeout(resolve, 10000)); // Wait for 10 seconds
      }
    } while (jobStatus === "IN_PROGRESS" || jobStatus === "PARTIAL_SUCCESS");
  } catch (err) {
    console.log("Error retrieving Textract job results:", err);
  }
};

英文:

I have worked on a script to extract out tables, forms from pdf documents using AWS Textract in Node.js.
The problem Im facing is that when I use the async (and even sync) operations of textract , I am not getting tables after the first page in documents uploaded on S3. All the textual data, form key-values seem fine but the response does not show tables after page 1.

The interesting thing though is that the tables are recognized and shown in csv results in the Textract BulkUploader from the AWS Console. Which is very strange!

The textract api response "Blocks" don't show any BlockType of "TABLE" on Pages after Page 1 when I use the aws-sdk. Please help me with this as the results shown on AWS console itself does in fact show the tables after page 1. So why the difference when I am making the api calls through script? Any help will be much appreciated!

Here is the code I have tried out:

const {
TextractClient,
StartDocumentAnalysisCommand,
GetDocumentAnalysisCommand,
} = require(&quot;@aws-sdk/client-textract&quot;);
const startJob = async (file, bucket) =&gt; {
try {
const params = {
DocumentLocation: {
S3Object: {
Bucket: bucket,
Name: file,
},
},
FeatureTypes: [&quot;FORMS&quot;, &quot;TABLES&quot;],
};
const command = new StartDocumentAnalysisCommand(params);
const response = await textractClient.send(command);
const jobId = response.JobId;
console.log(&quot;Textract job started with ID:&quot;, jobId);
// Wait for the job to complete
await waitForJobCompletion(jobId, file);
} catch (err) {
console.log(&quot;Error starting Textract job:&quot;, err);
}
};
// Wait for the Textract job to completes
const waitForJobCompletion = async (jobId, file) =&gt; {
try {
const jobParams = {
JobId: jobId,
};
let response;
let jobStatus;
do {
//   const command = new GetDocumentTextDetectionCommand(params); //for text detection
const command = new GetDocumentAnalysisCommand(jobParams);
response = await textractClient.send(command);
jobStatus = response.JobStatus;
console.log(&quot;Job status:&quot;, jobStatus);
if (jobStatus === &quot;SUCCEEDED&quot;) {
// Job completed successfully, retrieve the results
if (response &amp;&amp; response.Blocks) {
fs.writeFile(`./s3-textract-results/tabledata.json`, JSON.stringify(response), &#39;utf8&#39;, (err) =&gt; {
if (err) {
console.error(&#39;Error writing to file:&#39;, err);
} else {
console.log(&#39;Data written to file.&#39;);
}
});
console.log(response.Blocks);
}
} else if (jobStatus === &quot;FAILED&quot; || jobStatus === &quot;PARTIAL_SUCCESS&quot;) {
// Job failed or partially succeeded, handle the error
console.log(&quot;Job failed or partially succeeded:&quot;, response);
} else {
// Job is still in progress, wait for a while and check again
await new Promise((resolve) =&gt; setTimeout(resolve, 10000)); // Wait for 5 seconds
}
} while (jobStatus === &quot;IN_PROGRESS&quot; || jobStatus === &quot;PARTIAL_SUCCESS&quot;);
} catch (err) {
console.log(&quot;Error retrieving Textract job results:&quot;, err);
}
};

答案1

得分: 1

GetDocumentAnalysis响应是分页的，根据响应中的nextToken的存在来表示。我没有看到您在后续调用中使用next token。

我建议您添加它并查看是否获得了完整的结果。

英文:

The GetDocumentAnalysis response is paginated as indicated by the presence of a nextToken in the response. I don't see you using the next token in your subsequent calls.

I'd recommend adding that in and seeing if you get the complete results.

答案2

得分: 0

使用响应中的“NextToken”在状态为“SUCCEEDED”但作业尚未完全完成时使其工作。必须在后续请求的jobParams中传递NextToken以获取剩余的响应。Textract在响应太大（如果文件很大）的情况下提供多部分响应。

英文:

Made it work using the "NextToken" in the response whenever the status was "SUCCEEDED" but the job was not fully complete. You have to pass the NextToken in the jobParams in the subsequent requests to get the remaining response. Textract provides a multi-part response in case the response is too big (if the file is large)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

AWS Textract API在多页文档中未显示表格数据（仅在第一页显示表格）。

问题

答案1

答案2

Cannot read properties of null (reading ‘edgesOut’) in React.js

ERROR TypeError: text.split is not a function. (In ‘text.split(/\s+/)’, ‘text.split’ is undefined)

在React JS中显示结果为 [object Object], [object Object]

如何添加一个带有链接的自定义字段

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。