AWS Textract API在多页文档中未显示表格数据(仅在第一页显示表格)。

huangapple go评论68阅读模式
英文:

AWS Textract API not showing table data in multipage documents (only shows table on 1st page)

问题

我已经在使用Node.js中的AWS Textract提取PDF文档中的表格和表单的脚本上进行了工作。我遇到的问题是,当我使用textract的异步(甚至同步)操作时,我在S3上上传的文档的第一页之后无法获得表格。所有文本数据和表单键值似乎都正常,但响应在第1页之后不显示表格。

然而,有趣的是,使用AWS控制台中的Textract BulkUploader在CSV结果中显示识别的表格,这非常奇怪!

当我使用aws-sdk时,textract api响应的“Blocks”在第1页之后的页面上不显示“TABLE”类型的块。请帮助我解决这个问题,因为AWS控制台上显示的结果实际上在第1页之后显示表格。所以为什么在通过脚本进行api调用时会有差异呢?任何帮助都将不胜感激!

以下是我尝试过的代码:

const {
  TextractClient,
  StartDocumentAnalysisCommand,
  GetDocumentAnalysisCommand,
} = require("@aws-sdk/client-textract");

const startJob = async (file, bucket) => {
  try {
    const params = {
      DocumentLocation: {
        S3Object: {
          Bucket: bucket,
          Name: file,
        },
      },
      FeatureTypes: ["FORMS", "TABLES"],
    };
    const command = new StartDocumentAnalysisCommand(params);
    const response = await textractClient.send(command);
    const jobId = response.JobId;

    console.log("Textract job started with ID:", jobId);

    // Wait for the job to complete
    await waitForJobCompletion(jobId, file);
  } catch (err) {
    console.log("Error starting Textract job:", err);
  }
};

// Wait for the Textract job to complete
const waitForJobCompletion = async (jobId, file) => {
  try {
    const jobParams = {
      JobId: jobId,
    };

    let response;
    let jobStatus;

    do {
      //   const command = new GetDocumentTextDetectionCommand(params); //for text detection
      const command = new GetDocumentAnalysisCommand(jobParams);

      response = await textractClient.send(command);
      jobStatus = response.JobStatus;

      console.log("Job status:", jobStatus);

      if (jobStatus === "SUCCEEDED") {
        // Job completed successfully, retrieve the results
        if (response && response.Blocks) {
          fs.writeFile(`./s3-textract-results/tabledata.json`, JSON.stringify(response), 'utf8', (err) => {
            if (err) {
              console.error('Error writing to file:', err);
            } else {
              console.log('Data written to file.');
            }
          });
          console.log(response.Blocks);
        }
      } else if (jobStatus === "FAILED" || jobStatus === "PARTIAL_SUCCESS") {
        // Job failed or partially succeeded, handle the error
        console.log("Job failed or partially succeeded:", response);
      } else {
        // Job is still in progress, wait for a while and check again
        await new Promise((resolve) => setTimeout(resolve, 10000)); // Wait for 10 seconds
      }
    } while (jobStatus === "IN_PROGRESS" || jobStatus === "PARTIAL_SUCCESS");
  } catch (err) {
    console.log("Error retrieving Textract job results:", err);
  }
};
英文:

I have worked on a script to extract out tables, forms from pdf documents using AWS Textract in Node.js.
The problem Im facing is that when I use the async (and even sync) operations of textract , I am not getting tables after the first page in documents uploaded on S3. All the textual data, form key-values seem fine but the response does not show tables after page 1.

The interesting thing though is that the tables are recognized and shown in csv results in the Textract BulkUploader from the AWS Console. Which is very strange!

The textract api response "Blocks" don't show any BlockType of "TABLE" on Pages after Page 1 when I use the aws-sdk. Please help me with this as the results shown on AWS console itself does in fact show the tables after page 1. So why the difference when I am making the api calls through script? Any help will be much appreciated!

Here is the code I have tried out:

const {
TextractClient,
StartDocumentAnalysisCommand,
GetDocumentAnalysisCommand,
} = require("@aws-sdk/client-textract");
const startJob = async (file, bucket) => {
try {
const params = {
DocumentLocation: {
S3Object: {
Bucket: bucket,
Name: file,
},
},
FeatureTypes: ["FORMS", "TABLES"],
};
const command = new StartDocumentAnalysisCommand(params);
const response = await textractClient.send(command);
const jobId = response.JobId;
console.log("Textract job started with ID:", jobId);
// Wait for the job to complete
await waitForJobCompletion(jobId, file);
} catch (err) {
console.log("Error starting Textract job:", err);
}
};
// Wait for the Textract job to completes
const waitForJobCompletion = async (jobId, file) => {
try {
const jobParams = {
JobId: jobId,
};
let response;
let jobStatus;
do {
//   const command = new GetDocumentTextDetectionCommand(params); //for text detection
const command = new GetDocumentAnalysisCommand(jobParams);
response = await textractClient.send(command);
jobStatus = response.JobStatus;
console.log("Job status:", jobStatus);
if (jobStatus === "SUCCEEDED") {
// Job completed successfully, retrieve the results
if (response && response.Blocks) {
fs.writeFile(`./s3-textract-results/tabledata.json`, JSON.stringify(response), 'utf8', (err) => {
if (err) {
console.error('Error writing to file:', err);
} else {
console.log('Data written to file.');
}
});
console.log(response.Blocks);
}
} else if (jobStatus === "FAILED" || jobStatus === "PARTIAL_SUCCESS") {
// Job failed or partially succeeded, handle the error
console.log("Job failed or partially succeeded:", response);
} else {
// Job is still in progress, wait for a while and check again
await new Promise((resolve) => setTimeout(resolve, 10000)); // Wait for 5 seconds
}
} while (jobStatus === "IN_PROGRESS" || jobStatus === "PARTIAL_SUCCESS");
} catch (err) {
console.log("Error retrieving Textract job results:", err);
}
};

答案1

得分: 1

GetDocumentAnalysis响应是分页的,根据响应中的nextToken的存在来表示。我没有看到您在后续调用中使用next token。

我建议您添加它并查看是否获得了完整的结果。

英文:

The GetDocumentAnalysis response is paginated as indicated by the presence of a nextToken in the response. I don't see you using the next token in your subsequent calls.

I'd recommend adding that in and seeing if you get the complete results.

答案2

得分: 0

使用响应中的“NextToken”在状态为“SUCCEEDED”但作业尚未完全完成时使其工作。必须在后续请求的jobParams中传递NextToken以获取剩余的响应。Textract在响应太大(如果文件很大)的情况下提供多部分响应。

英文:

Made it work using the "NextToken" in the response whenever the status was "SUCCEEDED" but the job was not fully complete. You have to pass the NextToken in the jobParams in the subsequent requests to get the remaining response. Textract provides a multi-part response in case the response is too big (if the file is large)

huangapple
  • 本文由 发表于 2023年7月13日 17:17:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76677782.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定