英文:
Extract text from pdf/doc/docx file using AWS Lambda (Node.js)
问题
以下是翻译好的部分:
"我正在尝试在使用Node.js编写的AWS Lambda函数中从doc/docx/pdf文件中提取文本。我需要将文本数据提取为一个单词数组。我尝试使用了一些不同的npm包,但我注意到它只是跳过那些函数。
AWS Lambda函数:
import { PDFExtract } from "pdf.js-extract";
...
export const handler = async (event) => {
...
const pdfExtract = new PDFExtract();
const tempFilePath = join(tmpdir(), "resume.pdf");
const buffer = readFileSync(tempFilePath);
const wordsList = [];
await pdfExtract.extractBuffer(buffer, {}, (err, data) => {
if (err)
return console.log(err);
data.pages[0].content.forEach((e) => {
const str = e.str.trim().split(" ");
str.forEach((word) => {
if (word.length > 1)
wordsList.push(word);
});
});
});
console.log(wordsList);
}
相同的代码在我的本地机器上运行得很好,但当我部署到AWS Lambda时,它无法提取任何文本。"
英文:
I'm trying to extract text from doc/docx/pdf file in an AWS Lambda function written in Node.js.
I need to extract text data as an array of words. I've tried using a few different npm packages, but I've noticed that it just skipping those functions.
AWS Lambda Function:
import { PDFExtract } from "pdf.js-extract";
...
export const handler = async (event) => {
...
const pdfExtract = new PDFExtract();
const tempFilePath = join(tmpdir(), "resume.pdf");
const buffer = readFileSync(tempFilePath);
const wordsList = [];
await pdfExtract.extractBuffer(buffer, {}, (err, data) => {
if (err)
return console.log(err);
data.pages[0].content.forEach((e) => {
const str = e.str.trim().split(" ");
str.forEach((word) => {
if (word.length > 1)
wordsList.push(word);
});
});
});
console.log(wordsList);
}
The same code works perfectly fine on my local machine, but when I deploy it to AWS Lambda, it fails to extract any text.
答案1
得分: 0
我在Python中尝试了类似的事情,对我有效。您是如何将您的代码部署到Lambda的?我假设您的Node包已经正确部署,并且当您测试Lambda函数时,它运行,只是没有返回文本。如果不是这种情况,请在问题中包含这一点。另外,请问您使用的是什么类型的PDF,基于文本的PDF还是基于图像的PDF?
英文:
I have tried similar thing in python, which works for me . How are you deploying your code to lambda? I am assuming your node packages are deployed properly and when you test your lambda function , it runs , just doesn't return text. if that's not the case, please include it in the question. also, what kind of pdf are you using , text based pdf or image based pdf?
答案2
得分: 0
lambda函数将文件作为base64字符串获取。我最终能够弄清问题所在,那就是超时设置,但尽管如此,我现在使用Python而不是Node.js,它运行速度快得多。无论如何,谢谢。
英文:
The lambda function gets the file as a base64 string. I was finally able to figure out what the problem was and it was the timeout setting but despite that, I using now Python instead of Node.js and it works much faster. Thanks anyway.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论