从AWS Lambda中提取PDF/DOC/DOCX文件中的文本(使用Node.js)。

huangapple go评论60阅读模式
英文:

Extract text from pdf/doc/docx file using AWS Lambda (Node.js)

问题

以下是翻译好的部分:

"我正在尝试在使用Node.js编写的AWS Lambda函数中从doc/docx/pdf文件中提取文本。我需要将文本数据提取为一个单词数组。我尝试使用了一些不同的npm包,但我注意到它只是跳过那些函数。

AWS Lambda函数:

import { PDFExtract } from "pdf.js-extract";

...

export const handler = async (event) => {

	...

	const pdfExtract = new PDFExtract();
	const tempFilePath = join(tmpdir(), "resume.pdf");
	const buffer = readFileSync(tempFilePath);
	const wordsList = [];

	await pdfExtract.extractBuffer(buffer, {}, (err, data) => {
		if (err)
			return console.log(err);
		data.pages[0].content.forEach((e) => {
			const str = e.str.trim().split(" ");
			str.forEach((word) => {
				if (word.length > 1)
					wordsList.push(word);
			});
		});
	});
        console.log(wordsList);
}

文件结构

相同的代码在我的本地机器上运行得很好,但当我部署到AWS Lambda时,它无法提取任何文本。"

英文:

I'm trying to extract text from doc/docx/pdf file in an AWS Lambda function written in Node.js.
I need to extract text data as an array of words. I've tried using a few different npm packages, but I've noticed that it just skipping those functions.

AWS Lambda Function:

import { PDFExtract } from "pdf.js-extract";

...

export const handler = async (event) => {

	...

	const pdfExtract = new PDFExtract();
	const tempFilePath = join(tmpdir(), "resume.pdf");
	const buffer = readFileSync(tempFilePath);
	const wordsList = [];

	await pdfExtract.extractBuffer(buffer, {}, (err, data) => {
		if (err)
			return console.log(err);
		data.pages[0].content.forEach((e) => {
			const str = e.str.trim().split(" ");
			str.forEach((word) => {
				if (word.length > 1)
					wordsList.push(word);
			});
		});
	});
        console.log(wordsList);
}

files structure

The same code works perfectly fine on my local machine, but when I deploy it to AWS Lambda, it fails to extract any text.

答案1

得分: 0

我在Python中尝试了类似的事情,对我有效。您是如何将您的代码部署到Lambda的?我假设您的Node包已经正确部署,并且当您测试Lambda函数时,它运行,只是没有返回文本。如果不是这种情况,请在问题中包含这一点。另外,请问您使用的是什么类型的PDF,基于文本的PDF还是基于图像的PDF?

英文:

I have tried similar thing in python, which works for me . How are you deploying your code to lambda? I am assuming your node packages are deployed properly and when you test your lambda function , it runs , just doesn't return text. if that's not the case, please include it in the question. also, what kind of pdf are you using , text based pdf or image based pdf?

答案2

得分: 0

lambda函数将文件作为base64字符串获取。我最终能够弄清问题所在,那就是超时设置,但尽管如此,我现在使用Python而不是Node.js,它运行速度快得多。无论如何,谢谢。

英文:

The lambda function gets the file as a base64 string. I was finally able to figure out what the problem was and it was the timeout setting but despite that, I using now Python instead of Node.js and it works much faster. Thanks anyway.

huangapple
  • 本文由 发表于 2023年3月4日 03:32:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75631195.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定