问题

以下是翻译好的部分：

"我正在尝试在使用Node.js编写的AWS Lambda函数中从doc/docx/pdf文件中提取文本。我需要将文本数据提取为一个单词数组。我尝试使用了一些不同的npm包，但我注意到它只是跳过那些函数。

AWS Lambda函数：

import { PDFExtract } from "pdf.js-extract";

...

export const handler = async (event) => {

	...

	const pdfExtract = new PDFExtract();
	const tempFilePath = join(tmpdir(), "resume.pdf");
	const buffer = readFileSync(tempFilePath);
	const wordsList = [];

	await pdfExtract.extractBuffer(buffer, {}, (err, data) => {
		if (err)
			return console.log(err);
		data.pages[0].content.forEach((e) => {
			const str = e.str.trim().split(" ");
			str.forEach((word) => {
				if (word.length > 1)
					wordsList.push(word);
			});
		});
	});
        console.log(wordsList);
}

文件结构

相同的代码在我的本地机器上运行得很好，但当我部署到AWS Lambda时，它无法提取任何文本。"

英文:

I'm trying to extract text from doc/docx/pdf file in an AWS Lambda function written in Node.js.
I need to extract text data as an array of words. I've tried using a few different npm packages, but I've noticed that it just skipping those functions.

AWS Lambda Function:

import { PDFExtract } from &quot;pdf.js-extract&quot;;

...

export const handler = async (event) =&gt; {

	...

	const pdfExtract = new PDFExtract();
	const tempFilePath = join(tmpdir(), &quot;resume.pdf&quot;);
	const buffer = readFileSync(tempFilePath);
	const wordsList = [];

	await pdfExtract.extractBuffer(buffer, {}, (err, data) =&gt; {
		if (err)
			return console.log(err);
		data.pages[0].content.forEach((e) =&gt; {
			const str = e.str.trim().split(&quot; &quot;);
			str.forEach((word) =&gt; {
				if (word.length &gt; 1)
					wordsList.push(word);
			});
		});
	});
        console.log(wordsList);
}

files structure

The same code works perfectly fine on my local machine, but when I deploy it to AWS Lambda, it fails to extract any text.

答案1

得分: 0

我在Python中尝试了类似的事情，对我有效。您是如何将您的代码部署到Lambda的？我假设您的Node包已经正确部署，并且当您测试Lambda函数时，它运行，只是没有返回文本。如果不是这种情况，请在问题中包含这一点。另外，请问您使用的是什么类型的PDF，基于文本的PDF还是基于图像的PDF？

英文:

I have tried similar thing in python, which works for me . How are you deploying your code to lambda? I am assuming your node packages are deployed properly and when you test your lambda function , it runs , just doesn't return text. if that's not the case, please include it in the question. also, what kind of pdf are you using , text based pdf or image based pdf?

答案2

得分: 0

lambda函数将文件作为base64字符串获取。我最终能够弄清问题所在，那就是超时设置，但尽管如此，我现在使用Python而不是Node.js，它运行速度快得多。无论如何，谢谢。

英文:

The lambda function gets the file as a base64 string. I was finally able to figure out what the problem was and it was the timeout setting but despite that, I using now Python instead of Node.js and it works much faster. Thanks anyway.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从AWS Lambda中提取PDF/DOC/DOCX文件中的文本（使用Node.js）。

问题

答案1

答案2

如何在按钮上显示星期几？

可以使用HTML5 VideoEncoder编码为YUV422吗？

如何在React Router中设置默认路由参数值？

查找在JS中使用的日期格式（d/m/Y…）和时间格式。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论