问题

我尝试创建一个Supabase边缘函数，从URL读取文件并返回其文本内容，但是在Deno环境中我找不到任何可用的库。

这是我到目前为止尝试过的内容：

import { PDFDocument } from 'https://cdn.skypack.dev/pdf-lib';

async function fetchPDF(url: string): Promise<Uint8Array> {
    const response = await fetch(url);
    const data = await response.arrayBuffer();
    return new Uint8Array(data);
}

async function readPDFText(url: string): Promise<string> {
    const pdfBytes = await fetchPDF(url);
    const pdfDoc = await PDFDocument.load(pdfBytes);
    const pages = pdfDoc.getPages();

    let text = '';
    for (const page of pages) {
        const content = await page.extractText();
        text += content;
    }

    return text;
}

const pdfUrl = 'URL_GOES_HERE';
const pdfText = await readPDFText(pdfUrl);
console.log(pdfText);

然而，我收到一个TypeError，指出.extractText()不是一个函数，我还尝试过getTextContent()，但是出现了相同的错误。

英文:

I am trying to create a supabase edge function to read a file from an URL and return its text, however I can't find any working libraries for Deno environment.

This is what I tried so far:

import { PDFDocument } from &#39;https://cdn.skypack.dev/pdf-lib&#39;;

async function fetchPDF(url: string): Promise&lt;Uint8Array&gt; {
	const response = await fetch(url);
	const data = await response.arrayBuffer();
	return new Uint8Array(data);
}

async function readPDFText(url: string): Promise&lt;string&gt; {
	const pdfBytes = await fetchPDF(url);
	const pdfDoc = await PDFDocument.load(pdfBytes);
	const pages = pdfDoc.getPages();

	let text = &#39;&#39;;
	for (const page of pages) {
		const content = await page.extractText();
		text += content;
	}

	return text;
}

const pdfUrl = &#39;URL_GOES_HERE&#39;;
const pdfText = await readPDFText(pdfUrl);
console.log(pdfText);

however, I get a TypeError that .extractText() is not a function, I also tried getTextContent(), same error.

答案1

得分: 2

那个库不支持文本提取。

> 目前无法使用pdf-lib解析文档中的纯文本
>（但您可以提取acroform字段的内容）。我建议您考虑使用PDF.js来解析/提取文本。
>
> 当然，这并不是一个理想的解决方案，因为它需要两个
> 不同的库来执行一个看似简单的任务。但这是我目前所知的最佳方法，直到pdf-lib支持文本
> 解析。

作为替代方案，您可以使用具有该功能的任何npm包。

以下是使用pdf-parse的工作示例：

import pdf from 'npm:pdf-parse/lib/pdf-parse.js';

async function extractTextFromPDF(pdfUrl) {
    const response = await fetch(pdfUrl);
    const data = await pdf(await response.arrayBuffer());
    return data.text;    
}

const pdfUrl = 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf';
const pdfText = await extractTextFromPDF(pdfUrl);
console.log(pdfText);

英文:

That library does not support text extraction

> It is not currently possible to parse plain text out of a document
> with pdf-lib (but you can extract the content of acroform fields). I'd
> suggest you consider using PDF.js to parse/extract text.
>
> Of course, this isn't an ideal solution since it requires two
> different libraries for a seemingly simple task. But it's the best
> approach I know of for now, until pdf-lib gains support for text
> parsing.

As an alternative, you could use any npm package that has that functionality.

Here's a working example using pdf-parse

import pdf from &#39;npm:pdf-parse/lib/pdf-parse.js&#39;

async function extractTextFromPDF(pdfUrl) {
    const response = await fetch(pdfUrl);
    const data = await pdf(await response.arrayBuffer());
    return data.text;    
}

const pdfUrl = &#39;https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf&#39;;
const pdfText = await extractTextFromPDF(pdfUrl);
console.log(pdfText);

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有办法从URL获取PDF并从中提取文本在Deno中？

问题

答案1

如何检查 x 和 y 是否在一个对象中？

Ref 在 React 中的自定义钩子的不同调用之间重置。

为什么我从Chargebee API方法validateVat中得到状态：未定义？

使用省略参数进行类型检查

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论