有没有办法从URL获取PDF并从中提取文本在Deno中?

huangapple go评论66阅读模式
英文:

Is there a way to fetch PDF from URL and extract text from it in Deno?

问题

我尝试创建一个Supabase边缘函数,从URL读取文件并返回其文本内容,但是在Deno环境中我找不到任何可用的库。

这是我到目前为止尝试过的内容:

import { PDFDocument } from 'https://cdn.skypack.dev/pdf-lib';

async function fetchPDF(url: string): Promise<Uint8Array> {
    const response = await fetch(url);
    const data = await response.arrayBuffer();
    return new Uint8Array(data);
}

async function readPDFText(url: string): Promise<string> {
    const pdfBytes = await fetchPDF(url);
    const pdfDoc = await PDFDocument.load(pdfBytes);
    const pages = pdfDoc.getPages();

    let text = '';
    for (const page of pages) {
        const content = await page.extractText();
        text += content;
    }

    return text;
}

const pdfUrl = 'URL_GOES_HERE';
const pdfText = await readPDFText(pdfUrl);
console.log(pdfText);

然而,我收到一个TypeError,指出.extractText()不是一个函数,我还尝试过getTextContent(),但是出现了相同的错误。

英文:

I am trying to create a supabase edge function to read a file from an URL and return its text, however I can't find any working libraries for Deno environment.

This is what I tried so far:

import { PDFDocument } from &#39;https://cdn.skypack.dev/pdf-lib&#39;;

async function fetchPDF(url: string): Promise&lt;Uint8Array&gt; {
	const response = await fetch(url);
	const data = await response.arrayBuffer();
	return new Uint8Array(data);
}

async function readPDFText(url: string): Promise&lt;string&gt; {
	const pdfBytes = await fetchPDF(url);
	const pdfDoc = await PDFDocument.load(pdfBytes);
	const pages = pdfDoc.getPages();

	let text = &#39;&#39;;
	for (const page of pages) {
		const content = await page.extractText();
		text += content;
	}

	return text;
}

const pdfUrl = &#39;URL_GOES_HERE&#39;;
const pdfText = await readPDFText(pdfUrl);
console.log(pdfText);

however, I get a TypeError that .extractText() is not a function, I also tried getTextContent(), same error.

答案1

得分: 2

那个库不支持文本提取。

> 目前无法使用pdf-lib解析文档中的纯文本
>(但您可以提取acroform字段的内容)。我建议您考虑使用PDF.js来解析/提取文本。
>
> 当然,这并不是一个理想的解决方案,因为它需要两个
> 不同的库来执行一个看似简单的任务。但这是我目前所知的最佳方法,直到pdf-lib支持文本
> 解析。

作为替代方案,您可以使用具有该功能的任何npm包。

以下是使用pdf-parse的工作示例:

import pdf from 'npm:pdf-parse/lib/pdf-parse.js';

async function extractTextFromPDF(pdfUrl) {
    const response = await fetch(pdfUrl);
    const data = await pdf(await response.arrayBuffer());
    return data.text;    
}

const pdfUrl = 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf';
const pdfText = await extractTextFromPDF(pdfUrl);
console.log(pdfText);
英文:

That library does not support text extraction

> It is not currently possible to parse plain text out of a document
> with pdf-lib (but you can extract the content of acroform fields). I'd
> suggest you consider using PDF.js to parse/extract text.
>
> Of course, this isn't an ideal solution since it requires two
> different libraries for a seemingly simple task. But it's the best
> approach I know of for now, until pdf-lib gains support for text
> parsing.

As an alternative, you could use any npm package that has that functionality.

Here's a working example using pdf-parse

import pdf from &#39;npm:pdf-parse/lib/pdf-parse.js&#39;

async function extractTextFromPDF(pdfUrl) {
    const response = await fetch(pdfUrl);
    const data = await pdf(await response.arrayBuffer());
    return data.text;    
}

const pdfUrl = &#39;https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf&#39;;
const pdfText = await extractTextFromPDF(pdfUrl);
console.log(pdfText);

huangapple
  • 本文由 发表于 2023年5月15日 06:43:28
  • 转载请务必保留本文链接:https://go.coder-hub.com/76249978.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定