英文:
Is there a way to fetch PDF from URL and extract text from it in Deno?
问题
我尝试创建一个Supabase边缘函数,从URL读取文件并返回其文本内容,但是在Deno环境中我找不到任何可用的库。
这是我到目前为止尝试过的内容:
import { PDFDocument } from 'https://cdn.skypack.dev/pdf-lib';
async function fetchPDF(url: string): Promise<Uint8Array> {
const response = await fetch(url);
const data = await response.arrayBuffer();
return new Uint8Array(data);
}
async function readPDFText(url: string): Promise<string> {
const pdfBytes = await fetchPDF(url);
const pdfDoc = await PDFDocument.load(pdfBytes);
const pages = pdfDoc.getPages();
let text = '';
for (const page of pages) {
const content = await page.extractText();
text += content;
}
return text;
}
const pdfUrl = 'URL_GOES_HERE';
const pdfText = await readPDFText(pdfUrl);
console.log(pdfText);
然而,我收到一个TypeError,指出.extractText()不是一个函数,我还尝试过getTextContent(),但是出现了相同的错误。
英文:
I am trying to create a supabase edge function to read a file from an URL and return its text, however I can't find any working libraries for Deno environment.
This is what I tried so far:
import { PDFDocument } from 'https://cdn.skypack.dev/pdf-lib';
async function fetchPDF(url: string): Promise<Uint8Array> {
const response = await fetch(url);
const data = await response.arrayBuffer();
return new Uint8Array(data);
}
async function readPDFText(url: string): Promise<string> {
const pdfBytes = await fetchPDF(url);
const pdfDoc = await PDFDocument.load(pdfBytes);
const pages = pdfDoc.getPages();
let text = '';
for (const page of pages) {
const content = await page.extractText();
text += content;
}
return text;
}
const pdfUrl = 'URL_GOES_HERE';
const pdfText = await readPDFText(pdfUrl);
console.log(pdfText);
however, I get a TypeError that .extractText() is not a function, I also tried getTextContent(), same error.
答案1
得分: 2
那个库不支持文本提取。
> 目前无法使用pdf-lib解析文档中的纯文本
>(但您可以提取acroform字段的内容)。我建议您考虑使用PDF.js来解析/提取文本。
>
> 当然,这并不是一个理想的解决方案,因为它需要两个
> 不同的库来执行一个看似简单的任务。但这是我目前所知的最佳方法,直到pdf-lib支持文本
> 解析。
作为替代方案,您可以使用具有该功能的任何npm包。
以下是使用pdf-parse
的工作示例:
import pdf from 'npm:pdf-parse/lib/pdf-parse.js';
async function extractTextFromPDF(pdfUrl) {
const response = await fetch(pdfUrl);
const data = await pdf(await response.arrayBuffer());
return data.text;
}
const pdfUrl = 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf';
const pdfText = await extractTextFromPDF(pdfUrl);
console.log(pdfText);
英文:
That library does not support text extraction
> It is not currently possible to parse plain text out of a document
> with pdf-lib (but you can extract the content of acroform fields). I'd
> suggest you consider using PDF.js to parse/extract text.
>
> Of course, this isn't an ideal solution since it requires two
> different libraries for a seemingly simple task. But it's the best
> approach I know of for now, until pdf-lib gains support for text
> parsing.
As an alternative, you could use any npm package that has that functionality.
Here's a working example using pdf-parse
import pdf from 'npm:pdf-parse/lib/pdf-parse.js'
async function extractTextFromPDF(pdfUrl) {
const response = await fetch(pdfUrl);
const data = await pdf(await response.arrayBuffer());
return data.text;
}
const pdfUrl = 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf';
const pdfText = await extractTextFromPDF(pdfUrl);
console.log(pdfText);
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论