如何在Apps Script中创建PDF.js库

huangapple go评论66阅读模式
英文:

How to create PDF.js Library in Apps Script

问题

I need to parse through multiple PDF files in one of the folders in my Google Drive, and return the parsed information into a Google Sheet. (I have already worked with parsing through Gmail so I don't think this will be a problem for me)

However, the research I have done about this indicates that I will need to import a Library first into my script editor that can parse through PDF files.

我需要在我的Google Drive文件夹中解析多个PDF文件,并将解析后的信息返回到Google表格中。(我已经解析过Gmail,所以我认为这对我不会是个问题)

然而,我所做的研究表明,我需要首先在我的脚本编辑器中导入一个可以解析PDF文件的库。

I am trying to import the PDF.js Library, but I cannot find the script ID, so instead, I am trying to import the code into a Script file that can then be added as a Library into other scripts.

我尝试导入PDF.js库,但我找不到脚本ID,所以我正在尝试导入代码到一个脚本文件中,然后可以将其添加为其他脚本的库。

I have downloaded the Zip file from the repository on GitHub: https://github.com/mozilla/pdf.js

我已从GitHub仓库下载了Zip文件:https://github.com/mozilla/pdf.js

However, I am not sure which file to copy into the script editor? Should it be the file called "Builder.js"?

然而,我不确定应该复制哪个文件到脚本编辑器?应该是名为"Builder.js"的文件吗?

Sorry this is the first time I am interacting with GitHub.

抱歉,这是我第一次与GitHub互动。

EDIT: Current script looks something like this. Unlike email, I cannot retrieve the contents of the PDF file in text form so that I can pull out information I need

编辑:当前脚本看起来是这样的。与电子邮件不同,我无法以文本形式检索PDF文件的内容,以便提取我需要的信息。

function getPDFfiles() {
  const pdfFolder = DriveApp.getFolderById("myfolderid");
  const files = pdfFolder.getFilesByType(MimeType.PDF);
  let pdfNames = [];
  while (files.hasNext()) {
    const file = files.next();
    const fileName = file.getName();
    pdfNames.push([fileName]);
  }
  generator.getRange(2, 1, pdfNames.length, pdfNames[0].length).setValues(pdfNames);
}; // getPDFfiles function ends
function getPDFfiles() {
  const pdfFolder = DriveApp.getFolderById("myfolderid");
  const files = pdfFolder.getFilesByType(MimeType.PDF);
  let pdfNames = [];
  while (files.hasNext()) {
    const file = files.next();
    const fileName = file.getName();
    pdfNames.push([fileName]);
  }
  generator.getRange(2, 1, pdfNames.length, pdfNames[0].length).setValues(pdfNames);
}; // getPDFfiles function ends
英文:

I need to parse through multiple PDF files in one of the folders in my Google Drive, and return the parsed information into a Google Sheet. (I have already worked with parsing through Gmail so I don't think this will be a problem for me)

However, the research I have done about this indicates that I will need to import a Library first into my script editor that can parse through PDF files.

I am trying to import the PDF.js Library, but I cannot find the script ID, so instead, I am trying to import the code into a Script file that can then be added as a Library into other scripts.

I have downloaded the Zip file from the repository on GitHub: https://github.com/mozilla/pdf.js

However, I am not sure which file to copy into the script editor? Should it be the file called "Builder.js"?

Sorry this is the first time I am interacting with GitHub.

EDIT: Current script looks something like this. Unlike email, I cannot retrieve the contents of the PDF file in text form so that I can pull out information I need

function getPDFfiles () {

const pdfFolder = DriveApp.getFolderById("myfolderid");
const files = pdfFolder.getFilesByType(MimeType.PDF);

let pdfNames = []

while (files.hasNext()) {
    const file = files.next();
    const fileName = file.getName();
  pdfNames.push([fileName]);
  }


generator.getRange(2, 1, pdfNames.length, pdfNames[0].length).setValues(pdfNames);


}; // getPDFfiles function ends

答案1

得分: 3

I believe your goal is as follows.

  • 您希望使用Google Apps Script将PDF文件转换为文本数据。

在当前阶段,我担心PDF.js可能无法直接与Google Apps Script一起使用。因此,在这种情况下,我想提出一种不使用PDF.js的方法。当您的显示脚本被修改时,以下修改如何?

修改后的脚本:

在这个修改后的脚本中,使用Drive API将PDF格式转换为Google文档。因此,请在高级Google服务中启用Drive API

function getPDFfiles() {
  const pdfFolder = DriveApp.getFolderById("myfolderid");
  const files = pdfFolder.getFilesByType(MimeType.PDF);
  const res = []
  while (files.hasNext()) {
    const file = files.next();
    const tempId = Drive.Files.copy({ mimeType: MimeType.GOOGLE_DOCS }, file.getId(), { supportsAllDrives: true }).id;
    const text = DocumentApp.openById(tempId).getBody().getText();
    DriveApp.getFileById(tempId).setTrashed(true); // or Drive.Files.remove(tempId);
    const fileName = file.getName();
    res.push([fileName, text]);
  }

  const generator = SpreadsheetApp.getActiveSheet(); // 请设置您的工作表。
  generator.getRange(2, 1, res.length, res[0].length).setValues(res);
}
  • 当运行此脚本时,PDF数据将转换为Google文档。然后,从Google文档中检索文本数据。接着,将临时Google文档删除。最后,将文件名和转换后的文本放入电子表格中。

注意:

  • 这个示例脚本是用来从PDF数据中检索文本数据的。如果您想要从文本数据中检索特定文本,我认为也可以从Google文档中实现。

参考:

英文:

I believe your goal is as follows.

  • You want to convert PDF files to text data using Google Apps Script.

In the current stage, I'm worried that PDF.js might not be able to be directly used with Google Apps Script. So, in this case, I would like to propose a method without using PDF.js. When your showing script is modified, how about the following modification?

Modified script:

In this modified script, Drive API is used for converting PDF format to Google Document. So, please enable Drive API at Advanced Google services.

function getPDFfiles() {
  const pdfFolder = DriveApp.getFolderById("myfolderid");
  const files = pdfFolder.getFilesByType(MimeType.PDF);
  const res = []
  while (files.hasNext()) {
    const file = files.next();
    const tempId = Drive.Files.copy({ mimeType: MimeType.GOOGLE_DOCS }, file.getId(), { supportsAllDrives: true }).id;
    const text = DocumentApp.openById(tempId).getBody().getText();
    DriveApp.getFileById(tempId).setTrashed(true); // or Drive.Files.remove(tempId);
    const fileName = file.getName();
    res.push([fileName, text]);
  }

  const generator = SpreadsheetApp.getActiveSheet(); // Please set your sheet.
  generator.getRange(2, 1, res.length, res[0].length).setValues(res);
}
  • When this script is run, the PDF data is converted Google Document. And, the text data is retrieved from Google Document. And, the temporal Google Document is removed. And, the filename and the converted text are put into the Spreadsheet.

Note:

  • This sample script is for retrieving text data from PDF data. If you want to retrieve the specific text from the text data, I think that it can be also achieved from Google Document.

Reference:

huangapple
  • 本文由 发表于 2023年6月1日 05:42:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76377483.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定