2023年6月1日 05:42:09go评论82阅读模式

英文:

How to create PDF.js Library in Apps Script

问题

I need to parse through multiple PDF files in one of the folders in my Google Drive, and return the parsed information into a Google Sheet. (I have already worked with parsing through Gmail so I don't think this will be a problem for me)

However, the research I have done about this indicates that I will need to import a Library first into my script editor that can parse through PDF files.

我需要在我的Google Drive文件夹中解析多个PDF文件，并将解析后的信息返回到Google表格中。（我已经解析过Gmail，所以我认为这对我不会是个问题）

然而，我所做的研究表明，我需要首先在我的脚本编辑器中导入一个可以解析PDF文件的库。

I am trying to import the PDF.js Library, but I cannot find the script ID, so instead, I am trying to import the code into a Script file that can then be added as a Library into other scripts.

我尝试导入PDF.js库，但我找不到脚本ID，所以我正在尝试导入代码到一个脚本文件中，然后可以将其添加为其他脚本的库。

I have downloaded the Zip file from the repository on GitHub: https://github.com/mozilla/pdf.js

我已从GitHub仓库下载了Zip文件：https://github.com/mozilla/pdf.js

However, I am not sure which file to copy into the script editor? Should it be the file called "Builder.js"?

然而，我不确定应该复制哪个文件到脚本编辑器？应该是名为"Builder.js"的文件吗？

Sorry this is the first time I am interacting with GitHub.

抱歉，这是我第一次与GitHub互动。

EDIT: Current script looks something like this. Unlike email, I cannot retrieve the contents of the PDF file in text form so that I can pull out information I need

编辑：当前脚本看起来是这样的。与电子邮件不同，我无法以文本形式检索PDF文件的内容，以便提取我需要的信息。

function getPDFfiles() {
  const pdfFolder = DriveApp.getFolderById("myfolderid");
  const files = pdfFolder.getFilesByType(MimeType.PDF);
  let pdfNames = [];
  while (files.hasNext()) {
    const file = files.next();
    const fileName = file.getName();
    pdfNames.push([fileName]);
  }
  generator.getRange(2, 1, pdfNames.length, pdfNames[0].length).setValues(pdfNames);
}; // getPDFfiles function ends

function getPDFfiles() {
  const pdfFolder = DriveApp.getFolderById("myfolderid");
  const files = pdfFolder.getFilesByType(MimeType.PDF);
  let pdfNames = [];
  while (files.hasNext()) {
    const file = files.next();
    const fileName = file.getName();
    pdfNames.push([fileName]);
  }
  generator.getRange(2, 1, pdfNames.length, pdfNames[0].length).setValues(pdfNames);
}; // getPDFfiles function ends

英文:

However, the research I have done about this indicates that I will need to import a Library first into my script editor that can parse through PDF files.

I am trying to import the PDF.js Library, but I cannot find the script ID, so instead, I am trying to import the code into a Script file that can then be added as a Library into other scripts.

I have downloaded the Zip file from the repository on GitHub: https://github.com/mozilla/pdf.js

However, I am not sure which file to copy into the script editor? Should it be the file called "Builder.js"?

Sorry this is the first time I am interacting with GitHub.

EDIT: Current script looks something like this. Unlike email, I cannot retrieve the contents of the PDF file in text form so that I can pull out information I need

function getPDFfiles () {

const pdfFolder = DriveApp.getFolderById(&quot;myfolderid&quot;);
const files = pdfFolder.getFilesByType(MimeType.PDF);

let pdfNames = []

while (files.hasNext()) {
    const file = files.next();
    const fileName = file.getName();
  pdfNames.push([fileName]);
  }


generator.getRange(2, 1, pdfNames.length, pdfNames[0].length).setValues(pdfNames);


}; // getPDFfiles function ends

答案1

得分: 3

I believe your goal is as follows.

您希望使用Google Apps Script将PDF文件转换为文本数据。

在当前阶段，我担心PDF.js可能无法直接与Google Apps Script一起使用。因此，在这种情况下，我想提出一种不使用PDF.js的方法。当您的显示脚本被修改时，以下修改如何？

修改后的脚本：

在这个修改后的脚本中，使用Drive API将PDF格式转换为Google文档。因此，请在高级Google服务中启用Drive API。

function getPDFfiles() {
  const pdfFolder = DriveApp.getFolderById("myfolderid");
  const files = pdfFolder.getFilesByType(MimeType.PDF);
  const res = []
  while (files.hasNext()) {
    const file = files.next();
    const tempId = Drive.Files.copy({ mimeType: MimeType.GOOGLE_DOCS }, file.getId(), { supportsAllDrives: true }).id;
    const text = DocumentApp.openById(tempId).getBody().getText();
    DriveApp.getFileById(tempId).setTrashed(true); // or Drive.Files.remove(tempId);
    const fileName = file.getName();
    res.push([fileName, text]);
  }

  const generator = SpreadsheetApp.getActiveSheet(); // 请设置您的工作表。
  generator.getRange(2, 1, res.length, res[0].length).setValues(res);
}

当运行此脚本时，PDF数据将转换为Google文档。然后，从Google文档中检索文本数据。接着，将临时Google文档删除。最后，将文件名和转换后的文本放入电子表格中。

注意：

这个示例脚本是用来从PDF数据中检索文本数据的。如果您想要从文本数据中检索特定文本，我认为也可以从Google文档中实现。

参考：

Method: files.copy

英文:

I believe your goal is as follows.

You want to convert PDF files to text data using Google Apps Script.

In the current stage, I'm worried that PDF.js might not be able to be directly used with Google Apps Script. So, in this case, I would like to propose a method without using PDF.js. When your showing script is modified, how about the following modification?

Modified script:

In this modified script, Drive API is used for converting PDF format to Google Document. So, please enable Drive API at Advanced Google services.

function getPDFfiles() {
  const pdfFolder = DriveApp.getFolderById(&quot;myfolderid&quot;);
  const files = pdfFolder.getFilesByType(MimeType.PDF);
  const res = []
  while (files.hasNext()) {
    const file = files.next();
    const tempId = Drive.Files.copy({ mimeType: MimeType.GOOGLE_DOCS }, file.getId(), { supportsAllDrives: true }).id;
    const text = DocumentApp.openById(tempId).getBody().getText();
    DriveApp.getFileById(tempId).setTrashed(true); // or Drive.Files.remove(tempId);
    const fileName = file.getName();
    res.push([fileName, text]);
  }

  const generator = SpreadsheetApp.getActiveSheet(); // Please set your sheet.
  generator.getRange(2, 1, res.length, res[0].length).setValues(res);
}

When this script is run, the PDF data is converted Google Document. And, the text data is retrieved from Google Document. And, the temporal Google Document is removed. And, the filename and the converted text are put into the Spreadsheet.

Note:

This sample script is for retrieving text data from PDF data. If you want to retrieve the specific text from the text data, I think that it can be also achieved from Google Document.

Reference:

Method: files.copy

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Apps Script中创建PDF.js库

问题

答案1

修改后的脚本：

注意：

参考：

Modified script:

Note:

Reference:

如何根据从Google表格中选择的范围在Google文档中更改文本颜色。

打开一个模态窗口，其中包含一个iframe，iframe的src属性为变量Google sheets。

XPATH内的IMPORTXML：如何查找所有语言的文本？

如何减小 PDF 中 PNG 图像的大小（压缩 PDF 中的 PNG）。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论