2023年3月3日 18:20:02go评论78阅读模式

英文:

Extracting data from multiple PDFs

问题

我有200个PDF文件，它们都有类似的格式。

目前我正在打开每个PDF文件，查找两个相关的值，然后手动输入它们到Excel表格中。

我想知道是否有自动化的方法。我的（非IT背景）想法是编写一个程序，它可以对文件夹中的所有文件进行OCR扫描，然后以CSV格式找到并提取相关数据，并将其传输到Excel中。

我想知道是否有人可以给我一些建议，关于如何首先着手解决这个问题。是否有可能实现类似的自动化？有没有一种编程语言比其他更适合这个任务？VBA或PowerQuery是否对此任务有任何帮助？

英文:

I have 200 PDF files, all formatted similarly.

Currently I am opening each PDF and looking for the two relevant values and typing them into an Excel table, all manually.

I'm wondering if there is a way to automate this. My (non-IT background) idea is to write a program that OCR scans all the files located in a folder, and then finds and extracts the relevant data in CVS format, and transfers it to Excel.

I was wondering if anyone could give me some pointers on how to first approach this. Is something remotely similar possible at all? Is there a language that's better suited for this task than the other? Would VBA or PowerQuery be in any way helpful to this task?

答案1

得分: 1

OCR

关于OCR部分，有很多工具可供选择，只举几个热门的例子：

有很多关于如何安装这些工具的文档，不过大多数是为Linux准备的。

OCR

For the OCR part there are tons of tools, just to name a few popular ones:

There are many documents on how to install these tools , unfortunately mostly for linux

Relevant Data

A good question .. but a non-detailed one ( and you may get downvotes because you did not tell whether you need tables extracted or just text )

Of course you can use any programming language ,
an easy approach would be OCR to single files ,
then e.g. grep -l MYTERM myfiles will yield the filenames (linux, or git bash under windows ),

any finally generate a CSV that you import to excel( easy approach) or find a way to generate "real" Excel files.

Regards

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

从多个PDF中提取数据

问题

答案1

OCR

相关数据

OCR

Relevant Data

Tesseract为何在这里返回错误的数字？

Google Cloud Vision 阻止索引

使用Get-Service返回对象中的服务器名称

统计唯一活动的数量

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。