从多个PDF中提取数据

huangapple go评论56阅读模式
英文:

Extracting data from multiple PDFs

问题

我有200个PDF文件,它们都有类似的格式。

目前我正在打开每个PDF文件,查找两个相关的值,然后手动输入它们到Excel表格中。

我想知道是否有自动化的方法。我的(非IT背景)想法是编写一个程序,它可以对文件夹中的所有文件进行OCR扫描,然后以CSV格式找到并提取相关数据,并将其传输到Excel中。

我想知道是否有人可以给我一些建议,关于如何首先着手解决这个问题。是否有可能实现类似的自动化?有没有一种编程语言比其他更适合这个任务?VBA或PowerQuery是否对此任务有任何帮助?

英文:

I have 200 PDF files, all formatted similarly.

Currently I am opening each PDF and looking for the two relevant values and typing them into an Excel table, all manually.

I'm wondering if there is a way to automate this. My (non-IT background) idea is to write a program that OCR scans all the files located in a folder, and then finds and extracts the relevant data in CVS format, and transfers it to Excel.

I was wondering if anyone could give me some pointers on how to first approach this. Is something remotely similar possible at all? Is there a language that's better suited for this task than the other? Would VBA or PowerQuery be in any way helpful to this task?

答案1

得分: 1

OCR

关于OCR部分,有很多工具可供选择,只举几个热门的例子:

有很多关于如何安装这些工具的文档,不过大多数是为Linux准备的。

相关数据

一个很好的问题...但不够详细(你可能会因为没有说明是否需要提取表格或只需文本而被点踩)。

当然,你可以使用任何编程语言,
一个简单的方法是对单个文件进行OCR,
然后例如使用 grep -l MYTERM myfiles 将返回文件名(在Linux下或在Windows下使用Git Bash),
最后生成一个CSV文件,可以导入到Excel(简单方法),或找到一种生成“真正”的Excel文件的方法。

致意

英文:

OCR

For the OCR part there are tons of tools, just to name a few popular ones:

There are many documents on how to install these tools , unfortunately mostly for linux

Relevant Data

A good question .. but a non-detailed one ( and you may get downvotes because you did not tell whether you need tables extracted or just text )

Of course you can use any programming language ,
an easy approach would be OCR to single files ,
then e.g. grep -l MYTERM myfiles will yield the filenames (linux, or git bash under windows ),

any finally generate a CSV that you import to excel( easy approach) or find a way to generate "real" Excel files.

Regards

huangapple
  • 本文由 发表于 2023年3月3日 18:20:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/75625808.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定