英文:
Extracting data from multiple PDFs
问题
我有200个PDF文件,它们都有类似的格式。
目前我正在打开每个PDF文件,查找两个相关的值,然后手动输入它们到Excel表格中。
我想知道是否有自动化的方法。我的(非IT背景)想法是编写一个程序,它可以对文件夹中的所有文件进行OCR扫描,然后以CSV格式找到并提取相关数据,并将其传输到Excel中。
我想知道是否有人可以给我一些建议,关于如何首先着手解决这个问题。是否有可能实现类似的自动化?有没有一种编程语言比其他更适合这个任务?VBA或PowerQuery是否对此任务有任何帮助?
英文:
I have 200 PDF files, all formatted similarly.
Currently I am opening each PDF and looking for the two relevant values and typing them into an Excel table, all manually.
I'm wondering if there is a way to automate this. My (non-IT background) idea is to write a program that OCR scans all the files located in a folder, and then finds and extracts the relevant data in CVS format, and transfers it to Excel.
I was wondering if anyone could give me some pointers on how to first approach this. Is something remotely similar possible at all? Is there a language that's better suited for this task than the other? Would VBA or PowerQuery be in any way helpful to this task?
答案1
得分: 1
OCR
关于OCR部分,有很多工具可供选择,只举几个热门的例子:
有很多关于如何安装这些工具的文档,不过大多数是为Linux准备的。
相关数据
一个很好的问题...但不够详细(你可能会因为没有说明是否需要提取表格或只需文本而被点踩)。
当然,你可以使用任何编程语言,
一个简单的方法是对单个文件进行OCR,
然后例如使用 grep -l MYTERM myfiles
将返回文件名(在Linux下或在Windows下使用Git Bash),
最后生成一个CSV文件,可以导入到Excel(简单方法),或找到一种生成“真正”的Excel文件的方法。
致意
英文:
OCR
For the OCR part there are tons of tools, just to name a few popular ones:
There are many documents on how to install these tools , unfortunately mostly for linux
Relevant Data
A good question .. but a non-detailed one ( and you may get downvotes because you did not tell whether you need tables extracted or just text )
Of course you can use any programming language ,
an easy approach would be OCR to single files ,
then e.g. grep -l MYTERM myfiles
will yield the filenames (linux, or git bash under windows ),
any finally generate a CSV that you import to excel( easy approach) or find a way to generate "real" Excel files.
Regards
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论