2020年9月18日 12:51:53go评论97阅读模式

英文:

Getting text from PDF using Apache PDFBox

问题

以下是您要的翻译内容：

如何获取关于 PDF 结构的信息，我指的是文本或图片？我需要使我的程序将没有文本的 PDF 移动到另一个文件夹，但现在我得到的只是一个空的文本文件。

try (FileWriter writer = new FileWriter(outputFile)) {
                PDDocument document = new PDDocument().load(file);
                PDFTextStripper pdfTextStripper = new PDFTextStripper();
                String text = pdfTextStripper.getText(document);
                writer.write(text);
                document.close();
            } catch (IOException e){
                e.printStackTrace();
            }

另外，我在从保存在 PDF 中的网页中获取文本时遇到问题。它的样子如下：

我认为编码可能有问题，但不知道该怎么办。

英文:

How can I get infromation about the structure of pdf, I mean text or pic? I need my programm to move pdf without text in other folder, but now I'm getting just an empty txt file.

try (FileWriter writer = new FileWriter(outputFile)) {
                PDDocument document = new PDDocument().load(file);
                PDFTextStripper pdfTextStripper = new PDFTextStripper();
                String text = pdfTextStripper.getText(document);
                writer.write(text);
                document.close();
            } catch (IOException e){
                e.printStackTrace();
            }

Also, have a problem with getting text from saved in pdf web-pages. It looks like:

I think there is something wrong with encoding, but don't know what to do

答案1

得分: 1

你的代码运行得很好，但是你的文本查看器使用了错误的编码。

使用你的代码和与你相同的PDFBox版本，我可以得到正确的提取文本：

但是当我强制我的查看器假设为UTF-16编码时，我得到了与你类似的结果：

[![查看器截图，假设为UTF-16编码][2]][2]

文件本身没有通过BOM或其他方式指示任何特定的编码：

[![查看器截图，十六进制转储视图][3]][3]

因此，你的文本查看器要么错误地猜测了UTF-16编码，要么被配置为使用它。

因此，要么将你的文本查看器切换为使用UTF-8，要么明确告诉你的FileWriter使用UTF-16。

根据你的具体安装，文件编码实际上可能不同。但是由于我的UTF-16视图看起来与你的非常相似，所以编码很可能至少类似于UTF-8，可能是一些ISO 8859-x编码...

1: https://i.stack.imgur.com/S2JNo.png "查看器截图，假设为UTF-8编码"
[2]: https://i.stack.imgur.com/60DsN.png "查看器截图，假设为UTF-16编码"
[3]: https://i.stack.imgur.com/0yIFn.png "查看器截图，十六进制转储视图"

英文:

Your code works alright, your text viewer assumes a wrong encoding.

Using your code and the same PDFBox version as you I get proper extracted text:

But when I force my viewer to assume UTF-16 encoding, I get something very similar to what you get:

[![viewer screen shot, UTF-16 encoding assumed][2]][2]

The file itself does not indicate any specific encoding by a BOM or anything:

[![viewer screen shot, hex dump view][3]][3]

Thus, your text viewer either incorrectly guesses UTF-16 encoding or is configured to use it.

Thus, either switch your text viewer to use UTF-8 or explicitly tell your FileWriter to use UTF-16.

Depending on your specific installation, the file encoding might actually be different. As my UTF-16 view looks so very much like yours, though, the encoding very likely is at least similar to UTF-8, probably some ISO 8859-x...

1: https://i.stack.imgur.com/S2JNo.png "viewer screen shot, UTF-8 encoding assumed"
[2]: https://i.stack.imgur.com/60DsN.png "viewer screen shot, UTF-16 encoding assumed"
[3]: https://i.stack.imgur.com/0yIFn.png "viewer screen shot, hex dump view"

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Apache PDFBox从PDF中获取文本。

问题

答案1

在Java中，是否有一种方法可以将扫描器仅按单个字符或n个字符前进？

Java程序中偶数索引出现的问题

在Android应用中无需使用OnClickListener检索Firebase数据

Bazel覆盖率需要JDK，为什么我不能使用Bazel信息提供的jdk-home？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。