使用Apache PDFBox从PDF中获取文本。

huangapple go评论77阅读模式
英文:

Getting text from PDF using Apache PDFBox

问题

以下是您要的翻译内容:

如何获取关于 PDF 结构的信息,我指的是文本或图片?我需要使我的程序将没有文本的 PDF 移动到另一个文件夹,但现在我得到的只是一个空的文本文件。

try (FileWriter writer = new FileWriter(outputFile)) {
                PDDocument document = new PDDocument().load(file);
                PDFTextStripper pdfTextStripper = new PDFTextStripper();
                String text = pdfTextStripper.getText(document);
                writer.write(text);
                document.close();
            } catch (IOException e){
                e.printStackTrace();
            }

另外,我在从保存在 PDF 中的网页中获取文本时遇到问题。它的样子如下:

使用Apache PDFBox从PDF中获取文本。

我认为编码可能有问题,但不知道该怎么办。

英文:

How can I get infromation about the structure of pdf, I mean text or pic? I need my programm to move pdf without text in other folder, but now I'm getting just an empty txt file.

try (FileWriter writer = new FileWriter(outputFile)) {
                PDDocument document = new PDDocument().load(file);
                PDFTextStripper pdfTextStripper = new PDFTextStripper();
                String text = pdfTextStripper.getText(document);
                writer.write(text);
                document.close();
            } catch (IOException e){
                e.printStackTrace();
            }

Also, have a problem with getting text from saved in pdf web-pages. It looks like:

使用Apache PDFBox从PDF中获取文本。

I think there is something wrong with encoding, but don't know what to do

答案1

得分: 1

你的代码运行得很好,但是你的文本查看器使用了错误的编码。

使用你的代码和与你相同的PDFBox版本,我可以得到正确的提取文本:

使用Apache PDFBox从PDF中获取文本。

但是当我强制我的查看器假设为UTF-16编码时,我得到了与你类似的结果:

[![查看器截图,假设为UTF-16编码][2]][2]

文件本身没有通过BOM或其他方式指示任何特定的编码:

[![查看器截图,十六进制转储视图][3]][3]

因此,你的文本查看器要么错误地猜测了UTF-16编码,要么被配置为使用它。

因此,要么将你的文本查看器切换为使用UTF-8,要么明确告诉你的FileWriter使用UTF-16。


根据你的具体安装,文件编码实际上可能不同。但是由于我的UTF-16视图看起来与你的非常相似,所以编码很可能至少类似于UTF-8,可能是一些ISO 8859-x编码...

1: https://i.stack.imgur.com/S2JNo.png "查看器截图,假设为UTF-8编码"
[2]: https://i.stack.imgur.com/60DsN.png "查看器截图,假设为UTF-16编码"
[3]: https://i.stack.imgur.com/0yIFn.png "查看器截图,十六进制转储视图"

英文:

Your code works alright, your text viewer assumes a wrong encoding.

Using your code and the same PDFBox version as you I get proper extracted text:

使用Apache PDFBox从PDF中获取文本。

But when I force my viewer to assume UTF-16 encoding, I get something very similar to what you get:

[![viewer screen shot, UTF-16 encoding assumed][2]][2]

The file itself does not indicate any specific encoding by a BOM or anything:

[![viewer screen shot, hex dump view][3]][3]

Thus, your text viewer either incorrectly guesses UTF-16 encoding or is configured to use it.

Thus, either switch your text viewer to use UTF-8 or explicitly tell your FileWriter to use UTF-16.


Depending on your specific installation, the file encoding might actually be different. As my UTF-16 view looks so very much like yours, though, the encoding very likely is at least similar to UTF-8, probably some ISO 8859-x...

1: https://i.stack.imgur.com/S2JNo.png "viewer screen shot, UTF-8 encoding assumed"
[2]: https://i.stack.imgur.com/60DsN.png "viewer screen shot, UTF-16 encoding assumed"
[3]: https://i.stack.imgur.com/0yIFn.png "viewer screen shot, hex dump view"

huangapple
  • 本文由 发表于 2020年9月18日 12:51:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/63949522.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定