英文:
Getting text from PDF using Apache PDFBox
问题
以下是您要的翻译内容:
如何获取关于 PDF 结构的信息,我指的是文本或图片?我需要使我的程序将没有文本的 PDF 移动到另一个文件夹,但现在我得到的只是一个空的文本文件。
try (FileWriter writer = new FileWriter(outputFile)) {
PDDocument document = new PDDocument().load(file);
PDFTextStripper pdfTextStripper = new PDFTextStripper();
String text = pdfTextStripper.getText(document);
writer.write(text);
document.close();
} catch (IOException e){
e.printStackTrace();
}
另外,我在从保存在 PDF 中的网页中获取文本时遇到问题。它的样子如下:
我认为编码可能有问题,但不知道该怎么办。
英文:
How can I get infromation about the structure of pdf, I mean text or pic? I need my programm to move pdf without text in other folder, but now I'm getting just an empty txt file.
try (FileWriter writer = new FileWriter(outputFile)) {
PDDocument document = new PDDocument().load(file);
PDFTextStripper pdfTextStripper = new PDFTextStripper();
String text = pdfTextStripper.getText(document);
writer.write(text);
document.close();
} catch (IOException e){
e.printStackTrace();
}
Also, have a problem with getting text from saved in pdf web-pages. It looks like:
I think there is something wrong with encoding, but don't know what to do
答案1
得分: 1
你的代码运行得很好,但是你的文本查看器使用了错误的编码。
使用你的代码和与你相同的PDFBox版本,我可以得到正确的提取文本:
但是当我强制我的查看器假设为UTF-16编码时,我得到了与你类似的结果:
[![查看器截图,假设为UTF-16编码][2]][2]
文件本身没有通过BOM或其他方式指示任何特定的编码:
[![查看器截图,十六进制转储视图][3]][3]
因此,你的文本查看器要么错误地猜测了UTF-16编码,要么被配置为使用它。
因此,要么将你的文本查看器切换为使用UTF-8,要么明确告诉你的FileWriter
使用UTF-16。
根据你的具体安装,文件编码实际上可能不同。但是由于我的UTF-16视图看起来与你的非常相似,所以编码很可能至少类似于UTF-8,可能是一些ISO 8859-x编码...
1: https://i.stack.imgur.com/S2JNo.png "查看器截图,假设为UTF-8编码"
[2]: https://i.stack.imgur.com/60DsN.png "查看器截图,假设为UTF-16编码"
[3]: https://i.stack.imgur.com/0yIFn.png "查看器截图,十六进制转储视图"
英文:
Your code works alright, your text viewer assumes a wrong encoding.
Using your code and the same PDFBox version as you I get proper extracted text:
But when I force my viewer to assume UTF-16 encoding, I get something very similar to what you get:
[![viewer screen shot, UTF-16 encoding assumed][2]][2]
The file itself does not indicate any specific encoding by a BOM or anything:
[![viewer screen shot, hex dump view][3]][3]
Thus, your text viewer either incorrectly guesses UTF-16 encoding or is configured to use it.
Thus, either switch your text viewer to use UTF-8 or explicitly tell your FileWriter
to use UTF-16.
Depending on your specific installation, the file encoding might actually be different. As my UTF-16 view looks so very much like yours, though, the encoding very likely is at least similar to UTF-8, probably some ISO 8859-x...
1: https://i.stack.imgur.com/S2JNo.png "viewer screen shot, UTF-8 encoding assumed"
[2]: https://i.stack.imgur.com/60DsN.png "viewer screen shot, UTF-16 encoding assumed"
[3]: https://i.stack.imgur.com/0yIFn.png "viewer screen shot, hex dump view"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论