英文:
How to search for a string in a pdf document
问题
我有一个包含图像、超链接、文字等内容的PDF文档。
我想要在所有文字中搜索一个字符串,即排除图像和超链接。
如何编写Java代码来实现这一功能。有人可以在这里提供帮助。
英文:
I have a pdf document which contains images, hyperlinks , words and many other things.
I want to search for a sting in all the words, i.e images and hyperlinks are excluded.
How to write a java code with that. Could someone help here.
答案1
得分: 2
你可以使用Apache的PDFbox库(https://pdfbox.apache.org/download.cgi)。
以下是代码示例:
import java.util.Scanner;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Main {
public static void main(String args[]) throws IOException {
Scanner scan = new Scanner(System.in);
System.out.println("输入PDF文件的目录:");
String PDFdir = scan.nextLine();
System.out.println("输入要查找的短语:");
String phrase = scan.nextLine();
File file = new File(PDFdir);
PDDocument doc = PDDocument.load(file);
PDFTextStripper findPhrase = new PDFTextStripper();
String text = findPhrase.getText(doc);
String PDF_content = text;
String result = PDF_content.contains(phrase) ? "是" : "否";
System.out.println(result);
doc.close();
}
}
请记得下载PDFbox的JAR文件并将其导入到你的项目中。
输出/结果:
编辑:
你还可以查找PDF中短语的数量:
if (result.equals("是")) {
int counter = 0;
while(PDF_content.contains(phrase)) {
counter++;
PDF_content = PDF_content.replaceFirst(phrase, "");
}
System.out.println(counter);
}
输出/结果:
英文:
You can use the PDFbox library of Apache (https://pdfbox.apache.org/download.cgi).
Here is an example of code.
import java.util.Scanner;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Main {
public static void main(String args[]) throws IOException {
Scanner scan = new Scanner(System.in);
System.out.println("Type the directory of the PDF File : ");
String PDFdir = scan.nextLine();
System.out.println("Input the phrase to find");
String phrase = scan.nextLine();
File file = new File(PDFdir);
PDDocument doc = PDDocument.load(file);
PDFTextStripper findPhrase = new PDFTextStripper();
String text = findPhrase.getText(doc);
String PDF_content = text;
String result = PDF_content.contains(phrase) ? "Yes" : "No"
System.out.println(result);
doc.close();
}
}
Remember you will have to download PDFbox jar file and import it into your project.
Output/Result :
Edit:
You can also find the number of phrases in the PDF :
if (result.equals("Yes")) {
int counter = 0;
while(PDF_content.contains(phrase)) {
counter++;
PDF_content = PDF_content.replaceFirst(phrase, "");
}
System.out.println(counter);
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论