问题

我有一个包含图像、超链接、文字等内容的PDF文档。

我想要在所有文字中搜索一个字符串，即排除图像和超链接。
如何编写Java代码来实现这一功能。有人可以在这里提供帮助。

英文:

I have a pdf document which contains images, hyperlinks , words and many other things.

I want to search for a sting in all the words, i.e images and hyperlinks are excluded.
How to write a java code with that. Could someone help here.

答案1

得分: 2

你可以使用Apache的PDFbox库（https://pdfbox.apache.org/download.cgi）。
以下是代码示例：

import java.util.Scanner;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class Main {
    public static void main(String args[]) throws IOException {
        Scanner scan = new Scanner(System.in);
        System.out.println("输入PDF文件的目录：");
        String PDFdir = scan.nextLine();
        System.out.println("输入要查找的短语：");
        String phrase = scan.nextLine();
        File file = new File(PDFdir);
        PDDocument doc = PDDocument.load(file);
        PDFTextStripper findPhrase = new PDFTextStripper();
        String text = findPhrase.getText(doc);
        String PDF_content = text;
        String result = PDF_content.contains(phrase) ? "是" : "否";
        System.out.println(result);
        doc.close();
    }
}

请记得下载PDFbox的JAR文件并将其导入到你的项目中。

输出/结果：

编辑：

你还可以查找PDF中短语的数量：

if (result.equals("是")) {
    int counter = 0;
    while(PDF_content.contains(phrase)) {
        counter++;
        PDF_content = PDF_content.replaceFirst(phrase, "");
    }
    System.out.println(counter);
}

输出/结果：

英文:

You can use the PDFbox library of Apache (https://pdfbox.apache.org/download.cgi).
Here is an example of code.

import java.util.Scanner;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Main {
    public static void main(String args[]) throws IOException {
        Scanner scan = new Scanner(System.in);
        System.out.println(&quot;Type the directory of the PDF File : &quot;);
        String PDFdir = scan.nextLine();
        System.out.println(&quot;Input the phrase to find&quot;);
        String phrase = scan.nextLine();
        File file = new File(PDFdir);
        PDDocument doc = PDDocument.load(file);
        PDFTextStripper findPhrase = new PDFTextStripper();
        String text = findPhrase.getText(doc);
        String PDF_content = text;
        String result = PDF_content.contains(phrase) ? &quot;Yes&quot; : &quot;No&quot;
        System.out.println(result);
        doc.close();
    }
}

Remember you will have to download PDFbox jar file and import it into your project.

Output/Result :

Edit:

You can also find the number of phrases in the PDF :

if (result.equals(&quot;Yes&quot;)) {
    int counter = 0;
        while(PDF_content.contains(phrase)) {
            counter++;
            PDF_content = PDF_content.replaceFirst(phrase, &quot;&quot;);
        }
    System.out.println(counter);
}

Output/Result :

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在PDF文档中搜索字符串

问题

答案1

Design problem in Microservice architecture.

我需要在ArrayList中检测表单。

Solr DIH无法工作，显示错误信息：java.library.path中无sqljdbc_auth。

H2嵌入式数据库 IntelliJ IDEA <-> 源文件

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论