英文:
tika.parseToString returns empty string
问题
以下是您要翻译的内容:
为什么以下应用程序不会打印文件内容?
package org.example;
import org.apache.tika.Tika;
import java.io.File;
public class TikaFirstTry {
public static void main(String[] args) throws Exception {
Tika tika = new Tika();
for (String fileName : args){
System.out.println(fileName);
String text = tika.parseToString(new File(fileName));
System.out.println("text is: " + text);
}
}
}
文件 foo.txt 包含:
pizzaaaaa
程序输出是:
C:/Users/me/Desktop/foo.txt
text is:
而且没有抛出异常...
我的 POM 文件包含
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-async-cli</artifactId>
<version>2.7.1-SNAPSHOT</version>
</dependency>
</dependencies>
英文:
Why the following application won't print the file contents?
package org.example;
import org.apache.tika.Tika;
import java.io.File;
public class TikaFirstTry {
public static void main(String[] args) throws Exception {
Tika tika = new Tika();
for (String fileName : args){
System.out.println(fileName);
String text = tika.parseToString(new File(fileName));
System.out.println("text is: " + text);
}
}
}
The file foo.txt contains:
pizzaaaaa
The program output is:
C:/Users/me/Desktop/foo.txt
text is:
and no exception is thrown...
my pom contains
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-async-cli</artifactId>
<version>2.7.1-SNAPSHOT</version>
</dependency>
</dependencies>
答案1
得分: 2
以下是您要翻译的代码部分:
<project>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-bom</artifactId>
<version>2.7.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-async-cli</artifactId>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
</dependency>
</dependencies>
</project>
希望这可以帮助您。如果您有任何其他问题或需要进一步的翻译,请告诉我。
英文:
TL;DR
These are the relevant dependency
sections in pom.xml
which are required to run your example:
<project>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-bom</artifactId>
<version>2.7.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-async-cli</artifactId>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
</dependency>
</dependencies>
</project>
Full answer
First of all, as @khmarbaise has noticed, your tika-async-cli
dependency version looks faulty. As of 26 February, there are only 2 versions of artifact tika-async-cli
available for download: 2.6.0
and 2.7.0
. The one you've shared is not on the list and mvn install
throws an error when trying to fetch that version from Maven Central.
You need both tika-core
and tika-parsers-*
dependencies to run your example.
You've already included tika-core
since tika-async-cli
includes it as a direct dependency:
$ mvn dependency:tree
# ...
[INFO] +- org.apache.tika:tika-async-cli:jar:2.7.0:compile
[INFO] | +- org.apache.tika:tika-core:jar:2.7.0:compile
[INFO] | | \- commons-io:commons-io:jar:2.11.0:compile
[INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.19.0:compile
[INFO] | | \- org.apache.logging.log4j:log4j-api:jar:2.19.0:compile
[INFO] | \- org.apache.logging.log4j:log4j-slf4j2-impl:jar:2.19.0:compile
# ...
As @Gagravarr has hinted, one of the tika-parsers-*
was missing in your dependencies
section. Currently these come as 3 separate dependencies:
tika-parsers-standard-package
,tika-parser-scientific-module
,tika-parser-sqlite3-module
.
As I understand, this came about with Tika 2.0 (more on that here). For your purposes, tika-parsers-standard-package
seems sufficient.
The https://github.com/apache/tika README somewhat proposes the Maven Configuration but it is unfortunately incomplete.
I suspect you do not see an exception because Tika falls back to an EmptyParser
when parsers are not loaded. It creates an empty XHTML document in the background and such a document has no text content. Hence your code outputs an empty string.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论