tika.parseToString返回空字符串

huangapple go评论101阅读模式
英文:

tika.parseToString returns empty string

问题

以下是您要翻译的内容:

  1. 为什么以下应用程序不会打印文件内容
  2. package org.example;
  3. import org.apache.tika.Tika;
  4. import java.io.File;
  5. public class TikaFirstTry {
  6. public static void main(String[] args) throws Exception {
  7. Tika tika = new Tika();
  8. for (String fileName : args){
  9. System.out.println(fileName);
  10. String text = tika.parseToString(new File(fileName));
  11. System.out.println("text is: " + text);
  12. }
  13. }
  14. }

文件 foo.txt 包含:

  1. pizzaaaaa

程序输出是:

  1. C:/Users/me/Desktop/foo.txt
  2. text is:

而且没有抛出异常...

我的 POM 文件包含

  1. <dependencies>
  2. <dependency>
  3. <groupId>org.apache.tika</groupId>
  4. <artifactId>tika-async-cli</artifactId>
  5. <version>2.7.1-SNAPSHOT</version>
  6. </dependency>
  7. </dependencies>
英文:

Why the following application won't print the file contents?

  1. package org.example;
  2. import org.apache.tika.Tika;
  3. import java.io.File;
  4. public class TikaFirstTry {
  5. public static void main(String[] args) throws Exception {
  6. Tika tika = new Tika();
  7. for (String fileName : args){
  8. System.out.println(fileName);
  9. String text = tika.parseToString(new File(fileName));
  10. System.out.println(&quot;text is: &quot; + text);
  11. }
  12. }
  13. }

The file foo.txt contains:

  1. pizzaaaaa

The program output is:

  1. C:/Users/me/Desktop/foo.txt
  2. text is:

and no exception is thrown...

my pom contains

  1. &lt;dependencies&gt;
  2. &lt;dependency&gt;
  3. &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
  4. &lt;artifactId&gt;tika-async-cli&lt;/artifactId&gt;
  5. &lt;version&gt;2.7.1-SNAPSHOT&lt;/version&gt;
  6. &lt;/dependency&gt;
  7. &lt;/dependencies&gt;

答案1

得分: 2

以下是您要翻译的代码部分:

  1. &lt;project&gt;
  2. &lt;dependencyManagement&gt;
  3. &lt;dependencies&gt;
  4. &lt;dependency&gt;
  5. &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
  6. &lt;artifactId&gt;tika-bom&lt;/artifactId&gt;
  7. &lt;version&gt;2.7.0&lt;/version&gt;
  8. &lt;type&gt;pom&lt;/type&gt;
  9. &lt;scope&gt;import&lt;/scope&gt;
  10. &lt;/dependency&gt;
  11. &lt;/dependencies&gt;
  12. &lt;/dependencyManagement&gt;
  13. &lt;dependencies&gt;
  14. &lt;dependency&gt;
  15. &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
  16. &lt;artifactId&gt;tika-async-cli&lt;/artifactId&gt;
  17. &lt;/dependency&gt;
  18. &lt;dependency&gt;
  19. &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
  20. &lt;artifactId&gt;tika-parsers-standard-package&lt;/artifactId&gt;
  21. &lt;/dependency&gt;
  22. &lt;/dependencies&gt;
  23. &lt;/project&gt;

希望这可以帮助您。如果您有任何其他问题或需要进一步的翻译,请告诉我。

英文:

TL;DR

These are the relevant dependency sections in pom.xml which are required to run your example:

  1. &lt;project&gt;
  2. &lt;dependencyManagement&gt;
  3. &lt;dependencies&gt;
  4. &lt;dependency&gt;
  5. &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
  6. &lt;artifactId&gt;tika-bom&lt;/artifactId&gt;
  7. &lt;version&gt;2.7.0&lt;/version&gt;
  8. &lt;type&gt;pom&lt;/type&gt;
  9. &lt;scope&gt;import&lt;/scope&gt;
  10. &lt;/dependency&gt;
  11. &lt;/dependencies&gt;
  12. &lt;/dependencyManagement&gt;
  13. &lt;dependencies&gt;
  14. &lt;dependency&gt;
  15. &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
  16. &lt;artifactId&gt;tika-async-cli&lt;/artifactId&gt;
  17. &lt;/dependency&gt;
  18. &lt;dependency&gt;
  19. &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
  20. &lt;artifactId&gt;tika-parsers-standard-package&lt;/artifactId&gt;
  21. &lt;/dependency&gt;
  22. &lt;/dependencies&gt;
  23. &lt;/project&gt;

Full answer

First of all, as @khmarbaise has noticed, your tika-async-cli dependency version looks faulty. As of 26 February, there are only 2 versions of artifact tika-async-cli available for download: 2.6.0 and 2.7.0. The one you've shared is not on the list and mvn install throws an error when trying to fetch that version from Maven Central.

You need both tika-core and tika-parsers-* dependencies to run your example.

You've already included tika-core since tika-async-cli includes it as a direct dependency:

  1. $ mvn dependency:tree
  2. # ...
  3. [INFO] +- org.apache.tika:tika-async-cli:jar:2.7.0:compile
  4. [INFO] | +- org.apache.tika:tika-core:jar:2.7.0:compile
  5. [INFO] | | \- commons-io:commons-io:jar:2.11.0:compile
  6. [INFO] | +- org.apache.logging.log4j:log4j-core:jar:2.19.0:compile
  7. [INFO] | | \- org.apache.logging.log4j:log4j-api:jar:2.19.0:compile
  8. [INFO] | \- org.apache.logging.log4j:log4j-slf4j2-impl:jar:2.19.0:compile
  9. # ...

As @Gagravarr has hinted, one of the tika-parsers-* was missing in your dependencies section. Currently these come as 3 separate dependencies:

  • tika-parsers-standard-package,
  • tika-parser-scientific-module,
  • tika-parser-sqlite3-module.

As I understand, this came about with Tika 2.0 (more on that here). For your purposes, tika-parsers-standard-package seems sufficient.

The https://github.com/apache/tika README somewhat proposes the Maven Configuration but it is unfortunately incomplete.

I suspect you do not see an exception because Tika falls back to an EmptyParser when parsers are not loaded. It creates an empty XHTML document in the background and such a document has no text content. Hence your code outputs an empty string.

huangapple
  • 本文由 发表于 2023年2月27日 00:50:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/75573537.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定