tika.parseToString返回空字符串

huangapple go评论59阅读模式
英文:

tika.parseToString returns empty string

问题

以下是您要翻译的内容:

为什么以下应用程序不会打印文件内容

package org.example;

import org.apache.tika.Tika;
import java.io.File;

public class TikaFirstTry {
    public static void main(String[] args) throws Exception {
        Tika tika = new Tika();

        for (String fileName : args){
            System.out.println(fileName);
            String text = tika.parseToString(new File(fileName));
            System.out.println("text is: " + text);
        }
    }
}

文件 foo.txt 包含:

pizzaaaaa

程序输出是:

C:/Users/me/Desktop/foo.txt
text is:

而且没有抛出异常...

我的 POM 文件包含

<dependencies>
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-async-cli</artifactId>
    <version>2.7.1-SNAPSHOT</version>
  </dependency>
</dependencies>
英文:

Why the following application won't print the file contents?

package org.example;

import org.apache.tika.Tika;
import java.io.File;

public class TikaFirstTry {
    public static void main(String[] args) throws Exception {
        Tika tika = new Tika();

        for (String fileName : args){
            System.out.println(fileName);
            String text = tika.parseToString(new File(fileName));
            System.out.println(&quot;text is: &quot; + text);
        }
    }
}

The file foo.txt contains:

pizzaaaaa

The program output is:

C:/Users/me/Desktop/foo.txt
text is: 

and no exception is thrown...

my pom contains

&lt;dependencies&gt;
  &lt;dependency&gt;
    &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
    &lt;artifactId&gt;tika-async-cli&lt;/artifactId&gt;
    &lt;version&gt;2.7.1-SNAPSHOT&lt;/version&gt;
  &lt;/dependency&gt;
&lt;/dependencies&gt;

答案1

得分: 2

以下是您要翻译的代码部分:

&lt;project&gt;
  &lt;dependencyManagement&gt;
    &lt;dependencies&gt;
      &lt;dependency&gt;
        &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
        &lt;artifactId&gt;tika-bom&lt;/artifactId&gt;
        &lt;version&gt;2.7.0&lt;/version&gt;
        &lt;type&gt;pom&lt;/type&gt;
        &lt;scope&gt;import&lt;/scope&gt;
      &lt;/dependency&gt;
    &lt;/dependencies&gt;
  &lt;/dependencyManagement&gt;

  &lt;dependencies&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
      &lt;artifactId&gt;tika-async-cli&lt;/artifactId&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
      &lt;artifactId&gt;tika-parsers-standard-package&lt;/artifactId&gt;
    &lt;/dependency&gt;
  &lt;/dependencies&gt;
&lt;/project&gt;

希望这可以帮助您。如果您有任何其他问题或需要进一步的翻译,请告诉我。

英文:

TL;DR

These are the relevant dependency sections in pom.xml which are required to run your example:

&lt;project&gt;
  &lt;dependencyManagement&gt;
    &lt;dependencies&gt;
      &lt;dependency&gt;
        &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
        &lt;artifactId&gt;tika-bom&lt;/artifactId&gt;
        &lt;version&gt;2.7.0&lt;/version&gt;
        &lt;type&gt;pom&lt;/type&gt;
        &lt;scope&gt;import&lt;/scope&gt;
      &lt;/dependency&gt;
    &lt;/dependencies&gt;
  &lt;/dependencyManagement&gt;

  &lt;dependencies&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
      &lt;artifactId&gt;tika-async-cli&lt;/artifactId&gt;
    &lt;/dependency&gt;
    &lt;dependency&gt;
      &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
      &lt;artifactId&gt;tika-parsers-standard-package&lt;/artifactId&gt;
    &lt;/dependency&gt;
  &lt;/dependencies&gt;
&lt;/project&gt;

Full answer

First of all, as @khmarbaise has noticed, your tika-async-cli dependency version looks faulty. As of 26 February, there are only 2 versions of artifact tika-async-cli available for download: 2.6.0 and 2.7.0. The one you've shared is not on the list and mvn install throws an error when trying to fetch that version from Maven Central.

You need both tika-core and tika-parsers-* dependencies to run your example.

You've already included tika-core since tika-async-cli includes it as a direct dependency:

$ mvn dependency:tree
# ...
[INFO] +- org.apache.tika:tika-async-cli:jar:2.7.0:compile
[INFO] |  +- org.apache.tika:tika-core:jar:2.7.0:compile
[INFO] |  |  \- commons-io:commons-io:jar:2.11.0:compile
[INFO] |  +- org.apache.logging.log4j:log4j-core:jar:2.19.0:compile       
[INFO] |  |  \- org.apache.logging.log4j:log4j-api:jar:2.19.0:compile     
[INFO] |  \- org.apache.logging.log4j:log4j-slf4j2-impl:jar:2.19.0:compile
# ...

As @Gagravarr has hinted, one of the tika-parsers-* was missing in your dependencies section. Currently these come as 3 separate dependencies:

  • tika-parsers-standard-package,
  • tika-parser-scientific-module,
  • tika-parser-sqlite3-module.

As I understand, this came about with Tika 2.0 (more on that here). For your purposes, tika-parsers-standard-package seems sufficient.

The https://github.com/apache/tika README somewhat proposes the Maven Configuration but it is unfortunately incomplete.

I suspect you do not see an exception because Tika falls back to an EmptyParser when parsers are not loaded. It creates an empty XHTML document in the background and such a document has no text content. Hence your code outputs an empty string.

huangapple
  • 本文由 发表于 2023年2月27日 00:50:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/75573537.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定