使用Apache的Lucene库的默认和自定义停用词(奇怪的输出)

huangapple go评论75阅读模式
英文:

Using default and custom stop words with Apache's Lucene (weird output)

问题

以下是您要求的翻译内容:

我正在使用Apache的Lucene(版本8.6.3)从字符串中去除停用词,以下是使用Java 8代码的示例:

private static final String CONTENTS = "contents";
final String text = "This is a short test! Bla!";
final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

try {
    Analyzer analyzer = new StandardAnalyzer(stopSet);
    TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
    CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();

    while (tokenStream.incrementToken()) {
        System.out.print("[" + term.toString() + "] ");
    }

    tokenStream.close();
    analyzer.close();
} catch (IOException e) {
    System.out.println("Exception:\n");
    e.printStackTrace();
}

这将输出所需的结果:

> [this] [is] [a] [bla]

现在我想同时使用默认的英文停用词集,这应该删除"this","is"和"a"(根据github的说明),以及上面的自定义停用词集(我实际使用的停用词集要长得多)。所以我尝试了以下代码:

Analyzer analyzer = new EnglishAnalyzer(stopSet);

输出结果是:

> [thi] [is] [a] [bla]

是的,"this" 中的 "s" 消失了。是什么原因导致了这个问题?它也没有使用默认的停用词集。

以下更改会删除默认和自定义的停用词:

Analyzer analyzer = new EnglishAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
tokenStream = new StopFilter(tokenStream, stopSet);

问题:什么是正确的方法?在tokenStream内部使用自己(参见上面的代码)会引起问题吗?

附加问题:如何输出保留的单词并保持原始文本中的正确大小写?

英文:

I'm removing stop words from a String, using Apache's Lucene (8.6.3) and the following Java 8 code:

private static final String CONTENTS = &quot;contents&quot;;
final String text = &quot;This is a short test! Bla!&quot;;
final List&lt;String&gt; stopWords = Arrays.asList(&quot;short&quot;,&quot;test&quot;);
final CharArraySet stopSet = new CharArraySet(stopWords, true);

try {
    Analyzer analyzer = new StandardAnalyzer(stopSet);
    TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
    CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();

    while(tokenStream.incrementToken()) {
        System.out.print(&quot;[&quot; + term.toString() + &quot;] &quot;);
    }

    tokenStream.close();
    analyzer.close();
} catch (IOException e) {
    System.out.println(&quot;Exception:\n&quot;);
    e.printStackTrace();
}

This outputs the desired result:

> [this] [is] [a] [bla]

Now I want to use both the default English stop set, which should also remove "this", "is" and "a" (according to github) AND the custom stop set above (the actual one I'm going to use is a lot longer), so I tried this:

Analyzer analyzer = new EnglishAnalyzer(stopSet);

The output is:

> [thi] [is] [a] [bla]

Yes, the "s" in "this" is missing. What's causing this? It also didn't use the default stop set.

The following changes remove both the default and the custom stop words:

Analyzer analyzer = new EnglishAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
tokenStream = new StopFilter(tokenStream, stopSet);

Question: What is the "right" way to do this? Is using the tokenStream within itself (see code above) going to cause problems?

Bonus question: How do I output the remaining words with the right upper/lower case, hence what they use in the original text?

答案1

得分: 4

Handling the Combined Stop Words

处理合并的停用词列表:

import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);

以上代码将简单地将 Lucene 提供的英文停用词与您的自定义列表合并。

这将产生以下输出:

[bla]

处理单词大小写

这会稍微复杂一些。正如您注意到的那样,StandardAnalyzer 包含一个步骤,将所有单词转换为小写 - 因此我们无法使用它。

此外,如果您想要维护自己的自定义停用词列表,并且如果该列表的大小不小,我建议将其存储在自己的文本文件中,而不是将列表嵌入代码中。

因此,假设您有一个名为 stopwords.txt 的文件。在此文件中,每行将有一个单词 - 文件将已包含您的自定义停用词列表和官方英文停用词列表的合并内容。

您需要自己手动准备这个文件(即忽略本答案第 1 部分中的注释)。

我的测试文件只包含以下内容:

short
this
is
a
test
the
him
it

对于这种情况,我更喜欢使用 CustomAnalyzer,因为它让我可以非常简单地构建一个分析器。

import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();

这将执行以下操作:

  1. 它使用 "icu" 分词器 org.apache.lucene.analysis.icu.segmentation.ICUTokenizer,负责在 Unicode 空格上进行分词,并处理标点符号。

  2. 它应用停用词列表。请注意 ignoreCase 属性使用 true,并引用停用词文件。"wordset" 格式意味着 "每行一个单词"(也有其他格式)。

关键在于,以上链条中没有任何内容会更改单词的大小写。

因此,现在,使用这个新的分析器,输出如下所示:

[Bla]

最后的注意事项

停用词列表文件放在哪里?默认情况下,Lucene 期望在应用程序的类路径上找到它。因此,例如,您可以将它放在默认包中。

但请记住,文件需要由构建过程处理,以便它最终与应用程序的类文件一起处理(不要与源代码一起留下)。

我主要使用 Maven - 因此我在我的 POM 中有以下内容,以确保 ".txt" 文件按需部署:

<build>  
    <resources>  
        <resource>  
            <directory>src/main/java</directory>  
            <excludes>  
                <exclude>**/*.java</exclude>  
            </excludes>  
        </resource>  
    </resources>  
</build>

这告诉 Maven 将文件(除了 Java 源文件)复制到构建目标 - 从而确保文本文件被复制。

最后的注意事项 - 我没有调查您为什么会得到截断的 [thi] 标记。如果我有机会,我会仔细查看。


后续问题

> 在合并之后,我必须使用 StandardAnalyzer,对吗?

是的,那是正确的。我在答案的第 1 部分中提供的注意事项直接与您的问题中的代码以及您使用的 StandardAnalyzer 相关。

> 我想将停用词文件放在特定的非导入路径上 - 该如何做?

您可以告诉 CustomAnalyzer 在一个名为 "resources" 的目录中查找停用词文件。该目录可以位于文件系统上的任何位置(为了方便维护,正如您所指出的):

import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();

.builder() 不同,我们现在使用 .builder(resources)

英文:

I will tackle this in two parts:

  • stop-words
  • preserving original case

Handling the Combined Stop Words

To handle the combination of Lucene's English stop word list, plus your own custom list, you can create a merged list as follows:

import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List&lt;String&gt; stopWords = Arrays.asList(&quot;short&quot;, &quot;test&quot;);
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);

The above code simply takes the English stopwords bundled with Lucene and merges then with your list.

That gives the following output:

[bla]

Handling Word Case

This is a bit more involved. As you have noticed, the StandardAnalyzer includes a step in which all words are converted to lower case - so we can't use that.

Also, if you want to maintain your own custom list of stop words, and if that list is of any size, I would recommend storing it in its own text file, rather than embedding the list into your code.

So, let's assume you have a file called stopwords.txt. In this file, there will be one word per line - and the file will already contain the merged list of your custom stop words and the official list of English stop words.

You will need to prepare this file manually yourself (i.e. ignore the notes in part 1 of this answer).

My test file is just this:

short
this
is
a
test
the
him
it

I also prefer to use the CustomAnalyzer for something like this, as it lets me build an analyzer very simply.

import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer(&quot;icu&quot;)
        .addTokenFilter(&quot;stop&quot;,
                &quot;ignoreCase&quot;, &quot;true&quot;,
                &quot;words&quot;, &quot;stopwords.txt&quot;,
                &quot;format&quot;, &quot;wordset&quot;)
        .build();

This does the following:

  1. It uses the "icu" tokenizer org.apache.lucene.analysis.icu.segmentation.ICUTokenizer, which takes care of tokenizing on Unicode whitespace, and handling punctuation.

  2. It applies the stopword list. Note the use of true for the ignoreCase attribute, and the reference to the stop-word file. The format of wordset means "one word per line" (there are other formats, also).

The key here is that there is nothing in the above chain which changes word case.

So, now, using this new analyzer, the output is as follows:

[Bla]

Final Notes

Where do you put the stop list file? By default, Lucene expects to find it on the classpath of your application. So, for example, you can put it in the default package.

But remember that the file needs to be handled by your build process, so that it ends up alongside the application's class files (not left behind with the source code).

I mostly use Maven - and therefore I have this in my POM to ensure the ".txt" file gets deployed as needed:

    &lt;build&gt;  
        &lt;resources&gt;  
            &lt;resource&gt;  
                &lt;directory&gt;src/main/java&lt;/directory&gt;  
                &lt;excludes&gt;  
                    &lt;exclude&gt;**/*.java&lt;/exclude&gt;  
                &lt;/excludes&gt;  
            &lt;/resource&gt;  
        &lt;/resources&gt;  
    &lt;/build&gt; 

This tells Maven to copy files (except Java source files) to the build target - thus ensuring the text file gets copied.

Final note - I did not investigate why you were getting that truncated [thi] token. If I get a chance I will take a closer look.


Follow-Up Questions

>After combining I have to use the StandardAnalyzer, right?

Yes, that is correct. the notes I provided in part 1 of the answer relate directly to the code in your question, and to the StandardAnalyzer you use.

>I want to keep the stop word file on a specific non-imported path - how to do that?

You can tell the CustomAnalyzer to look in a "resources" directory for the stop-words file. That directory can be anywhere on the file system (for easy maintenance, as you noted):

import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get(&quot;/path/to/resources/directory&quot;);

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer(&quot;icu&quot;)
        .addTokenFilter(&quot;stop&quot;,
                &quot;ignoreCase&quot;, &quot;true&quot;,
                &quot;words&quot;, &quot;stopwords.txt&quot;,
                &quot;format&quot;, &quot;wordset&quot;)
        .build();

Instead of using .builder() we now use .builder(resources).

huangapple
  • 本文由 发表于 2020年10月13日 00:41:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/64321901.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定