英文:
Using default and custom stop words with Apache's Lucene (weird output)
问题
以下是您要求的翻译内容:
我正在使用Apache的Lucene(版本8.6.3)从字符串中去除停用词,以下是使用Java 8代码的示例:
private static final String CONTENTS = "contents";
final String text = "This is a short test! Bla!";
final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);
try {
Analyzer analyzer = new StandardAnalyzer(stopSet);
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
System.out.print("[" + term.toString() + "] ");
}
tokenStream.close();
analyzer.close();
} catch (IOException e) {
System.out.println("Exception:\n");
e.printStackTrace();
}
这将输出所需的结果:
> [this] [is] [a] [bla]
现在我想同时使用默认的英文停用词集,这应该删除"this","is"和"a"(根据github的说明),以及上面的自定义停用词集(我实际使用的停用词集要长得多)。所以我尝试了以下代码:
Analyzer analyzer = new EnglishAnalyzer(stopSet);
输出结果是:
> [thi] [is] [a] [bla]
是的,"this" 中的 "s" 消失了。是什么原因导致了这个问题?它也没有使用默认的停用词集。
以下更改会删除默认和自定义的停用词:
Analyzer analyzer = new EnglishAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
tokenStream = new StopFilter(tokenStream, stopSet);
问题:什么是正确的方法?在tokenStream
内部使用自己(参见上面的代码)会引起问题吗?
附加问题:如何输出保留的单词并保持原始文本中的正确大小写?
英文:
I'm removing stop words from a String, using Apache's Lucene (8.6.3) and the following Java 8 code:
private static final String CONTENTS = "contents";
final String text = "This is a short test! Bla!";
final List<String> stopWords = Arrays.asList("short","test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);
try {
Analyzer analyzer = new StandardAnalyzer(stopSet);
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while(tokenStream.incrementToken()) {
System.out.print("[" + term.toString() + "] ");
}
tokenStream.close();
analyzer.close();
} catch (IOException e) {
System.out.println("Exception:\n");
e.printStackTrace();
}
This outputs the desired result:
> [this] [is] [a] [bla]
Now I want to use both the default English stop set, which should also remove "this", "is" and "a" (according to github) AND the custom stop set above (the actual one I'm going to use is a lot longer), so I tried this:
Analyzer analyzer = new EnglishAnalyzer(stopSet);
The output is:
> [thi] [is] [a] [bla]
Yes, the "s" in "this" is missing. What's causing this? It also didn't use the default stop set.
The following changes remove both the default and the custom stop words:
Analyzer analyzer = new EnglishAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
tokenStream = new StopFilter(tokenStream, stopSet);
Question: What is the "right" way to do this? Is using the tokenStream
within itself (see code above) going to cause problems?
Bonus question: How do I output the remaining words with the right upper/lower case, hence what they use in the original text?
答案1
得分: 4
Handling the Combined Stop Words
处理合并的停用词列表:
import org.apache.lucene.analysis.en.EnglishAnalyzer;
...
final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);
CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
以上代码将简单地将 Lucene 提供的英文停用词与您的自定义列表合并。
这将产生以下输出:
[bla]
处理单词大小写
这会稍微复杂一些。正如您注意到的那样,StandardAnalyzer
包含一个步骤,将所有单词转换为小写 - 因此我们无法使用它。
此外,如果您想要维护自己的自定义停用词列表,并且如果该列表的大小不小,我建议将其存储在自己的文本文件中,而不是将列表嵌入代码中。
因此,假设您有一个名为 stopwords.txt
的文件。在此文件中,每行将有一个单词 - 文件将已包含您的自定义停用词列表和官方英文停用词列表的合并内容。
您需要自己手动准备这个文件(即忽略本答案第 1 部分中的注释)。
我的测试文件只包含以下内容:
short
this
is
a
test
the
him
it
对于这种情况,我更喜欢使用 CustomAnalyzer
,因为它让我可以非常简单地构建一个分析器。
import org.apache.lucene.analysis.custom.CustomAnalyzer;
...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("stop",
"ignoreCase", "true",
"words", "stopwords.txt",
"format", "wordset")
.build();
这将执行以下操作:
-
它使用 "icu" 分词器
org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
,负责在 Unicode 空格上进行分词,并处理标点符号。 -
它应用停用词列表。请注意
ignoreCase
属性使用true
,并引用停用词文件。"wordset" 格式意味着 "每行一个单词"(也有其他格式)。
关键在于,以上链条中没有任何内容会更改单词的大小写。
因此,现在,使用这个新的分析器,输出如下所示:
[Bla]
最后的注意事项
停用词列表文件放在哪里?默认情况下,Lucene 期望在应用程序的类路径上找到它。因此,例如,您可以将它放在默认包中。
但请记住,文件需要由构建过程处理,以便它最终与应用程序的类文件一起处理(不要与源代码一起留下)。
我主要使用 Maven - 因此我在我的 POM 中有以下内容,以确保 ".txt" 文件按需部署:
<build>
<resources>
<resource>
<directory>src/main/java</directory>
<excludes>
<exclude>**/*.java</exclude>
</excludes>
</resource>
</resources>
</build>
这告诉 Maven 将文件(除了 Java 源文件)复制到构建目标 - 从而确保文本文件被复制。
最后的注意事项 - 我没有调查您为什么会得到截断的 [thi]
标记。如果我有机会,我会仔细查看。
后续问题
> 在合并之后,我必须使用 StandardAnalyzer,对吗?
是的,那是正确的。我在答案的第 1 部分中提供的注意事项直接与您的问题中的代码以及您使用的 StandardAnalyzer 相关。
> 我想将停用词文件放在特定的非导入路径上 - 该如何做?
您可以告诉 CustomAnalyzer 在一个名为 "resources" 的目录中查找停用词文件。该目录可以位于文件系统上的任何位置(为了方便维护,正如您所指出的):
import java.nio.file.Path;
import java.nio.file.Paths;
...
Path resources = Paths.get("/path/to/resources/directory");
Analyzer analyzer = CustomAnalyzer.builder(resources)
.withTokenizer("icu")
.addTokenFilter("stop",
"ignoreCase", "true",
"words", "stopwords.txt",
"format", "wordset")
.build();
与 .builder()
不同,我们现在使用 .builder(resources)
。
英文:
I will tackle this in two parts:
- stop-words
- preserving original case
Handling the Combined Stop Words
To handle the combination of Lucene's English stop word list, plus your own custom list, you can create a merged list as follows:
import org.apache.lucene.analysis.en.EnglishAnalyzer;
...
final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);
CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
The above code simply takes the English stopwords bundled with Lucene and merges then with your list.
That gives the following output:
[bla]
Handling Word Case
This is a bit more involved. As you have noticed, the StandardAnalyzer
includes a step in which all words are converted to lower case - so we can't use that.
Also, if you want to maintain your own custom list of stop words, and if that list is of any size, I would recommend storing it in its own text file, rather than embedding the list into your code.
So, let's assume you have a file called stopwords.txt
. In this file, there will be one word per line - and the file will already contain the merged list of your custom stop words and the official list of English stop words.
You will need to prepare this file manually yourself (i.e. ignore the notes in part 1 of this answer).
My test file is just this:
short
this
is
a
test
the
him
it
I also prefer to use the CustomAnalyzer
for something like this, as it lets me build an analyzer very simply.
import org.apache.lucene.analysis.custom.CustomAnalyzer;
...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("stop",
"ignoreCase", "true",
"words", "stopwords.txt",
"format", "wordset")
.build();
This does the following:
-
It uses the "icu" tokenizer
org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
, which takes care of tokenizing on Unicode whitespace, and handling punctuation. -
It applies the stopword list. Note the use of
true
for theignoreCase
attribute, and the reference to the stop-word file. The format ofwordset
means "one word per line" (there are other formats, also).
The key here is that there is nothing in the above chain which changes word case.
So, now, using this new analyzer, the output is as follows:
[Bla]
Final Notes
Where do you put the stop list file? By default, Lucene expects to find it on the classpath of your application. So, for example, you can put it in the default package.
But remember that the file needs to be handled by your build process, so that it ends up alongside the application's class files (not left behind with the source code).
I mostly use Maven - and therefore I have this in my POM to ensure the ".txt" file gets deployed as needed:
<build>
<resources>
<resource>
<directory>src/main/java</directory>
<excludes>
<exclude>**/*.java</exclude>
</excludes>
</resource>
</resources>
</build>
This tells Maven to copy files (except Java source files) to the build target - thus ensuring the text file gets copied.
Final note - I did not investigate why you were getting that truncated [thi]
token. If I get a chance I will take a closer look.
Follow-Up Questions
>After combining I have to use the StandardAnalyzer, right?
Yes, that is correct. the notes I provided in part 1 of the answer relate directly to the code in your question, and to the StandardAnalyzer you use.
>I want to keep the stop word file on a specific non-imported path - how to do that?
You can tell the CustomAnalyzer to look in a "resources" directory for the stop-words file. That directory can be anywhere on the file system (for easy maintenance, as you noted):
import java.nio.file.Path;
import java.nio.file.Paths;
...
Path resources = Paths.get("/path/to/resources/directory");
Analyzer analyzer = CustomAnalyzer.builder(resources)
.withTokenizer("icu")
.addTokenFilter("stop",
"ignoreCase", "true",
"words", "stopwords.txt",
"format", "wordset")
.build();
Instead of using .builder()
we now use .builder(resources)
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论