我不想通过将单词拆分为字母来删除停用词。

huangapple go评论70阅读模式
英文:

I don't want to remove stop words by splitting words into letters

问题

我正在编写这段代码来从我的文本中去除停用词

**问题 - 这段代码在去除停用词方面表现得很好但当文本中存在像 antide 这样的单词时问题就出现了因为它会将 ant 从 importantwant 中移除将 ide 从 side 中移除但我不想将单词拆分为单个字母以去除停用词**

String sCurrentLine;
List<String> stopWordsofwordnet = new ArrayList<>();
FileReader fr = new FileReader("G:\\stopwords.txt");
BufferedReader br = new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null) {
    stopWordsofwordnet.add(sCurrentLine);
}

List<String> wordsList = new ArrayList<>();
String text = request.getParameter("textblock");
text = text.trim().replaceAll("[\\s,;]+", " ");
String[] words = text.split(" ");

for (String word : words) {
    wordsList.add(word);
}

// 从临时列表中移除停用词
for (int i = 0; i < wordsList.size(); i++) {
    for (int j = 0; j < stopWordsofwordnet.size(); j++) {
        if (stopWordsofwordnet.get(j).contains(wordsList.get(i).toLowerCase())) {
            out.println(wordsList.get(i) + "&nbsp;");
            wordsList.remove(i);
            i--;
            break;
        }
    }
}

for (String str : wordsList) {
    out.print(str + " ");
}
英文:

I am writing this piece of code to remove stop words from my text.

Problem - This code works perfectly for removing stopwords but the problem arises when words like ant, ide is present in my text as it removes both words ant and ide because ant is present in important, want and ide is present in side. But I don't want to split words into a letter to remove stopwords.

            String sCurrentLine;
List&lt;String&gt; stopWordsofwordnet=new ArrayList&lt;&gt;();
FileReader fr=new FileReader(&quot;G:\\stopwords.txt&quot;);
BufferedReader br= new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null)
{
stopWordsofwordnet.add(sCurrentLine);
}
//out.println(&quot;&lt;br&gt;&quot;+stopWordsofwordnet);
List&lt;String&gt; wordsList = new ArrayList&lt;&gt;();
String text = request.getParameter(&quot;textblock&quot;);
text=text.trim().replaceAll(&quot;[\\s,;]+&quot;, &quot; &quot;);
String[] words = text.split(&quot; &quot;);
//            wordsList.addAll(Arrays.asList(words));
for (String word : words) {
wordsList.add(word);
}
out.println(&quot;&lt;br&gt;&quot;);
//remove stop words here from the temp list
for (int i = 0; i &lt; wordsList.size(); i++) 
{
// get the item as string
for (int j = 0; j &lt; stopWordsofwordnet.size(); j++) 
{
if (stopWordsofwordnet.get(j).contains(wordsList.get(i).toLowerCase())) 
{
out.println(wordsList.get(i)+&quot;&amp;nbsp;&quot;);
wordsList.remove(i);
i--;
break;
}
}
}
out.println(&quot;&lt;br&gt;&quot;);
for (String str : wordsList) {
out.print(str+&quot; &quot;);
}

答案1

得分: 0

你的代码过于复杂,可以简化为以下内容:

// 从文件中加载停用词
Set<String> stopWords = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
stopWords.addAll(Files.readAllLines(Paths.get("G:\\stopwords.txt")));

// 获取文本并将其分割成单词
String text = request.getParameter("textblock");
List<String> wordsList = new ArrayList<>(Arrays.asList(
		text.replaceAll("[\\s,;]+", " ").trim().split(" ")));

// 从单词列表中移除停用词
wordsList.removeAll(stopWords);
英文:

Your code is overly complicated, and can be reduced to this:

// Load stop words from file
Set&lt;String&gt; stopWords = new TreeSet&lt;&gt;(String.CASE_INSENSITIVE_ORDER);
stopWords.addAll(Files.readAllLines(Paths.get(&quot;G:\\stopwords.txt&quot;)));
// Get text and split into words
String text = request.getParameter(&quot;textblock&quot;);
List&lt;String&gt; wordsList = new ArrayList&lt;&gt;(Arrays.asList(
text.replaceAll(&quot;[\\s,;]+&quot;, &quot; &quot;).trim().split(&quot; &quot;)));
// Remove stop words from list of words
wordsList.removeAll(stopWords);

huangapple
  • 本文由 发表于 2020年10月13日 23:01:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/64337797.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定