英文:
I don't want to remove stop words by splitting words into letters
问题
我正在编写这段代码来从我的文本中去除停用词。
**问题 - 这段代码在去除停用词方面表现得很好,但当文本中存在像 ant、ide 这样的单词时,问题就出现了,因为它会将 ant 从 important、want 中移除,将 ide 从 side 中移除。但我不想将单词拆分为单个字母以去除停用词。**
String sCurrentLine;
List<String> stopWordsofwordnet = new ArrayList<>();
FileReader fr = new FileReader("G:\\stopwords.txt");
BufferedReader br = new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null) {
stopWordsofwordnet.add(sCurrentLine);
}
List<String> wordsList = new ArrayList<>();
String text = request.getParameter("textblock");
text = text.trim().replaceAll("[\\s,;]+", " ");
String[] words = text.split(" ");
for (String word : words) {
wordsList.add(word);
}
// 从临时列表中移除停用词
for (int i = 0; i < wordsList.size(); i++) {
for (int j = 0; j < stopWordsofwordnet.size(); j++) {
if (stopWordsofwordnet.get(j).contains(wordsList.get(i).toLowerCase())) {
out.println(wordsList.get(i) + " ");
wordsList.remove(i);
i--;
break;
}
}
}
for (String str : wordsList) {
out.print(str + " ");
}
英文:
I am writing this piece of code to remove stop words from my text.
Problem - This code works perfectly for removing stopwords but the problem arises when words like ant, ide is present in my text as it removes both words ant and ide because ant is present in important, want and ide is present in side. But I don't want to split words into a letter to remove stopwords.
String sCurrentLine;
List<String> stopWordsofwordnet=new ArrayList<>();
FileReader fr=new FileReader("G:\\stopwords.txt");
BufferedReader br= new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null)
{
stopWordsofwordnet.add(sCurrentLine);
}
//out.println("<br>"+stopWordsofwordnet);
List<String> wordsList = new ArrayList<>();
String text = request.getParameter("textblock");
text=text.trim().replaceAll("[\\s,;]+", " ");
String[] words = text.split(" ");
// wordsList.addAll(Arrays.asList(words));
for (String word : words) {
wordsList.add(word);
}
out.println("<br>");
//remove stop words here from the temp list
for (int i = 0; i < wordsList.size(); i++)
{
// get the item as string
for (int j = 0; j < stopWordsofwordnet.size(); j++)
{
if (stopWordsofwordnet.get(j).contains(wordsList.get(i).toLowerCase()))
{
out.println(wordsList.get(i)+"&nbsp;");
wordsList.remove(i);
i--;
break;
}
}
}
out.println("<br>");
for (String str : wordsList) {
out.print(str+" ");
}
答案1
得分: 0
你的代码过于复杂,可以简化为以下内容:
// 从文件中加载停用词
Set<String> stopWords = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
stopWords.addAll(Files.readAllLines(Paths.get("G:\\stopwords.txt")));
// 获取文本并将其分割成单词
String text = request.getParameter("textblock");
List<String> wordsList = new ArrayList<>(Arrays.asList(
text.replaceAll("[\\s,;]+", " ").trim().split(" ")));
// 从单词列表中移除停用词
wordsList.removeAll(stopWords);
英文:
Your code is overly complicated, and can be reduced to this:
// Load stop words from file
Set<String> stopWords = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
stopWords.addAll(Files.readAllLines(Paths.get("G:\\stopwords.txt")));
// Get text and split into words
String text = request.getParameter("textblock");
List<String> wordsList = new ArrayList<>(Arrays.asList(
text.replaceAll("[\\s,;]+", " ").trim().split(" ")));
// Remove stop words from list of words
wordsList.removeAll(stopWords);
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论