英文:
remove stop words from a string in order to create clusters
问题
i need to use openrefine, to get some clusters. It is written in Java.
What i cannot achieve, is how should i alter the code:
尤其是在第93行:
s = s.trim(); // 首先,去除字符串周围的空白
s = s.toLowerCase(); // TODO:这里使用的是默认区域设置(locale)。我们是否想要这样?
s = normalize(s);
s = punctctrl.matcher(s).replaceAll(""); // 分解后可能会产生标点,因此在处理后去除它们
String[] frags = StringUtils.split(s); // 通过空白进行拆分(不包括补充字符)
TreeSet
for (String ss : frags) {
set.add(ss); // 排序片段并去重
}
这样做可以在生成聚类之前,也移除单词 "and" 和 "&" 符号吗?
非常感谢您提前的帮助。
英文:
i need to use openrefine, to get some clusters. It is written in Java.
What i cannot achieve, is how should i alter the code:
especially here in the line 93:
s = s.trim(); // first off, remove whitespace around the string
s = s.toLowerCase(); // TODO: This is using the default locale. Is that what we want?
s = normalize(s);
s = punctctrl.matcher(s).replaceAll(""); // decomposition can generate punctuation so strip it after
String[] frags = StringUtils.split(s); // split by whitespace (excluding supplementary characters)
TreeSet<String> set = new TreeSet<String>();
for (String ss : frags) {
set.add(ss); // order fragments and dedupe
}
so as to also remove the word "and", the "&" symbol, prior to generate the clusters?
Thank you in advance for any help
答案1
得分: 0
将要从字符串中移除的单词或字符放入一个String[]数组中,然后使用循环执行以下任务:
String[] alsoReplace = {"and", "the", "&"};
for (String str : alsoReplace) {
s = s.replaceAll("(?i)" + str + "(\\s+)?", "");
}
英文:
Place the words or characters you want to remove from string into a String[] array and then remove use a loop to carry out the task:
String[] alsoReplace = {"and", "the", "&"};
for (String str : alsoReplace) {
s = s.replaceAll("(?i)" + str + "(\\s+)?" , "");
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论