从字符串中去除停用词以创建聚类。

huangapple go评论63阅读模式
英文:

remove stop words from a string in order to create clusters

问题

i need to use openrefine, to get some clusters. It is written in Java.

What i cannot achieve, is how should i alter the code:

GitHub链接

尤其是在第93行:

GitHub链接

s = s.trim(); // 首先,去除字符串周围的空白
s = s.toLowerCase(); // TODO:这里使用的是默认区域设置(locale)。我们是否想要这样?
s = normalize(s);
s = punctctrl.matcher(s).replaceAll(""); // 分解后可能会产生标点,因此在处理后去除它们
String[] frags = StringUtils.split(s); // 通过空白进行拆分(不包括补充字符)
TreeSet set = new TreeSet();
for (String ss : frags) {
set.add(ss); // 排序片段并去重
}

这样做可以在生成聚类之前,也移除单词 "and" 和 "&" 符号吗?

非常感谢您提前的帮助。

英文:

i need to use openrefine, to get some clusters. It is written in Java.

What i cannot achieve, is how should i alter the code:

https://github.com/OpenRefine/OpenRefine/blob/master/main/src/com/google/refine/clustering/binning/FingerprintKeyer.java

especially here in the line 93:

https://github.com/OpenRefine/OpenRefine/blob/c76e2b9a461ed5b353ebf5c80e0e0cad2163331c/main/src/com/google/refine/clustering/binning/FingerprintKeyer.java#L93

s = s.trim(); // first off, remove whitespace around the string
s = s.toLowerCase(); // TODO: This is using the default locale. Is that what we want?
s = normalize(s);
s = punctctrl.matcher(s).replaceAll(""); // decomposition can generate punctuation so strip it after
String[] frags = StringUtils.split(s); // split by whitespace (excluding supplementary characters)
TreeSet<String> set = new TreeSet<String>();
for (String ss : frags) {
   set.add(ss); // order fragments and dedupe
}

so as to also remove the word "and", the "&" symbol, prior to generate the clusters?

Thank you in advance for any help

答案1

得分: 0

将要从字符串中移除的单词或字符放入一个String[]数组中,然后使用循环执行以下任务:

String[] alsoReplace = {"and", "the", "&"};
for (String str : alsoReplace) {
    s = s.replaceAll("(?i)" + str + "(\\s+)?", "");
}
英文:

Place the words or characters you want to remove from string into a String[] array and then remove use a loop to carry out the task:

String[] alsoReplace = {"and", "the", "&"};
for (String str : alsoReplace) {
    s = s.replaceAll("(?i)" + str + "(\\s+)?" , "");
}

huangapple
  • 本文由 发表于 2020年10月5日 20:13:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/64208413.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定