Sorting a list of strings by ignoring (not replacing) non-alphanumeric characters, or by looking at the first alphanumeric character

huangapple go评论65阅读模式
英文:

Sorting a list of strings by ignoring (not replacing) non-alphanumeric characters, or by looking at the first alphanumeric character

问题

基本上,我需要根据非常特定的标准对字符串列表进行排序,但这个标准并不是那么特定,我认为它不需要自己的比较器。

Collections.Sort 能够实现大部分自然排序,然而对于像是:

"-&4" 和 "%B",它会将 "%B" 排在 "-&4" 前面。

我想要的是按照第一个字母或数字字符进行排序,所以它会比较:

"4" 和 "B",将:

"-&4" 放在 "%B" 之前。

对含有特殊字符的字符串进行 replaceall 不能实现,因为我必须保留字符串的完整性,我尝试过先对所有字符进行替换,然后排序以生成排序位置,然后再尝试对未替换的列表进行重新排序,但未能成功(似乎也有些过于复杂)。

我已经花了过去的4个小时在谷歌上搜索这个问题,惊讶于这是一个如此新颖的情况。大多数解决方案都涉及对非字母数字字符进行 replaceall,但我需要保留原始字符串的完整性。

如果这种措辞令人困惑,我深感抱歉。

英文:

Basically, I need to sort a list of Strings based on a very specific criteria, however, it's not so specific that I believe it needs its own comparator.

Collections.Sort gets me about 95% the way there as most of its natural sorting, however, for strings like:

"-&4" and "%B", it will prioritize "%B" over "-&4".

What I'd like is it to be sorted on the first alphanumeric character, so it would be comparing:

"4" and "B", putting:

"-&4" first then "%B".

Doing a replaceall on special characters can't really work because I have to retain the integrity of the string, and I went down a rabbit hole of replacing all, sorting to generate a sort position then try to re-sort the non-replaced list to no avail (also seems overkill).

I've spent the past 4 hours googling this and surprised it's such a novel situation. Most solutions come with a replaceall on non-alphanumeric characters, but I'd need to retain the integrity of the original string.

Apologies if this is confusing verbiage as well.

答案1

得分: 1

它并不是那么具体,我认为它不需要自己的比较器

如果您没有提供 Comparator,字符串将按照它们的自然顺序进行排序。由于这不是您想要的,您肯定需要提供一个比较器,而且由于没有内建的比较器完全符合您的要求,所以您需要提供一个自定义比较器。

下面的代码使用辅助方法和lambda表达式或方法引用创建了一个自定义比较器。仅仅因为您没有创建自己实现 Comparator 接口的类,不意味着您没有创建自己的比较器。


要按照只有字母数字字符进行排序,忽略空格和特殊字符,可以像这样做:

List<String> list = ...

Pattern p = Pattern.compile("[^\\p{L}\\p{N}]+");
list.sort(Comparator.comparing(s -> p.matcher(s).replaceAll("")));

如果列表很大,您可能希望通过缓存排序所使用的规范化字符串来提高性能。

List<String> list = ...

Pattern p = Pattern.compile("[^\\p{L}\\p{N}]+");
Map<String, String> normalized = list.stream()
		.collect(Collectors.toMap(s -> s, s -> p.matcher(s).replaceAll(""), (a, b) -> a));
list.sort(Comparator.comparing(normalized::get));

正则表达式解释

  • \p{L} 匹配所有 Unicode 类别 中的字符“Letter”。
  • \p{N} 匹配所有 Unicode 类别中的“Number”字符。
  • [^\p{L}\p{N}] 匹配所有不是“Letter”或“Number”的字符。
  • "[^\\p{L}\\p{N}]+" 是匹配一个或多个这些字符的Java编码字面值。
英文:

> it's not so specific that I believe it needs its own comparator

If you don't supply a Comparator, the strings are sorted by their natural order. Since that's not what you want, you definitely need to supply a comparator, and since there is no built-in comparator doing exactly what you want, you do need to supply a custom comparator.

The code below create a custom comparator using a helper method, and a lambda expression or a method reference. Just because you don't create your own class implementing Comparator, doesn't mean you're not creating your own comparator.


To sort by only alphanumeric characters, ignoring spaces and special characters, you can do it like this:

List&lt;String&gt; list = ...

Pattern p = Pattern.compile(&quot;[^\\p{L}\\p{N}]+&quot;);
list.sort(Comparator.comparing(s -&gt; p.matcher(s).replaceAll(&quot;&quot;)));

If the list is large, you'd likely want to improve performance by caching the normalized string that the sort is using.

List&lt;String&gt; list = ...

Pattern p = Pattern.compile(&quot;[^\\p{L}\\p{N}]+&quot;);
Map&lt;String, String&gt; normalized = list.stream()
		.collect(Collectors.toMap(s -&gt; s, s -&gt; p.matcher(s).replaceAll(&quot;&quot;), (a, b) -&gt; a));
list.sort(Comparator.comparing(normalized::get));

Regex explained

  • \p{L} matches all characters in Unicode category "Letter".
  • \p{N} matches all characters in Unicode category "Number".
  • [^\p{L}\p{N}] matches all characters that are not "Letter" or "Number".
  • &quot;[^\\p{L}\\p{N}]+&quot; is the Java encoded literal matching one or more of those characters.

huangapple
  • 本文由 发表于 2020年7月25日 00:07:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/63077353.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定