Sorting a list of strings by ignoring (not replacing) non-alphanumeric characters, or by looking at the first alphanumeric character

huangapple go评论83阅读模式

Sorting a list of strings by ignoring (not replacing) non-alphanumeric characters, or by looking at the first alphanumeric character



Collections.Sort 能够实现大部分自然排序,然而对于像是:

"-&4" 和 "%B",它会将 "%B" 排在 "-&4" 前面。


"4" 和 "B",将:

"-&4" 放在 "%B" 之前。

对含有特殊字符的字符串进行 replaceall 不能实现,因为我必须保留字符串的完整性,我尝试过先对所有字符进行替换,然后排序以生成排序位置,然后再尝试对未替换的列表进行重新排序,但未能成功(似乎也有些过于复杂)。

我已经花了过去的4个小时在谷歌上搜索这个问题,惊讶于这是一个如此新颖的情况。大多数解决方案都涉及对非字母数字字符进行 replaceall,但我需要保留原始字符串的完整性。



Basically, I need to sort a list of Strings based on a very specific criteria, however, it's not so specific that I believe it needs its own comparator.

Collections.Sort gets me about 95% the way there as most of its natural sorting, however, for strings like:

"-&4" and "%B", it will prioritize "%B" over "-&4".

What I'd like is it to be sorted on the first alphanumeric character, so it would be comparing:

"4" and "B", putting:

"-&4" first then "%B".

Doing a replaceall on special characters can't really work because I have to retain the integrity of the string, and I went down a rabbit hole of replacing all, sorting to generate a sort position then try to re-sort the non-replaced list to no avail (also seems overkill).

I've spent the past 4 hours googling this and surprised it's such a novel situation. Most solutions come with a replaceall on non-alphanumeric characters, but I'd need to retain the integrity of the original string.

Apologies if this is confusing verbiage as well.


得分: 1


如果您没有提供 Comparator,字符串将按照它们的自然顺序进行排序。由于这不是您想要的,您肯定需要提供一个比较器,而且由于没有内建的比较器完全符合您的要求,所以您需要提供一个自定义比较器。

下面的代码使用辅助方法和lambda表达式或方法引用创建了一个自定义比较器。仅仅因为您没有创建自己实现 Comparator 接口的类,不意味着您没有创建自己的比较器。


  1. List<String> list = ...
  2. Pattern p = Pattern.compile("[^\\p{L}\\p{N}]+");
  3. list.sort(Comparator.comparing(s -> p.matcher(s).replaceAll("")));


  1. List<String> list = ...
  2. Pattern p = Pattern.compile("[^\\p{L}\\p{N}]+");
  3. Map<String, String> normalized =
  4. .collect(Collectors.toMap(s -> s, s -> p.matcher(s).replaceAll(""), (a, b) -> a));
  5. list.sort(Comparator.comparing(normalized::get));


  • \p{L} 匹配所有 Unicode 类别 中的字符“Letter”。
  • \p{N} 匹配所有 Unicode 类别中的“Number”字符。
  • [^\p{L}\p{N}] 匹配所有不是“Letter”或“Number”的字符。
  • "[^\\p{L}\\p{N}]+" 是匹配一个或多个这些字符的Java编码字面值。

> it's not so specific that I believe it needs its own comparator

If you don't supply a Comparator, the strings are sorted by their natural order. Since that's not what you want, you definitely need to supply a comparator, and since there is no built-in comparator doing exactly what you want, you do need to supply a custom comparator.

The code below create a custom comparator using a helper method, and a lambda expression or a method reference. Just because you don't create your own class implementing Comparator, doesn't mean you're not creating your own comparator.

To sort by only alphanumeric characters, ignoring spaces and special characters, you can do it like this:

  1. List&lt;String&gt; list = ...
  2. Pattern p = Pattern.compile(&quot;[^\\p{L}\\p{N}]+&quot;);
  3. list.sort(Comparator.comparing(s -&gt; p.matcher(s).replaceAll(&quot;&quot;)));

If the list is large, you'd likely want to improve performance by caching the normalized string that the sort is using.

  1. List&lt;String&gt; list = ...
  2. Pattern p = Pattern.compile(&quot;[^\\p{L}\\p{N}]+&quot;);
  3. Map&lt;String, String&gt; normalized =
  4. .collect(Collectors.toMap(s -&gt; s, s -&gt; p.matcher(s).replaceAll(&quot;&quot;), (a, b) -&gt; a));
  5. list.sort(Comparator.comparing(normalized::get));

Regex explained

  • \p{L} matches all characters in Unicode category "Letter".
  • \p{N} matches all characters in Unicode category "Number".
  • [^\p{L}\p{N}] matches all characters that are not "Letter" or "Number".
  • &quot;[^\\p{L}\\p{N}]+&quot; is the Java encoded literal matching one or more of those characters.

  • 本文由 发表于 2020年7月25日 00:07:10
  • 转载请务必保留本文链接:



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
