英文:
Arrays.sort() vs sorting using map
问题
我有一个需求,我必须循环遍历一个包含字符串列表的数组:
String[] arr = {"abc", "cda", "cka", "snd"}
并匹配字符串 bca
,忽略字符的顺序,如果它在数组中出现,则返回 true
(比如 abc
中包含了 bca
)。
为了解决这个问题,我有两种方法:
- 使用
Arrays.sort()
对这两个字符串进行排序,然后使用Arrays.equals
进行比较。 - 创建两个哈希映射,分别记录每个字符串中每个字符的频率,然后最终比较这两个字符频率的映射是否相同。
我阅读到使用 Arrays.sort()
方法的复杂度较高。因此,考虑尝试第二种方法。但是当我运行这两段代码时,第一种方法执行程序的时间非常短。
是否有任何建议可以解释这种情况?
英文:
I have a requirement where I have to loop through an array which has list of strings:
String[] arr = {"abc","cda","cka","snd"}
and match the string "bca"
, ignoring the order of the characters, which will return true
as it’s present in the array ("abc"
).
To solve this I have two approaches:
- Use
Arrays.sort()
to sort both the strings and then use Arrays.equals to compare them. - create 2 hashmaps and add frequency of each letter in string and then finally compare two map of char using equals method.
I read that complexity of using Arrays.sort()
method is more. So, thought of working on 2nd approach but when I am running both the code 1st approach is taking very less time to execute program.
Any suggestions why this is happening?
答案1
得分: 3
时间复杂性只告诉你,这种方法将如何随着(显著)更大的输入而扩展。它并不告诉你哪种方法更快。
完全有可能,某个解决方案在小输入大小(字符串长度和/或数组长度)上运行更快,但由于其时间复杂性,在更大的大小上扩展得很糟糕。但甚至有可能你永远不会遇到一个情况,在这种情况下,具有更好时间复杂性的算法变得更快,因为输入大小的自然限制阻止了这种情况。
你没有展示你的方法的代码,但你的第一个方法很可能在字符串上调用像 toCharArray()
这样的方法,然后跟着调用 Arrays.sort(char[])
。这意味着排序操作在原始数据上进行。
相反,当你的第二种方法使用 HashMap<Character,Integer>
来记录频率时,它将受到装箱开销的影响,对字符和计数都需要装箱,而且还会使用一个显著较大的需要处理的数据结构。
因此,哈希方法对于小字符串和数组来说速度较慢是不足为奇的,因为它有显著较大的固定开销,而且还有一个大小相关的(O(n)
)开销。
因此,第一个方法不得不显著地承受 O(n log n)
的时间复杂性才能实现这个结果。但事实并非如此。这种时间复杂性是一般排序的最坏情况。正如这个答案中所解释的,Arrays.sort的文档中指定的算法不应被视为理所当然。当你调用 Arrays.sort(char[])
并且数组大小超过一定阈值时,实现将转向具有O(n)时间复杂性的计数排序(但会使用更多的临时内存)。
因此,即使对于大字符串,你也不会受到更糟糕的时间复杂性的影响。事实上,计数排序与频率映射类似,但通常更高效,因为它避免了装箱开销,而是使用一个 int[]
数组来代替 HashMap<Character,Integer>
。
英文:
The Time Complexity only tells you, how the approach will scale with (significantly) larger input. It doesn’t tell you which approach is faster.
It’s perfectly possible that a solution is faster for small input sizes (string lengths and/or array length) but scales badly for larger sizes, due to its Time Complexity. But it’s even possible that you never encounter the point where an algorithm with a better Time Complexity becomes faster, when natural limits to the input sizes prevent it.
You didn’t show the code of your approaches, but it’s likely that your first approach calls a method like toCharArray()
on the strings, followed by Arrays.sort(char[])
. This implies that sort operates on primitive data.
In contrast, when your second approach uses a HashMap<Character,Integer>
to record frequencies, it will be subject to boxing overhead, for the characters and the counts, and also use a significantly larger data structure that needs to be processed.
So it’s not surprising that the hash approach is slower for small strings and arrays, as it has a significantly larger fixed overhead and also a size dependent (O(n)
) overhead.
So first approach had to suffer from the O(n log n)
time complexity significantly to turn this result. But this won’t happen. That time complexity is a worst case of sorting in general. As explained in this answer, the algorithms specified in the documentation of Arrays.sort
should not be taken for granted. When you call Arrays.sort(char[])
and the array size crosses a certain threshold, the implementation will turn to Counting Sort with an O(n) time complexity (but use more memory temporarily).
So even with large strings, you won’t suffer from a worse time complexity. In fact, the Counting Sort shares similarities with the frequency map, but usually is more efficient, as it avoids the boxing overhead, using an int[]
array instead of a HashMap<Character,Integer>
.
答案2
得分: 1
Sure, here is the translation:
方法 1: 将会是 O(NlogN)
方法 2: 将会是 O(N*M),其中 M 是数组中每个字符串的长度。
你应该在 O(N) 线性搜索中进行:
for (String str : arr) {
if (str.equals(target)) return true;
}
return false;
英文:
Approach 1: will be O(NlogN)
Approach 2: will be O(N*M), where M is the length of each string in your array.
You should search linearly in O(N):
for (String str : arr) {
if (str.equals(target)) return true;
}
return false;
答案3
得分: 1
让我们分解这个问题:
您需要一个按字符对字符串进行排序的函数(bccabc
-> abbccc
),以便能够将给定字符串与现有字符串进行比较。
Function<String, String> sortChars = s -> s.chars()
.sorted()
.mapToObj(i -> (char) i)
.map(String::valueOf)
.collect(Collectors.joining());
与每次比较字符串时对给定字符串的字符进行排序不同,您可以预先计算唯一标记的集合(来自您的数组的值,排序后的字符):
Set<String> tokens = Arrays.stream(arr)
.map(sortChars)
.collect(Collectors.toSet());
这将得到值 "abc", "acd", "ack", "dns"
。
然后,您可以创建一个函数,检查给定的字符串在按字符排序时是否与给定的标记之一匹配:
Predicate<String> match = s -> tokens.contains(sortChars.apply(s));
现在您可以轻松地检查任何给定的字符串,如下所示:
boolean matches = match.test("bca");
匹配只需要对给定输入进行排序,然后进行哈希集查找以检查是否匹配,因此非常高效。
当然,您也可以将Function和Predicate编写为方法(String sortChars(String s)
和 boolean matches(String s)
),如果您不熟悉函数式编程的话。
英文:
Let's decompose the problem:
You need a function to sort a string by its chars (bccabc
-> abbccc
) to be able to compare a given string with the existing ones.
Function<String, String> sortChars = s -> s.chars()
.sorted()
.mapToObj(i -> (char) i)
.map(String::valueOf)
.collect(Collectors.joining());
Instead of sorting the chars of the given strings anytime you compare them, you can precompute the set of unique tokens (values from your array, sorted chars):
Set<String> tokens = Arrays.stream(arr)
.map(sortChars)
.collect(Collectors.toSet());
This will result in the values "abc","acd","ack","dns"
.
Afterwards you can create a function which checks if a given string, when sorted by chars, matches any of the given tokens:
Predicate<String> match = s -> tokens.contains(sortChars.apply(s));
Now you can easily check any given string as follows:
boolean matches = match.test("bca");
Matching will only need to sort the given input and do a hash set lookup to check if it matches, so it's very efficient.
You can of course write the Function and Predicate as methods instead (String sortChars(String s)
and boolean matches(String s)
if you're unfamiliar with functional programming.
答案4
得分: 0
作为其他答案的补充。当然,您的两个选项具有不同的性能特征。但是要理解性能不一定是做出决策的唯一因素!
意思是:如果您正在谈论每分钟运行数百或数千次,在大数据集上运行的搜索:那么当然,您应该投入大量时间来找到一个能够提供最佳性能的解决方案。很可能,这包括在处理真实数据时进行各种实际测量的实验。时间复杂性是一个理论构造,在现实世界中,还存在诸如CPU缓存大小、线程问题、IO瓶颈等因素,这些因素可能会对实际数字产生重大影响。
但是:当您的代码每分钟只执行一次,即使在几十兆字节甚至几百兆字节的数据上... 那么可能没有必要专注于性能。
换句话说:「排序」解决方案听起来很直观。它易于理解,易于实现,使用一些体面的测试用例很难出错。如果该解决方案能够完成工作并达到“足够好”的效果,那么考虑使用它:简单的解决方案。
性能是一个奢侈的问题。只有在有理由的情况下才会考虑性能问题。
英文:
More of an addendum to the other answers. Of course, your two options have different performance characteristics. But: understand that performance is not necessarily the only factor to make a decision!
Meaning: if you are talking about a search that runs hundreds or thousands of time per minute, on large data sets: then for sure, you should invest a lot of time to come up with a solution that gives you best performance. Most likely, that includes doing various experiments with actual measurements when processing real data. Time complexity is a theoretical construct, in the real world, there are also elements such as CPU cache sizes, threading issues, IO bottlenecks, and whatnot that can have significant impact on real numbers.
But: when your code will doing its work just once a minute, even on a few dozen or hundred MB of data ... then it might not be worth to focus on performance.
In other words: the "sort" solution sounds straight forward. It is easy to understand, easy to implement, and hard to get wrong (with some decent test cases). If that solution gets the job done "good enough", then consider to use use that: the simple solution.
Performance is a luxury problem. You only address it if there is a reason to.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论