提高在Java中按字符对大型字符串列表进行排序的速度

huangapple go评论78阅读模式
英文:

Increase Speed of Sorting Large List of Strings by Character in Java

问题

以下是您提供的代码翻译结果:

我已经编写了一段代码用于遍历字符串列表并返回整个字符串列表中找到的最不频繁的字符的列表

要以包含最不频繁字符的单词优先的方式返回排序后的单词列表最快的方法是什么?(我正在处理一个庞大的字符串列表所以我编写的代码运行得不够快下面的示例只是为了说明

例如如果给定的列表是:```["hello", "my", "name", "is", "inigo", "montoya", "you", "killed", "my", "father", "prepare", "to", "die"]```。我的代码返回:```[s, g, u, k, f, h, d, p, n, t, r, l, m, y, a, i, o, e]```,其中```s```是在字符串列表中找到的最不频繁的字母,```e```是在列表中找到的最常见的字母然后结果列表将首先返回包含最不频繁字母的单词依此类推例如:```["is", "inigo", "you", "killed", "father", "hello", "die", "prepare", "name", "montoya", "to", "my"]```

以下是找到最不频繁字母的代码
```java
public static void method(List<String> words)
{
    Map<Character, Integer> elemCount = new LinkedHashMap<>();
    for (String word : words)
    {
        for (int i = 0; i < word.length(); i++)
        {
            if (elemCount.containsKey(word.charAt(i)))
            {
                elemCount.put(word.charAt(i), elemCount.get(word.charAt(i)) + 1);
            }
            else
            {
                elemCount.put(word.charAt(i), 1);
            }
        }
    }
    ArrayList<Character> sortedElems = new ArrayList<>();
    LinkedList<String> sorted = new LinkedList<>();
    elemCount.entrySet().stream().sorted(
    Map.Entry.comparingByValue()).forEach(entry -> 
    { 
        for (int i = 1; i <= entry.getValue(); i++)
        {
            if (sortedElems.contains(entry.getKey()) == false)
            {
                sortedElems.add(entry.getKey());
            }
        }
    }
    );

以下是按最不频繁字符排序的代码部分:

    for (int i = 0; i < sortedElems.size(); i++)
    {
        for (String word : words)
        {
            char x = sortedElems.get(i);
            CharSequence c = x + "";
            if (word.contains(c) == true && sorted.contains(word) == false)
            {
                sorted.add(word);
            }
        }
    }
    System.out.println(sorted);
}

请注意,由于您要求只返回翻译的部分,我已经省略了代码的解释性文本。如果您对翻译的任何部分有疑问,请随时提问。

英文:

I have written a code to runs through a list of strings and returns a list of the least frequent characters found throughout the whole list of strings.

What would be the fastest way to return the list of the words sorted by those that contain the least frequent characters first? (I'm working with a huge list of strings, so the code I've written isn't running fast enough. The example below is just for example)

For example, if the given list is: [&quot;hello&quot;, &quot;my&quot;, &quot;name&quot;, &quot;is&quot;, &quot;inigo&quot;, &quot;montoya&quot;, &quot;you&quot;, &quot;killed&quot;, &quot;my&quot;, &quot;father&quot;, &quot;prepare&quot;, &quot;to&quot;, &quot;die&quot;]. My code, returns : [s, g, u, k, f, h, d, p, n, t, r, l, m, y, a, i, o, e], where s is the least frequent letter found in the list of strings, and e is the most frequent letter found in the list. The resulting list would then return the words containing the least frequent letter first and so on. For example: [&quot;is&quot;, &quot;inigo&quot;, &quot;you&quot;, &quot;killed&quot;, &quot;father&quot;, &quot;hello&quot;, &quot;die&quot;, &quot;prepare&quot;, &quot;name&quot;, &quot;montoya&quot;, &quot;to&quot;, &quot;my&quot;]

Here is my code that finds the least frequent letters:

    public static void method(List&lt;String&gt; words)
{
Map&lt;Character, Integer&gt; elemCount = new LinkedHashMap&lt;&gt;();
for (String word : words)
{
for (int i = 0; i  &lt; word.length(); i++)
{
if (elemCount.containsKey(word.charAt(i)))
{
elemCount.put(word.charAt(i), elemCount.get(word.charAt(i)) + 1);
}
else
{
elemCount.put(word.charAt(i), 1);
}
}
}
ArrayList&lt;Character&gt; sortedElems = new ArrayList&lt;&gt;();
LinkedList&lt;String&gt; sorted = new LinkedList&lt;&gt;();
elemCount.entrySet().stream().sorted(
Map.Entry.comparingByValue()).forEach(entry -&gt; 
{ 
for (int i = 1; i &lt;= entry.getValue(); i++)
{
if (sortedElems.contains(entry.getKey()) == false)
{
sortedElems.add(entry.getKey());
}
}
}
);

And here is the code where I'm trying to sort by least frequent characters found in the string:

        for (int i = 0; i &lt; sortedElems.size(); i++)
{
for (String word : words)
{
char x = sortedElems.get(i);
CharSequence c = x + &quot;&quot;;
if (word.contains(c) == true &amp;&amp; sorted.contains(word) == false)
{
sorted.add(word);
}
}
}
System.out.println(sorted);

答案1

得分: 1

这似乎运行得相当不错。我放弃了流方法,因为它太慢了。

		List<String> words = new ArrayList<>(List.of("hello", "my",
				"name", "is", "inigo", "montoya", "you", "killed",
				"my", "father", "prepare", "to", "die"));

        // 频率计数。
		int[] chars = new int[256];
		for (String w : words) {
			for (char c : w.toCharArray()) {
				chars[c]++;
			}
		}
		// 找到最少使用的字符
		Map<String, Integer> mins = new HashMap<>();
		for (String w : words) {
			if (!mins.containsKey(w)) {
				int v = Integer.MAX_VALUE;
				for (char c : w.toCharArray()) {
					if (chars[c] < v) {
						v = chars[c];
					}
				}
				mins.put(w, v);
			}
		}
		
		Comparator<String> comp1 =
				(String a, String b) -> a.compareTo(b);
		Comparator<String> comp =
				(a, b) -> mins.get(a).compareTo(mins.get(b));
		comp = comp.thenComparing(comp1);
		
        // 首先按字符排序,然后按词法排序
		words.sort(comp);
	    words.forEach(System.out::println);

这是频率计数结果:

1=[f, g, k, s, u]
2=[d, h, p]
3=[n, r, t]
4=[a, l, m, y]
5=[i]
6=[o]
7=[e]

以及按排序顺序的单词:

father
inigo
is
killed
you
die
hello
prepare
montoya
name
to
my
my
英文:

This seems to work pretty well. I abandoned the streams approach as it was too slow.

		List&lt;String&gt; words = new ArrayList&lt;&gt;(List.of(&quot;hello&quot;, &quot;my&quot;,
&quot;name&quot;, &quot;is&quot;, &quot;inigo&quot;, &quot;montoya&quot;, &quot;you&quot;, &quot;killed&quot;,
&quot;my&quot;, &quot;father&quot;, &quot;prepare&quot;, &quot;to&quot;, &quot;die&quot;));
// frequency count.
int[] chars = new int[256];
for (String w : words) {
for (char c : w.toCharArray()) {
chars[c]++;
}
}
// find minimum used character
Map&lt;String, Integer&gt; mins = new HashMap&lt;&gt;();
for (String w : words) {
if (!mins.containsKey(w)) {
int v = Integer.MAX_VALUE;
for (char c : w.toCharArray()) {
if (chars[c] &lt; v) {
v = chars[c];
}
}
mins.put(w, v);
}
}
Comparator&lt;String&gt; comp1 =
(String a, String b) -&gt; a.compareTo(b);
Comparator&lt;String&gt; comp =
(a, b) -&gt; mins.get(a).compareTo(mins.get(b));
comp = comp.thenComparing(comp1);
// sort first on character then lexically
words.sort(comp);
words.forEach(System.out::println);

Here is the frequency count

1=[f, g, k, s, u]
2=[d, h, p]
3=[n, r, t]
4=[a, l, m, y]
5=[i]
6=[o]
7=[e]

And the words in sorted order

father
inigo
is
killed
you
die
hello
prepare
montoya
name
to
my
my
</details>
# 答案2
**得分**: 0
以下是翻译好的部分:
这些是我考虑到的,无法放在评论中的部分。
根据您的描述,算法的一般流程是扫描数组以生成直方图,然后使用直方图信息对数组进行排序的两步过程。
第一步是不可避免的,除非数据直接从这些词的源头以流的形式到达。在这种情况下,您可以在新单词到达时简单地更新直方图,几乎没有任何额外开销。
无论如何,这是比较简单的部分。
对于比较困难的部分,我们可以扫描数组,并将每个单词(或指向每个单词的指针)插入到B树中,其分割机制使用实时计算的分数,该分数使用一个简单的函数根据当前单词和直方图的参数计算得出。最后,扫描B树,提取带有所有单词的最终数组,排列整齐。
在非流式场景中,计算复杂度将接近于 O(n+nlog(n)+n),但是如果对原始数组使用流式处理,并且您满意于将B树作为最终产品,而无需构建显式的最终数组,则计算复杂度最多为 O(nlog(n))。
<details>
<summary>英文:</summary>
These are my considerations that could not fit in a comment.
From your description, the general flow of the algorithm is the two-step process of scanning the array to generate the histogram and somehow sorting the array using histogram info.
The first step is unavoidable and cannot be further reduced unless the data arrives as stream directly from the source of these words. In that case you can simply update the histogram with barely any overhead as new words arrive.
Regardless, this was the easy part.
For the hard part, we can scan the array and insert each word (or a pointer to each word) into a b-tree whose splitting mechanism uses a Score computed on-the-fly using a simple function that takes as parameters the current word and the histogram.&lt;br&gt;
Finally, scan the b-tree to extract the final array with all the words neatly sorted.
In a non-streaming scenario, the computational complexity would be close to &lt;code&gt;O(n+nlog(n)+n)&lt;/code&gt;, however you can arrive at no more than &lt;code&gt;O(nlog(n))&lt;/code&gt; if streaming is used for the original array AND you are satisfied with having a b-tree as your final product without needing to build an explicit final array.
</details>

huangapple
  • 本文由 发表于 2020年4月6日 00:11:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/61045547.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定