2015年11月27日 13:03:52go评论92阅读模式

英文:

Performance of hash table, why is C++ the slowest?

问题

对哈希表进行了简单的性能测试，结果显示C++版本比Perl版本和Golang版本都要慢。

Perl版本大约花费了200毫秒，
C++版本花费了280毫秒。
Golang版本花费了56毫秒。

在我的PC上，配置为Core(TM) i7-2670QM CPU @ 2.20GHz，Ubuntu 14.04.3LTS。

有什么想法吗？

Perl版本：

use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep
                          clock_gettime clock_getres clock_nanosleep clock
                          stat );
sub getTS {
    my ($seconds, $microseconds) = gettimeofday;
    return $seconds + (0.0+ $microseconds)/1000000.0;
}
my %mymap;
$mymap{"U.S."} = "Washington";
$mymap{"U.K."} = "London";
$mymap{"France"} = "Paris";
$mymap{"Russia"} = "Moscow";
$mymap{"China"} = "Beijing";
$mymap{"Germany"} = "Berlin";
$mymap{"Japan"} = "Tokyo";
$mymap{"China"} = "Beijing";
$mymap{"Italy"} = "Rome";
$mymap{"Spain"} = "Madrad";
$x = "";
$start = getTS();
for ($i=0; $i<1000000; $i++) {
    $x = $mymap{"China"};
}
printf "took %f sec\n", getTS() - $start;

C++版本：

#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec/1000000.0;
}

using namespace std;

int main () {
    std::unordered_map<std::string,std::string> mymap;

    // populating container:
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";	

    double start = getTS();
    string x;
    for (int i=0; i<1000000; i++) {
        mymap["China"];
    }
    printf("took %f sec\n", getTS() - start);
    return 0;
}

Golang版本：

package main

import "fmt"
import "time"

func main() {
    var x string
    mymap := make(map[string]string)
    mymap["U.S."] = "Washington"
    mymap["U.K."] = "London"
    mymap["France"] = "Paris"
    mymap["Russia"] = "Moscow"
    mymap["China"] = "Beijing"
    mymap["Germany"] = "Berlin"
    mymap["Japan"] = "Tokyo"
    mymap["China"] = "Beijing"
    mymap["Italy"] = "Rome"
    mymap["Spain"] = "Madrad"
    t0 := time.Now()
    sum := 1
    for sum < 1000000 {
        x = mymap["China"]
        sum += 1
    }
    t1 := time.Now()
    fmt.Printf("The call took %v to run.\n", t1.Sub(t0))
    fmt.Println(x)
}

更新1

为了改进C++版本，将x = mymap["China"];改为mymap["China"];，但性能上几乎没有什么差别。

更新2

在没有任何优化的情况下编译，得到了原始结果：g++ -std=c++11 unorderedMap.cc。使用"-O2"优化，时间只需要大约一半（150毫秒）。

更新3

为了避免可能的char*到string构造函数调用，我创建了一个字符串常量。时间降低到约220毫秒（在编译时没有优化）。感谢@neil-kirk的建议，使用优化（-O2标志），时间约为80毫秒。

double start = getTS();
string x = "China";
for (int i=0; i<1000000; i++) {
    mymap[x];
}

更新4

感谢@steffen-ullrich指出了Perl版本的语法错误。我进行了更改。性能数字约为150毫秒。

更新5

似乎执行的指令数量很重要。使用命令valgrind --tool=cachegrind <cmd>。

对于Go版本：

$ valgrind --tool=cachegrind ./te1
==2103== Cachegrind, a cache and branch-prediction profiler
==2103== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2103== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2103== Command: ./te1
==2103== 
--2103-- warning: L3 cache found, using its data for the LL simulation.
The call took 1.647099s to run.
Beijing
==2103== 
==2103== I   refs:      255,763,381
==2103== I1  misses:          3,709
==2103== LLi misses:          2,743
==2103== I1  miss rate:        0.00%
==2103== LLi miss rate:        0.00%
==2103== 
==2103== D   refs:      109,437,132  (77,838,331 rd   + 31,598,801 wr)
==2103== D1  misses:        352,474  (   254,714 rd   +     97,760 wr)
==2103== LLd misses:        149,260  (    96,250 rd   +     53,010 wr)
==2103== D1  miss rate:         0.3% (       0.3%     +        0.3%  )
==2103== LLd miss rate:         0.1% (       0.1%     +        0.1%  )
==2103== 
==2103== LL refs:           356,183  (   258,423 rd   +     97,760 wr)
==2103== LL misses:         152,003  (    98,993 rd   +     53,010 wr)
==2103== LL miss rate:          0.0% (       0.0%     +        0.1%  )

对于C++优化版本（没有优化标志）：

$ valgrind --tool=cachegrind ./a.out
==2180== Cachegrind, a cache and branch-prediction profiler
==2180== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2180== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2180== Command: ./a.out
==2180== 
--2180-- warning: L3 cache found, using its data for the LL simulation.
took 64.657681 sec
==2180== 
==2180== I   refs:      5,281,474,482
==2180== I1  misses:            1,710
==2180== LLi misses:            1,651
==2180== I1  miss rate:          0.00%
==2180== LLi miss rate:          0.00%
==2180== 
==2180== D   refs:      3,170,495,683  (1,840,363,429 rd   + 1,330,132,254 wr)
==2180== D1  misses:           12,055  (       10,374 rd   +         1,681 wr)
==2180== LLd misses:            7,383  (        6,132 rd   +         1,251 wr)
==2180== D1  miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== LLd miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== 
==2180== LL refs:              13,765  (       12,084 rd   +         1,681 wr)
==2180== LL misses:             9,034  (        7,783 rd   +         1,251 wr)
==2180== LL miss rate:            0.0% (          0.0%     +           0.0%  )

对于优化的C++版本：

$ valgrind --tool=cachegrind ./a.out
==2157== Cachegrind, a cache and branch-prediction profiler
==2157== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2157== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2157== Command: ./a.out
==2157== 
--2157-- warning: L3 cache found, using its data for the LL simulation.
took 9.419447 sec
==2157== 
==2157== I   refs:      1,451,459,660
==2157== I1  misses:            1,599
==2157== LLi misses:            1,549
==2157== I1  miss rate:          0.00%
==2157== LLi miss rate:          0.00%
==2157== 
==2157== D   refs:        430,486,197  (340,358,108 rd   + 90,128,089 wr)
==2157== D1  misses:           12,008  (     10,337 rd   +      1,671 wr)
==2157== LLd misses:            7,372  (      6,120 rd   +      1,252 wr)
==2157== D1  miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== LLd miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== 
==2157== LL refs:              13,607  (     11,936 rd   +      1,671 wr)
==2157== LL misses:             8,921  (      7,669 rd   +      1,252 wr)
==2157== LL miss rate:            0.0% (        0.0%     +        0.0%  )

英文:

Ran a simple performance test on hash, it appears C++ version is the slower than both the perl version and the golang version.

perl version took about 200 millisecond,
C++ version took 280 millisecond.
golang version took 56 millisecond.

On my PC with Core(TM) i7-2670QM CPU @ 2.20GHz, Ubuntu 14.04.3LTS,

Any ideas?

perl version

use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep
                      clock_gettime clock_getres clock_nanosleep clock
                      stat );
sub getTS {
	my ($seconds, $microseconds) = gettimeofday;
	return $seconds + (0.0+ $microseconds)/1000000.0;
}
my %mymap;
$mymap{&quot;U.S.&quot;} = &quot;Washington&quot;;
$mymap{&quot;U.K.&quot;} = &quot;London&quot;;
$mymap{&quot;France&quot;} = &quot;Paris&quot;;
$mymap{&quot;Russia&quot;} = &quot;Moscow&quot;;
$mymap{&quot;China&quot;} = &quot;Beijing&quot;;
$mymap{&quot;Germany&quot;} = &quot;Berlin&quot;;
$mymap{&quot;Japan&quot;} = &quot;Tokyo&quot;;
$mymap{&quot;China&quot;} = &quot;Beijing&quot;;
$mymap{&quot;Italy&quot;} = &quot;Rome&quot;;
$mymap{&quot;Spain&quot;} = &quot;Madrad&quot;;
$x = &quot;&quot;;
$start = getTS();
for ($i=0; $i&lt;1000000; $i++) {
	$x = $mymap{&quot;China&quot;};
}
printf &quot;took %f sec\n&quot;, getTS() - $start;

C++ version

#include &lt;iostream&gt;
#include &lt;string&gt;
#include &lt;unordered_map&gt;
#include &lt;sys/time.h&gt;

double getTS() {
	struct timeval tv;
	gettimeofday(&amp;tv, NULL);
	return tv.tv_sec + tv.tv_usec/1000000.0;
}
using namespace std;
int main () {
  std::unordered_map&lt;std::string,std::string&gt; mymap;

  // populating container:
	mymap[&quot;U.S.&quot;] = &quot;Washington&quot;;
	mymap[&quot;U.K.&quot;] = &quot;London&quot;;
	mymap[&quot;France&quot;] = &quot;Paris&quot;;
	mymap[&quot;Russia&quot;] = &quot;Moscow&quot;;
	mymap[&quot;China&quot;] = &quot;Beijing&quot;;
	mymap[&quot;Germany&quot;] = &quot;Berlin&quot;;
	mymap[&quot;Japan&quot;] = &quot;Tokyo&quot;;
	mymap[&quot;China&quot;] = &quot;Beijing&quot;;
	mymap[&quot;Italy&quot;] = &quot;Rome&quot;;
	mymap[&quot;Spain&quot;] = &quot;Madrad&quot;;	

  double start = getTS();
  string x;
  for (int i=0; i&lt;1000000; i++) {
	  mymap[&quot;China&quot;];
  }
  printf(&quot;took %f sec\n&quot;, getTS() - start);
  return 0;
}

Golang version

package main

import &quot;fmt&quot;
import &quot;time&quot;

func main() {
	var x string
	mymap := make(map[string]string)
	mymap[&quot;U.S.&quot;] = &quot;Washington&quot;;
	mymap[&quot;U.K.&quot;] = &quot;London&quot;;
	mymap[&quot;France&quot;] = &quot;Paris&quot;;
	mymap[&quot;Russia&quot;] = &quot;Moscow&quot;;
	mymap[&quot;China&quot;] = &quot;Beijing&quot;;
	mymap[&quot;Germany&quot;] = &quot;Berlin&quot;;
	mymap[&quot;Japan&quot;] = &quot;Tokyo&quot;;
	mymap[&quot;China&quot;] = &quot;Beijing&quot;;
	mymap[&quot;Italy&quot;] = &quot;Rome&quot;;
	mymap[&quot;Spain&quot;] = &quot;Madrad&quot;;
	t0 := time.Now()
	sum := 1
	for sum &lt; 1000000 {
		x = mymap[&quot;China&quot;]
		sum += 1
	}
	t1 := time.Now()
    fmt.Printf(&quot;The call took %v to run.\n&quot;, t1.Sub(t0))
	fmt.Println(x)
}

Update 1

To improve C++ version, changed x = mymap["China"]; to mymap["China"];, but there is very little difference in performance.

Update 2

I got the original result when compiling without any optimization: g++ -std=c++11 unorderedMap.cc. With "-O2" optimization, it cost only about half the time (150ms)

Update 3

To remove the possible char* to string constructor call, I created a string constant. The time comes down to about 220ms (no optimization in compiling). Thanks to the suggestion from @neil-kirk, with optimization, (-O2 flag), the time is about 80ms.

  double start = getTS();
  string x = &quot;China&quot;;
  for (int i=0; i&lt;1000000; i++) {
      mymap[x];
  }

Update 4

Thanks to @steffen-ullrich who pointed out there is a syntax error for perl version. I Changed it. The performance number is about 150ms.

Update 5

It appears the number of executed instructions matters. Using command valgrind --tool=cachegrind <cmd>

For Go version

$ valgrind --tool=cachegrind  ./te1
==2103== Cachegrind, a cache and branch-prediction profiler
==2103== Copyright (C) 2002-2013, and GNU GPL&#39;d, by Nicholas Nethercote et al.
==2103== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2103== Command: ./te1
==2103== 
--2103-- warning: L3 cache found, using its data for the LL simulation.
The call took 1.647099s to run.
Beijing
==2103== 
==2103== I   refs:      255,763,381
==2103== I1  misses:          3,709
==2103== LLi misses:          2,743
==2103== I1  miss rate:        0.00%
==2103== LLi miss rate:        0.00%
==2103== 
==2103== D   refs:      109,437,132  (77,838,331 rd   + 31,598,801 wr)
==2103== D1  misses:        352,474  (   254,714 rd   +     97,760 wr)
==2103== LLd misses:        149,260  (    96,250 rd   +     53,010 wr)
==2103== D1  miss rate:         0.3% (       0.3%     +        0.3%  )
==2103== LLd miss rate:         0.1% (       0.1%     +        0.1%  )
==2103== 
==2103== LL refs:           356,183  (   258,423 rd   +     97,760 wr)
==2103== LL misses:         152,003  (    98,993 rd   +     53,010 wr)
==2103== LL miss rate:          0.0% (       0.0%     +        0.1%  )

For C++ optimized version (no optimization flag)

$ valgrind --tool=cachegrind ./a.out
==2180== Cachegrind, a cache and branch-prediction profiler
==2180== Copyright (C) 2002-2013, and GNU GPL&#39;d, by Nicholas Nethercote et al.
==2180== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2180== Command: ./a.out
==2180== 
--2180-- warning: L3 cache found, using its data for the LL simulation.
took 64.657681 sec
==2180== 
==2180== I   refs:      5,281,474,482
==2180== I1  misses:            1,710
==2180== LLi misses:            1,651
==2180== I1  miss rate:          0.00%
==2180== LLi miss rate:          0.00%
==2180== 
==2180== D   refs:      3,170,495,683  (1,840,363,429 rd   + 1,330,132,254 wr)
==2180== D1  misses:           12,055  (       10,374 rd   +         1,681 wr)
==2180== LLd misses:            7,383  (        6,132 rd   +         1,251 wr)
==2180== D1  miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== LLd miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== 
==2180== LL refs:              13,765  (       12,084 rd   +         1,681 wr)
==2180== LL misses:             9,034  (        7,783 rd   +         1,251 wr)
==2180== LL miss rate:            0.0% (          0.0%     +           0.0%  )

For C++ optimized version

$ valgrind --tool=cachegrind ./a.out
==2157== Cachegrind, a cache and branch-prediction profiler
==2157== Copyright (C) 2002-2013, and GNU GPL&#39;d, by Nicholas Nethercote et al.
==2157== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2157== Command: ./a.out
==2157== 
--2157-- warning: L3 cache found, using its data for the LL simulation.
took 9.419447 sec
==2157== 
==2157== I   refs:      1,451,459,660
==2157== I1  misses:            1,599
==2157== LLi misses:            1,549
==2157== I1  miss rate:          0.00%
==2157== LLi miss rate:          0.00%
==2157== 
==2157== D   refs:        430,486,197  (340,358,108 rd   + 90,128,089 wr)
==2157== D1  misses:           12,008  (     10,337 rd   +      1,671 wr)
==2157== LLd misses:            7,372  (      6,120 rd   +      1,252 wr)
==2157== D1  miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== LLd miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== 
==2157== LL refs:              13,607  (     11,936 rd   +      1,671 wr)
==2157== LL misses:             8,921  (      7,669 rd   +      1,252 wr)
==2157== LL miss rate:            0.0% (        0.0%     +        0.0%  )

答案1

得分: 16

从你的Perl代码（在你尝试修复之前）：

> @mymap = ();
> $mymap["U.S."] = "Washington";
> $mymap["U.K."] = "London";

这不是一个映射，而是一个数组。哈希映射的语法是：

  %mymap;
  $mymap{&quot;U.S.&quot;} = ....

因此，你实际上创建的是一个数组而不是哈希映射，并且一直访问元素0。请始终在Perl中使用use strict;和use warnings;，即使是简单的语法检查也会显示出问题：

perl -cw x.pl
Argument &quot;U.S.&quot; isn&#39;t numeric in array element at x.pl line 9.
Argument &quot;U.K.&quot; isn&#39;t numeric in array element at x.pl line 10.

除此之外，基准测试的主要部分实际上没有做任何有用的事情（分配一个变量但从未使用），一些编译器可以检测到并简单地优化掉它。

如果你检查你的Perl程序生成的代码，你会看到：

$ perl -MO=Deparse x.pl
@mymap = ();
$mymap[0] = &#39;Washington&#39;;
$mymap[0] = &#39;London&#39;;
...
for ($i = 0; $i &lt; 1000000; ++$i) {
    $x = $mymap[0];
}

这意味着它在编译时检测到问题，并将其替换为对数组索引0的访问。

因此，无论何时进行基准测试，你需要：

检查你的程序是否实际上做了它应该做的事情。
检查编译器是否不会优化掉东西，并且不会在编译时执行其他语言在运行时执行的操作。任何没有或有可预测结果的语句都容易受到这种优化的影响。
验证你实际上测量的是你打算测量的内容，而不是其他内容。即使对程序进行小的更改也可能会影响运行时间，因为之前可能不需要内存分配等，而这些更改可能与你打算测量的内容无关。在你的基准测试中，你一次又一次地测量对同一个哈希元素的访问，而没有访问其他元素。这是一种与处理器缓存非常契合的活动，但可能不反映实际使用情况。

而且，使用简单的计时器并不是一个真实的基准测试。系统上还有其他进程，还有调度程序，还有缓存破坏...而且在今天的CPU上，它高度依赖于系统负载，因为也许CPU在较低功耗模式下运行一个基准测试，而在其他基准测试中则不同，即具有不同的CPU时钟。例如，在我的系统上，同一个“基准测试”的多次运行在测量时间上的变化在100ms和150ms之间。

基准测试是谎言，而像你这样的微基准测试更是如此。

英文:

From your Perl code (before your attempts to fix it):

> @mymap = ();
> $mymap["U.S."] = "Washington";
> $mymap["U.K."] = "London";

This is not a map but an array. The syntax for hash maps is:

  %mymap;
  $mymap{&quot;U.S.&quot;} = ....

Thus what you effectively do is to create an array and not a hash map and access the element 0 all the time.
Please, use use strict; and use warnings; all the time with Perl and even a simple syntax check with warnings would have shown you the problem:

perl -cw x.pl
Argument &quot;U.S.&quot; isn&#39;t numeric in array element at x.pl line 9.
Argument &quot;U.K.&quot; isn&#39;t numeric in array element at x.pl line 10.

Apart from that the main part of the benchmark effectively does nothing useful (assign a variable and never use it) and some compilers can detect it and simply optimize it away.

If you would check the code generated by your Perl program you would see:

$ perl -MO=Deparse x.pl
@mymap = ();
$mymap[0] = &#39;Washington&#39;;
$mymap[0] = &#39;London&#39;;
...
for ($i = 0; $i &lt; 1000000; ++$i) {
    $x = $mymap[0];
}

That is it detects the problem at compile time and replaces it with an access to array index 0.

Thus whenever do benchmarks you need to:

Check that your program actually does what it is supposed to.
Check that the compiler does not optimize things away and does not execute stuff at compile time where other languages will do it at run time. Any kind of statements with no or predictable results are prone to such optimization.
Verify that you actually measure what you intend to measure and not something else. Even small changes to the program can affect the run time because there is a memory allocation needed were not was before etc and these changes might not be related to what you intend to measure. In your benchmark you measure the access to the same hash element again and again without any access to other elements in between. This is an activity which can play very nice with processor caches but probably does not reflect real world use.

And, using a simple timer is not a realistic benchmark. There are other processes on the system, there is the scheduler, there is cache trashing... and with today's CPU it highly depends on the load on the system because maybe the CPU will run one benchmark in a lower power mode than the other benchmarks, i.e. with a different CPU clock. For example multiple runs of the same "benchmark" vary in the measured time between 100ms and 150ms on my system.

Benchmarks are lies and micro benchmarks like yours doubly so.

答案2

得分: 4

我稍微修改了你的示例代码，以获取有关哈希表结构的一些详细信息：

#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>
#include <chrono>

using namespace std;
int main()
{
    std::unordered_map<std::string, std::string> mymap;

    // 填充容器：
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";

    std::hash<std::string> h;
    for (auto const& i : mymap)
    {
        printf("hash(%s) = %ud\n", i.first.c_str(), h(i.first));
    }

    for (int i = 0; i != mymap.bucket_count(); ++i)
    {
        auto const bucketsize = mymap.bucket_size(i);
        if (0 != bucketsize)
        {
            printf("size of bucket %d = %d\n", i, bucketsize);
        }
    }

    auto const start = std::chrono::steady_clock::now();

    string const x = "China";
    std::string res;

    for (int i = 0; i < 1000000; i++)
    {
        mymap.find(x);
    }

    auto const elapsed = std::chrono::steady_clock::now() - start;
    printf("%s\n", res);
    printf("took %d ms\n",
           std::chrono::duration_cast<std::chrono::milliseconds>(elapsed).count());
    return 0;
}

在我的系统上运行这段代码，得到的运行时间约为68毫秒，输出如下：

hash(Japan) = 3611029618d
hash(Spain) = 749986602d
hash(China) = 3154384700d
hash(U.S.) = 2546447179d
hash(Italy) = 2246786301d
hash(Germany) = 2319993784d
hash(U.K.) = 2699630607d
hash(France) = 3266727934d
hash(Russia) = 3992029278d
size of bucket 0 = 0
size of bucket 1 = 0
size of bucket 2 = 1
size of bucket 3 = 1
size of bucket 4 = 1
size of bucket 5 = 0
size of bucket 6 = 1
size of bucket 7 = 0
size of bucket 8 = 0
size of bucket 9 = 2
size of bucket 10 = 3

结果显示哈希表没有很好地优化，并且存在一些冲突。进一步打印桶中的元素可以看到，西班牙和中国在桶9中。该桶可能是一个链表，节点在内存中分布，这解释了性能下降。

如果选择另一个哈希表大小，使得没有冲突，可以获得加速。我尝试添加mymap.rehash(1001)，得到了20-30%的加速，运行时间在44-52毫秒之间。

现在，另一个问题是计算"China"的哈希值。该函数在每次迭代中执行。当我们切换到常量普通C字符串时，可以消除这个问题：

#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>
#include <chrono>

static auto constexpr us = "U.S.";
static auto constexpr uk = "U.K.";
static auto constexpr fr = "France";
static auto constexpr ru = "Russia";
static auto constexpr cn = "China";
static auto constexpr ge = "Germany";
static auto constexpr jp = "Japan";
static auto constexpr it = "Italy";
static auto constexpr sp = "Spain";

using namespace std;
int main()
{
    std::unordered_map<const char*, std::string> mymap;

    // 填充容器：
    mymap[us] = "Washington";
    mymap[uk] = "London";
    mymap[fr] = "Paris";
    mymap[ru] = "Moscow";
    mymap[cn] = "Beijing";
    mymap[ge] = "Berlin";
    mymap[jp] = "Tokyo";
    mymap[it] = "Rome";
    mymap[sp] = "Madrad";

    string const x = "China";
    char const* res = nullptr;
    auto const start = std::chrono::steady_clock::now();
    for (int i = 0; i < 1000000; i++)
    {
        res = mymap[cn].c_str();
    }

    auto const elapsed = std::chrono::steady_clock::now() - start;
    printf("%s\n", res);
    printf("took %d ms\n",
           std::chrono::duration_cast<std::chrono::milliseconds>(elapsed).count());
    return 0;
}

在我的机器上，这将运行时间减少了50%，约为20毫秒。区别在于，现在不再从字符串内容计算哈希值，而是将地址转换为哈希值，这样更快，因为它只返回地址值作为size_t。我们也不需要重新哈希，因为对于cn所在的桶没有冲突。

英文:

I've modified your example a little bit to get some details about the structure of the hash table:

#include &lt;iostream&gt;
#include &lt;string&gt;
#include &lt;unordered_map&gt;
#include &lt;sys/time.h&gt;
#include &lt;chrono&gt;
using namespace std;
int main()
{
std::unordered_map&lt;std::string, std::string&gt; mymap;
// populating container:
mymap[&quot;U.S.&quot;] = &quot;Washington&quot;;
mymap[&quot;U.K.&quot;] = &quot;London&quot;;
mymap[&quot;France&quot;] = &quot;Paris&quot;;
mymap[&quot;Russia&quot;] = &quot;Moscow&quot;;
mymap[&quot;China&quot;] = &quot;Beijing&quot;;
mymap[&quot;Germany&quot;] = &quot;Berlin&quot;;
mymap[&quot;Japan&quot;] = &quot;Tokyo&quot;;
mymap[&quot;China&quot;] = &quot;Beijing&quot;;
mymap[&quot;Italy&quot;] = &quot;Rome&quot;;
mymap[&quot;Spain&quot;] = &quot;Madrad&quot;;
std::hash&lt;std::string&gt; h;
for ( auto const&amp; i : mymap )
{
printf( &quot;hash(%s) = %ud\n&quot;, i.first.c_str(), h( i.first ) );
}
for ( int i = 0; i != mymap.bucket_count(); ++i )
{
auto const bucketsize = mymap.bucket_size( i );
if ( 0 != bucketsize )
{
printf( &quot;size of bucket %d = %d\n&quot;, i, bucketsize );
}
}
auto const start = std::chrono::steady_clock::now();
string const x = &quot;China&quot;;
std::string res;
for ( int i = 0; i &lt; 1000000; i++ )
{
mymap.find( x );
}
auto const elapsed = std::chrono::steady_clock::now() - start;
printf( &quot;%s\n&quot;, res );
printf( &quot;took %d ms\n&quot;,
std::chrono::duration_cast&lt;std::chrono::milliseconds&gt;( elapsed ).count() );
return 0;
}

Running this on my system, I get a runtime of ~68ms with the following output:

hash(Japan) = 3611029618d
hash(Spain) = 749986602d
hash(China) = 3154384700d
hash(U.S.) = 2546447179d
hash(Italy) = 2246786301d
hash(Germany) = 2319993784d
hash(U.K.) = 2699630607d
hash(France) = 3266727934d
hash(Russia) = 3992029278d
size of bucket 0 = 0
size of bucket 1 = 0
size of bucket 2 = 1
size of bucket 3 = 1
size of bucket 4 = 1
size of bucket 5 = 0
size of bucket 6 = 1
size of bucket 7 = 0
size of bucket 8 = 0
size of bucket 9 = 2
size of bucket 10 = 3

It turns out that the hash table is not well optimized and contains some collisions. Further printing the elements in the bucket shows that Spain and China are in bucket 9. The bucket is probably a linked list with nodes distributed in memory, explaining the performance drop.

If you choose another hash table size such that there are no collisions, you can get a speedup. I tested it with adding mymap.rehash(1001) and got I speedup of 20-30% to something between 44-52ms.

Now, another point would be computation of the hash values for "China". The function is executed in each iteration. We can make this go away when we switch to constant plain C strings:

#include &lt;iostream&gt;
#include &lt;string&gt;
#include &lt;unordered_map&gt;
#include &lt;sys/time.h&gt;
#include &lt;chrono&gt;
static auto constexpr us = &quot;U.S.&quot;;
static auto constexpr uk = &quot;U.K.&quot;;
static auto constexpr fr = &quot;France&quot;;
static auto constexpr ru = &quot;Russia&quot;;
static auto constexpr cn = &quot;China&quot;;
static auto constexpr ge = &quot;Germany&quot;;
static auto constexpr jp = &quot;Japan&quot;;
static auto constexpr it = &quot;Italy&quot;;
static auto constexpr sp = &quot;Spain&quot;;
using namespace std;
int main()
{
std::unordered_map&lt;const char*, std::string&gt; mymap;
// populating container:
mymap[us] = &quot;Washington&quot;;
mymap[uk] = &quot;London&quot;;
mymap[fr] = &quot;Paris&quot;;
mymap[ru] = &quot;Moscow&quot;;
mymap[cn] = &quot;Beijing&quot;;
mymap[ge] = &quot;Berlin&quot;;
mymap[jp] = &quot;Tokyo&quot;;
mymap[it] = &quot;Rome&quot;;
mymap[sp] = &quot;Madrad&quot;;
string const x = &quot;China&quot;;
char const* res = nullptr;
auto const start = std::chrono::steady_clock::now();
for ( int i = 0; i &lt; 1000000; i++ )
{
res = mymap[cn].c_str();
}
auto const elapsed = std::chrono::steady_clock::now() - start;
printf( &quot;%s\n&quot;, res );
printf( &quot;took %d ms\n&quot;,
std::chrono::duration_cast&lt;std::chrono::milliseconds&gt;( elapsed ).count() );
return 0;
}

On my machine, this reduces the runtime by 50% to ~20ms. The difference is that instead of computing the hash value from the string contents, it now just transforms the address into a hash value which is much faster because it just returns the address value as a size_t. We also do not need to rehash because there are no collisions for the bucket with cn.

答案3

得分: 3

这只是表明对于这个特定的用例，Go的哈希映射实现进行了很好的优化。

mymap["China"] 调用了 mapaccess1_faststr，该函数专门针对字符串键进行了优化。特别是对于只有一个桶的小型映射，对于长度小于32字节的字符串，甚至不会计算哈希码。

英文:

This just shows that for this particular use case Go hash map implementation is optimized very well.

mymap["China"] calls mapaccess1_faststr that is specifically optimized for string keys. Particularly for small one-bucket maps the hash code is not even calculated for short (less than 32 bytes) strings.

答案4

得分: 2

这是一个猜测：

unordered_map<string, string>::operator[]需要一个字符串参数。你提供了一个char*。如果没有开启优化，C++版本可能会调用std::string(char*)构造函数一百万次，将"China"转换为std::string。Go语言规范可能会将"string literals"与string类型相同，因此不需要调用构造函数。

如果开启了优化，字符串构造函数将被提升出循环，你就不会看到同样的问题。或者可能不会生成任何代码，除了两个系统调用来获取时间和打印差异的系统调用，因为最终都没有效果。

要确认这一点，你需要实际查看生成的汇编代码。这将是一个编译器选项。参考这个问题，了解GCC所需的标志。

英文:

This is a guess:

unordered_map<string, string>::operator[] requires a string argument. You are providing a char*. Without optimisations on, the C++ version is likely calling the std::string(char*) constructor a million times in order to turn "China" into a std::string. Go's language spec likely makes "string literals" of the same type as string, so no constructors need to be called.

With optimisations on, the string constructor will get hoisted out of the loop and you won't see the same problem. Or quite possibly no code will be generated except the two system calls to get the time and the system call to print the difference, because ultimately it all has no effect.

To confirm that, you'd have to actually look at what assembly is being generated. That'll be a compiler option. See this question for the flags required for GCC.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

哈希表的性能，为什么C++是最慢的？

问题

答案1

答案2

答案3

答案4

How can I override content type header of every responses for http.Client?

更流畅的FileServe和Routing合作

如何检索 []bson.M 类型的映射？

获取Go语言表单的数据未收到

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论