why golang channel is much faster than intel tbb concurrent_queue when test with 8 producer 1 consumer

huangapple go评论91阅读模式
英文:

why golang channel is much faster than intel tbb concurrent_queue when test with 8 producer 1 consumer

问题

我做了一个测试,比较了Golang的通道(channel)和C++的TBB并发队列(concurrent queue)的性能。我设置了8个写入者和1个读取者,它们在不同的线程中运行。结果显示,无论是延迟还是整体的发送/接收速度,Golang都比C++版本快得多。这是真的吗?还是我的代码有问题?

Golang的结果,单位是微秒:

延迟 最大值:1505,平均值:1073
发送开始时间:1495593677683232,接收结束时间:1495593677901854,时间:218622

以下是Golang的代码部分:

package main

import (
    "flag"
    "time"
    "fmt"
    "sync"
    "runtime"
)

var (
    producer = flag.Int("producer", 8, "producer")
    consumer = flag.Int("consumer", 1, "consumer")
    start_signal sync.WaitGroup
)

const (
    TEST_NUM = 1000000
)

type Item struct  {
    id int
    sendtime int64
    recvtime int64
}

var g_vec[TEST_NUM] Item

func sender(out chan int, begin int, end int) {
    start_signal.Wait()
    runtime.LockOSThread()
    println("i am in sender", begin, end)
    for i:=begin; i < end; i++ {
        item := &g_vec[i]
        item.id = i
        item.sendtime = time.Now().UnixNano()/1000
        out<- i
    }
    println("sender finish")
}

func reader(out chan int, total int) {
    //runtime.LockOSThread()
    start_signal.Done()
    for i:=0; i<total;i++ {
        tmp :=<- out
        item := &g_vec[tmp]
        item.recvtime = time.Now().UnixNano()/1000
    }
    var lsum int64 = 0
    var lavg int64 = 0
    var lmax int64 = 0
    var lstart int64 = 0
    var lend int64 = 0
    for _, item:= range g_vec {
        if lstart > item.sendtime || lstart == 0 {
           lstart = item.sendtime
        }
        if lend < item.recvtime {
            lend = item.recvtime
        }

        ltmp := item.recvtime - item.sendtime
        lsum += ltmp
        if  ltmp > lmax {
            lmax = ltmp
        }
    }
    lavg = lsum /  TEST_NUM
    fmt.Printf("latency max:%v,avg:%v\n", lmax, lavg)
    fmt.Printf("send begin:%v,recv end:%v, time:%v", lstart, lend, lend-lstart)
}

func main() {
    runtime.GOMAXPROCS(10)
    out := make (chan int,5000)

    start_signal.Add(1)
    for i:=0 ;i<*producer;i++ {
        go sender(out,i*TEST_NUM/(*producer), (i+1)*TEST_NUM/(*producer))
    }
    reader(out, TEST_NUM)
}

C++的代码部分如下:

concurrent_bounded_queue<int> g_queue;
max:558301,min:3,avg:403741  (单位是微秒)
start:1495594232068580,end:1495594233497618,length:1429038

static void sender(int start, int end)
{   
    for (int i=start; i < end; i++)
    {
        using namespace std::chrono;
        auto now = system_clock::now();
        auto now_ms = time_point_cast<microseconds>(now);
        auto value = now_ms.time_since_epoch();
        int64_t duration = value.count();

        Item &item = g_pvec->at(i);
        item.id = i;
        item.sendTime = duration;
        //std::cout << "sending " << i << "\n";
        g_queue.push(i);
    }
}

static void reader(int num)
{
    barrier.set_value();    
    for (int i=0;i<num;i++)
    {
        int v;
        g_queue.pop(v);
        Item &el = g_pvec->at(v);

        using namespace std::chrono;
        auto now = system_clock::now();
        auto now_ms = time_point_cast<microseconds>(now);
        auto value = now_ms.time_since_epoch();
        int64_t duration = value.count();

        el.recvTime = duration;
        //std::cout << "recv " << item.id << ":" << duration << "\n";
    }
    // 计算结果。
    int64_t lmax = 0;
    int64_t lmin = 100000000;
    int64_t lavg = 0;
    int64_t lsum = 0;
    int64_t lbegin = 0;
    int64_t lend = 0;
    for (auto &item : *g_pvec)
    {
        if (item.sendTime<lbegin || lbegin==0)
        {
            lbegin = item.sendTime;
        }
        if (item.recvTime>lend )
        {
            lend = item.recvTime;
        }

        lsum += item.recvTime - item.sendTime;
        lmax = max(item.recvTime - item.sendTime, lmax);
        lmin = min(item.recvTime - item.sendTime, lmin);
    }
    lavg = lsum / num;
    std::cout << "max:" << lmax << ",min:" << lmin << ",avg:" << lavg << "\n";
    std::cout << "start:" << lbegin << ",end:" << lend << ",length:" << lend-lbegin << "\n";
}

DEFINE_CODE_TEST(plain_queue_test)
{
    g_pvec = new std::vector<Item>();
    g_pvec->resize(TEST_NUM);    
    
    auto sf = barrier.get_future().share();
    
    std::vector<std::thread> vt;
    for (int i = 0; i < SENDER_NUM; i++)
    {
        vt.emplace_back([sf, i]{    
            sf.wait();
            sender(i*TEST_NUM / SENDER_NUM, (i + 1)*TEST_NUM / SENDER_NUM);
        });
    }
    

    std::cout << "create reader\n";
    std::thread rt(bind(reader, TEST_NUM));
    for (auto& t : vt)
    {
        t.join();
    }    
    rt.join();
}

根据您提供的代码和结果,Golang的性能确实比C++版本的TBB并发队列要快得多。至于是否存在错误,我无法确定,因为我只是一个翻译助手,无法运行和分析代码。但从您提供的结果来看,Golang的延迟更低,发送和接收速度更快。

希望对您有所帮助!如果您有任何其他问题,请随时提问。

英文:

I did a test to compare golang channel and C++ tbb concurrent queue performance, I setup 8 writer and 1 reader which are in different threads.the result shows golang is much faster than C++ version(whatever latency and overall send/recv speed), is it true? or any mistake in my code?

golang result, unit is microsecond

latency max:1505,avg:1073
send begin:1495593677683232,recv end:1495593677901854, time:218622

package main
import (
&quot;flag&quot;
&quot;time&quot;
&quot;fmt&quot;
&quot;sync&quot;
&quot;runtime&quot;
)
var (
producer = flag.Int(&quot;producer&quot;, 8, &quot;producer&quot;)
consumer = flag.Int(&quot;consumer&quot;, 1, &quot;consumer&quot;)
start_signal sync.WaitGroup
)
const (
TEST_NUM = 1000000
)
type Item struct  {
id int
sendtime int64
recvtime int64
}
var g_vec[TEST_NUM] Item
func sender(out chan int, begin int, end int) {
start_signal.Wait()
runtime.LockOSThread()
println(&quot;i am in sender&quot;, begin, end)
for i:=begin; i &lt; end; i++ {
item := &amp;g_vec[i]
item.id = i
item.sendtime = time.Now().UnixNano()/1000
out&lt;- i
}
println(&quot;sender finish&quot;)
}
func reader(out chan int, total int) {
//runtime.LockOSThread()
start_signal.Done()
for i:=0; i&lt;total;i++ {
tmp :=&lt;- out
item := &amp;g_vec[tmp]
item.recvtime = time.Now().UnixNano()/1000
}
var lsum int64 = 0
var lavg int64 = 0
var lmax int64 = 0
var lstart int64 = 0
var lend int64 = 0
for _, item:= range g_vec {
if lstart &gt; item.sendtime || lstart == 0 {
lstart = item.sendtime
}
if lend &lt; item.recvtime {
lend = item.recvtime
}
ltmp := item.recvtime - item.sendtime
lsum += ltmp
if  ltmp &gt; lmax {
lmax = ltmp
}
}
lavg = lsum /  TEST_NUM
fmt.Printf(&quot;latency max:%v,avg:%v\n&quot;, lmax, lavg)
fmt.Printf(&quot;send begin:%v,recv end:%v, time:%v&quot;, lstart, lend, lend-lstart)
}
func main() {
runtime.GOMAXPROCS(10)
out := make (chan int,5000)
start_signal.Add(1)
for i:=0 ;i&lt;*producer;i++ {
go sender(out,i*TEST_NUM/(*producer), (i+1)*TEST_NUM/(*producer))
}
reader(out, TEST_NUM)
}

C++, only main part

concurrent_bounded_queue<int> g_queue;
max:558301,min:3,avg:403741 (unit is microsecond)
start:1495594232068580,end:1495594233497618,length:1429038

static void sender(int start, int end)
{   
for (int i=start; i &lt; end; i++)
{
using namespace std::chrono;
auto now = system_clock::now();
auto now_ms = time_point_cast&lt;microseconds&gt;(now);
auto value = now_ms.time_since_epoch();
int64_t duration = value.count();
Item &amp;item = g_pvec-&gt;at(i);
item.id = i;
item.sendTime = duration;
//std::cout &lt;&lt; &quot;sending &quot; &lt;&lt; i &lt;&lt; &quot;\n&quot;;
g_queue.push(i);
}
}
static void reader(int num)
{
barrier.set_value();    
for (int i=0;i&lt;num;i++)
{
int v;
g_queue.pop(v);
Item &amp;el = g_pvec-&gt;at(v);
using namespace std::chrono;
auto now = system_clock::now();
auto now_ms = time_point_cast&lt;microseconds&gt;(now);
auto value = now_ms.time_since_epoch();
int64_t duration = value.count();
el.recvTime = duration;
//std::cout &lt;&lt; &quot;recv &quot; &lt;&lt; item.id &lt;&lt; &quot;:&quot; &lt;&lt; duration &lt;&lt; &quot;\n&quot;;
}
// caculate the result.
int64_t lmax = 0;
int64_t lmin = 100000000;
int64_t lavg = 0;
int64_t lsum = 0;
int64_t lbegin = 0;
int64_t lend = 0;
for (auto &amp;item : *g_pvec)
{
if (item.sendTime&lt;lbegin || lbegin==0)
{
lbegin = item.sendTime;
}
if (item.recvTime&gt;lend )
{
lend = item.recvTime;
}
lsum += item.recvTime - item.sendTime;
lmax = max(item.recvTime - item.sendTime, lmax);
lmin = min(item.recvTime - item.sendTime, lmin);
}
lavg = lsum / num;
std::cout &lt;&lt; &quot;max:&quot; &lt;&lt; lmax &lt;&lt; &quot;,min:&quot; &lt;&lt; lmin &lt;&lt; &quot;,avg:&quot; &lt;&lt; lavg &lt;&lt; &quot;\n&quot;;
std::cout &lt;&lt; &quot;start:&quot; &lt;&lt; lbegin &lt;&lt; &quot;,end:&quot; &lt;&lt; lend &lt;&lt; &quot;,length:&quot; &lt;&lt; lend-lbegin &lt;&lt; &quot;\n&quot;;
}
DEFINE_CODE_TEST(plain_queue_test)
{
g_pvec = new std::vector&lt;Item&gt;();
g_pvec-&gt;resize(TEST_NUM);    
auto sf = barrier.get_future().share();
std::vector&lt;std::thread&gt; vt;
for (int i = 0; i &lt; SENDER_NUM; i++)
{
vt.emplace_back([sf, i]{    
sf.wait();
sender(i*TEST_NUM / SENDER_NUM, (i + 1)*TEST_NUM / SENDER_NUM);
});
}
std::cout &lt;&lt; &quot;create reader\n&quot;;
std::thread rt(bind(reader, TEST_NUM));
for (auto&amp; t : vt)
{
t.join();
}    
rt.join();
}

why golang channel is much faster than intel tbb concurrent_queue when test with 8 producer 1 consumer

why golang channel is much faster than intel tbb concurrent_queue when test with 8 producer 1 consumer

(red color means cpu spin/overhead, green is idle) from the vtune cpu graph I felt golang channel has a more efficient mutex(e.g. does it need a system call to sleep a goroutine vs C++ mutex?)
why golang channel is much faster than intel tbb concurrent_queue when test with 8 producer 1 consumer

why golang channel is much faster than intel tbb concurrent_queue when test with 8 producer 1 consumer

答案1

得分: 6

从VTune跟踪中,我可以得出结论,TBB队列不会休眠,而是花费大量时间旋转,而Go版本则有浅绿色区域,表示线程在操作系统同步上休眠。为什么这样更好呢?通常,这表明您的机器上存在过度订阅,因此通过操作系统进行通信是值得的。

那么,您是否过度订阅了呢?如果是的话,我会说这是一种预期的行为,符合相应库的理念。TBB专为计算并行性而设计,它在处理IO任务方面表现不佳,而在对抗过度订阅时则会产生负面影响。Go专为IO任务而设计,因此具有内置的并发性和调度器的FIFO策略,这对于并行数值计算并不友好。过度订阅对于IO任务而言是推荐的,但它会影响甚至破坏计算并行性。

英文:

From the VTune traces I can conclude that TBB queue does not sleep, spending a lot time spinning, while Go version has light green areas indicating threads sleeping on OS synchronization. Why is it better? Usually, it indicates that you have oversubscription on your machine, thus communication through OS pays off.
So, do you oversubscribe it? If yes, I'd say this is rather expected behavior which fits into philosophy of corresponding libraries. TBB is designed for compute parallelism, it does not handle IO tasks well while playing against oversubscription. Go is designed exactly for IO tasks, thus built-in concurrency with FIFO policy of the scheduler that is unfriendly to parallel number crunching. Oversubscription is rather recommend for IO tasks while it affects or even kills computational parallelism.

huangapple
  • 本文由 发表于 2017年5月24日 10:55:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/44148184.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定