在不同线程之间更新原子变量的延迟反映

huangapple go评论77阅读模式
英文:

Delay in atomic variable update reflection across threads

问题

I am interested in the exploring the minimum time in which some write in a variable can be reflected across threads.
我对探索变量写入在不同线程中能够反映出来的最短时间感兴趣。

For this I am using a global atomic variable and updating it periodically.
为此,我正在使用一个全局原子变量,并定期更新它。

Meanwhile, another thread spins and checks for updated value.
与此同时,另一个线程会自旋并检查更新后的值。

Both threads are attached to separate isolated cores (OS - ubuntu).
这两个线程都附加到单独的隔离核心上(操作系统 - Ubuntu)。

//global
constexpr int total = 100;
atomic<int64_t> var;
void reader()
{
    int count = 0;
    int64_t tps[total];

    int64_t last = 0;
    while(count < total)
    {
        int64_t send_tp = var.load(std::memory_order_seq_cst);
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast<nanoseconds>(tp.time_since_epoch()).count();    

        if (send_tp != last)
        {
            last = send_tp;
            tps[count] = curr - send_tp;
            count++;
        }
    }

    for(auto i = 0; i<total; i++)
        cout << tps[i] << endl;
}
void writer()
{
    for (int i=0; i<total; i++)
    {
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast<nanoseconds>(tp.time_since_epoch()).count();
        var.store(curr, std::memory_order_seq_cst);

        // adding delay in writes, so that none are missed
        while(duration_cast<nanoseconds>(high_resolution_clock::now() - tp).count() < 100000000);
    }
}

Using this program, I'm getting around 70 nanoseconds median time.
使用这个程序,我得到了大约70纳秒的中位时间。

I also tried to measure the overheads
我还尝试测量了开销

void overhead() {
    int count = 0;
    int64_t tps[total];

    int64_t last = 0;
    while(count < total)
    {
        auto tp1 = high_resolution_clock::now();
        int64_t to_send = duration_cast<nanoseconds>(tp1.time_since_epoch()).count();
        var.store(to_send, std::memory_order_seq_cst);

        int64_t send_tp = var.load(std::memory_order_seq_cst);
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast<nanoseconds>(tp.time_since_epoch()).count();    

        if (send_tp != last)
        {
            last = send_tp;
            tps[count] = curr - send_tp;
            count++;
        }
    }

    for(auto i = 0; i<total; i++)
        cout << tps[i] << endl;
}

I know atomics will not have much overhead in a single thread access, and it turned out to have a median of 30 nanoseconds (I guess due to chrono::high_resolution_clock()).
我知道原子操作在单线程访问中不会有太大的开销,结果中位数为30纳秒(我猜是由于chrono::high_resolution_clock())。

So this concludes that the delay is around 40 nanoseconds (median). I tried different memory orderings, like memory_order_relaxed or release-acquire but the results were pretty similar.
因此,这表明延迟大约为40纳秒(中位数)。我尝试了不同的内存顺序,如memory_order_relaxedrelease-acquire,但结果非常相似。

From my understanding, the sync needed is just fetching the L1 cache line from adjacent core, so why is it taking around 40 nanoseconds for this. Am I missing something, or any suggestions on how the setup can be improved?
从我的理解来看,所需的同步仅涉及从相邻核心中获取L1缓存行,为什么需要大约40纳秒的时间呢?我是否遗漏了什么,或者有关如何改进设置的建议?

Hardware details -

Intel(R) Core(TM) i9-9900K CPU (hyperthreading disabled)

Compiled : g++ file.cpp -lpthread -O3
硬件详情 -

Intel(R) Core(TM) i9-9900K CPU(已禁用超线程)

编译:g++ file.cpp -lpthread -O3

英文:

I am interested in the exploring the minimum time in which some write in a variable can be reflected across threads.
For this I am using a global atomic variable and updating it periodically. Meanwhile, another thread spins and checks for updated value. Both threads are attached to separate isolated cores (OS - ubuntu).

//global
constexpr int total = 100;
atomic&lt;int64_t&gt; var;
void reader()
{
    int count = 0;
    int64_t tps[total];

    int64_t last = 0;
    while(count &lt; total)
    {
        int64_t send_tp = var.load(std::memory_order_seq_cst);
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast&lt;nanoseconds&gt;(tp.time_since_epoch()).count();    

        if (send_tp != last)
        {
            last = send_tp;
            tps[count] = curr - send_tp;
            count++;
        }
    }

    for(auto i = 0; i&lt;total; i++)
        cout &lt;&lt; tps[i] &lt;&lt; endl;
}
void writer()
{
    for (int i=0; i&lt;total; i++)
    {
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast&lt;nanoseconds&gt;(tp.time_since_epoch()).count();
        var.store(curr, std::memory_order_seq_cst);

        // adding delay in writes, so that none are missed
        while(duration_cast&lt;nanoseconds&gt;(high_resolution_clock::now() - tp).count() &lt; 100000000);
    }
}

Using this program, I'm getting around 70 nanoseconds median time.

I also tried to measure the overheads

void overhead() {
    int count = 0;
    int64_t tps[total];

    int64_t last = 0;
    while(count &lt; total)
    {
        auto tp1 = high_resolution_clock::now();
        int64_t to_send = duration_cast&lt;nanoseconds&gt;(tp1.time_since_epoch()).count();
        var.store(to_send, std::memory_order_seq_cst);

        int64_t send_tp = var.load(std::memory_order_seq_cst);
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast&lt;nanoseconds&gt;(tp.time_since_epoch()).count();    

        if (send_tp != last)
        {
            last = send_tp;
            tps[count] = curr - send_tp;
            count++;
        }
    }

    for(auto i = 0; i&lt;total; i++)
        cout &lt;&lt; tps[i] &lt;&lt; endl;
}

I know atomics will not have much overhead in a single thread access, and it turned out to have a median of 30 nanoseconds (I guess due to chrono::high_resolution_clock()).

So this concludes that the delay is around 40 nanoseconds (median). I tried different memory orderings, like memory_order_relaxed or release-acquire but the results were pretty similar.

From my understanding, the sync needed is just fetching the L1 cache line from adjacent core, so why is it taking around 40 nanoseconds for this. Am I missing something, or any suggestions on how the setup can be improved?

Hardware details -

Intel(R) Core(TM) i9-9900K CPU (hyperthreading disabled)

Compiled : g++ file.cpp -lpthread -O3

答案1

得分: 3

40纳秒的线程间延迟(包括测量开销)对于现代的x86 CPU 来说听起来大致合适。

是的,在读者部分存储一个时间戳并将其与时间测量进行比较听起来是合理的。

核心之间的缓存一致性消息必须经过环形总线发送到L3切片。当负载请求(在L2中未命中的请求)到达正确的L3切片时,它会检测到(从包含的L3标记中)另一个线程拥有处于MESI Exclusive或Modified状态的线,并生成一个消息发送到该核心。然后,该核心将执行回写操作(并可能直接将数据发送给请求它的核心?)

这是在桌面CPU上的情况,我们知道没有其他插槽来检查一致性:英特尔服务器CPU具有显着更高的内存延迟和核心间延迟。

英文:

40ns inter-thread latency (including measurement overhead) sounds about right for modern x86 CPUs.

And yeah, storing a timestamp and checking it against a time measurement in the reader sounds reasonable.

Cache-coherency messages between cores have to go over the ring bus to the L3 slice. When the load request (that missed in L2) gets to the right L3 slice, it will detect (from the inclusive L3 tags) that another thread owns the line in MESI Exclusive or Modified state, and generate a message to that core. That core will then do a write back (and perhaps send the data directly to the core that requested it?)

And that's on a desktop CPU where we know there are no other sockets to snoop for coherency: Intel server CPUs have significantly higher memory latency and inter-core latency.

huangapple
  • 本文由 发表于 2023年5月22日 15:52:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/76304040.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定