2023年5月22日 15:52:43go评论106阅读模式

英文:

Delay in atomic variable update reflection across threads

问题

I am interested in the exploring the minimum time in which some write in a variable can be reflected across threads.
我对探索变量写入在不同线程中能够反映出来的最短时间感兴趣。

For this I am using a global atomic variable and updating it periodically.
为此，我正在使用一个全局原子变量，并定期更新它。

Meanwhile, another thread spins and checks for updated value.
与此同时，另一个线程会自旋并检查更新后的值。

Both threads are attached to separate isolated cores (OS - ubuntu).
这两个线程都附加到单独的隔离核心上（操作系统 - Ubuntu）。

//global
constexpr int total = 100;
atomic<int64_t> var;

void reader()
{
    int count = 0;
    int64_t tps[total];
    int64_t last = 0;
    while(count < total)
    {
        int64_t send_tp = var.load(std::memory_order_seq_cst);
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast<nanoseconds>(tp.time_since_epoch()).count();    
        if (send_tp != last)
        {
            last = send_tp;
            tps[count] = curr - send_tp;
            count++;
        }
    }
    for(auto i = 0; i<total; i++)
        cout << tps[i] << endl;
}

void writer()
{
    for (int i=0; i<total; i++)
    {
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast<nanoseconds>(tp.time_since_epoch()).count();
        var.store(curr, std::memory_order_seq_cst);
        // adding delay in writes, so that none are missed
        while(duration_cast<nanoseconds>(high_resolution_clock::now() - tp).count() < 100000000);
    }
}

Using this program, I'm getting around 70 nanoseconds median time.
使用这个程序，我得到了大约70纳秒的中位时间。

I also tried to measure the overheads
我还尝试测量了开销

void overhead() {
    int count = 0;
    int64_t tps[total];
    int64_t last = 0;
    while(count < total)
    {
        auto tp1 = high_resolution_clock::now();
        int64_t to_send = duration_cast<nanoseconds>(tp1.time_since_epoch()).count();
        var.store(to_send, std::memory_order_seq_cst);
        int64_t send_tp = var.load(std::memory_order_seq_cst);
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast<nanoseconds>(tp.time_since_epoch()).count();    
        if (send_tp != last)
        {
            last = send_tp;
            tps[count] = curr - send_tp;
            count++;
        }
    }
    for(auto i = 0; i<total; i++)
        cout << tps[i] << endl;
}

I know atomics will not have much overhead in a single thread access, and it turned out to have a median of 30 nanoseconds (I guess due to chrono::high_resolution_clock()).
我知道原子操作在单线程访问中不会有太大的开销，结果中位数为30纳秒（我猜是由于chrono::high_resolution_clock()）。

So this concludes that the delay is around 40 nanoseconds (median). I tried different memory orderings, like memory_order_relaxed or release-acquire but the results were pretty similar.
因此，这表明延迟大约为40纳秒（中位数）。我尝试了不同的内存顺序，如memory_order_relaxed或release-acquire，但结果非常相似。

From my understanding, the sync needed is just fetching the L1 cache line from adjacent core, so why is it taking around 40 nanoseconds for this. Am I missing something, or any suggestions on how the setup can be improved?
从我的理解来看，所需的同步仅涉及从相邻核心中获取L1缓存行，为什么需要大约40纳秒的时间呢？我是否遗漏了什么，或者有关如何改进设置的建议？

Hardware details -

Intel(R) Core(TM) i9-9900K CPU (hyperthreading disabled)

Compiled : g++ file.cpp -lpthread -O3
硬件详情 -

Intel(R) Core(TM) i9-9900K CPU（已禁用超线程）

编译：g++ file.cpp -lpthread -O3

英文:

I am interested in the exploring the minimum time in which some write in a variable can be reflected across threads.
For this I am using a global atomic variable and updating it periodically. Meanwhile, another thread spins and checks for updated value. Both threads are attached to separate isolated cores (OS - ubuntu).

//global
constexpr int total = 100;
atomic&lt;int64_t&gt; var;

void reader()
{
    int count = 0;
    int64_t tps[total];
    int64_t last = 0;
    while(count &lt; total)
    {
        int64_t send_tp = var.load(std::memory_order_seq_cst);
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast&lt;nanoseconds&gt;(tp.time_since_epoch()).count();    
        if (send_tp != last)
        {
            last = send_tp;
            tps[count] = curr - send_tp;
            count++;
        }
    }
    for(auto i = 0; i&lt;total; i++)
        cout &lt;&lt; tps[i] &lt;&lt; endl;
}

void writer()
{
    for (int i=0; i&lt;total; i++)
    {
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast&lt;nanoseconds&gt;(tp.time_since_epoch()).count();
        var.store(curr, std::memory_order_seq_cst);
        // adding delay in writes, so that none are missed
        while(duration_cast&lt;nanoseconds&gt;(high_resolution_clock::now() - tp).count() &lt; 100000000);
    }
}

Using this program, I'm getting around 70 nanoseconds median time.

I also tried to measure the overheads

void overhead() {
    int count = 0;
    int64_t tps[total];
    int64_t last = 0;
    while(count &lt; total)
    {
        auto tp1 = high_resolution_clock::now();
        int64_t to_send = duration_cast&lt;nanoseconds&gt;(tp1.time_since_epoch()).count();
        var.store(to_send, std::memory_order_seq_cst);
        int64_t send_tp = var.load(std::memory_order_seq_cst);
        auto tp = high_resolution_clock::now();
        int64_t curr = duration_cast&lt;nanoseconds&gt;(tp.time_since_epoch()).count();    
        if (send_tp != last)
        {
            last = send_tp;
            tps[count] = curr - send_tp;
            count++;
        }
    }
    for(auto i = 0; i&lt;total; i++)
        cout &lt;&lt; tps[i] &lt;&lt; endl;
}

I know atomics will not have much overhead in a single thread access, and it turned out to have a median of 30 nanoseconds (I guess due to chrono::high_resolution_clock()).

So this concludes that the delay is around 40 nanoseconds (median). I tried different memory orderings, like memory_order_relaxed or release-acquire but the results were pretty similar.

Hardware details -

Intel(R) Core(TM) i9-9900K CPU (hyperthreading disabled)

Compiled : g++ file.cpp -lpthread -O3

答案1

得分: 3

40纳秒的线程间延迟（包括测量开销）对于现代的x86 CPU 来说听起来大致合适。

是的，在读者部分存储一个时间戳并将其与时间测量进行比较听起来是合理的。

核心之间的缓存一致性消息必须经过环形总线发送到L3切片。当负载请求（在L2中未命中的请求）到达正确的L3切片时，它会检测到（从包含的L3标记中）另一个线程拥有处于MESI Exclusive或Modified状态的线，并生成一个消息发送到该核心。然后，该核心将执行回写操作（并可能直接将数据发送给请求它的核心？）

这是在桌面CPU上的情况，我们知道没有其他插槽来检查一致性：英特尔服务器CPU具有显着更高的内存延迟和核心间延迟。

英文:

40ns inter-thread latency (including measurement overhead) sounds about right for modern x86 CPUs.

And yeah, storing a timestamp and checking it against a time measurement in the reader sounds reasonable.

Cache-coherency messages between cores have to go over the ring bus to the L3 slice. When the load request (that missed in L2) gets to the right L3 slice, it will detect (from the inclusive L3 tags) that another thread owns the line in MESI Exclusive or Modified state, and generate a message to that core. That core will then do a write back (and perhaps send the data directly to the core that requested it?)

And that's on a desktop CPU where we know there are no other sockets to snoop for coherency: Intel server CPUs have significantly higher memory latency and inter-core latency.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在不同线程之间更新原子变量的延迟反映

问题

答案1

为什么我的代码在共享对象中重复？

JNI 8 C++：线程附加与分离以及异步回调

再入在同步方法中

java中的synchronized无法同步

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。