With very short sleep times, why does a thread only finish zero or one iteration of printing before seeing the stop flag set?

huangapple go评论75阅读模式
英文:

With very short sleep times, why does a thread only finish zero or one iteration of printing before seeing the stop flag set?

问题

请查看下面的代码,```AsyncTask``` 创建一个对等线程(计时器)来增加一个原子变量并休眠一段时间。预期的输出是打印 ```counter_``` 10 次,值从 1  10,但实际结果很奇怪:

- 看起来实际结果是随机的,有时只打印一次,有时根本不打印。
- 此外,我发现当我将线程休眠时间(对等线程和主线程都包括)更改为秒或毫秒时,程序按预期工作。

```cpp
#include <atomic>
#include <thread>
#include <iostream>

class AtomicTest {
 public:
  int AsyncTask() {
    std::thread timer([this](){
      while (not stop_.load(std::memory_order_acquire)) {
        counter_.fetch_add(1, std::memory_order_relaxed);
        std::cout << "counter = " << counter_ << std::endl;
        std::this_thread::sleep_for(std::chrono::microseconds(1)); // both milliseconds and seconds work well
      }
    });
    timer.detach();

    std::this_thread::sleep_for(std::chrono::microseconds(10));
    stop_.store(true, std::memory_order_release);
    return 0;
  }

 private:
  std::atomic<int> counter_{0};
  std::atomic<bool> stop_{false};
};

int main(void) {
  AtomicTest test;
  test.AsyncTask();
  return 0;
}

我知道线程切换也需要时间,可能是因为线程休眠时间太短吗?

我的程序运行环境:

  • Apple clang 版本 14.0.0 (clang-1400.0.29.202)
  • 目标: arm64-apple-darwin22.2.0)

<details>
<summary>英文:</summary>

See the code below, ```AsyncTask``` creates a peer thread(timer) to increment a atomic variable and sleep for a while. The expected output is to print ```counter_``` 10 times, with values ranging from 1 to 10, but the actual result is strange: 

- It seems like that the actual result is random, sometimes it&#39;s printed once, sometimes it&#39;s not printed at all.
- Further, I found that when I changed thread sleep time(both peer thread and main thread) to seconds or milliseconds, the program worked as expected.

```cpp
#include &lt;atomic&gt;
#include &lt;thread&gt;
#include &lt;iostream&gt;

class AtomicTest {
 public:
  int AsyncTask() {
    std::thread timer([this](){
      while (not stop_.load(std::memory_order_acquire)) {
        counter_.fetch_add(1, std::memory_order_relaxed);
        std::cout &lt;&lt; &quot;counter = &quot; &lt;&lt; counter_ &lt;&lt; std::endl;
        std::this_thread::sleep_for(std::chrono::microseconds(1)); // both milliseconds and seconds work well
      }
    });
    timer.detach();

    std::this_thread::sleep_for(std::chrono::microseconds(10));
    stop_.store(true, std::memory_order_release);
    return 0;
  }

 private:
  std::atomic&lt;int&gt; counter_{0};
  std::atomic&lt;bool&gt; stop_{false};
};

int main(void) {
  AtomicTest test;
  test.AsyncTask();
  return 0;
}

I know that thread switching also takes time, is it because thread sleep time too short?

My programme running environment:

  • Apple clang version 14.0.0 (clang-1400.0.29.202)
  • Target: arm64-apple-darwin22.2.0)

答案1

得分: 2

是的,stop_.store 可能在新线程被调度到 CPU 核心之前或之后运行,所以它的第一个测试会将停止标志读取为 true

10 微秒比典型的操作系统进程调度时间片(通常为 1 或 10 毫秒)要短,如果相关的话。并且比原子存储的核间延迟高出几个数量级。

您描述的结果正是我对这样一个依赖于时序的程序所期望的,它旨在检测哪个线程赢得了比赛,以及赢得了多少(通过其慢速的 &lt;&lt; endl 和在写入线程内部的睡眠来实现)。

我绝对不会期望它总是打印 10 次,由于线程启动开销占了占了 1 微秒睡眠间隔内打印线程的重要部分,这种情况可能很少发生。


顺便说一下,您的问题最初标题为“关于增加原子变量的问题?”。但 counter 只从一个线程访问。它可能与停止标志在同一个缓存行中,但在没有主线程的竞争的情况下,这基本上是微不足道的,一个非常快的操作。

这与您所做的事情无关;它可以是线程 lambda 内部的本地非原子 int,您将看到相同的时序效应。这里重要的是 cout &lt;&lt; endl,它强制刷新流(从而进行系统调用),即使您将其重定向到文件,以及 this_thread::sleep_for()

如果写系统调用是对终端的(而不是重定向到文件),它甚至可能会在终端模拟器绘制屏幕时阻塞,尽管对于只有几个小写入来说,可能在某个地方(可能在内核内部)有足够大的缓冲区来吸收它。

原子递增可能需要几纳秒的时间,而且它是 relaxed 的,AArch64 可以非常高效地处理它,与周围代码大部分时间重叠。(现代 x86 在最佳情况下可以每 20 个时钟周期进行一次原子递增,其中包括完整的内存屏障。我期望 Apple M1 在不需要成为屏障时可以更便宜地处理它。)

英文:

Yes, easily plausible that stop_.store could run before the new thread has been scheduled to a CPU core, or soon after. So its first test reads the stop flag as true.

10 us is shorter than typical OS process-scheduling timeslices (often 1 or 10 ms) in case that's relevant. And only a couple orders of magnitude higher than inter-core latency for an atomic store becoming visible.

The results you describe are exactly what I'd expect for a timing-dependent program like this, written to detect which thread wins the race and by how much (with its slow &lt;&lt; endl and sleep inside the writing thread.)

I definitely wouldn't expect it to always print 10 times, and it would be rare that'd ever happen due to thread startup overhead being a significant fraction of the 1 us sleep interval inside the printing thread.


BTW, your question was originally titled "A question about incrementing atomic variables?". But counter is only ever accessed from one thread. It's probably in the same cache line as the stop flag, but without contention from the main thread it's basically trivial, a very fast operation.

It's irrelevant to what you're doing; it could be a local non-atomic int inside the thread's lambda and you'd see the same timing effects. The significant things here are cout &lt;&lt; endl which forces a flush of the stream (and thus a system call) even if you redirected to a file, and the this_thread::sleep_for().

If the write system call was to a terminal (not redirect to a file), it might even block while the terminal emulator drew on the screen, although for only a couple small writes there's probably a big enough buffer somewhere (probably inside the kernel) to absorb it.

An atomic increment probably takes a few nanoseconds, and being relaxed it's something AArch64 can handle very efficiently, overlapping much of that time with surrounding code. (Modern x86 can do an atomic increment about one per 20 clock cycles at best, and that includes a full memory barrier. I expect Apple M1 to handle it more cheaply when it doesn't need to be a barrier.)

huangapple
  • 本文由 发表于 2023年5月17日 11:35:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76268390.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定