2023年6月22日 04:57:04go评论105阅读模式

英文:

Why is locale causing std::ostringstream to get slower as I use more threads?

问题

我正在使用 std::ostringstream 构建一些格式化的字符串。在单线程运行时，代码分析显示没有由 std::ostringstream 引起的性能瓶颈。

当我开始使用更多的线程时，由于 std::__1::locale::locale，std::ostringstream 速度变慢。

这在使用更多线程时变得更糟糕。

我没有明确执行任何线程同步，但我怀疑 std::__1::locale::locale 内部的某些内容会导致我的线程阻塞，随着线程数量的增加变得更糟。单线程花费约 30 秒，而10个线程花费10分钟。

问题的代码很小，但被多次调用，

static std::string to_string(const T d) {
    std::ostringstream stream;
    stream << d;
    return stream.str();
}

当我更改它以避免每次构建新的 std::ostringstream 时，

thread_local static std::ostringstream stream;
const std::string clear;
static std::string to_string(const T d) {
    stream.str(clear);
    stream << d;
    return stream.str();
}

我可以恢复多线程性能，但单线程性能下降。我该如何避免这个问题？在构建格式化字符串时是否有办法避免本地化？

在使用 std::complex 的哈希函数时，这里构建的字符串永远不需要人类可读。是否有办法在构建格式化字符串时避免本地化？

此外，您提供的代码在运行时输出不同线程数下的性能数据，包括在调试和发布模式下的定时数据。它似乎运行在一个拥有10个核心的Apple M1处理器上。

请注意：由于您不希望我回答有关翻译的问题，请在需要时提供特定的指示或问题。

英文:

I'm building some formatted strings using std::ostringstream. When running on a single thread, code profiling shows no bottle neck caused by std::ostringstream.

When I start using more threads, std::ostringstream slows down due to std::__1::locale::locale.

This gets worse and worse as more threads are used.

I'm not performing any thread synchronization explicitly but I suspect something inside std::__1::locale::locale is causing my threads to block which gets worse as I use more threads. It's the difference between a single thread taking ~30 seconds and 10 threads taking 10 minutes.

The code is in question is small but called many times,

static std::string to_string(const T d) {
    std::ostringstream stream;
    stream &lt;&lt; d;
    return stream.str();
}

When I change it to avoid constructing a new std::ostringstream every time,

thread_local static std::ostringstream stream;
const std::string clear;
static std::string to_string(const T d) {
    stream.str(clear);
    stream &lt;&lt; d;
    return stream.str();
}

I recover multithreaded performance but single thread performance suffers. What can I do to avoid this problem? The strings built here never need to be human readable. They are only used so that I can work around the lack of a hash function for std::complex. Is there away to avoid localization when building formatted strings?

#include &lt;map&gt;
#include &lt;sstream&gt;
#include &lt;complex&gt;
#include &lt;iostream&gt;
#include &lt;thread&gt;
#include &lt;chrono&gt;
thread_local std::map&lt;std::string, void *&gt; cache;
int main(int argc, const char * argv[]) {
    for (size_t i = 1; i &lt;= 10; i++) {
        const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
        std::vector&lt;std::thread&gt; threads(i);
        for (auto &amp;t : threads) {
            t = std::thread([] () -&gt; void {
                for (size_t j = 0; j &lt; 1000000; j++) {
                    std::ostringstream stream;
                    stream &lt;&lt; std::complex&lt;double&gt; (static_cast&lt;double&gt; (j));
                    cache[stream.str()] = reinterpret_cast&lt;void *&gt; (&amp;j);
                }
            });
        }
        for (auto &amp;t : threads) {
            t.join();
        }
        
        const std::chrono::high_resolution_clock::time_point end =
                  std::chrono::high_resolution_clock::now();
        const auto total_time = end - start;
        const std::chrono::nanoseconds total_time_ns =
                  std::chrono::duration_cast&lt;std::chrono::nanoseconds&gt; (total_time);
        if (total_time_ns.count() &lt; 1000) {
            std::cout &lt;&lt; total_time_ns.count()               &lt;&lt; &quot; ns&quot;  &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 1000000) {
            std::cout &lt;&lt; total_time_ns.count()/1000.0        &lt;&lt; &quot; μs&quot;  &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 1000000000) {
            std::cout &lt;&lt; total_time_ns.count()/1000000.0     &lt;&lt; &quot; ms&quot;  &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 60000000000) {
            std::cout &lt;&lt; total_time_ns.count()/1000000000.0  &lt;&lt; &quot; s&quot;   &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 3600000000000) {
            std::cout &lt;&lt; total_time_ns.count()/60000000000.0 &lt;&lt; &quot; min&quot; &lt;&lt; std::endl;
        } else {
            std::cout &lt;&lt; total_time_ns.count()/3600000000000 &lt;&lt; &quot; h&quot;   &lt;&lt; std::endl;
        }
        std::cout &lt;&lt; std::endl;
    }
    return 0;
}

Running on an 10 core (8 performance, 2 efficiency)Apple M1 produces the output. Build setting are using the standard Xcode defaults. For a debug build the timings are

3.90096 s
4.15853 s
4.48616 s
4.843 s
6.15202 s
8.14986 s
10.6319 s
12.7732 s
16.7492 s
19.9288 s

For a Release build, the timings are

844.28 ms
1.23803 s
2.05088 s
3.39994 s
7.43743 s
9.53968 s
11.2953 s
12.6878 s
20.3917 s
24.1944 s

答案1

得分: 1

在寻找替代方案时，std::to_string 的注释提到：

> std::to_string 依赖于当前区域设置以进行格式化，因此从多个线程同时调用 std::to_string 可能会导致调用的部分序列化。C++17 提供了 std::to_chars 作为一个性能更高且独立于区域设置的替代方案。

在最小示例中使用 std::to_chars 相比于我对一个 peinfully 并行的代码所期望的性能表现要好得多。

#include &lt;map&gt;
#include &lt;sstream&gt;
#include &lt;complex&gt;
#include &lt;iostream&gt;
#include &lt;thread&gt;
#include &lt;chrono&gt;
#include &lt;charconv&gt;
#include &lt;limits&gt;
#include &lt;string&gt;
#include &lt;iomanip&gt;
thread_local std::map&lt;std::string, void *&gt; cache;
thread_local std::map&lt;std::string, void *&gt; cache2;
void stream() {
    for (size_t i = 1; i &lt;= 10; i++) {
        const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
        std::vector&lt;std::thread&gt; threads(i);
        for (auto &amp;t : threads) {
            t = std::thread([] () -&gt; void {
                for (size_t j = 0; j &lt; 1000000; j++) {
                    std::ostringstream stream;
                    stream &lt;&lt; std::setprecision(std::numeric_limits&lt;double&gt;::max_digits10);
                    stream &lt;&lt; std::complex&lt;double&gt; (static_cast&lt;double&gt; (j));
                    cache[stream.str()] = reinterpret_cast&lt;void *&gt; (&amp;j);
                }
            });
        }
        for (auto &amp;t : threads) {
            t.join();
        }
        
        const std::chrono::high_resolution_clock::time_point end =
                  std::chrono::high_resolution_clock::now();
        const auto total_time = end - start;
        const std::chrono::nanoseconds total_time_ns =
                  std::chrono::duration_cast&lt;std::chrono::nanoseconds&gt; (total_time);
        if (total_time_ns.count() &lt; 1000) {
            std::cout &lt;&lt; total_time_ns.count()               &lt;&lt; &quot; ns&quot;  &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 1000000) {
            std::cout &lt;&lt; total_time_ns.count()/1000.0        &lt;&lt; &quot; μs&quot;  &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 1000000000) {
            std::cout &lt;&lt; total_time_ns.count()/1000000.0     &lt;&lt; &quot; ms&quot;  &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 60000000000) {
            std::cout &lt;&lt; total_time_ns.count()/1000000000.0  &lt;&lt; &quot; s&quot;   &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 3600000000000) {
            std::cout &lt;&lt; total_time_ns.count()/60000000000.0 &lt;&lt; &quot; min&quot; &lt;&lt; std::endl;
        } else {
            std::cout &lt;&lt; total_time_ns.count()/3600000000000 &lt;&lt; &quot; h&quot;   &lt;&lt; std::endl;
        }
        std::cout &lt;&lt; std::endl;
    }
}
void to_chars() {
    for (size_t i = 1; i &lt;= 10; i++) {
        const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
        std::vector&lt;std::thread&gt; threads(i);
        const size_t max_digits = std::numeric_limits&lt;double&gt;::max_digits10;
        for (size_t k = 0, ke = threads.size(); k &lt; ke; k++) {
            threads[k] = std::thread([] () -&gt; void {
                std::array&lt;char, 36&gt; buffer;
                for (size_t j = 0; j &lt; 1000000; j++) {
                    char *end = std::to_chars(buffer.begin(), buffer.end(), static_cast&lt;double&gt; (j),
                                              std::chars_format::general, max_digits).ptr;
                    cache2[std::string(buffer.data(), end)] = reinterpret_cast&lt;void *&gt; (&amp;j);
                }
            });
        }
        for (auto &amp;t : threads) {
            t.join();
        }
        
        const std::chrono::high_resolution_clock::time_point end =
                  std::chrono::high_resolution_clock::now();
        const auto total_time = end - start;
        const std::chrono::nanoseconds total_time_ns =
                  std::chrono::duration_cast&lt;std::chrono::nanoseconds&gt; (total_time);
        if (total_time_ns.count() &lt; 1000) {
            std::cout &lt;&lt; total_time_ns.count()               &lt;&lt; &quot; ns&quot;  &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 1000000) {
            std::cout &lt;&lt; total_time_ns.count()/1000.0        &lt;&lt; &quot; μs&quot;  &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 1000000000) {
            std::cout &lt;&lt; total_time_ns.count()/1000000.0     &lt;&lt; &quot; ms&quot;  &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 60000000000) {
            std::cout &lt;&lt; total_time_ns.count()/1000000000.0  &lt;&lt; &quot; s&quot;   &lt;&lt; std::endl;
        } else if (total_time_ns.count() &lt; 3600000000000) {
            std::cout &lt;&lt; total_time_ns.count()/60000000000.0 &lt;&lt; &quot; min&quot; &lt;&lt; std::endl;
        } else {
            std::cout &lt;&lt; total_time_ns.count
<details>
<summary>英文:</summary>
Doing some digging for alternatives, the [`std::to_string`][1] notes
&gt; `std::to_string` relies on the current locale for formatting purposes, and therefore concurrent calls to `std::to_string` from multiple threads may result in partial serialization of calls. C++17 provides `std::to_chars` as a higher-performance locale-independent alternative.
Using `std::to_chars` in the minimum example instead results in much better performance to what I was expecting for an embarrassingly parallel code.

#include <map>
#include <sstream>
#include <complex>
#include <iostream>
#include <thread>
#include <chrono>
#include <charconv>
#include <limits>
#include <string>
#include <iomanip>

thread_local std::map<std::string, void *> cache;
thread_local std::map<std::string, void *> cache2;

void stream() {
for (size_t i = 1; i <= 10; i++) {
const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
std::vector<std::thread> threads(i);
for (auto &t : threads) {
t = std::thread([] () -> void {
for (size_t j = 0; j < 1000000; j++) {
std::ostringstream stream;
stream << std::setprecision(std::numeric_limits<double>::max_digits10);
stream << std::complex<double> (static_cast<double> (j));
cache[stream.str()] = reinterpret_cast<void *> (&j);
}
});
}
for (auto &t : threads) {
t.join();
}

    const std::chrono::high_resolution_clock::time_point end =
std::chrono::high_resolution_clock::now();
const auto total_time = end - start;
const std::chrono::nanoseconds total_time_ns =
std::chrono::duration_cast&lt;std::chrono::nanoseconds&gt; (total_time);
if (total_time_ns.count() &lt; 1000) {
std::cout &lt;&lt; total_time_ns.count()               &lt;&lt; &quot; ns&quot;  &lt;&lt; std::endl;
} else if (total_time_ns.count() &lt; 1000000) {
std::cout &lt;&lt; total_time_ns.count()/1000.0        &lt;&lt; &quot; μs&quot;  &lt;&lt; std::endl;
} else if (total_time_ns.count() &lt; 1000000000) {
std::cout &lt;&lt; total_time_ns.count()/1000000.0     &lt;&lt; &quot; ms&quot;  &lt;&lt; std::endl;
} else if (total_time_ns.count() &lt; 60000000000) {
std::cout &lt;&lt; total_time_ns.count()/1000000000.0  &lt;&lt; &quot; s&quot;   &lt;&lt; std::endl;
} else if (total_time_ns.count() &lt; 3600000000000) {
std::cout &lt;&lt; total_time_ns.count()/60000000000.0 &lt;&lt; &quot; min&quot; &lt;&lt; std::endl;
} else {
std::cout &lt;&lt; total_time_ns.count()/3600000000000 &lt;&lt; &quot; h&quot;   &lt;&lt; std::endl;
}
std::cout &lt;&lt; std::endl;
}

}

void to_chars() {
for (size_t i = 1; i <= 10; i++) {
const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
std::vector<std::thread> threads(i);
const size_t max_digits = std::numeric_limits<double>::max_digits10;
for (size_t k = 0, ke = threads.size(); k < ke; k++) {
threads[k] = std::thread([] () -> void {
std::array<char, 36> buffer;
for (size_t j = 0; j < 1000000; j++) {
char *end = std::to_chars(buffer.begin(), buffer.end(), static_cast<double> (j),
std::chars_format::general, max_digits).ptr;
cache2[std::string(buffer.data(), end)] = reinterpret_cast<void *> (&j);
}
});
}
for (auto &t : threads) {
t.join();
}

    const std::chrono::high_resolution_clock::time_point end =
std::chrono::high_resolution_clock::now();
const auto total_time = end - start;
const std::chrono::nanoseconds total_time_ns =
std::chrono::duration_cast&lt;std::chrono::nanoseconds&gt; (total_time);
if (total_time_ns.count() &lt; 1000) {
std::cout &lt;&lt; total_time_ns.count()               &lt;&lt; &quot; ns&quot;  &lt;&lt; std::endl;
} else if (total_time_ns.count() &lt; 1000000) {
std::cout &lt;&lt; total_time_ns.count()/1000.0        &lt;&lt; &quot; μs&quot;  &lt;&lt; std::endl;
} else if (total_time_ns.count() &lt; 1000000000) {
std::cout &lt;&lt; total_time_ns.count()/1000000.0     &lt;&lt; &quot; ms&quot;  &lt;&lt; std::endl;
} else if (total_time_ns.count() &lt; 60000000000) {
std::cout &lt;&lt; total_time_ns.count()/1000000000.0  &lt;&lt; &quot; s&quot;   &lt;&lt; std::endl;
} else if (total_time_ns.count() &lt; 3600000000000) {
std::cout &lt;&lt; total_time_ns.count()/60000000000.0 &lt;&lt; &quot; min&quot; &lt;&lt; std::endl;
} else {
std::cout &lt;&lt; total_time_ns.count()/3600000000000 &lt;&lt; &quot; h&quot;   &lt;&lt; std::endl;
}
std::cout &lt;&lt; std::endl;
}

}

int main(int argc, const char * argv[]) {
stream();
std::cout << "-----------------------------------------------------------" << std::endl;
to_chars();
return 0;
}

Results in timings of

854.078 ms

1.3472 s

2.26556 s

3.61298 s

7.55469 s

9.29697 s

11.321 s

12.6926 s

19.607 s

24.4866 s

403.037 ms

416.532 ms

432.433 ms

437.869 ms

450.775 ms

458.693 ms

473.683 ms

498.53 ms

528.434 ms

560.239 ms


Code profiling confirms the may string hashes are no longer the largest bottleneck.
[![Code profiling using std::to_char and 10 threads][2]][2]
[1]: https://en.cppreference.com/w/cpp/string/basic_string/to_string
[2]: https://i.stack.imgur.com/HnDB5.jpg
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么使用更多线程时，locale会导致std::ostringstream变慢？

问题

答案1

Parallel threads doing the same until a barrier, after barrier the result is not always the same

std::enable_if 用于 std::is_integral 及其否定形式都显示为模糊的候选重载。

Thread对象在启动之前有多重？

for循环正在执行，尽管条件应该为假。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。