问题

以下是代码的翻译部分：

我有一个正在尝试加速的小型扩散模拟。这是“朴素”版本：

#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>

std::chrono::nanoseconds diffuse_naive(std::vector<double> &c,
                            std::vector<double> &c_tmp,
                            const double T,
                            const double dt,
                            const double aux) {
    const size_t num_steps = (T / dt) + 1;
    const size_t length = sqrt(c.size());
    // M = N + 2，具有填充的长度
    const int M = (int) length;
    auto time_start = std::chrono::high_resolution_clock::now();
    for (size_t step = 0; step < num_steps; ++step) {
        for (size_t i = 1; i < length - 1; ++i) {
            for (size_t j = 1; j < length - 1; ++j) {
                c_tmp[i * M + j] = c[i * M + j] + aux * (
                    c[i * M + (j + 1)] + c[i * M + (j - 1)] +
                    c[(i + 1) * M + j] + c[(i - 1) * M + j] -
                    4 * c[i * M + j]);
            }
        }
        std::swap(c, c_tmp);
    }
    auto time_stop = std::chrono::high_resolution_clock::now();
    auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
    return time_elapsed;
}

我最初尝试使用“const”和“__restrict”来加速它，并在这样做时将内部的“for”循环移到它们自己的函数中：

#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>

void step_const_c(const std::vector<double> &__restrict c,
                    std::vector<double> &__restrict c_tmp,
                    const size_t &length,
                    const double &M,
                    const double &aux) {
    for (size_t i = 1; i < length - 1; ++i) {
        for (size_t j = 1; j < length - 1; ++j) {
            c_tmp[i * M + j] = c[i * M + j] + aux * (
                c[i * M + (j + 1)] + c[i * M + (j - 1)] +
                c[(i + 1) * M + j] + c[(i - 1) * M + j] -
                4 * c[i * M + j]);
        }
    }
}

std::chrono::nanoseconds diffuse_const_c(std::vector<double> &c,
                            std::vector<double> &c_tmp,
                            const double T,
                            const double dt,
                            const double aux) {
    const size_t num_steps = (T / dt) + 1;
    const size_t length = sqrt(c.size());
    // M = N + 2，具有填充的长度
    const int M = (int) length;
    auto time_start = std::chrono::high_resolution_clock::now();
    for (size_t step = 0; step < num_steps; ++step) {
        step_const_c(c, c_tmp, length, M, aux);
        std::swap(c, c_tmp);
    }
    auto time_stop = std::chrono::high_resolution_clock::now();
    auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
    return time_elapsed;
}

它们都使用“g++ -O3 -Wall -Wextra -pedantic -std=c++14 -o diffusion diffusion.cpp”进行编译，并分别使用“./diffusion 0.1 2.0 200 0.5 1 naive”和“./diffusion 0.1 2.0 200 0.5 1 const”运行。

如下所示，“__restrict”在使用向量时几乎没有意义，所以我运行了这个程序，有和没有使用“const”和“__restrict”都是相同的，运行时间都是一样的，从朴素方法的0.05秒到调用“step_const_c”的0.3秒（无论是否使用“const”和“__restrict”）。：

是什么原因导致将内部的“for”循环移到它们自己的函数中，速度比将“for”循环放在同一个函数中要慢这么多？
谢谢！

**编辑**：这个问题最初是关于为什么添加“const”和“__restrict”会减慢程序运行速度，但我从“step_const_c”的参数中删除了“const”和“__restrict”，运行时间几乎相同，这表明它们不是问题，而是将“for”循环放在自己的函数中。

<details>
<summary>英文:</summary>

I have a small diffusion simulation that I&#39;m experimenting with speeding up.
Here is the &#39;naive&#39; version:

#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>

std::chrono::nanoseconds diffuse_naive(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}


I was originally experimenting with the use of `const` and `__restrict` to speed this up, and in doing so, moved the inner `for` loops to their own function:

#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>

void step_const_c(const std::vector<double> &__restrict c,
std::vector<double> &__restrict c_tmp,
const size_t &length,
const double &M,
const double &aux) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
}

std::chrono::nanoseconds diffuse_const_c(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
step_const_c(c, c_tmp, length, M, aux);
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}


Both are compiled with `g++ -O3 -Wall -Wextra -pedantic -Wall -Wextra -pedantic -std=c++14 -o diffusion diffusion.cpp` and run with `./diffusion 0.1 2.0 200 0.5 1 naive` and `./diffusion 0.1 2.0 200 0.5 1 const` respectively.
As pointed out by a comment below, the use of `__restrict` is pretty pointless when using vectors, so I ran this both with and without the use of `const` and `__restrict`. The runtime with the separate for-loop function increases by a factor of six, from 0.05 seconds with the naive approach to 0.3 seconds when calling `step_const_c` (with or without `const` and `__restrict`).
What is causing the version with the inner `for` loops in their own function to be so much slower than having the `for` loops in the same function?
Thanks!
**Edit**: This question was originally about why adding `const` and `__restrict` slowed the program down, but I removed `const` and `__restrict` from the arguments of `step_const_c`, and the runtime was pretty much identical, which indicates they aren&#39;t the issue, but rather putting the `for` loops in their own function.
The full content of `diffusion.cpp`:

#include <cmath>
#include <fstream>
#include <vector>

void save_results(const std::vector<double> &c,
const std::string &implementation) {
std::string filename = "cpp_" + implementation + ".txt";
size_t length = sqrt(c.size());
std::ofstream ofs(filename, std::ofstream::out);
for (size_t i = 0; i < length; ++i) {
for (size_t j = 0; j < length; ++j) {
ofs << c[i * length + j];
if (j == length - 1) {
ofs << "\n";
} else {
ofs << ",";
}
}
}
ofs.close();
}

void initialize_concentration(std::vector<double> &c,
const double &L,
const size_t &N,
const double &h) {
const double bound = 0.125 * L;
for (size_t i = 0; i < N; ++i) {
for (size_t j = 0; j < N; ++j) {
if (std::abs(i * h - 0.5 * L) < bound &&
std::abs(j * h - 0.5 * L) < bound) {
c[(i + 1) * (N + 2) + (j + 1)] = 1.0;
}
}
}
}

int main(int argc, char *argv[]) {
if (argc < 7) {
fprintf(stderr, "Usage: %s D L N T out implementation\n", argv[0]);
return 1;
}

const double D = std::stod(argv[1]);        // diffusion constant
const double L = std::stod(argv[2]);        // domain size
const size_t N = std::stoul(argv[3]);       // number of grid points along each dim
const double T = std::stod(argv[4]);        // simulation time in seconds
const size_t output = std::stoul(argv[5]);  // whether to store results, yes == 1
const std::string implementation = argv[6]; // version of program to run
const double h = L / (N - 1);               // length between grid elements
const double dt = h * h / (4 * D);          // maximum timestamp for numeric stability
const double aux = dt * D / (h * h);        // all constant terms in diffusion equation
std::vector&lt;double&gt; c((N+2) * (N+2), 0.0);
std::vector&lt;double&gt; c_tmp((N+2) * (N+2), 0.0);
initialize_concentration(c, L, N, h);
std::chrono::nanoseconds time_elapsed;
if (implementation.compare(&quot;naive&quot;) == 0) {
time_elapsed = diffuse_naive(c, c_tmp, T, dt, aux);
} else if (implementation.compare(&quot;const&quot;) == 0) {
time_elapsed = diffuse_const_c(c, c_tmp, T, dt, aux);
}
printf(&quot;Elapsed time: %luns\n&quot;, time_elapsed.count());
if (output == 1) {
save_results(c, implementation);
}

}


</details>
# 答案1
**得分**: 1
`diffuse_const_c()` 声明 `M` 为 `const int &amp;`，但 `step_const_c()` 请求 `const double &amp;M`。
我在C++方面经验有限，但如果我不得不猜测底层发生了什么，我会假设添加了两个额外的步骤，导致速度变慢：
1. 创建一个新的匿名整数变量
2. 强制类型转换 `M` 并将其放入该变量
考虑到函数的其余部分只是从预分配的向量中获取值，进行一些快速的代数运算，然后将结果放回预分配的向量，这些额外的步骤听起来相对昂贵。
将 `step_const_c()` 中的函数参数改为 `const int &amp;M` 会将运行时间降低到与简单实现相同的水平。
<details>
<summary>英文:</summary>
There was implicit typecasting happening (I think)!
`diffuse_const_c()` declares `M` as an `const int &amp;`, but `step_const_c()` was requesting `const double &amp;M`.
I&#39;m far from experienced in C++, but if I had to guess what&#39;s happening under the hood, I would assume two additional steps are being added that are slowing this down:
1. Create a new anonymous integer variable
2. Cast `M` and place it in that variable
Considering the rest of the function is just grabbing values from a pre-allocated vector, some quick algebra, then putting the result back into a pre-allocated vector, these additional steps sound relatively expensive.
Switching the function argument in `step_const_c()` to `const int &amp;M` brings the runtime down to the same level as the naive implementation.
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将for循环移到它们自己的函数中为什么会减慢程序速度？

问题

优化字典的RAM使用情况

如何在单线程系统上停止阻塞任务？

为什么将字符串初始化为“”””比使用默认构造函数更高效？

Qt6工具提示和下拉菜单显示在错误的屏幕上。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论