将for循环移到它们自己的函数中为什么会减慢程序速度?

huangapple go评论69阅读模式
英文:

Why does moving for loops to their own function slow down the program?

问题

以下是代码的翻译部分:

我有一个正在尝试加速的小型扩散模拟。这是“朴素”版本:

#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>

std::chrono::nanoseconds diffuse_naive(std::vector<double> &c,
                            std::vector<double> &c_tmp,
                            const double T,
                            const double dt,
                            const double aux) {
    const size_t num_steps = (T / dt) + 1;
    const size_t length = sqrt(c.size());
    // M = N + 2,具有填充的长度
    const int M = (int) length;
    auto time_start = std::chrono::high_resolution_clock::now();
    for (size_t step = 0; step < num_steps; ++step) {
        for (size_t i = 1; i < length - 1; ++i) {
            for (size_t j = 1; j < length - 1; ++j) {
                c_tmp[i * M + j] = c[i * M + j] + aux * (
                    c[i * M + (j + 1)] + c[i * M + (j - 1)] +
                    c[(i + 1) * M + j] + c[(i - 1) * M + j] -
                    4 * c[i * M + j]);
            }
        }
        std::swap(c, c_tmp);
    }
    auto time_stop = std::chrono::high_resolution_clock::now();
    auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
    return time_elapsed;
}

我最初尝试使用“const”和“__restrict”来加速它,并在这样做时将内部的“for”循环移到它们自己的函数中:

#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>

void step_const_c(const std::vector<double> &__restrict c,
                    std::vector<double> &__restrict c_tmp,
                    const size_t &length,
                    const double &M,
                    const double &aux) {
    for (size_t i = 1; i < length - 1; ++i) {
        for (size_t j = 1; j < length - 1; ++j) {
            c_tmp[i * M + j] = c[i * M + j] + aux * (
                c[i * M + (j + 1)] + c[i * M + (j - 1)] +
                c[(i + 1) * M + j] + c[(i - 1) * M + j] -
                4 * c[i * M + j]);
        }
    }
}

std::chrono::nanoseconds diffuse_const_c(std::vector<double> &c,
                            std::vector<double> &c_tmp,
                            const double T,
                            const double dt,
                            const double aux) {
    const size_t num_steps = (T / dt) + 1;
    const size_t length = sqrt(c.size());
    // M = N + 2,具有填充的长度
    const int M = (int) length;
    auto time_start = std::chrono::high_resolution_clock::now();
    for (size_t step = 0; step < num_steps; ++step) {
        step_const_c(c, c_tmp, length, M, aux);
        std::swap(c, c_tmp);
    }
    auto time_stop = std::chrono::high_resolution_clock::now();
    auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
    return time_elapsed;
}

它们都使用“g++ -O3 -Wall -Wextra -pedantic -std=c++14 -o diffusion diffusion.cpp”进行编译,并分别使用“./diffusion 0.1 2.0 200 0.5 1 naive”和“./diffusion 0.1 2.0 200 0.5 1 const”运行。

如下所示,“__restrict”在使用向量时几乎没有意义,所以我运行了这个程序,有和没有使用“const”和“__restrict”都是相同的,运行时间都是一样的,从朴素方法的0.05秒到调用“step_const_c”的0.3秒(无论是否使用“const”和“__restrict”)。:

是什么原因导致将内部的“for”循环移到它们自己的函数中,速度比将“for”循环放在同一个函数中要慢这么多?
谢谢!

**编辑**:这个问题最初是关于为什么添加“const”和“__restrict”会减慢程序运行速度,但我从“step_const_c”的参数中删除了“const”和“__restrict”,运行时间几乎相同,这表明它们不是问题,而是将“for”循环放在自己的函数中。

<details>
<summary>英文:</summary>

I have a small diffusion simulation that I&#39;m experimenting with speeding up.
Here is the &#39;naive&#39; version:

#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>

std::chrono::nanoseconds diffuse_naive(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}


I was originally experimenting with the use of `const` and `__restrict` to speed this up, and in doing so, moved the inner `for` loops to their own function:

#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>

void step_const_c(const std::vector<double> &__restrict c,
std::vector<double> &__restrict c_tmp,
const size_t &length,
const double &M,
const double &aux) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
}

std::chrono::nanoseconds diffuse_const_c(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
step_const_c(c, c_tmp, length, M, aux);
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}


Both are compiled with `g++ -O3 -Wall -Wextra -pedantic -Wall -Wextra -pedantic -std=c++14 -o diffusion diffusion.cpp` and run with `./diffusion 0.1 2.0 200 0.5 1 naive` and `./diffusion 0.1 2.0 200 0.5 1 const` respectively.
As pointed out by a comment below, the use of `__restrict` is pretty pointless when using vectors, so I ran this both with and without the use of `const` and `__restrict`. The runtime with the separate for-loop function increases by a factor of six, from 0.05 seconds with the naive approach to 0.3 seconds when calling `step_const_c` (with or without `const` and `__restrict`).
What is causing the version with the inner `for` loops in their own function to be so much slower than having the `for` loops in the same function?
Thanks!
**Edit**: This question was originally about why adding `const` and `__restrict` slowed the program down, but I removed `const` and `__restrict` from the arguments of `step_const_c`, and the runtime was pretty much identical, which indicates they aren&#39;t the issue, but rather putting the `for` loops in their own function.
The full content of `diffusion.cpp`:

#include <cmath>
#include <fstream>
#include <vector>

void save_results(const std::vector<double> &c,
const std::string &implementation) {
std::string filename = "cpp_" + implementation + ".txt";
size_t length = sqrt(c.size());
std::ofstream ofs(filename, std::ofstream::out);
for (size_t i = 0; i < length; ++i) {
for (size_t j = 0; j < length; ++j) {
ofs << c[i * length + j];
if (j == length - 1) {
ofs << "\n";
} else {
ofs << ",";
}
}
}
ofs.close();
}

void step_const_c(const std::vector<double> &__restrict c,
std::vector<double> &__restrict c_tmp,
const size_t &length,
const double &M,
const double &aux) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
}

std::chrono::nanoseconds diffuse_const_c(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
step_const_c(c, c_tmp, length, M, aux);
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}

std::chrono::nanoseconds diffuse_naive(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}

void initialize_concentration(std::vector<double> &c,
const double &L,
const size_t &N,
const double &h) {
const double bound = 0.125 * L;
for (size_t i = 0; i < N; ++i) {
for (size_t j = 0; j < N; ++j) {
if (std::abs(i * h - 0.5 * L) < bound &&
std::abs(j * h - 0.5 * L) < bound) {
c[(i + 1) * (N + 2) + (j + 1)] = 1.0;
}
}
}
}

int main(int argc, char *argv[]) {
if (argc < 7) {
fprintf(stderr, "Usage: %s D L N T out implementation\n", argv[0]);
return 1;
}

const double D = std::stod(argv[1]);        // diffusion constant
const double L = std::stod(argv[2]);        // domain size
const size_t N = std::stoul(argv[3]);       // number of grid points along each dim
const double T = std::stod(argv[4]);        // simulation time in seconds
const size_t output = std::stoul(argv[5]);  // whether to store results, yes == 1
const std::string implementation = argv[6]; // version of program to run
const double h = L / (N - 1);               // length between grid elements
const double dt = h * h / (4 * D);          // maximum timestamp for numeric stability
const double aux = dt * D / (h * h);        // all constant terms in diffusion equation
std::vector&lt;double&gt; c((N+2) * (N+2), 0.0);
std::vector&lt;double&gt; c_tmp((N+2) * (N+2), 0.0);
initialize_concentration(c, L, N, h);
std::chrono::nanoseconds time_elapsed;
if (implementation.compare(&quot;naive&quot;) == 0) {
time_elapsed = diffuse_naive(c, c_tmp, T, dt, aux);
} else if (implementation.compare(&quot;const&quot;) == 0) {
time_elapsed = diffuse_const_c(c, c_tmp, T, dt, aux);
}
printf(&quot;Elapsed time: %luns\n&quot;, time_elapsed.count());
if (output == 1) {
save_results(c, implementation);
}

}


</details>
# 答案1
**得分**: 1
`diffuse_const_c()` 声明 `M` 为 `const int &amp;`,但 `step_const_c()` 请求 `const double &amp;M`。
我在C++方面经验有限,但如果我不得不猜测底层发生了什么,我会假设添加了两个额外的步骤,导致速度变慢:
1. 创建一个新的匿名整数变量
2. 强制类型转换 `M` 并将其放入该变量
考虑到函数的其余部分只是从预分配的向量中获取值,进行一些快速的代数运算,然后将结果放回预分配的向量,这些额外的步骤听起来相对昂贵。
将 `step_const_c()` 中的函数参数改为 `const int &amp;M` 会将运行时间降低到与简单实现相同的水平。
<details>
<summary>英文:</summary>
There was implicit typecasting happening (I think)!
`diffuse_const_c()` declares `M` as an `const int &amp;`, but `step_const_c()` was requesting `const double &amp;M`.
I&#39;m far from experienced in C++, but if I had to guess what&#39;s happening under the hood, I would assume two additional steps are being added that are slowing this down:
1. Create a new anonymous integer variable
2. Cast `M` and place it in that variable
Considering the rest of the function is just grabbing values from a pre-allocated vector, some quick algebra, then putting the result back into a pre-allocated vector, these additional steps sound relatively expensive.
Switching the function argument in `step_const_c()` to `const int &amp;M` brings the runtime down to the same level as the naive implementation.
</details>

huangapple
  • 本文由 发表于 2023年7月23日 15:47:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76747148.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定