英文:
Why does moving for loops to their own function slow down the program?
问题
以下是代码的翻译部分:
我有一个正在尝试加速的小型扩散模拟。这是“朴素”版本:
#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>
std::chrono::nanoseconds diffuse_naive(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2,具有填充的长度
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}
我最初尝试使用“const”和“__restrict”来加速它,并在这样做时将内部的“for”循环移到它们自己的函数中:
#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>
void step_const_c(const std::vector<double> &__restrict c,
std::vector<double> &__restrict c_tmp,
const size_t &length,
const double &M,
const double &aux) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
}
std::chrono::nanoseconds diffuse_const_c(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2,具有填充的长度
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
step_const_c(c, c_tmp, length, M, aux);
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}
它们都使用“g++ -O3 -Wall -Wextra -pedantic -std=c++14 -o diffusion diffusion.cpp”进行编译,并分别使用“./diffusion 0.1 2.0 200 0.5 1 naive”和“./diffusion 0.1 2.0 200 0.5 1 const”运行。
如下所示,“__restrict”在使用向量时几乎没有意义,所以我运行了这个程序,有和没有使用“const”和“__restrict”都是相同的,运行时间都是一样的,从朴素方法的0.05秒到调用“step_const_c”的0.3秒(无论是否使用“const”和“__restrict”)。:
是什么原因导致将内部的“for”循环移到它们自己的函数中,速度比将“for”循环放在同一个函数中要慢这么多?
谢谢!
**编辑**:这个问题最初是关于为什么添加“const”和“__restrict”会减慢程序运行速度,但我从“step_const_c”的参数中删除了“const”和“__restrict”,运行时间几乎相同,这表明它们不是问题,而是将“for”循环放在自己的函数中。
<details>
<summary>英文:</summary>
I have a small diffusion simulation that I'm experimenting with speeding up.
Here is the 'naive' version:
#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>
std::chrono::nanoseconds diffuse_naive(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}
I was originally experimenting with the use of `const` and `__restrict` to speed this up, and in doing so, moved the inner `for` loops to their own function:
#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>
void step_const_c(const std::vector<double> &__restrict c,
std::vector<double> &__restrict c_tmp,
const size_t &length,
const double &M,
const double &aux) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
}
std::chrono::nanoseconds diffuse_const_c(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
step_const_c(c, c_tmp, length, M, aux);
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}
Both are compiled with `g++ -O3 -Wall -Wextra -pedantic -Wall -Wextra -pedantic -std=c++14 -o diffusion diffusion.cpp` and run with `./diffusion 0.1 2.0 200 0.5 1 naive` and `./diffusion 0.1 2.0 200 0.5 1 const` respectively.
As pointed out by a comment below, the use of `__restrict` is pretty pointless when using vectors, so I ran this both with and without the use of `const` and `__restrict`. The runtime with the separate for-loop function increases by a factor of six, from 0.05 seconds with the naive approach to 0.3 seconds when calling `step_const_c` (with or without `const` and `__restrict`).
What is causing the version with the inner `for` loops in their own function to be so much slower than having the `for` loops in the same function?
Thanks!
**Edit**: This question was originally about why adding `const` and `__restrict` slowed the program down, but I removed `const` and `__restrict` from the arguments of `step_const_c`, and the runtime was pretty much identical, which indicates they aren't the issue, but rather putting the `for` loops in their own function.
The full content of `diffusion.cpp`:
#include <cmath>
#include <fstream>
#include <vector>
void save_results(const std::vector<double> &c,
const std::string &implementation) {
std::string filename = "cpp_" + implementation + ".txt";
size_t length = sqrt(c.size());
std::ofstream ofs(filename, std::ofstream::out);
for (size_t i = 0; i < length; ++i) {
for (size_t j = 0; j < length; ++j) {
ofs << c[i * length + j];
if (j == length - 1) {
ofs << "\n";
} else {
ofs << ",";
}
}
}
ofs.close();
}
void step_const_c(const std::vector<double> &__restrict c,
std::vector<double> &__restrict c_tmp,
const size_t &length,
const double &M,
const double &aux) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
}
std::chrono::nanoseconds diffuse_const_c(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
step_const_c(c, c_tmp, length, M, aux);
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}
std::chrono::nanoseconds diffuse_naive(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}
void initialize_concentration(std::vector<double> &c,
const double &L,
const size_t &N,
const double &h) {
const double bound = 0.125 * L;
for (size_t i = 0; i < N; ++i) {
for (size_t j = 0; j < N; ++j) {
if (std::abs(i * h - 0.5 * L) < bound &&
std::abs(j * h - 0.5 * L) < bound) {
c[(i + 1) * (N + 2) + (j + 1)] = 1.0;
}
}
}
}
int main(int argc, char *argv[]) {
if (argc < 7) {
fprintf(stderr, "Usage: %s D L N T out implementation\n", argv[0]);
return 1;
}
const double D = std::stod(argv[1]); // diffusion constant
const double L = std::stod(argv[2]); // domain size
const size_t N = std::stoul(argv[3]); // number of grid points along each dim
const double T = std::stod(argv[4]); // simulation time in seconds
const size_t output = std::stoul(argv[5]); // whether to store results, yes == 1
const std::string implementation = argv[6]; // version of program to run
const double h = L / (N - 1); // length between grid elements
const double dt = h * h / (4 * D); // maximum timestamp for numeric stability
const double aux = dt * D / (h * h); // all constant terms in diffusion equation
std::vector<double> c((N+2) * (N+2), 0.0);
std::vector<double> c_tmp((N+2) * (N+2), 0.0);
initialize_concentration(c, L, N, h);
std::chrono::nanoseconds time_elapsed;
if (implementation.compare("naive") == 0) {
time_elapsed = diffuse_naive(c, c_tmp, T, dt, aux);
} else if (implementation.compare("const") == 0) {
time_elapsed = diffuse_const_c(c, c_tmp, T, dt, aux);
}
printf("Elapsed time: %luns\n", time_elapsed.count());
if (output == 1) {
save_results(c, implementation);
}
}
</details>
# 答案1
**得分**: 1
`diffuse_const_c()` 声明 `M` 为 `const int &`,但 `step_const_c()` 请求 `const double &M`。
我在C++方面经验有限,但如果我不得不猜测底层发生了什么,我会假设添加了两个额外的步骤,导致速度变慢:
1. 创建一个新的匿名整数变量
2. 强制类型转换 `M` 并将其放入该变量
考虑到函数的其余部分只是从预分配的向量中获取值,进行一些快速的代数运算,然后将结果放回预分配的向量,这些额外的步骤听起来相对昂贵。
将 `step_const_c()` 中的函数参数改为 `const int &M` 会将运行时间降低到与简单实现相同的水平。
<details>
<summary>英文:</summary>
There was implicit typecasting happening (I think)!
`diffuse_const_c()` declares `M` as an `const int &`, but `step_const_c()` was requesting `const double &M`.
I'm far from experienced in C++, but if I had to guess what's happening under the hood, I would assume two additional steps are being added that are slowing this down:
1. Create a new anonymous integer variable
2. Cast `M` and place it in that variable
Considering the rest of the function is just grabbing values from a pre-allocated vector, some quick algebra, then putting the result back into a pre-allocated vector, these additional steps sound relatively expensive.
Switching the function argument in `step_const_c()` to `const int &M` brings the runtime down to the same level as the naive implementation.
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论