将for循环移到它们自己的函数中为什么会减慢程序速度?

huangapple go评论89阅读模式
英文:

Why does moving for loops to their own function slow down the program?

问题

以下是代码的翻译部分:

  1. 我有一个正在尝试加速的小型扩散模拟。这是“朴素”版本:
  2. #include <chrono>
  3. #include <cmath>
  4. #include <fstream>
  5. #include <vector>
  6. std::chrono::nanoseconds diffuse_naive(std::vector<double> &c,
  7. std::vector<double> &c_tmp,
  8. const double T,
  9. const double dt,
  10. const double aux) {
  11. const size_t num_steps = (T / dt) + 1;
  12. const size_t length = sqrt(c.size());
  13. // M = N + 2,具有填充的长度
  14. const int M = (int) length;
  15. auto time_start = std::chrono::high_resolution_clock::now();
  16. for (size_t step = 0; step < num_steps; ++step) {
  17. for (size_t i = 1; i < length - 1; ++i) {
  18. for (size_t j = 1; j < length - 1; ++j) {
  19. c_tmp[i * M + j] = c[i * M + j] + aux * (
  20. c[i * M + (j + 1)] + c[i * M + (j - 1)] +
  21. c[(i + 1) * M + j] + c[(i - 1) * M + j] -
  22. 4 * c[i * M + j]);
  23. }
  24. }
  25. std::swap(c, c_tmp);
  26. }
  27. auto time_stop = std::chrono::high_resolution_clock::now();
  28. auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
  29. return time_elapsed;
  30. }
  31. 我最初尝试使用“const”和“__restrict”来加速它,并在这样做时将内部的“for”循环移到它们自己的函数中:
  32. #include <chrono>
  33. #include <cmath>
  34. #include <fstream>
  35. #include <vector>
  36. void step_const_c(const std::vector<double> &__restrict c,
  37. std::vector<double> &__restrict c_tmp,
  38. const size_t &length,
  39. const double &M,
  40. const double &aux) {
  41. for (size_t i = 1; i < length - 1; ++i) {
  42. for (size_t j = 1; j < length - 1; ++j) {
  43. c_tmp[i * M + j] = c[i * M + j] + aux * (
  44. c[i * M + (j + 1)] + c[i * M + (j - 1)] +
  45. c[(i + 1) * M + j] + c[(i - 1) * M + j] -
  46. 4 * c[i * M + j]);
  47. }
  48. }
  49. }
  50. std::chrono::nanoseconds diffuse_const_c(std::vector<double> &c,
  51. std::vector<double> &c_tmp,
  52. const double T,
  53. const double dt,
  54. const double aux) {
  55. const size_t num_steps = (T / dt) + 1;
  56. const size_t length = sqrt(c.size());
  57. // M = N + 2,具有填充的长度
  58. const int M = (int) length;
  59. auto time_start = std::chrono::high_resolution_clock::now();
  60. for (size_t step = 0; step < num_steps; ++step) {
  61. step_const_c(c, c_tmp, length, M, aux);
  62. std::swap(c, c_tmp);
  63. }
  64. auto time_stop = std::chrono::high_resolution_clock::now();
  65. auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
  66. return time_elapsed;
  67. }
  68. 它们都使用“g++ -O3 -Wall -Wextra -pedantic -std=c++14 -o diffusion diffusion.cpp”进行编译,并分别使用“./diffusion 0.1 2.0 200 0.5 1 naive”和“./diffusion 0.1 2.0 200 0.5 1 const”运行。
  69. 如下所示,“__restrict”在使用向量时几乎没有意义,所以我运行了这个程序,有和没有使用“const”和“__restrict”都是相同的,运行时间都是一样的,从朴素方法的0.05秒到调用“step_const_c”的0.3秒(无论是否使用“const”和“__restrict”)。:
  70. 是什么原因导致将内部的“for”循环移到它们自己的函数中,速度比将“for”循环放在同一个函数中要慢这么多?
  71. 谢谢!
  72. **编辑**:这个问题最初是关于为什么添加“const”和“__restrict”会减慢程序运行速度,但我从“step_const_c”的参数中删除了“const”和“__restrict”,运行时间几乎相同,这表明它们不是问题,而是将“for”循环放在自己的函数中。
  73. <details>
  74. <summary>英文:</summary>
  75. I have a small diffusion simulation that I&#39;m experimenting with speeding up.
  76. Here is the &#39;naive&#39; version:

#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>

std::chrono::nanoseconds diffuse_naive(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}

  1. I was originally experimenting with the use of `const` and `__restrict` to speed this up, and in doing so, moved the inner `for` loops to their own function:

#include <chrono>
#include <cmath>
#include <fstream>
#include <vector>

void step_const_c(const std::vector<double> &__restrict c,
std::vector<double> &__restrict c_tmp,
const size_t &length,
const double &M,
const double &aux) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
}

std::chrono::nanoseconds diffuse_const_c(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
step_const_c(c, c_tmp, length, M, aux);
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}

  1. Both are compiled with `g++ -O3 -Wall -Wextra -pedantic -Wall -Wextra -pedantic -std=c++14 -o diffusion diffusion.cpp` and run with `./diffusion 0.1 2.0 200 0.5 1 naive` and `./diffusion 0.1 2.0 200 0.5 1 const` respectively.
  2. As pointed out by a comment below, the use of `__restrict` is pretty pointless when using vectors, so I ran this both with and without the use of `const` and `__restrict`. The runtime with the separate for-loop function increases by a factor of six, from 0.05 seconds with the naive approach to 0.3 seconds when calling `step_const_c` (with or without `const` and `__restrict`).
  3. What is causing the version with the inner `for` loops in their own function to be so much slower than having the `for` loops in the same function?
  4. Thanks!
  5. **Edit**: This question was originally about why adding `const` and `__restrict` slowed the program down, but I removed `const` and `__restrict` from the arguments of `step_const_c`, and the runtime was pretty much identical, which indicates they aren&#39;t the issue, but rather putting the `for` loops in their own function.
  6. The full content of `diffusion.cpp`:

#include <cmath>
#include <fstream>
#include <vector>

void save_results(const std::vector<double> &c,
const std::string &implementation) {
std::string filename = "cpp_" + implementation + ".txt";
size_t length = sqrt(c.size());
std::ofstream ofs(filename, std::ofstream::out);
for (size_t i = 0; i < length; ++i) {
for (size_t j = 0; j < length; ++j) {
ofs << c[i * length + j];
if (j == length - 1) {
ofs << "\n";
} else {
ofs << ",";
}
}
}
ofs.close();
}

void step_const_c(const std::vector<double> &__restrict c,
std::vector<double> &__restrict c_tmp,
const size_t &length,
const double &M,
const double &aux) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
}

std::chrono::nanoseconds diffuse_const_c(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
step_const_c(c, c_tmp, length, M, aux);
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}

std::chrono::nanoseconds diffuse_naive(std::vector<double> &c,
std::vector<double> &c_tmp,
const double T,
const double dt,
const double aux) {
const size_t num_steps = (T / dt) + 1;
const size_t length = sqrt(c.size());
// M = N + 2, the length with padding
const int M = (int) length;
auto time_start = std::chrono::high_resolution_clock::now();
for (size_t step = 0; step < num_steps; ++step) {
for (size_t i = 1; i < length - 1; ++i) {
for (size_t j = 1; j < length - 1; ++j) {
c_tmp[i * M + j] = c[i * M + j] + aux * (
c[i * M + (j + 1)] + c[i * M + (j - 1)] +
c[(i + 1) * M + j] + c[(i - 1) * M + j] -
4 * c[i * M + j]);
}
}
std::swap(c, c_tmp);
}
auto time_stop = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(time_stop - time_start);
return time_elapsed;
}

void initialize_concentration(std::vector<double> &c,
const double &L,
const size_t &N,
const double &h) {
const double bound = 0.125 * L;
for (size_t i = 0; i < N; ++i) {
for (size_t j = 0; j < N; ++j) {
if (std::abs(i * h - 0.5 * L) < bound &&
std::abs(j * h - 0.5 * L) < bound) {
c[(i + 1) * (N + 2) + (j + 1)] = 1.0;
}
}
}
}

int main(int argc, char *argv[]) {
if (argc < 7) {
fprintf(stderr, "Usage: %s D L N T out implementation\n", argv[0]);
return 1;
}

  1. const double D = std::stod(argv[1]); // diffusion constant
  2. const double L = std::stod(argv[2]); // domain size
  3. const size_t N = std::stoul(argv[3]); // number of grid points along each dim
  4. const double T = std::stod(argv[4]); // simulation time in seconds
  5. const size_t output = std::stoul(argv[5]); // whether to store results, yes == 1
  6. const std::string implementation = argv[6]; // version of program to run
  7. const double h = L / (N - 1); // length between grid elements
  8. const double dt = h * h / (4 * D); // maximum timestamp for numeric stability
  9. const double aux = dt * D / (h * h); // all constant terms in diffusion equation
  10. std::vector&lt;double&gt; c((N+2) * (N+2), 0.0);
  11. std::vector&lt;double&gt; c_tmp((N+2) * (N+2), 0.0);
  12. initialize_concentration(c, L, N, h);
  13. std::chrono::nanoseconds time_elapsed;
  14. if (implementation.compare(&quot;naive&quot;) == 0) {
  15. time_elapsed = diffuse_naive(c, c_tmp, T, dt, aux);
  16. } else if (implementation.compare(&quot;const&quot;) == 0) {
  17. time_elapsed = diffuse_const_c(c, c_tmp, T, dt, aux);
  18. }
  19. printf(&quot;Elapsed time: %luns\n&quot;, time_elapsed.count());
  20. if (output == 1) {
  21. save_results(c, implementation);
  22. }

}

  1. </details>
  2. # 答案1
  3. **得分**: 1
  4. `diffuse_const_c()` 声明 `M` 为 `const int &amp;`,但 `step_const_c()` 请求 `const double &amp;M`。
  5. 我在C++方面经验有限,但如果我不得不猜测底层发生了什么,我会假设添加了两个额外的步骤,导致速度变慢:
  6. 1. 创建一个新的匿名整数变量
  7. 2. 强制类型转换 `M` 并将其放入该变量
  8. 考虑到函数的其余部分只是从预分配的向量中获取值,进行一些快速的代数运算,然后将结果放回预分配的向量,这些额外的步骤听起来相对昂贵。
  9. 将 `step_const_c()` 中的函数参数改为 `const int &amp;M` 会将运行时间降低到与简单实现相同的水平。
  10. <details>
  11. <summary>英文:</summary>
  12. There was implicit typecasting happening (I think)!
  13. `diffuse_const_c()` declares `M` as an `const int &amp;`, but `step_const_c()` was requesting `const double &amp;M`.
  14. I&#39;m far from experienced in C++, but if I had to guess what&#39;s happening under the hood, I would assume two additional steps are being added that are slowing this down:
  15. 1. Create a new anonymous integer variable
  16. 2. Cast `M` and place it in that variable
  17. Considering the rest of the function is just grabbing values from a pre-allocated vector, some quick algebra, then putting the result back into a pre-allocated vector, these additional steps sound relatively expensive.
  18. Switching the function argument in `step_const_c()` to `const int &amp;M` brings the runtime down to the same level as the naive implementation.
  19. </details>

huangapple
  • 本文由 发表于 2023年7月23日 15:47:08
  • 转载请务必保留本文链接:https://go.coder-hub.com/76747148.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定