这是重新格式化一个包含64位值到16位的缓冲区的最快方法吗?

huangapple go评论64阅读模式
英文:

Is this the fastest way to reformat a buffer of 64b values to 16b?

问题

我有一个数据流,它将物理上的64位值输出到缓冲区。当缓冲区达到一定水平时,需要将其重新格式化为连续的16位值。每个数据流产生的值的实际值从来不会超过每个值的64位中的24位,因此这相当于将一个24位值截断为16位,并重新排列缓冲区,使这些值现在是连续的。我相信我已经找到了这样做的最快方式,但我不确定是否存在我可能忽略的优化或C++标准工具提供的更快的方式。下面是一个最小可重现示例,显示了我的重新格式化函数以及一个测试工具,用于生成类似我遇到的数据并计时重新格式化。

#include <iostream>
#include <chrono>
#include <unistd.h>

int num_samples = 160000;

void fill_buffer(uint8_t** buffer){
  *buffer = (uint8_t*)malloc(num_samples * sizeof(uint64_t));
  for (int i = 0; i < num_samples; i += 8){
    (*buffer)[i] = rand() % 0xFF;
    (*buffer)[i + 1] = rand() % 0xFF;
    (*buffer)[i + 2] = rand() % 0xFF;
  }
}

void reformat_1(uint8_t* buf){
  uint64_t* p_8byte = (uint64_t*)buf;
  uint16_t* p_2byte = (uint16_t*)buf;

  for (int i = 0; i < num_samples; i++){
    p_2byte[i] = p_8byte[i] >> 8;
  }
}

int main(int argc, char const* argv[]){
  uint8_t* buffer = NULL;

  fill_buffer(&buffer);
  auto start = std::chrono::high_resolution_clock::now();
  reformat_1(buffer);
  auto stop = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
  std::cout << "Time taken by function one: " << duration.count() << " microseconds" << std::endl;

  return 0;
}

同时,我愿意听取对我的基准设置的反馈。有趣的是,使用 -O3 选项时,我从文件中读取的实际样本数据的执行时间约为130微秒,而使用随机生成的数据时,我看到的执行时间接近1800微秒,因此这显然不是一个完全代表性的示例。

还有一件事我想要注意的是,我认为这可能会影响我的实际执行时间(与合成数据相比),但显然并非如此:虽然 num_samples 在这里是一个魔术数字,但在实际中,它是计算出来的并且通常是常量(但不总是),但不会由编译器替换为常数以展开循环等(我认为)。

英文:

I have a datastream which outputs what are physically 64bit values to a buffer. When the buffer reaches a certain level, it needs to be reformatted to consecutive 16bit values. The real values are never more than 24 of the 64 bits of each value produced by the datastream, so this amounts to truncating a 24b value to 16b and rearranging the buffer so the values are now consecutive. I believe I have found the fastest way to do this, however I am not sure if there are optimizations I could be missing or faster ways provided by C++ standard utilities. Below is a MRE showing my reformatting function as well as a test harness to produce data like what I am encountering and time the reformatting.

#include &lt;iostream&gt;
#include &lt;chrono&gt;
#include &lt;unistd.h&gt;

int num_samples = 160000;

void fill_buffer(uint8_t** buffer){
  *buffer = (uint8_t*)malloc(num_samples * sizeof(uint64_t));
  for (int i = 0; i &lt; num_samples; i += 8){
    (*buffer)[i] = rand() % 0xFF;
    (*buffer)[i + 1] = rand() % 0xFF;
    (*buffer)[i + 2] = rand() % 0xFF;
  }
}

void reformat_1(uint8_t* buf){
  uint64_t* p_8byte = (uint64_t*)buf;
  uint16_t* p_2byte = (uint16_t*)buf;

  for (int i = 0; i &lt; num_samples; i++){
    p_2byte[i] = p_8byte[i] &gt;&gt; 8;
  }
}

int main(int argc, char const* argv[]){
  uint8_t* buffer = NULL;

  fill_buffer(&amp;buffer);
  auto start = std::chrono::high_resolution_clock::now();
  reformat_1(buffer);
  auto stop = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast&lt;std::chrono::microseconds&gt;(stop - start);
  std::cout &lt;&lt; &quot;Time taken by function one: &quot; &lt;&lt; duration.count() &lt;&lt; &quot; microseconds&quot; &lt;&lt; std::endl;

  return 0;
}

Also willing to hear feedback on my benchmarking setup, I find it interesting that with -O3 I get ~130uS on my actual sample data read from a file, while with randomly generated data, I am seeing closer to 1800uS, so this is apparently not a perfectly representative example.

One other thing I will note that I would think would work against my actual times (vs synthetic) but apparently not: While the num_samples is a magic number here, in practice, it is calculated and usually constant (not always), but not something that the compiler would replace with a constant as to unroll loops or etc (I think).

答案1

得分: 1

这项微小改进可将性能提高约10%:

void reformat_2(uint8_t* buf){
  uint32_t* p_8byte = (uint32_t*)buf;
  uint16_t* p_2byte = (uint16_t*)buf;
  uint16_t* p_2end  = p_2byte + num_samples;

  while(p_2byte < p_2end){
    *p_2byte++ = *p_8byte >> 8;
    p_8byte += 2;
  }
}

为了更清楚地查看数字,我将缓冲区大小增加了100倍,达到了16M条目。

英文:

This micro-improvement increases performance by ~10%:

void reformat_2(uint8_t* buf){
  uint32_t* p_8byte = (uint32_t*)buf;
  uint16_t* p_2byte = (uint16_t*)buf;
  uint16_t* p_2end  = p_2byte + num_samples;

  while(p_2byte &lt; p_2end){
    *p_2byte++ = *p_8byte &gt;&gt; 8;
    p_8byte += 2;
  }
}

For see more clear numbers, I increased buffer size 100x, to 16M entries.

huangapple
  • 本文由 发表于 2023年6月29日 04:22:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76576499.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定