问题

我有一个数据流，它将物理上的64位值输出到缓冲区。当缓冲区达到一定水平时，需要将其重新格式化为连续的16位值。每个数据流产生的值的实际值从来不会超过每个值的64位中的24位，因此这相当于将一个24位值截断为16位，并重新排列缓冲区，使这些值现在是连续的。我相信我已经找到了这样做的最快方式，但我不确定是否存在我可能忽略的优化或C++标准工具提供的更快的方式。下面是一个最小可重现示例，显示了我的重新格式化函数以及一个测试工具，用于生成类似我遇到的数据并计时重新格式化。

#include <iostream>
#include <chrono>
#include <unistd.h>

int num_samples = 160000;

void fill_buffer(uint8_t** buffer){
  *buffer = (uint8_t*)malloc(num_samples * sizeof(uint64_t));
  for (int i = 0; i < num_samples; i += 8){
    (*buffer)[i] = rand() % 0xFF;
    (*buffer)[i + 1] = rand() % 0xFF;
    (*buffer)[i + 2] = rand() % 0xFF;
  }
}

void reformat_1(uint8_t* buf){
  uint64_t* p_8byte = (uint64_t*)buf;
  uint16_t* p_2byte = (uint16_t*)buf;

  for (int i = 0; i < num_samples; i++){
    p_2byte[i] = p_8byte[i] >> 8;
  }
}

int main(int argc, char const* argv[]){
  uint8_t* buffer = NULL;

  fill_buffer(&buffer);
  auto start = std::chrono::high_resolution_clock::now();
  reformat_1(buffer);
  auto stop = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
  std::cout << "Time taken by function one: " << duration.count() << " microseconds" << std::endl;

  return 0;
}

同时，我愿意听取对我的基准设置的反馈。有趣的是，使用 -O3 选项时，我从文件中读取的实际样本数据的执行时间约为130微秒，而使用随机生成的数据时，我看到的执行时间接近1800微秒，因此这显然不是一个完全代表性的示例。

还有一件事我想要注意的是，我认为这可能会影响我的实际执行时间（与合成数据相比），但显然并非如此：虽然 num_samples 在这里是一个魔术数字，但在实际中，它是计算出来的并且通常是常量（但不总是），但不会由编译器替换为常数以展开循环等（我认为）。

英文:

I have a datastream which outputs what are physically 64bit values to a buffer. When the buffer reaches a certain level, it needs to be reformatted to consecutive 16bit values. The real values are never more than 24 of the 64 bits of each value produced by the datastream, so this amounts to truncating a 24b value to 16b and rearranging the buffer so the values are now consecutive. I believe I have found the fastest way to do this, however I am not sure if there are optimizations I could be missing or faster ways provided by C++ standard utilities. Below is a MRE showing my reformatting function as well as a test harness to produce data like what I am encountering and time the reformatting.

#include &lt;iostream&gt;
#include &lt;chrono&gt;
#include &lt;unistd.h&gt;

int num_samples = 160000;

void fill_buffer(uint8_t** buffer){
  *buffer = (uint8_t*)malloc(num_samples * sizeof(uint64_t));
  for (int i = 0; i &lt; num_samples; i += 8){
    (*buffer)[i] = rand() % 0xFF;
    (*buffer)[i + 1] = rand() % 0xFF;
    (*buffer)[i + 2] = rand() % 0xFF;
  }
}

void reformat_1(uint8_t* buf){
  uint64_t* p_8byte = (uint64_t*)buf;
  uint16_t* p_2byte = (uint16_t*)buf;

  for (int i = 0; i &lt; num_samples; i++){
    p_2byte[i] = p_8byte[i] &gt;&gt; 8;
  }
}

int main(int argc, char const* argv[]){
  uint8_t* buffer = NULL;

  fill_buffer(&amp;buffer);
  auto start = std::chrono::high_resolution_clock::now();
  reformat_1(buffer);
  auto stop = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast&lt;std::chrono::microseconds&gt;(stop - start);
  std::cout &lt;&lt; &quot;Time taken by function one: &quot; &lt;&lt; duration.count() &lt;&lt; &quot; microseconds&quot; &lt;&lt; std::endl;

  return 0;
}

Also willing to hear feedback on my benchmarking setup, I find it interesting that with -O3 I get ~130uS on my actual sample data read from a file, while with randomly generated data, I am seeing closer to 1800uS, so this is apparently not a perfectly representative example.

One other thing I will note that I would think would work against my actual times (vs synthetic) but apparently not: While the num_samples is a magic number here, in practice, it is calculated and usually constant (not always), but not something that the compiler would replace with a constant as to unroll loops or etc (I think).

答案1

得分: 1

这项微小改进可将性能提高约10％：

void reformat_2(uint8_t* buf){
  uint32_t* p_8byte = (uint32_t*)buf;
  uint16_t* p_2byte = (uint16_t*)buf;
  uint16_t* p_2end  = p_2byte + num_samples;

  while(p_2byte < p_2end){
    *p_2byte++ = *p_8byte >> 8;
    p_8byte += 2;
  }
}

为了更清楚地查看数字，我将缓冲区大小增加了100倍，达到了16M条目。

英文:

This micro-improvement increases performance by ~10%:

void reformat_2(uint8_t* buf){
  uint32_t* p_8byte = (uint32_t*)buf;
  uint16_t* p_2byte = (uint16_t*)buf;
  uint16_t* p_2end  = p_2byte + num_samples;

  while(p_2byte &lt; p_2end){
    *p_2byte++ = *p_8byte &gt;&gt; 8;
    p_8byte += 2;
  }
}

For see more clear numbers, I increased buffer size 100x, to 16M entries.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

这是重新格式化一个包含64位值到16位的缓冲区的最快方法吗？

问题

答案1

如何在子类中使用嵌套类型的模板基类？

C++中的[[noreturn]]函数调用和析构函数

如何在复杂的赋值表达式中对Lambda函数进行缩进，缩进为4个空格。

如何在C++中引用类的非静态成员。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论