2023年2月16日 11:34:52go评论69阅读模式

英文:

Fastest modern C++ way to generate a 1GB file with arbitrary binary data

问题

这是您提供的代码部分的翻译：

所以，我尝试过这样做：

	uint64_t size = 1 * 1000 * 1000 * 1000;
    std::vector<char> data(size, 'a');

	std::ofstream myfile;
	myfile.open("test.dat");
	for (auto it = data.begin(); it != data.end(); ++it)
	{
		myfile << *it;
	}

	myfile.close();

这个代码可以运行，但在发布模式下需要超过15分钟。我使用了1TB的NVME固态硬盘和64GB的内存，感觉应该能在几秒钟内完成，**甚至不到一秒。**

我可以使用C++20。我更喜欢一些现代的写法，只要性能合理。我觉得超过10秒的运行时间太慢了。

明确一下，向数组中推送数据几乎是瞬间完成的。但写入数据是耗时的部分。

英文:

So, I've tried this:

uint64_t size = 1 * 1000 * 1000 * 1000;
std::vector&lt;char&gt; data(size, &#39;a&#39;);

std::ofstream myfile;
myfile.open(&quot;test.dat&quot;);
for (auto it = data.begin(); it != data.end(); ++it)
{
	myfile &lt;&lt; *it;
}

myfile.close();

Which works but takes > 15 minutes in release. I'm running with a 1TB NVME ssd and 64GBs of RAM, I feel like I should be able to do this in seconds, or less than a second.

I do have access to C++20. I prefer something modern for its readability so long as its reasonably performant. I feel like anything over 10 seconds is too slow.

To be clear, pushing back into the array all of the data is nearly instant. But writing it is what's taking minutes.

答案1

得分: 5

Writing one byte at a time leads to a high overhead, especially if the file is unbuffered and you're making a system call for each byte. Likewise, operator<< writes formatted data, which does the right thing for a char but is more error-prone, since many inputs will get formatted during writing.

一次写入一个字节会导致很高的开销，特别是如果文件是无缓冲的，并且每个字节都需要进行系统调用。同样，operator<< 写入格式化数据，对于 char 类型可以正常工作，但更容易出错，因为在写入过程中会对许多输入进行格式化。

Instead, you already have access to a buffer full of char - the simplest solution is to just write the whole buffer with ofstream::write:

相反，您已经可以访问到一个充满 char 的缓冲区 - 最简单的解决方案就是使用 ofstream::write 直接写入整个缓冲区：

#include &lt;vector&gt;
#include &lt;iostream&gt;
#include &lt;fstream&gt;

int main()
{
	uint64_t size = 2 * 1000 *1000 * 1000;
	std::vector&lt;char&gt; data;
	data.reserve(size);

	for (int i = 0; i &lt; size; ++i)
	{
		data.push_back(&#39;a&#39;);
	}

	std::ofstream myfile;
	myfile.open(&quot;test.dat&quot;);
	myfile.write(data.data(), size);

	myfile.close();
}

When straced, the example above only uses one syscall to write:

当使用 strace 跟踪时，上面的示例只使用一个系统调用来进行写入：

writev(3, [{iov_base=NULL, iov_len=0}, {iov_base=&quot;aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&quot;..., iov_len=2000000000}], 2) = 2000000000

Better yet, write (e.g.) a 1 MB block 1000 times, to avoid allocating a whole gigabyte of RAM:

更好的做法是，例如，写入一个 1 MB 的块 1000 次，以避免分配整个 1GB 的内存：

#include &lt;vector&gt;
#include &lt;iostream&gt;
#include &lt;fstream&gt;

int main()
{
	uint64_t size = 1000 * 1000;
	std::vector&lt;char&gt; data;
	data.reserve(size);

	for (int i = 0; i &lt; size; ++i)
	{
		data.push_back(&#39;a&#39;);
	}

	std::ofstream myfile;
	myfile.open(&quot;test.dat&quot;);
    for (int i = 0; i &lt; 1000; i++) {
	    myfile.write(data.data(), size);
    }
	myfile.close();
}

Finally, pay attention to what you're putting into the buffer. \ba is is a multibyte character - \b is an escape for 0x08 and a is a. On my machine, this gets truncated to just a. If you want more interesting data in your file, instead change the loop that fills the buffer.

最后，注意您要放入缓冲区的内容。\ba 是一个多字节字符 - \b 是 0x08 的转义字符，而 a 是 a。在我的机器上，这会被截断为 a。如果您想要在文件中放入更有趣的数据，可以改变填充缓冲区的循环。

英文:

Instead, you already have access to a buffer full of char - the simplest solution is to just write the whole buffer with ofstream::write:

#include &lt;vector&gt;
#include &lt;iostream&gt;
#include &lt;fstream&gt;

int main()
{
	uint64_t size = 2 * 1000 *1000 * 1000;
	std::vector&lt;char&gt; data;
	data.reserve(size);

	for (int i = 0; i &lt; size; ++i)
	{
		data.push_back(&#39;a&#39;);
	}

	std::ofstream myfile;
	myfile.open(&quot;test.dat&quot;);
	myfile.write(data.data(), size);

	myfile.close();
}

When straced, the example above only uses one syscall to write:

writev(3, [{iov_base=NULL, iov_len=0}, {iov_base=&quot;aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa&quot;..., iov_len=2000000000}], 2) = 2000000000

Better yet, write (e.g.) a 1 MB block 1000 times, to avoid allocating a whole gigabyte of RAM:

#include &lt;vector&gt;
#include &lt;iostream&gt;
#include &lt;fstream&gt;

int main()
{
	uint64_t size = 1000 * 1000;
	std::vector&lt;char&gt; data;
	data.reserve(size);

	for (int i = 0; i &lt; size; ++i)
	{
		data.push_back(&#39;a&#39;);
	}

	std::ofstream myfile;
	myfile.open(&quot;test.dat&quot;);
    for (int i = 0; i &lt; 1000; i++) {
	    myfile.write(data.data(), size);
    }
	myfile.close();
}

答案2

得分: 2

不要在内存中创建一个2GB的数组。

取一个块大小，比如65536字节。确切的最佳大小取决于您的操作系统和文件系统。

填充一个大小为该块大小的缓冲区，并使用iostream::write写入它。

类似这样的代码：

const size_t sz = 65536;
uint8_t data[sz];
for (/* blah */) {
    fill_random_data(data, sz);
    outfile.write(data, sz);
}

英文:

Don't make a 2GB array in memory.

Take a chunk size, something like 65536 bytes. The exact optimal size depends on your OS and filesystem.

Fill a buffer of that chunk size and write it with iostream::write.

Something like this:

const size_t sz = 65536;
uint8_t data[sz];
for ( /* blah */) {
    fill_random_data(data, sz);
    outfile.write(data, sz);
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Fastest modern C++ way to generate a 1GB file with arbitrary binary data.

问题

答案1

答案2

make_pair:error C2665: no overloaded function could convert all the argument types

“Overload” subscript-assignment operation in c++ 14

寻找 boost::iterator_facade 的示例用法。

template argument deduction compile error with boost's `static_vector`

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论