英文:
Fastest modern C++ way to generate a 1GB file with arbitrary binary data
问题
这是您提供的代码部分的翻译:
所以,我尝试过这样做:
uint64_t size = 1 * 1000 * 1000 * 1000;
std::vector<char> data(size, 'a');
std::ofstream myfile;
myfile.open("test.dat");
for (auto it = data.begin(); it != data.end(); ++it)
{
myfile << *it;
}
myfile.close();
这个代码可以运行,但在发布模式下需要超过15分钟。我使用了1TB的NVME固态硬盘和64GB的内存,感觉应该能在几秒钟内完成,**甚至不到一秒。**
我可以使用C++20。我更喜欢一些现代的写法,只要性能合理。我觉得超过10秒的运行时间太慢了。
明确一下,向数组中推送数据几乎是瞬间完成的。但写入数据是耗时的部分。
英文:
So, I've tried this:
uint64_t size = 1 * 1000 * 1000 * 1000;
std::vector<char> data(size, 'a');
std::ofstream myfile;
myfile.open("test.dat");
for (auto it = data.begin(); it != data.end(); ++it)
{
myfile << *it;
}
myfile.close();
Which works but takes > 15 minutes in release. I'm running with a 1TB NVME ssd and 64GBs of RAM, I feel like I should be able to do this in seconds, or less than a second.
I do have access to C++20. I prefer something modern for its readability so long as its reasonably performant. I feel like anything over 10 seconds is too slow.
To be clear, pushing back into the array all of the data is nearly instant. But writing it is what's taking minutes.
答案1
得分: 5
Writing one byte at a time leads to a high overhead, especially if the file is unbuffered and you're making a system call for each byte. Likewise, operator<<
writes formatted data, which does the right thing for a char but is more error-prone, since many inputs will get formatted during writing.
一次写入一个字节会导致很高的开销,特别是如果文件是无缓冲的,并且每个字节都需要进行系统调用。同样,operator<<
写入格式化数据,对于 char 类型可以正常工作,但更容易出错,因为在写入过程中会对许多输入进行格式化。
Instead, you already have access to a buffer full of char
- the simplest solution is to just write the whole buffer with ofstream::write
:
相反,您已经可以访问到一个充满 char
的缓冲区 - 最简单的解决方案就是使用 ofstream::write
直接写入整个缓冲区:
#include <vector>
#include <iostream>
#include <fstream>
int main()
{
uint64_t size = 2 * 1000 *1000 * 1000;
std::vector<char> data;
data.reserve(size);
for (int i = 0; i < size; ++i)
{
data.push_back('a');
}
std::ofstream myfile;
myfile.open("test.dat");
myfile.write(data.data(), size);
myfile.close();
}
When straced, the example above only uses one syscall to write:
当使用 strace 跟踪时,上面的示例只使用一个系统调用来进行写入:
writev(3, [{iov_base=NULL, iov_len=0}, {iov_base="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., iov_len=2000000000}], 2) = 2000000000
Better yet, write (e.g.) a 1 MB block 1000 times, to avoid allocating a whole gigabyte of RAM:
更好的做法是,例如,写入一个 1 MB 的块 1000 次,以避免分配整个 1GB 的内存:
#include <vector>
#include <iostream>
#include <fstream>
int main()
{
uint64_t size = 1000 * 1000;
std::vector<char> data;
data.reserve(size);
for (int i = 0; i < size; ++i)
{
data.push_back('a');
}
std::ofstream myfile;
myfile.open("test.dat");
for (int i = 0; i < 1000; i++) {
myfile.write(data.data(), size);
}
myfile.close();
}
Finally, pay attention to what you're putting into the buffer. \ba
is is a multibyte character - \b
is an escape for 0x08
and a
is a
. On my machine, this gets truncated to just a
. If you want more interesting data in your file, instead change the loop that fills the buffer.
最后,注意您要放入缓冲区的内容。\ba
是一个多字节字符 - \b
是 0x08
的转义字符,而 a
是 a
。在我的机器上,这会被截断为 a
。如果您想要在文件中放入更有趣的数据,可以改变填充缓冲区的循环。
英文:
Writing one byte at a time leads to a high overhead, especially if the file is unbuffered and you're making a system call for each byte. Likewise, operator<<
writes formatted data, which does the right thing for a char but is more error-prone, since many inputs will get formatted during writing.
Instead, you already have access to a buffer full of char
- the simplest solution is to just write the whole buffer with ofstream::write
:
#include <vector>
#include <iostream>
#include <fstream>
int main()
{
uint64_t size = 2 * 1000 *1000 * 1000;
std::vector<char> data;
data.reserve(size);
for (int i = 0; i < size; ++i)
{
data.push_back('a');
}
std::ofstream myfile;
myfile.open("test.dat");
myfile.write(data.data(), size);
myfile.close();
}
When straced, the example above only uses one syscall to write:
writev(3, [{iov_base=NULL, iov_len=0}, {iov_base="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., iov_len=2000000000}], 2) = 2000000000
Better yet, write (e.g.) a 1 MB block 1000 times, to avoid allocating a whole gigabyte of RAM:
#include <vector>
#include <iostream>
#include <fstream>
int main()
{
uint64_t size = 1000 * 1000;
std::vector<char> data;
data.reserve(size);
for (int i = 0; i < size; ++i)
{
data.push_back('a');
}
std::ofstream myfile;
myfile.open("test.dat");
for (int i = 0; i < 1000; i++) {
myfile.write(data.data(), size);
}
myfile.close();
}
Finally, pay attention to what you're putting into the buffer. \ba
is is a multibyte character - \b
is an escape for 0x08
and a
is a
. On my machine, this gets truncated to just a
. If you want more interesting data in your file, instead change the loop that fills the buffer.
答案2
得分: 2
不要在内存中创建一个2GB的数组。
取一个块大小,比如65536字节。确切的最佳大小取决于您的操作系统和文件系统。
填充一个大小为该块大小的缓冲区,并使用iostream::write写入它。
类似这样的代码:
const size_t sz = 65536;
uint8_t data[sz];
for (/* blah */) {
fill_random_data(data, sz);
outfile.write(data, sz);
}
英文:
Don't make a 2GB array in memory.
Take a chunk size, something like 65536 bytes. The exact optimal size depends on your OS and filesystem.
Fill a buffer of that chunk size and write it with iostream::write.
Something like this:
const size_t sz = 65536;
uint8_t data[sz];
for ( /* blah */) {
fill_random_data(data, sz);
outfile.write(data, sz);
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论