英文:
Expanding a temporary slice if more bytes are needed
问题
我正在以编程方式在一个目录中生成随机文件,至少要生成 temporaryFilesTotalSize
的随机数据(稍微多一点也没关系)。
以下是我的代码:
var files []string
for size := int64(0); size < temporaryFilesTotalSize; {
fileName := random.HexString(12)
filePath := dir + "/" + fileName
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
size += rand.Int63n(1 << 32) // 随机生成不超过4GB的大小
raw := make([]byte, size)
_, err := rand.Read(raw)
if err != nil {
panic(err)
}
file.Write(raw)
file.Close()
files = append(files, filePath)
}
有没有办法可以避免在循环中进行 raw := make([]byte, size)
的内存分配?
理想情况下,我希望在堆上保留一个切片,并且只在需要更大的 size
时才进行扩容。有没有一种高效的方法来实现这个?
英文:
I'm generating random files programmatically in a directory, at least temporaryFilesTotalSize
worth of random data (a bit more, who cares).
Here's my code:
var files []string
for size := int64(0); size < temporaryFilesTotalSize; {
fileName := random.HexString(12)
filePath := dir + "/" + fileName
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
size += rand.Int63n(1 << 32) // random dimension up to 4GB
raw := make([]byte, size)
_, err := rand.Read(raw)
if err != nil {
panic(err)
}
file.Write(raw)
file.Close()
files = append(files, filePath)
}
Is there any way I can avoid that raw := make([]byte, size)
allocation in the for loop?
Ideally I'd like to keep a slice on the heap and only grow if a bigger size
is required. Any way to do this efficiently?
答案1
得分: 0
首先,你应该知道生成随机数据并将其写入磁盘至少比为缓冲区分配连续内存慢一个数量级。这绝对属于"过早优化"的范畴。消除迭代内部的缓冲区创建不会使你的代码明显更快。
重用缓冲区
但是为了重用缓冲区,将其移出循环,创建最大所需的缓冲区,并在每次迭代中将其切片为所需的大小。这样做是可以的,因为我们将使用随机数据覆盖我们需要的整个部分。
请注意,我稍微更改了size
的生成(可能是你代码中的错误,因为你总是增加生成的临时文件,因为你使用size
累积大小来创建新的文件)。
还要注意,使用[]byte
准备内容写入文件最简单的方法是使用单个调用os.WriteFile()
。
类似于以下代码:
bigRaw := make([]byte, 1 << 32)
for totalSize := int64(0); ; {
size := rand.Int63n(1 << 32) // 随机维度最大为4GB
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
raw := bigRaw[:size]
rand.Read(raw) // 根据文档,rand.Read()总是返回nil错误
filePath := filepath.Join(dir, random.HexString(12))
if err := os.WriteFile(filePath, raw, 0666); err != nil {
panic(err)
}
files = append(files, filePath)
}
在没有中间缓冲区的情况下解决任务
由于你正在写入大文件(GB级别),分配那么大的缓冲区不是一个好主意:运行应用程序将需要GB级别的内存!我们可以通过使用内部循环使用较小的缓冲区来改进它,直到我们写入预期的大小,这解决了大内存问题,但增加了复杂性。幸运的是,我们可以在没有任何缓冲区的情况下解决任务,甚至降低复杂性!
我们应该以某种方式将随机数据从rand.Rand
直接传输到文件中,类似于io.Copy()
的功能。请注意,rand.Rand
实现了io.Reader
,而os.File
实现了io.ReaderFrom
,这表明我们可以简单地将rand.Rand
传递给file.ReadFrom()
,文件本身将直接从rand.Rand
中获取要写入的数据。
这听起来不错,但是ReadFrom()
从给定的读取器中读取数据,直到EOF或错误。如果我们传递rand.Rand
,这两者都不会发生。而且我们知道要读取和写入多少字节:size
。
io.LimitReader()
来拯救我们:我们向其传递一个io.Reader
和一个大小,返回的读取器将提供不超过给定字节数的数据,之后将报告EOF。
请注意,创建我们自己的rand.Rand
也会更快,因为我们传递给它的源将使用rand.NewSource()
创建,它返回一个"非同步"源(不适用于并发使用),这将更快!默认/全局rand.Rand
使用的源是同步的(因此适用于并发使用),但速度较慢。
太棒了!让我们看看它的实际效果:
r := rand.New(rand.NewSource(time.Now().Unix()))
for totalSize := int64(0); ; {
size := r.Int63n(1 << 32)
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
filePath := filepath.Join(dir, random.HexString(12))
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
if _, err := file.ReadFrom(io.LimitReader(r, fsize)); err != nil {
panic(err)
}
if err = file.Close(); err != nil {
panic(err)
}
files = append(files, filePath)
}
请注意,如果os.File
不实现io.ReaderFrom
,我们仍然可以使用io.Copy()
,将文件作为目标,将限制的读取器(上面使用的)作为源。
最后注意:关闭文件(或任何资源)最好使用defer
,这样无论如何都会被调用。但在循环中使用defer
有点棘手,因为延迟函数在封闭函数的末尾运行,而不是在循环迭代的末尾运行。所以你可以将其包装在一个函数中。有关详细信息,请参阅https://stackoverflow.com/questions/45617758/defer-in-the-loop-what-will-be-better/45620423#45620423
英文:
First of all you should know that generating random data and writing that to disk is at least an order of magnitude slower than allocating a contiguous memory for buffer. This definitely falls under the "premature optimization" category. Eliminating the creation of the buffer inside the iteration will not make your code noticeably faster.
Reusing the buffer
But to reuse the buffer, move it outside of the loop, create the biggest needed buffer, and slice it in each iteration to the needed size. It's OK to do this, because we'll overwrite the whole part we need with random data.
Note that I somewhat changed the size
generation (likely an error in your code as you always increase the generated temporary files, since you use the size
accumulated size for new ones).
Also note that writing a file with contents prepared in a []byte
is easiest done using a single call to os.WriteFile()
.
Something like this:
bigRaw := make([]byte, 1 << 32)
for totalSize := int64(0); ; {
size := rand.Int63n(1 << 32) // random dimension up to 4GB
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
raw := bigRaw[:size]
rand.Read(raw) // It's documented that rand.Read() always returns nil error
filePath := filepath.Join(dir, random.HexString(12))
if err := os.WriteFile(filePath, raw, 0666); err != nil {
panic(err)
}
files = append(files, filePath)
}
Solving the task without an intermediate buffer
Since you are writing big files (GBs), allocating that big buffer is not a good idea: running the app will require GBs of RAM! We could improve it with an inner loop to use smaller buffers until we write the expected size, which solves the big memory issue, but increases complexity. Luckily for us, we can solve the task without any buffers, and even with decreased complexity!
We should somehow "channel" the random data from a rand.Rand
to the file directly, something similar what io.Copy()
does. Note that rand.Rand
implements io.Reader
, and os.File
implements io.ReaderFrom
, which suggests we could simply pass a rand.Rand
to file.ReadFrom()
, and the file
itself would get the data directly from rand.Rand
that will be written.
This sounds good, but the ReadFrom()
reads data from the given reader until EOF or error. Neither will ever happen if we pass rand.Rand
. And we do know how many bytes we want to be read and written: size
.
To our "rescue" comes io.LimitReader()
: we pass an io.Reader
and a size to it, and the returned reader will supply no more than the given number of bytes, and after that will report EOF.
Note that creating our own rand.Rand
will also be faster as the source we pass to it will be created using rand.NewSource()
which returns an "unsynchronized" source (not safe for concurrent use) which in turn will be faster! The source used by the default/global rand.Rand
is synchronized (and so safe for concurrent use–but is slower).
Perfect! Let's see this in action:
r := rand.New(rand.NewSource(time.Now().Unix()))
for totalSize := int64(0); ; {
size := r.Int63n(1 << 32)
totalSize += size
if totalSize >= temporaryFilesTotalSize {
break
}
filePath := filepath.Join(dir, random.HexString(12))
file, err := os.Create(filePath)
if err != nil {
return nil, err
}
if _, err := file.ReadFrom(io.LimitReader(r, fsize)); err != nil {
panic(err)
}
if err = file.Close(); err != nil {
panic(err)
}
files = append(files, filePath)
}
Note that if os.File
would not implement io.ReaderFrom
, we could still use io.Copy()
, providing the file as the destination, and a limited reader (used above) as the source.
Final note: closing the file (or any resource) is best done using defer
, so it'll get called no matter what. Using defer
in a loop is a bit tricky though, as deferred functions run at the end of the enclosing function, and not at the end of the loop's iteration. So you may wrap it in a function. For details, see https://stackoverflow.com/questions/45617758/defer-in-the-loop-what-will-be-better/45620423#45620423
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论