如果需要更多字节,可以扩展临时切片。

huangapple go评论73阅读模式
英文:

Expanding a temporary slice if more bytes are needed

问题

我正在以编程方式在一个目录中生成随机文件,至少要生成 temporaryFilesTotalSize 的随机数据(稍微多一点也没关系)。

以下是我的代码:

var files []string

for size := int64(0); size < temporaryFilesTotalSize; {
    fileName := random.HexString(12)
    filePath := dir + "/" + fileName
    file, err := os.Create(filePath)
    if err != nil {
        return nil, err
    }

    size += rand.Int63n(1 << 32) // 随机生成不超过4GB的大小
    raw := make([]byte, size)
    _, err := rand.Read(raw)
    if err != nil {
        panic(err)
    }

    file.Write(raw)
    file.Close()
    files = append(files, filePath)
}

有没有办法可以避免在循环中进行 raw := make([]byte, size) 的内存分配?
理想情况下,我希望在堆上保留一个切片,并且只在需要更大的 size 时才进行扩容。有没有一种高效的方法来实现这个?

英文:

I'm generating random files programmatically in a directory, at least temporaryFilesTotalSize worth of random data (a bit more, who cares).

Here's my code:

var files []string

	for size := int64(0); size &lt; temporaryFilesTotalSize; {
		fileName := random.HexString(12)
		filePath := dir + &quot;/&quot; + fileName
		file, err := os.Create(filePath)
		if err != nil {
			return nil, err
		}

		size += rand.Int63n(1 &lt;&lt; 32) // random dimension up to 4GB
		raw := make([]byte, size)
		_, err := rand.Read(raw)
		if err != nil {
			panic(err)
		}

		file.Write(raw)
		file.Close()
		files = append(files, filePath)
	}

Is there any way I can avoid that raw := make([]byte, size) allocation in the for loop?
Ideally I'd like to keep a slice on the heap and only grow if a bigger size is required. Any way to do this efficiently?

答案1

得分: 0

首先,你应该知道生成随机数据并将其写入磁盘至少比为缓冲区分配连续内存慢一个数量级。这绝对属于"过早优化"的范畴。消除迭代内部的缓冲区创建不会使你的代码明显更快。

重用缓冲区

但是为了重用缓冲区,将其移出循环,创建最大所需的缓冲区,并在每次迭代中将其切片为所需的大小。这样做是可以的,因为我们将使用随机数据覆盖我们需要的整个部分。

请注意,我稍微更改了size的生成(可能是你代码中的错误,因为你总是增加生成的临时文件,因为你使用size累积大小来创建新的文件)。

还要注意,使用[]byte准备内容写入文件最简单的方法是使用单个调用os.WriteFile()

类似于以下代码:

bigRaw := make([]byte, 1 << 32)

for totalSize := int64(0); ; {
    size := rand.Int63n(1 << 32) // 随机维度最大为4GB
    totalSize += size
    if totalSize >= temporaryFilesTotalSize {
        break
    }

    raw := bigRaw[:size]
    rand.Read(raw) // 根据文档,rand.Read()总是返回nil错误

	filePath := filepath.Join(dir, random.HexString(12))
    if err := os.WriteFile(filePath, raw, 0666); err != nil {
        panic(err)
    }

    files = append(files, filePath)
}

在没有中间缓冲区的情况下解决任务

由于你正在写入大文件(GB级别),分配那么大的缓冲区不是一个好主意:运行应用程序将需要GB级别的内存!我们可以通过使用内部循环使用较小的缓冲区来改进它,直到我们写入预期的大小,这解决了大内存问题,但增加了复杂性。幸运的是,我们可以在没有任何缓冲区的情况下解决任务,甚至降低复杂性!

我们应该以某种方式将随机数据从rand.Rand直接传输到文件中,类似于io.Copy()的功能。请注意,rand.Rand实现了io.Reader,而os.File实现了io.ReaderFrom,这表明我们可以简单地将rand.Rand传递给file.ReadFrom(),文件本身将直接从rand.Rand中获取要写入的数据。

这听起来不错,但是ReadFrom()从给定的读取器中读取数据,直到EOF或错误。如果我们传递rand.Rand,这两者都不会发生。而且我们知道要读取和写入多少字节:size

io.LimitReader()来拯救我们:我们向其传递一个io.Reader和一个大小,返回的读取器将提供不超过给定字节数的数据,之后将报告EOF。

请注意,创建我们自己的rand.Rand也会更快,因为我们传递给它的源将使用rand.NewSource()创建,它返回一个"非同步"源(不适用于并发使用),这将更快!默认/全局rand.Rand使用的源是同步的(因此适用于并发使用),但速度较慢。

太棒了!让我们看看它的实际效果:

r := rand.New(rand.NewSource(time.Now().Unix()))

for totalSize := int64(0); ; {
    size := r.Int63n(1 << 32)
    totalSize += size
    if totalSize >= temporaryFilesTotalSize {
        break
    }

	filePath := filepath.Join(dir, random.HexString(12))
	file, err := os.Create(filePath)
	if err != nil {
		return nil, err
	}

	if _, err := file.ReadFrom(io.LimitReader(r, fsize)); err != nil {
		panic(err)
	}

	if err = file.Close(); err != nil {
		panic(err)
	}

	files = append(files, filePath)
}

请注意,如果os.File不实现io.ReaderFrom,我们仍然可以使用io.Copy(),将文件作为目标,将限制的读取器(上面使用的)作为源。

最后注意:关闭文件(或任何资源)最好使用defer,这样无论如何都会被调用。但在循环中使用defer有点棘手,因为延迟函数在封闭函数的末尾运行,而不是在循环迭代的末尾运行。所以你可以将其包装在一个函数中。有关详细信息,请参阅https://stackoverflow.com/questions/45617758/defer-in-the-loop-what-will-be-better/45620423#45620423

英文:

First of all you should know that generating random data and writing that to disk is at least an order of magnitude slower than allocating a contiguous memory for buffer. This definitely falls under the "premature optimization" category. Eliminating the creation of the buffer inside the iteration will not make your code noticeably faster.

Reusing the buffer

But to reuse the buffer, move it outside of the loop, create the biggest needed buffer, and slice it in each iteration to the needed size. It's OK to do this, because we'll overwrite the whole part we need with random data.

Note that I somewhat changed the size generation (likely an error in your code as you always increase the generated temporary files, since you use the size accumulated size for new ones).

Also note that writing a file with contents prepared in a []byte is easiest done using a single call to os.WriteFile().

Something like this:

bigRaw := make([]byte, 1 &lt;&lt; 32)

for totalSize := int64(0); ; {
    size := rand.Int63n(1 &lt;&lt; 32) // random dimension up to 4GB
    totalSize += size
    if totalSize &gt;= temporaryFilesTotalSize {
        break
    }

    raw := bigRaw[:size]
    rand.Read(raw) // It&#39;s documented that rand.Read() always returns nil error

	filePath := filepath.Join(dir, random.HexString(12))
    if err := os.WriteFile(filePath, raw, 0666); err != nil {
        panic(err)
    }

    files = append(files, filePath)
}

Solving the task without an intermediate buffer

Since you are writing big files (GBs), allocating that big buffer is not a good idea: running the app will require GBs of RAM! We could improve it with an inner loop to use smaller buffers until we write the expected size, which solves the big memory issue, but increases complexity. Luckily for us, we can solve the task without any buffers, and even with decreased complexity!

We should somehow "channel" the random data from a rand.Rand to the file directly, something similar what io.Copy() does. Note that rand.Rand implements io.Reader, and os.File implements io.ReaderFrom, which suggests we could simply pass a rand.Rand to file.ReadFrom(), and the file itself would get the data directly from rand.Rand that will be written.

This sounds good, but the ReadFrom() reads data from the given reader until EOF or error. Neither will ever happen if we pass rand.Rand. And we do know how many bytes we want to be read and written: size.

To our "rescue" comes io.LimitReader(): we pass an io.Reader and a size to it, and the returned reader will supply no more than the given number of bytes, and after that will report EOF.

Note that creating our own rand.Rand will also be faster as the source we pass to it will be created using rand.NewSource() which returns an "unsynchronized" source (not safe for concurrent use) which in turn will be faster! The source used by the default/global rand.Rand is synchronized (and so safe for concurrent use–but is slower).

Perfect! Let's see this in action:

r := rand.New(rand.NewSource(time.Now().Unix()))

for totalSize := int64(0); ; {
    size := r.Int63n(1 &lt;&lt; 32)
    totalSize += size
    if totalSize &gt;= temporaryFilesTotalSize {
        break
    }

	filePath := filepath.Join(dir, random.HexString(12))
	file, err := os.Create(filePath)
	if err != nil {
		return nil, err
	}

	if _, err := file.ReadFrom(io.LimitReader(r, fsize)); err != nil {
		panic(err)
	}

	if err = file.Close(); err != nil {
		panic(err)
	}

	files = append(files, filePath)
}

Note that if os.File would not implement io.ReaderFrom, we could still use io.Copy(), providing the file as the destination, and a limited reader (used above) as the source.

Final note: closing the file (or any resource) is best done using defer, so it'll get called no matter what. Using defer in a loop is a bit tricky though, as deferred functions run at the end of the enclosing function, and not at the end of the loop's iteration. So you may wrap it in a function. For details, see https://stackoverflow.com/questions/45617758/defer-in-the-loop-what-will-be-better/45620423#45620423

huangapple
  • 本文由 发表于 2022年2月3日 20:29:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/70971313.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定