英文:
How to efficiently replace strings occurrences between two strings delimiters using Go bytes?
问题
如何使用Go字节在两个字符串定界符之间高效地替换字符串出现次数?
例如,我的平面文件(3Mb)内容类似于:
Lorem START ipsum END dolor sit amet, START adipiscing END elit.
Ipsum dolor START sit END amet, START elit. END
.....
我想要替换所有在START和END定界符之间的出现次数。由于文件大小为3Mb,将整个内容加载到内存中是一个不好的主意。
谢谢。
英文:
How to efficiently replace strings occurrences between two strings delimiters using Go bytes?
For example my flat file (3Mb) content is similar to:
Lorem START ipsum END dolor sit amet, START adipiscing END elit.
Ipsum dolor START sit END amet, START elit. END
.....
I would like to replace all ocurrencies between START and END delimiters. Like my file size is 3Mb it's bad idea to load whole content in memory.
Thanks.
答案1
得分: 5
你可以使用bufio.Scanner和bufio.ScanWords来按空格边界进行分词,并将非空格序列与你的分隔符进行比较:
scanner := bufio.NewScanner(reader)
scanner.Split(bufio.ScanWords) // 你可以实现自己的分割函数
// 但对于你的示例来说,ScanWords就足够了
for scanner.Scan() {
// scanner.Bytes()以较大缓冲区的切片形式高效地暴露文件内容
if bytes.HasPrefix(scanner.Bytes(), []byte("START")) {
... // 继续扫描直到结束分隔符
}
// 复制未修改的输入非常简单:
_, err := writer.Write(scanner.Bytes())
if err != nil {
return err
}
}
这将确保从文件中读取的数据量保持有限(由MaxScanTokenSize控制)。
请注意,如果你想使用多个goroutine,你需要先复制数据,因为scanner.Bytes()返回的切片只在下一次调用.Scan()之前有效,但如果你选择这样做,我就不会使用scanner了。
值得一提的是,对于一台通用计算机来说,一个3MB大小的文件加载实际上并不是一个坏主意,现在的计算机性能足够处理,只有当文件大小增加一个数量级时,我才会考虑两次。使用bytes.Split和你的分隔符几乎肯定会更快。
英文:
You can use bufio.Scanner with bufio.ScanWords, tokenize on whitespace boundaries, and compare non-whitespace sequences to your delimiter:
scanner := bufio.NewScanner(reader)
scanner.Split(bufio.ScanWords) // you can implement your own split function
// but ScanWords will suffice for your example
for scanner.Scan() {
// scanner.Bytes() efficiently exposes the file contents
// as slices of a larger buffer
if bytes.HasPrefix(scanner.Bytes(), []byte("START")) {
... // keep scanning until the end delimiter
}
// copying unmodified inputs is quite simple:
_, err := writer.Write( scanner.Bytes() )
if err != nil {
return err
}
}
This will ensure that the amount of data read in from the file remains bounded (this is controlled by MaxScanTokenSize)
Note that if you want to use multiple goroutines, you'll need to copy the data first, since scanner.Bytes() returns a slice that is only valid until the next call to .Scan(), but if you choose to do that then I wouldn't bother with a scanner.
For what it's worth, a 3MB size file is actually not such a bad idea to load on a general purpose computer nowadays, I would only think twice if it was an order of magnitude bigger. It would almost certainly be faster to use bytes.Split with your delimiters.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论