英文:
How to efficiently replace strings occurrences between two strings delimiters using Go bytes?
问题
如何使用Go字节在两个字符串定界符之间高效地替换字符串出现次数?
例如,我的平面文件(3Mb)内容类似于:
Lorem START ipsum END dolor sit amet, START adipiscing END elit.
Ipsum dolor START sit END amet, START elit. END
.....
我想要替换所有在START
和END
定界符之间的出现次数。由于文件大小为3Mb,将整个内容加载到内存中是一个不好的主意。
谢谢。
英文:
How to efficiently replace strings occurrences between two strings delimiters using Go bytes?
For example my flat file (3Mb) content is similar to:
Lorem START ipsum END dolor sit amet, START adipiscing END elit.
Ipsum dolor START sit END amet, START elit. END
.....
I would like to replace all ocurrencies between START
and END
delimiters. Like my file size is 3Mb it's bad idea to load whole content in memory.
Thanks.
答案1
得分: 5
你可以使用bufio.Scanner
和bufio.ScanWords
来按空格边界进行分词,并将非空格序列与你的分隔符进行比较:
scanner := bufio.NewScanner(reader)
scanner.Split(bufio.ScanWords) // 你可以实现自己的分割函数
// 但对于你的示例来说,ScanWords就足够了
for scanner.Scan() {
// scanner.Bytes()以较大缓冲区的切片形式高效地暴露文件内容
if bytes.HasPrefix(scanner.Bytes(), []byte("START")) {
... // 继续扫描直到结束分隔符
}
// 复制未修改的输入非常简单:
_, err := writer.Write(scanner.Bytes())
if err != nil {
return err
}
}
这将确保从文件中读取的数据量保持有限(由MaxScanTokenSize
控制)。
请注意,如果你想使用多个goroutine,你需要先复制数据,因为scanner.Bytes()
返回的切片只在下一次调用.Scan()
之前有效,但如果你选择这样做,我就不会使用scanner了。
值得一提的是,对于一台通用计算机来说,一个3MB大小的文件加载实际上并不是一个坏主意,现在的计算机性能足够处理,只有当文件大小增加一个数量级时,我才会考虑两次。使用bytes.Split
和你的分隔符几乎肯定会更快。
英文:
You can use bufio.Scanner
with bufio.ScanWords
, tokenize on whitespace boundaries, and compare non-whitespace sequences to your delimiter:
scanner := bufio.NewScanner(reader)
scanner.Split(bufio.ScanWords) // you can implement your own split function
// but ScanWords will suffice for your example
for scanner.Scan() {
// scanner.Bytes() efficiently exposes the file contents
// as slices of a larger buffer
if bytes.HasPrefix(scanner.Bytes(), []byte("START")) {
... // keep scanning until the end delimiter
}
// copying unmodified inputs is quite simple:
_, err := writer.Write( scanner.Bytes() )
if err != nil {
return err
}
}
This will ensure that the amount of data read in from the file remains bounded (this is controlled by MaxScanTokenSize
)
Note that if you want to use multiple goroutines, you'll need to copy the data first, since scanner.Bytes()
returns a slice that is only valid until the next call to .Scan()
, but if you choose to do that then I wouldn't bother with a scanner.
For what it's worth, a 3MB size file is actually not such a bad idea to load on a general purpose computer nowadays, I would only think twice if it was an order of magnitude bigger. It would almost certainly be faster to use bytes.Split
with your delimiters.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论