How to efficiently replace strings occurrences between two strings delimiters using Go bytes?

huangapple go评论78阅读模式
英文:

How to efficiently replace strings occurrences between two strings delimiters using Go bytes?

问题

如何使用Go字节在两个字符串定界符之间高效地替换字符串出现次数?

例如,我的平面文件(3Mb)内容类似于:

Lorem START ipsum END dolor sit amet, START adipiscing END elit.
Ipsum dolor START sit END amet, START elit. END
.....

我想要替换所有在STARTEND定界符之间的出现次数。由于文件大小为3Mb,将整个内容加载到内存中是一个不好的主意。

谢谢。

英文:

How to efficiently replace strings occurrences between two strings delimiters using Go bytes?

For example my flat file (3Mb) content is similar to:

Lorem START ipsum END dolor sit amet, START adipiscing END elit.
Ipsum dolor START sit END amet, START elit. END
.....

I would like to replace all ocurrencies between START and END delimiters. Like my file size is 3Mb it's bad idea to load whole content in memory.

Thanks.

答案1

得分: 5

你可以使用bufio.Scannerbufio.ScanWords来按空格边界进行分词,并将非空格序列与你的分隔符进行比较:

scanner := bufio.NewScanner(reader)

scanner.Split(bufio.ScanWords) // 你可以实现自己的分割函数
                               // 但对于你的示例来说,ScanWords就足够了

for scanner.Scan() {
    // scanner.Bytes()以较大缓冲区的切片形式高效地暴露文件内容
    if bytes.HasPrefix(scanner.Bytes(), []byte("START")) {
        ... // 继续扫描直到结束分隔符
    }

    // 复制未修改的输入非常简单:
    _, err := writer.Write(scanner.Bytes())
    if err != nil {
        return err
    }
}

这将确保从文件中读取的数据量保持有限(由MaxScanTokenSize控制)。

请注意,如果你想使用多个goroutine,你需要先复制数据,因为scanner.Bytes()返回的切片只在下一次调用.Scan()之前有效,但如果你选择这样做,我就不会使用scanner了。

值得一提的是,对于一台通用计算机来说,一个3MB大小的文件加载实际上并不是一个坏主意,现在的计算机性能足够处理,只有当文件大小增加一个数量级时,我才会考虑两次。使用bytes.Split和你的分隔符几乎肯定会更快。

英文:

You can use bufio.Scanner with bufio.ScanWords, tokenize on whitespace boundaries, and compare non-whitespace sequences to your delimiter:

scanner := bufio.NewScanner(reader)

scanner.Split(bufio.ScanWords) // you can implement your own split function
                               // but ScanWords will suffice for your example

for scanner.Scan() {
    // scanner.Bytes() efficiently exposes the file contents
    // as slices of a larger buffer
	if bytes.HasPrefix(scanner.Bytes(), []byte("START")) {
        ... // keep scanning until the end delimiter
    }

    // copying unmodified inputs is quite simple:
    _, err := writer.Write( scanner.Bytes() )
    if err != nil {
        return err
    }
}

This will ensure that the amount of data read in from the file remains bounded (this is controlled by MaxScanTokenSize)

Note that if you want to use multiple goroutines, you'll need to copy the data first, since scanner.Bytes() returns a slice that is only valid until the next call to .Scan(), but if you choose to do that then I wouldn't bother with a scanner.

For what it's worth, a 3MB size file is actually not such a bad idea to load on a general purpose computer nowadays, I would only think twice if it was an order of magnitude bigger. It would almost certainly be faster to use bytes.Split with your delimiters.

huangapple
  • 本文由 发表于 2017年5月19日 18:35:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/44067829.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定