提高使用bufio.NewScanner读取性能的方法

huangapple go评论105阅读模式
英文:

Improving performance of reading with bufio.NewScanner

问题

一个简单的程序,用于实现一个目标:

  1. 逐行读取脚本文件,创建一个字符串,忽略任何空白行或注释(包括shebang)。如果需要,在行尾添加一个分号。 (我知道,我知道,反斜杠和和号等)

我的问题是:

如何提高这个小程序的性能?在另一个答案中,我读到了利用scanner.Bytes()而不是scanner.Text(),但这似乎不可行,因为我需要的是一个字符串。

带有测试文件的示例代码:https://play.golang.org/p/gzSTLkP3BoB

这是简单的程序:

  1. func main() {
  2. file, err := os.Open("./script.sh")
  3. if err != nil {
  4. log.Fatalln(err)
  5. }
  6. defer file.Close()
  7. var a strings.Builder
  8. scanner := bufio.NewScanner(file)
  9. for scanner.Scan() {
  10. lines := scanner.Text()
  11. switch {
  12. case lines == "" || lines[:1] == "#":
  13. continue
  14. case lines[len(lines)-1:] != ";":
  15. a.WriteString(lines + "; ")
  16. default:
  17. a.WriteString(lines + " ")
  18. }
  19. }
  20. fmt.Println(a.String())
  21. }
英文:

A simple program to serve one purpose:

  1. Read a script file line by line, create a string while ignoring any blank new lines or comments (including the shebang). Adding a ';' at the end of a line if needed. (I know, I know, backslashes and ampersands, etc)

My question is:

How to improve the performance of this small program? In a different answer I've read about utilizing scanner.Bytes() instead of scanner.Text(), but this doesn't seem feasible as a string is what I want.

Sample code with test file: https://play.golang.org/p/gzSTLkP3BoB

Here is the simple program:

  1. func main() {
  2. file, err := os.Open("./script.sh")
  3. if err != nil {
  4. log.Fatalln(err)
  5. }
  6. defer file.Close()
  7. var a strings.Builder
  8. scanner := bufio.NewScanner(file)
  9. for scanner.Scan() {
  10. lines := scanner.Text()
  11. switch {
  12. case lines == "" || lines[:1] == "#":
  13. continue
  14. case lines[len(lines)-1:] != ";":
  15. a.WriteString(lines + "; ")
  16. default:
  17. a.WriteString(lines + " ")
  18. }
  19. }
  20. fmt.Println(a.String())
  21. }

答案1

得分: 2

我使用了strings.Builderioutil.ReadAll来提高性能。由于你处理的是小型shell脚本,我假设一次性读取整个文件不会对内存造成压力(我使用了ioutil.ReadAll)。我还只分配了一次内存,为strings.Builder提供足够的存储空间,从而减少了内存分配。

现在,让我们来看一下基准测试结果:

  1. goos: darwin
  2. goarch: amd64
  3. pkg: test
  4. cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
  5. BenchmarkDoFast-8 342602 3334 ns/op 1280 B/op 3 allocs/op
  6. BenchmarkDoSlow-8 258896 4408 ns/op 4624 B/op 8 allocs/op
  7. PASS
  8. ok test 2.477s

我们可以看到,doFast不仅更快,而且分配的内存更少。度量指标越低越好。

  1. package main
  2. import (
  3. "bufio"
  4. "bytes"
  5. "fmt"
  6. "io/ioutil"
  7. "os"
  8. "strings"
  9. )
  10. func open(filename string) (*os.File, error) {
  11. return os.Open(filename)
  12. }
  13. func main() {
  14. fd, err := open("test.sh")
  15. if err != nil {
  16. panic(err)
  17. }
  18. defer fd.Close()
  19. outputA, err := doFast(fd)
  20. if err != nil {
  21. panic(err)
  22. }
  23. fd.Seek(0, 0)
  24. outputB, err := doSlow(fd)
  25. if err != nil {
  26. panic(err)
  27. }
  28. fmt.Println(outputA)
  29. fmt.Println(outputB)
  30. }
  31. func doFast(fd *os.File) (string, error) {
  32. b, err := ioutil.ReadAll(fd)
  33. if err != nil {
  34. return "", err
  35. }
  36. var res strings.Builder
  37. res.Grow(len(b))
  38. bLines := bytes.Split(b, []byte("\n"))
  39. for i := range bLines {
  40. switch {
  41. case len(bLines[i]) == 0 || bLines[i][0] == '#':
  42. case bLines[i][len(bLines[i])-1] != ';':
  43. res.Write(bLines[i])
  44. res.WriteString("; ")
  45. default:
  46. res.Write(bLines[i])
  47. res.WriteByte(' ')
  48. }
  49. }
  50. return res.String(), nil
  51. }
  52. func doSlow(fd *os.File) (string, error) {
  53. var a strings.Builder
  54. scanner := bufio.NewScanner(fd)
  55. for scanner.Scan() {
  56. lines := scanner.Text()
  57. switch {
  58. case lines == "" || lines[:1] == "#":
  59. continue
  60. case lines[len(lines)-1:] != ";":
  61. a.WriteString(lines + "; ")
  62. default:
  63. a.WriteString(lines + " ")
  64. }
  65. }
  66. return a.String(), nil
  67. }

注意:我没有使用bufio.NewScanner,它是否是必需的?

英文:

I used strings.Builder and ioutil.ReadAll to improve the performance. As you are dealing with small shell scripts I assumed that read the file all at once should not put pressure on memory (I used ioutil.ReadAll). I also allocated just once to make sufficient store for strings.Builder — reduced allocations.

  • doFast: faster implementation
  • doSlow: slower implementation (what you've originally done)

Now, let's look at the benchmark results:

  1. goos: darwin
  2. goarch: amd64
  3. pkg: test
  4. cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
  5. BenchmarkDoFast-8 342602 3334 ns/op 1280 B/op 3 allocs/op
  6. BenchmarkDoSlow-8 258896 4408 ns/op 4624 B/op 8 allocs/op
  7. PASS
  8. ok test 2.477s

We can see that doFast is not only faster but only makes lesser allocations. Metrics measured are lower the better.

  1. package main
  2. import (
  3. "bufio"
  4. "bytes"
  5. "fmt"
  6. "io/ioutil"
  7. "os"
  8. "strings"
  9. )
  10. func open(filename string) (*os.File, error) {
  11. return os.Open(filename)
  12. }
  13. func main() {
  14. fd, err := open("test.sh")
  15. if err != nil {
  16. panic(err)
  17. }
  18. defer fd.Close()
  19. outputA, err := doFast(fd)
  20. if err != nil {
  21. panic(err)
  22. }
  23. fd.Seek(0, 0)
  24. outputB, err := doSlow(fd)
  25. if err != nil {
  26. panic(err)
  27. }
  28. fmt.Println(outputA)
  29. fmt.Println(outputB)
  30. }
  31. func doFast(fd *os.File) (string, error) {
  32. b, err := ioutil.ReadAll(fd)
  33. if err != nil {
  34. return "", err
  35. }
  36. var res strings.Builder
  37. res.Grow(len(b))
  38. bLines := bytes.Split(b, []byte("\n"))
  39. for i := range bLines {
  40. switch {
  41. case len(bLines[i]) == 0 || bLines[i][0] == '#':
  42. case bLines[i][len(bLines[i])-1] != ';':
  43. res.Write(bLines[i])
  44. res.WriteString("; ")
  45. default:
  46. res.Write(bLines[i])
  47. res.WriteByte(' ')
  48. }
  49. }
  50. return res.String(), nil
  51. }
  52. func doSlow(fd *os.File) (string, error) {
  53. var a strings.Builder
  54. scanner := bufio.NewScanner(fd)
  55. for scanner.Scan() {
  56. lines := scanner.Text()
  57. switch {
  58. case lines == "" || lines[:1] == "#":
  59. continue
  60. case lines[len(lines)-1:] != ";":
  61. a.WriteString(lines + "; ")
  62. default:
  63. a.WriteString(lines + " ")
  64. }
  65. }
  66. return a.String(), nil
  67. }

Note: I didn't use bufio.NewScanner; is it required?

答案2

得分: 1

使用scanner.Bytes()是可行的。以下是代码:

  1. func main() {
  2. file, err := os.Open("./script.sh")
  3. if err != nil {
  4. log.Fatalln(err)
  5. }
  6. defer file.Close()
  7. var a strings.Builder
  8. scanner := bufio.NewScanner(file)
  9. for scanner.Scan() {
  10. lines := scanner.Bytes()
  11. switch {
  12. case len(lines) == 0 || lines[0] == '#':
  13. continue
  14. case lines[len(lines)-1] != ';':
  15. a.Write(lines)
  16. a.WriteString("; ")
  17. default:
  18. a.Write(lines)
  19. a.WriteByte(' ')
  20. }
  21. }
  22. fmt.Println(a.String())
  23. }

该程序避免了在scanner.Text()中进行字符串分配。如果程序的速度受到I/O限制,那么实际上该程序可能不会更快。

在 playground 上运行

如果你的目标是将结果写入标准输出(stdout),那么可以使用bufio.Writer而不是strings.Builder来进行写入。这个改变将strings.Builder中的一个或多个分配替换为bufio.Writer中的单个分配。

  1. func main() {
  2. file, err := os.Open("./script.sh")
  3. if err != nil {
  4. log.Fatalln(err)
  5. }
  6. defer file.Close()
  7. a := bufio.NewWriter(os.Stdout)
  8. defer a.Flush() // 在 main 函数返回时刷新缓冲区中的数据。
  9. scanner := bufio.NewScanner(file)
  10. for scanner.Scan() {
  11. lines := scanner.Bytes()
  12. switch {
  13. case len(lines) == 0 || lines[0] == '#':
  14. continue
  15. case lines[len(lines)-1] != ';':
  16. a.Write(lines)
  17. a.WriteString("; ")
  18. default:
  19. a.Write(lines)
  20. a.WriteByte(' ')
  21. }
  22. }
  23. }

在 playground 上运行

额外的改进:使用lines := bytes.TrimSpace(scanner.Bytes())来处理#之前和;之后的空白字符。

英文:

It is feasible to use scanner.Bytes(). Here's the code:

  1. func main() {
  2. file, err := os.Open("./script.sh")
  3. if err != nil {
  4. log.Fatalln(err)
  5. }
  6. defer file.Close()
  7. var a strings.Builder
  8. scanner := bufio.NewScanner(file)
  9. for scanner.Scan() {
  10. lines := scanner.Bytes()
  11. switch {
  12. case len(lines) == 0 || lines[0] == '#':
  13. continue
  14. case lines[len(lines)-1] != ';':
  15. a.Write(lines)
  16. a.WriteString("; ")
  17. default:
  18. a.Write(lines)
  19. a.WriteByte(' ')
  20. }
  21. }
  22. fmt.Println(a.String())
  23. }

This program avoids the string allocation in scanner.Text(). The program may not be faster in practice if the program speed is limited by I/O.

Run it on the playground.

If your goal is to write the result to stdout, then write to a bufio.Writer instead of a strings.Builder. This change replaces one or more allocations in strings.Builder with a single allocation in bufio.Writer.

  1. func main() {
  2. file, err := os.Open("./script.sh")
  3. if err != nil {
  4. log.Fatalln(err)
  5. }
  6. defer file.Close()
  7. a := bufio.NewWriter(os.Stdout)
  8. defer a.Flush() // flush buffered data on return from main.
  9. scanner := bufio.NewScanner(file)
  10. for scanner.Scan() {
  11. lines := scanner.Bytes()
  12. switch {
  13. case len(lines) == 0 || lines[0] == '#':
  14. continue
  15. case lines[len(lines)-1] != ';':
  16. a.Write(lines)
  17. a.WriteString("; ")
  18. default:
  19. a.Write(lines)
  20. a.WriteByte(' ')
  21. }
  22. }
  23. }

Run it on the playground.

Bonus improvement: use lines := bytes.TrimSpace(scanner.Bytes()) to handle whitespace before a '#' and after a ';'

答案3

得分: 0

你可以通过对输出进行缓冲来提高性能。

  1. func main() {
  2. output := bufio.NewWriter(os.Stdout)
  3. // 使用 fmt.Fprintf 替代 Printf
  4. fmt.Fprintf(output, "%s\n", a)
  5. }
英文:

You may be able to improve performance by buffering the output as well.

  1. func main() {
  2. output := bufio.NewWriter(os.Stdout)
  3. // instead of Printf, use
  4. fmt.Fprintf(output, "%s\n", a)
  5. }

huangapple
  • 本文由 发表于 2021年6月25日 10:46:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/68124797.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定