How to read a file character by character in Go

huangapple go评论121阅读模式
英文:

How to read a file character by character in Go

问题

我有一些大的 JSON 文件需要解析,我想避免一次性将所有数据加载到内存中。我想要一个能够每次返回一个字符的函数/循环。

我找到了这个例子,用于迭代字符串中的单词,以及 bufio 包中的 ScanRunes 函数,看起来它可以每次返回一个字符。我还尝试使用 bufio 包中的 ReadRune 函数,但感觉这种方法比较繁重。

编辑

我比较了三种方法。所有方法都使用循环从 bufio.Reader 或 bufio.Scanner 中获取内容。

  1. 使用 bufio.Reader.ReadRune 在循环中读取字符。检查从 .ReadRune 调用返回的错误。
  2. 在调用 bufio.Scanner.Split(bufio.ScanRunes) 后,从 bufio.Scanner 中读取字节。在每次迭代中调用 .Scan.Bytes,检查 .Scan 调用返回的错误。
  3. 与第二种方法相同,但是从 bufio.Scanner 中读取文本而不是字节,使用 .Text。不是使用 string([]runes) 来连接 rune 切片,而是使用 strings.Join([]strings, "") 来形成最终的文本块。

对于一个大小为 23 MB 的 JSON 文件,每种方法运行 10 次的时间如下:

  1. 0.65 秒
  2. 2.40 秒
  3. 0.97 秒

所以看起来 ReadRune 方法还不错。它还可以减少冗长的调用,因为每个 rune 只需要一个操作(.ReadRune),而不是两个操作(.Scan.Bytes)。

英文:

I have some large json files I want to parse, and I want to avoid loading all of the data into memory at once. I'd like a function/loop that can return me each character one at a time.

I found this example for iterating over words in a string, and the ScanRunes function in the bufio package looks like it could return a character at a time. I also had the ReadRune function from bufio mostly working, but that felt like a pretty heavy approach.

EDIT

I compared 3 approaches. All used a loop to pull content from either a bufio.Reader or a bufio.Scanner.

  1. Read runes in a loop using .ReadRune on a bufio.Reader. Checked for errors from the call to .ReadRune.
  2. Read bytes from a bufio.Scanner after calling .Split(bufio.ScanRunes) on the scanner. Called .Scan and .Bytes on each iteration, checking .Scan call for errors.
  3. Same as #2 but read text from a bufio.Scanner instead of bytes using .Text. Instead of joining a slice of runes with string([]runes), I joined an slice of strings with strings.Join([]strings, "") to form the final blobs of text.

The timing for 10 runs of each on a 23 MB json file was:

  1. 0.65 s
  2. 2.40 s
  3. 0.97 s

So it looks like ReadRune is not too bad after all. It also results in smaller less verbose call because each rune is fetched in 1 operation (.ReadRune) instead of 2 (.Scan and .Bytes).

答案1

得分: 11

只需在循环中逐个读取每个符文即可... 参见示例

  1. package main
  2. import (
  3. "bufio"
  4. "fmt"
  5. "io"
  6. "log"
  7. "strings"
  8. )
  9. var text = `
  10. The quick brown fox jumps over the lazy dog #1.
  11. Быстрая коричневая лиса перепрыгнула через ленивую собаку.
  12. `
  13. func main() {
  14. r := bufio.NewReader(strings.NewReader(text))
  15. for {
  16. if c, sz, err := r.ReadRune(); err != nil {
  17. if err == io.EOF {
  18. break
  19. } else {
  20. log.Fatal(err)
  21. }
  22. } else {
  23. fmt.Printf("%q [%d]\n", string(c), sz)
  24. }
  25. }
  26. }
英文:

Just read each rune one by one in the loop... See example

  1. package main
  2. import (
  3. "bufio"
  4. "fmt"
  5. "io"
  6. "log"
  7. "strings"
  8. )
  9. var text = `
  10. The quick brown fox jumps over the lazy dog #1.
  11. Быстрая коричневая лиса перепрыгнула через ленивую собаку.
  12. `
  13. func main() {
  14. r := bufio.NewReader(strings.NewReader(text))
  15. for {
  16. if c, sz, err := r.ReadRune(); err != nil {
  17. if err == io.EOF {
  18. break
  19. } else {
  20. log.Fatal(err)
  21. }
  22. } else {
  23. fmt.Printf("%q [%d]\n", string(c), sz)
  24. }
  25. }
  26. }

答案2

得分: 9

这段代码从输入中读取符文。不需要进行类型转换,它类似于迭代器:

  1. package main
  2. import (
  3. "bufio"
  4. "fmt"
  5. "strings"
  6. )
  7. func main() {
  8. in := `{"sample":"json string"}`
  9. s := bufio.NewScanner(strings.NewReader(in))
  10. s.Split(bufio.ScanRunes)
  11. for s.Scan() {
  12. fmt.Println(s.Text())
  13. }
  14. }
英文:

This code reads runes from the input. No cast is necessary, and it is iterator-like:

  1. package main
  2. import (
  3. "bufio"
  4. "fmt"
  5. "strings"
  6. )
  7. func main() {
  8. in := `{"sample":"json string"}`
  9. s := bufio.NewScanner(strings.NewReader(in))
  10. s.Split(bufio.ScanRunes)
  11. for s.Scan() {
  12. fmt.Println(s.Text())
  13. }
  14. }

答案3

得分: 1

如果只是关于内存大小的问题。在即将发布的版本中(很快就会发布),json解码器将进行令牌样式的增强:
你可以在这里看到:
https://tip.golang.org/pkg/encoding/json/#Decoder.Token

英文:

if it's just about the memory size. In the upcoming release (really soon) there is going to be a token style enhancement of the json decoder :
you can see it here

https://tip.golang.org/pkg/encoding/json/#Decoder.Token

huangapple
  • 本文由 发表于 2015年8月6日 22:04:05
  • 转载请务必保留本文链接:https://go.coder-hub.com/31857891.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定