英文:
How to read a file character by character in Go
问题
我有一些大的 JSON 文件需要解析,我想避免一次性将所有数据加载到内存中。我想要一个能够每次返回一个字符的函数/循环。
我找到了这个例子,用于迭代字符串中的单词,以及 bufio 包中的 ScanRunes 函数,看起来它可以每次返回一个字符。我还尝试使用 bufio 包中的 ReadRune 函数,但感觉这种方法比较繁重。
编辑
我比较了三种方法。所有方法都使用循环从 bufio.Reader 或 bufio.Scanner 中获取内容。
- 使用
bufio.Reader
的.ReadRune
在循环中读取字符。检查从.ReadRune
调用返回的错误。 - 在调用
bufio.Scanner
的.Split(bufio.ScanRunes)
后,从bufio.Scanner
中读取字节。在每次迭代中调用.Scan
和.Bytes
,检查.Scan
调用返回的错误。 - 与第二种方法相同,但是从
bufio.Scanner
中读取文本而不是字节,使用.Text
。不是使用string([]runes)
来连接 rune 切片,而是使用strings.Join([]strings, "")
来形成最终的文本块。
对于一个大小为 23 MB 的 JSON 文件,每种方法运行 10 次的时间如下:
0.65 秒
2.40 秒
0.97 秒
所以看起来 ReadRune
方法还不错。它还可以减少冗长的调用,因为每个 rune 只需要一个操作(.ReadRune
),而不是两个操作(.Scan
和 .Bytes
)。
英文:
I have some large json files I want to parse, and I want to avoid loading all of the data into memory at once. I'd like a function/loop that can return me each character one at a time.
I found this example for iterating over words in a string, and the ScanRunes function in the bufio package looks like it could return a character at a time. I also had the ReadRune
function from bufio mostly working, but that felt like a pretty heavy approach.
EDIT
I compared 3 approaches. All used a loop to pull content from either a bufio.Reader or a bufio.Scanner.
- Read runes in a loop using
.ReadRune
on abufio.Reader
. Checked for errors from the call to.ReadRune
. - Read bytes from a
bufio.Scanner
after calling.Split(bufio.ScanRunes)
on the scanner. Called.Scan
and.Bytes
on each iteration, checking.Scan
call for errors. - Same as #2 but read text from a
bufio.Scanner
instead of bytes using.Text
. Instead of joining a slice of runes withstring([]runes)
, I joined an slice of strings withstrings.Join([]strings, "")
to form the final blobs of text.
The timing for 10 runs of each on a 23 MB json file was:
0.65 s
2.40 s
0.97 s
So it looks like ReadRune
is not too bad after all. It also results in smaller less verbose call because each rune is fetched in 1 operation (.ReadRune
) instead of 2 (.Scan
and .Bytes
).
答案1
得分: 11
只需在循环中逐个读取每个符文即可... 参见示例
package main
import (
"bufio"
"fmt"
"io"
"log"
"strings"
)
var text = `
The quick brown fox jumps over the lazy dog #1.
Быстрая коричневая лиса перепрыгнула через ленивую собаку.
`
func main() {
r := bufio.NewReader(strings.NewReader(text))
for {
if c, sz, err := r.ReadRune(); err != nil {
if err == io.EOF {
break
} else {
log.Fatal(err)
}
} else {
fmt.Printf("%q [%d]\n", string(c), sz)
}
}
}
英文:
Just read each rune one by one in the loop... See example
package main
import (
"bufio"
"fmt"
"io"
"log"
"strings"
)
var text = `
The quick brown fox jumps over the lazy dog #1.
Быстрая коричневая лиса перепрыгнула через ленивую собаку.
`
func main() {
r := bufio.NewReader(strings.NewReader(text))
for {
if c, sz, err := r.ReadRune(); err != nil {
if err == io.EOF {
break
} else {
log.Fatal(err)
}
} else {
fmt.Printf("%q [%d]\n", string(c), sz)
}
}
}
答案2
得分: 9
这段代码从输入中读取符文。不需要进行类型转换,它类似于迭代器:
package main
import (
"bufio"
"fmt"
"strings"
)
func main() {
in := `{"sample":"json string"}`
s := bufio.NewScanner(strings.NewReader(in))
s.Split(bufio.ScanRunes)
for s.Scan() {
fmt.Println(s.Text())
}
}
英文:
This code reads runes from the input. No cast is necessary, and it is iterator-like:
package main
import (
"bufio"
"fmt"
"strings"
)
func main() {
in := `{"sample":"json string"}`
s := bufio.NewScanner(strings.NewReader(in))
s.Split(bufio.ScanRunes)
for s.Scan() {
fmt.Println(s.Text())
}
}
答案3
得分: 1
如果只是关于内存大小的问题。在即将发布的版本中(很快就会发布),json解码器将进行令牌样式的增强:
你可以在这里看到:
https://tip.golang.org/pkg/encoding/json/#Decoder.Token
英文:
if it's just about the memory size. In the upcoming release (really soon) there is going to be a token style enhancement of the json decoder :
you can see it here
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论