Read entire file of newline delimited JSON blobs to memory and unmarshal each blob with the least amount of conversions in golang?

huangapple go评论92阅读模式
英文:

Read entire file of newline delimited JSON blobs to memory and unmarshal each blob with the least amount of conversions in golang?

问题

我是新手,对于Go语言的特定构造不太了解。

我的使用情况是首先将包含以换行符分隔的JSON块的输入文件读入内存。从这个JSON源的“数组”中,我想要将每个数组元素解组以在Go语言中处理它。预期的结构映射已经定义好了。

通常我喜欢一次性读取所有行,所以ioutil.ReadFile()(在https://stackoverflow.com/questions/13514184/how-can-i-read-a-whole-file-into-a-string-variable-in-golang中提到)似乎是一个不错的选择。而json.Unmarshal似乎接受字节数组作为源。但是如果我使用ReadFile(),我得到的是整个文件的一个字节数组。我应该如何提取这个字节数组的片段,以便跳过换行字节(作为分隔符),并且每个片段都是一个JSON块?我认为最好的技术是不进行或最小化数据类型转换的技术。简单的方法可能是将字节数组转换为字符串,将以换行符分隔的字符串拆分为数组,然后将每个字符串数组元素转换回字节数组以传递给json.Unmarshal。我更喜欢优化的方法,但是不确定如何在Go语言中实现算法细节,希望能得到一些提示。

理想情况下,我希望预处理在之前完成,这样在迭代这些片段时就不需要处理来自文件的JSON字节数组的内容等。相反,我希望将从文件读取的单个字节数组预处理为字节数组片段的数组,删除所有换行字节,每个片段都是由换行符分隔的段落。

英文:

I'm new to go, so don't know a whole lot about the language specific constructs.

My use case is first to read into memory an input file containing JSON blobs that are newline delimited. From this "array" of JSON source, I'd like to unmarshal each array element to deal with it in golang. The expected structure mapping is already defined.

I typically like to read all lines at once, so ioutil.ReadFile() as mentioned in https://stackoverflow.com/questions/13514184/how-can-i-read-a-whole-file-into-a-string-variable-in-golang seems like a good choice. And json.Unmarshal appears to take byte array as the source. But if I'm using ReadFile(), I have a single array of bytes for the whole file. How might I extract slices of this byte array such that the newline bytes (as delimiters) are skipped and each slice is one of those JSON blobs? I'd assume the best technique is one that doesn't do or minimizes data type conversions. As the easy hack would be something like convert the byte array to string, split the newline delimited string to array then cast each string array element back to bytes to pass to json.Unmarshal. I'd prefer the optimized approach but not sure how to tackle the implementation algorithm details in go, could use some tips here.

Ideally, I'd like the preprocessing done beforehand, so that I'm not dealing with the content of the JSON byte array from file as I'm iterating over the slices, etc. Rather I'd like to preprocess the single byte array read from file into an array of byte array slices, with all the newline bytes removed, each slice being the segments that were delimited by newline.

答案1

得分: 13

使用bufio.Scanner逐行读取:

f, err := os.Open(fname)
if err != nil {
    // 处理错误
}
s := bufio.NewScanner(f)
for s.Scan() {
    var v ValueTypeToUnmarshalTo
    if err := json.Unmarshal(s.Bytes(), &v); err != nil {
        // 处理错误
    }
    // 对 v 进行操作
}
if s.Err() != nil {
    // 处理扫描错误
}

或者使用ioutil.ReadFile一次性读取整个文件,然后使用bytes.Split将文件分割成行:

p, err := ioutil.ReadFile(fname)
if err != nil {
    // 处理错误
}
for _, line := range bytes.Split(p, []byte{'\n'}) {
    var v ValueTypeToUnmarshalTo
    if err := json.Unmarshal(line, &v); err != nil {
        // 处理错误
    }
    // 对 v 进行操作
}

或者使用json.Decoder内置的流式处理功能从文件中读取多个值:

f, err := os.Open(fname)
if err != nil {
    // 处理错误
}
d := json.NewDecoder(f)
for {
    var v ValueTypeToUnmarshalTo
    if err := d.Decode(&v); err == io.EOF {
        break // 完成文件解码
    } else if err != nil {
        // 处理错误
    }
    // 对 v 进行操作
}

<kbd>在 playground 上运行代码</kbd>

ioutil.ReadFile 方法使用的内存比其他方法多(每个文件字节一个字节加上每行一个切片头)。

由于解码器忽略 JSON 值后面的空白字符,这三种方法都可以处理 \r\n 行终止符。

除了将 JSON 字节解组为 Go 值时固有的转换外,这些方法都不进行任何数据转换。

英文:

Use bufio.Scanner to read a line at a time:

 f, err := os.Open(fname)
 if err != nil {
     // handle error
 }
 s := bufio.NewScanner(f)
 for s.Scan() {
    var v ValueTypeToUnmarshalTo
    if err := json.Unmarshal(s.Bytes(), &amp;v); err != nil {
       //handle error
    }
    // do something with v
}
if s.Err() != nil {
    // handle scan error
}

or use ioutil.ReadFile to slurp up the entire file and bytes.Split to break the file into lines:

 p, err := ioutil.ReadFile(fname)
 if err != nil {
    // handle error
 }
 for _, line := range bytes.Split(p, []byte{&#39;\n&#39;}) {
    var v ValueTypeToUnmarshalTo
    if err := json.Unmarshal(line, &amp;v); err != nil {
       //handle error
    }
    // do something with v
 }

or use the json.Decoder built-in streaming feature to read mulitple values from the file:

 f, err := os.Open(fname)
 if err != nil {
    // handle error
 }
 d := json.NewDecoder(f)
 for {
    var v ValueTypeToUnmarshalTo
    if err := d.Decode(&amp;v); err == io.EOF {
       break // done decoding file
    } else if err != nil {
       // handle error
    }
    // do something with v
}

<kbd>Run the code on the playground</kbd>

The ioutil.ReadFile approach uses more memory than the other approaches (one byte for each byte in file plus one slice header for each line).

Because the decoder ignores whitespace following a JSON value, the three approaches handle \r\n line terminators.

There are no data conversions in any of these approaches other than those inherent to unmarshalling JSON bytes to Go values.

huangapple
  • 本文由 发表于 2015年12月21日 10:12:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/34388083.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定