英文:
Reading a non UTF-8 text file in Go
问题
我需要读取一个使用GBK编码的文本文件。Go编程语言的标准库假设所有文本都是以UTF-8编码的。
我该如何读取其他编码的文件?
英文:
I need to read a text file that is encoded in GBK. The standard library in Go programming language assumes that all text is encoded in UTF-8.
How can I read files in other encodings?
答案1
得分: 22
以前(如在旧答案中提到的),“简单”方法是使用需要cgo并包装iconv库的第三方软件包来完成。出于许多原因,这是不可取的。值得庆幸的是,现在已经有了一种更好的全Go方法,只使用Go作者提供的软件包(不在主要软件包集中,而是在Go子存储库中)。
golang.org/x/text/encoding
软件包定义了一个通用字符编码的接口,可以进行UTF-8的转换。golang.org/x/text/encoding/simplifiedchinese
子软件包提供了GB18030、GBK和HZ-GB2312编码实现。
下面是读取和写入GBK编码文件的示例。请注意,io.Reader
和io.Writer
会在读取/写入数据时进行编码转换。
package main
import (
"bufio"
"fmt"
"log"
"os"
"golang.org/x/text/encoding/simplifiedchinese"
"golang.org/x/text/transform"
)
// 要使用的编码。由于它实现了golang.org/x/text/encoding中的encoding.Encoding接口,因此您可以轻松地将其替换为其他已实现的编码器,例如`traditionalchinese.Big5`、`charmap.Windows1252`、`korean.EUCKR`等。
var enc = simplifiedchinese.GBK
func main() {
const filename = "example_GBK_file"
exampleWriteGBK(filename)
exampleReadGBK(filename)
}
func exampleReadGBK(filename string) {
// 从GBK编码的文件中读取UTF-8。
f, err := os.Open(filename)
if err != nil {
log.Fatal(err)
}
r := transform.NewReader(f, enc.NewDecoder())
// 根据需要从`r`读取转换后的UTF-8。
// 作为示例,我们将逐行读取并显示读取的内容:
sc := bufio.NewScanner(r)
for sc.Scan() {
fmt.Printf("Read line: %s\n", sc.Bytes())
}
if err = sc.Err(); err != nil {
log.Fatal(err)
}
if err = f.Close(); err != nil {
log.Fatal(err)
}
}
func exampleWriteGBK(filename string) {
// 将UTF-8写入GBK编码的文件。
f, err := os.Create(filename)
if err != nil {
log.Fatal(err)
}
w := transform.NewWriter(f, enc.NewEncoder())
// 根据需要将UTF-8写入`w`。
// 作为示例,我们将从维基百科的GBK页面中写入一些包含中文的文本。
_, err = fmt.Fprintln(w,
`In 1995, China National Information Technology Standardization
Technical Committee set down the Chinese Internal Code Specification
(Chinese: 汉字内码扩展规范(GBK); pinyin: Hànzì Nèimǎ
Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a
slight extension of Codepage 936. The newly added 95 characters were not
found in GB 13000.1-1993, and were provisionally assigned Unicode PUA
code points.`)
if err != nil {
log.Fatal(err)
}
if err = f.Close(); err != nil {
log.Fatal(err)
}
}
英文:
Previously (as mentioned in an older answer) the "easy" way to do this involved using third party packages that needed cgo and wrapped the iconv library. That is undesirable for many reasons. Thankfully, for quite a while now there has been a superior all Go way of doing this using only packages provided by the Go Authors (not in the main set of packages but in the Go Sub-Repositories).
The golang.org/x/text/encoding
package defines an interface for generic character encodings that can convert to/from UTF-8. The golang.org/x/text/encoding/simplifiedchinese
sub-package provides GB18030, GBK and HZ-GB2312 encoding implementations.
Here is an example of reading and writing a GBK encoded file. Note that the io.Reader
and io.Writer
do the encoding "on the fly" as data is being read/written.
package main
import (
"bufio"
"fmt"
"log"
"os"
"golang.org/x/text/encoding/simplifiedchinese"
"golang.org/x/text/transform"
)
// Encoding to use. Since this implements the encoding.Encoding
// interface from golang.org/x/text/encoding you can trivially
// change this out for any of the other implemented encoders,
// e.g. `traditionalchinese.Big5`, `charmap.Windows1252`,
// `korean.EUCKR`, etc.
var enc = simplifiedchinese.GBK
func main() {
const filename = "example_GBK_file"
exampleWriteGBK(filename)
exampleReadGBK(filename)
}
func exampleReadGBK(filename string) {
// Read UTF-8 from a GBK encoded file.
f, err := os.Open(filename)
if err != nil {
log.Fatal(err)
}
r := transform.NewReader(f, enc.NewDecoder())
// Read converted UTF-8 from `r` as needed.
// As an example we'll read line-by-line showing what was read:
sc := bufio.NewScanner(r)
for sc.Scan() {
fmt.Printf("Read line: %s\n", sc.Bytes())
}
if err = sc.Err(); err != nil {
log.Fatal(err)
}
if err = f.Close(); err != nil {
log.Fatal(err)
}
}
func exampleWriteGBK(filename string) {
// Write UTF-8 to a GBK encoded file.
f, err := os.Create(filename)
if err != nil {
log.Fatal(err)
}
w := transform.NewWriter(f, enc.NewEncoder())
// Write UTF-8 to `w` as desired.
// As an example we'll write some text from the Wikipedia
// GBK page that includes Chinese.
_, err = fmt.Fprintln(w,
`In 1995, China National Information Technology Standardization
Technical Committee set down the Chinese Internal Code Specification
(Chinese: 汉字内码扩展规范(GBK); pinyin: Hànzì Nèimǎ
Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a
slight extension of Codepage 936. The newly added 95 characters were not
found in GB 13000.1-1993, and were provisionally assigned Unicode PUA
code points.`)
if err != nil {
log.Fatal(err)
}
if err = f.Close(); err != nil {
log.Fatal(err)
}
}
答案2
得分: 5
尝试使用go-iconv。它封装了iconv
并实现了io.Reader
和io.Writer
。
golang-china讨论组中的这个帖子提到了一些使用go-iconv
的例子。
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论