在Go中读取一个非UTF-8的文本文件

huangapple go评论81阅读模式
英文:

Reading a non UTF-8 text file in Go

问题

我需要读取一个使用GBK编码的文本文件。Go编程语言的标准库假设所有文本都是以UTF-8编码的。

我该如何读取其他编码的文件?

英文:

I need to read a text file that is encoded in GBK. The standard library in Go programming language assumes that all text is encoded in UTF-8.

How can I read files in other encodings?

答案1

得分: 22

以前(如在旧答案中提到的),“简单”方法是使用需要cgo并包装iconv库的第三方软件包来完成。出于许多原因,这是不可取的。值得庆幸的是,现在已经有了一种更好的全Go方法,只使用Go作者提供的软件包(不在主要软件包集中,而是在Go子存储库中)。

golang.org/x/text/encoding软件包定义了一个通用字符编码的接口,可以进行UTF-8的转换。golang.org/x/text/encoding/simplifiedchinese子软件包提供了GB18030GBKHZ-GB2312编码实现。

下面是读取和写入GBK编码文件的示例。请注意,io.Readerio.Writer会在读取/写入数据时进行编码转换。

package main

import (
	"bufio"
	"fmt"
	"log"
	"os"

	"golang.org/x/text/encoding/simplifiedchinese"
	"golang.org/x/text/transform"
)

// 要使用的编码。由于它实现了golang.org/x/text/encoding中的encoding.Encoding接口,因此您可以轻松地将其替换为其他已实现的编码器,例如`traditionalchinese.Big5`、`charmap.Windows1252`、`korean.EUCKR`等。
var enc = simplifiedchinese.GBK

func main() {
	const filename = "example_GBK_file"
	exampleWriteGBK(filename)
	exampleReadGBK(filename)
}

func exampleReadGBK(filename string) {
	// 从GBK编码的文件中读取UTF-8。
	f, err := os.Open(filename)
	if err != nil {
		log.Fatal(err)
	}
	r := transform.NewReader(f, enc.NewDecoder())

	// 根据需要从`r`读取转换后的UTF-8。
	// 作为示例,我们将逐行读取并显示读取的内容:
	sc := bufio.NewScanner(r)
	for sc.Scan() {
		fmt.Printf("Read line: %s\n", sc.Bytes())
	}
	if err = sc.Err(); err != nil {
		log.Fatal(err)
	}

	if err = f.Close(); err != nil {
		log.Fatal(err)
	}
}

func exampleWriteGBK(filename string) {
	// 将UTF-8写入GBK编码的文件。
	f, err := os.Create(filename)
	if err != nil {
		log.Fatal(err)
	}
	w := transform.NewWriter(f, enc.NewEncoder())

	// 根据需要将UTF-8写入`w`。
	// 作为示例,我们将从维基百科的GBK页面中写入一些包含中文的文本。
	_, err = fmt.Fprintln(w,
		`In 1995, China National Information Technology Standardization
Technical Committee set down the Chinese Internal Code Specification
(Chinese: 汉字内码扩展规范(GBK); pinyin: Hànzì Nèimǎ
Kuòzhǎn Guīfàn (GBK)), Version 1.0, known as GBK 1.0, which is a
slight extension of Codepage 936. The newly added 95 characters were not
found in GB 13000.1-1993, and were provisionally assigned Unicode PUA
code points.`)
	if err != nil {
		log.Fatal(err)
	}

	if err = f.Close(); err != nil {
		log.Fatal(err)
	}
}

<kbd>Playground</kbd>

英文:

Previously (as mentioned in an older answer) the "easy" way to do this involved using third party packages that needed cgo and wrapped the iconv library. That is undesirable for many reasons. Thankfully, for quite a while now there has been a superior all Go way of doing this using only packages provided by the Go Authors (not in the main set of packages but in the Go Sub-Repositories).

The golang.org/x/text/encoding package defines an interface for generic character encodings that can convert to/from UTF-8. The golang.org/x/text/encoding/simplifiedchinese sub-package provides GB18030, GBK and HZ-GB2312 encoding implementations.

Here is an example of reading and writing a GBK encoded file. Note that the io.Reader and io.Writer do the encoding "on the fly" as data is being read/written.

package main

import (
	&quot;bufio&quot;
	&quot;fmt&quot;
	&quot;log&quot;
	&quot;os&quot;

	&quot;golang.org/x/text/encoding/simplifiedchinese&quot;
	&quot;golang.org/x/text/transform&quot;
)

// Encoding to use. Since this implements the encoding.Encoding
// interface from golang.org/x/text/encoding you can trivially
// change this out for any of the other implemented encoders,
// e.g. `traditionalchinese.Big5`, `charmap.Windows1252`,
// `korean.EUCKR`, etc.
var enc = simplifiedchinese.GBK

func main() {
	const filename = &quot;example_GBK_file&quot;
	exampleWriteGBK(filename)
	exampleReadGBK(filename)
}

func exampleReadGBK(filename string) {
	// Read UTF-8 from a GBK encoded file.
	f, err := os.Open(filename)
	if err != nil {
		log.Fatal(err)
	}
	r := transform.NewReader(f, enc.NewDecoder())

	// Read converted UTF-8 from `r` as needed.
	// As an example we&#39;ll read line-by-line showing what was read:
	sc := bufio.NewScanner(r)
	for sc.Scan() {
		fmt.Printf(&quot;Read line: %s\n&quot;, sc.Bytes())
	}
	if err = sc.Err(); err != nil {
		log.Fatal(err)
	}

	if err = f.Close(); err != nil {
		log.Fatal(err)
	}
}

func exampleWriteGBK(filename string) {
	// Write UTF-8 to a GBK encoded file.
	f, err := os.Create(filename)
	if err != nil {
		log.Fatal(err)
	}
	w := transform.NewWriter(f, enc.NewEncoder())

	// Write UTF-8 to `w` as desired.
	// As an example we&#39;ll write some text from the Wikipedia
	// GBK page that includes Chinese.
	_, err = fmt.Fprintln(w,
		`In 1995, China National Information Technology Standardization
Technical Committee set down the Chinese Internal Code Specification
(Chinese: 汉字内码扩展规范(GBK); pinyin: H&#224;nz&#236; N&#232;imǎ
Ku&#242;zhǎn Guīf&#224;n (GBK)), Version 1.0, known as GBK 1.0, which is a
slight extension of Codepage 936. The newly added 95 characters were not
found in GB 13000.1-1993, and were provisionally assigned Unicode PUA
code points.`)
	if err != nil {
		log.Fatal(err)
	}

	if err = f.Close(); err != nil {
		log.Fatal(err)
	}
}

<kbd>Playground</kbd>

答案2

得分: 5

尝试使用go-iconv。它封装了iconv并实现了io.Readerio.Writer

golang-china讨论组中的这个帖子提到了一些使用go-iconv的例子。

英文:

Try go-iconv. It wraps iconv and implements io.Reader and io.Writer.

This message in golang-china discussion group is mentioning a few examples of go-iconv usage.

huangapple
  • 本文由 发表于 2012年4月23日 17:28:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/10277933.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定