使用Go语言中的bufio扫描器读取Unicode字符。

huangapple go评论84阅读模式
英文:

Read unicode characters with bufio scanner in Go

问题

我正在尝试读取一个包含像这样的名称的纯文本文件: "CASTAÑEDA"

代码基本上是这样的:

file, err := os.Open("C:/Files/file.txt")
defer file.Close()
if err != nil {
    log.Fatal(err)
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
    fmt.Println(scanner.Text())
}

然后,当读取"CASTAÑEDA"时,它打印出"CASTA�EDA"

在使用bufio读取时,有没有办法处理这些字符?

谢谢。

英文:

I'm trying to read a plain text file that contains names like this: "CASTAÑEDA"

The code is basically like this:

file, err := os.Open("C:/Files/file.txt")
defer file.Close()
if err != nil {
    log.Fatal(err)
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
    fmt.Println(scanner.Text())
}

Then, when "CASTAÑEDA" is read it prints "CASTA�EDA"

There's any way to handle that characters when reading with bufio?

Thanks.

答案1

得分: 9

你的文件很可能不是UTF-8编码的。因此(Go语言要求所有字符串都是UTF-8编码),你的控制台输出看起来会乱码。我建议在你的情况下使用golang.org/x/text/encoding/charmapgolang.org/x/text/transform包来将文件的数据转换为UTF-8编码。根据你的文件路径,我猜你可能是在Windows系统上。所以你的字符编码可能是Windows1252(如果你使用记事本等编辑过它)。

可以尝试像这样的代码:

package main

import (
	"bufio"
	"fmt"
	"log"
	"os"

	"golang.org/x/text/encoding/charmap"
	"golang.org/x/text/transform"
)

func main() {
	file, err := os.Open("C:/temp/file.txt")
	defer file.Close()
	if err != nil {
		log.Fatal(err)
	}

	dec := transform.NewReader(file, charmap.Windows1252.NewDecoder()) // 在这里插入你的编码

	scanner := bufio.NewScanner(dec)
	for scanner.Scan() {
		fmt.Println(scanner.Text())
	}
}

你可以在golang.org/x/text/encoding/charmap包中找到更多的编码,根据你的喜好将其插入到我的示例中。

英文:

Your file is, most propably, non UTF-8. Because of that (go expects all strings to be UTF-8) your console output looks mangled. I would advise usage of the packages golang.org/x/text/encoding/charmap and golang.org/x/text/transform in your case, to convert the file's data to UTF-8. As I might presume, looking at your file path, you are on Windows. So your character encoding might be Windows1252 (if you have edited it e.g. with notepad.exe).

Try something like this:

package main

import (
	"bufio"
	"fmt"
	"log"
	"os"

	"golang.org/x/text/encoding/charmap"
	"golang.org/x/text/transform"
)

func main() {
	file, err := os.Open("C:/temp/file.txt")
	defer file.Close()
	if err != nil {
		log.Fatal(err)
	}

	dec := transform.NewReader(file, charmap.Windows1252.NewDecoder()) <- insert your enconding here

	scanner := bufio.NewScanner(dec)
	for scanner.Scan() {
		fmt.Println(scanner.Text())
	}
}

You can find more encodings in the package golang.org/x/text/encoding/charmap, that you can insert into my example to your liking.

答案2

得分: 4

你遇到的问题是你的输入很可能不是UTF-8编码(这是bufio和大多数Go语言/标准库所期望的)。相反,你的输入可能使用了一些扩展的ASCII代码页,这就是为什么非重音字符可以通过(UTF-8也是7位ASCII的超集),但是'Ñ'没有通过的原因。

在这种情况下,带重音字符的位表示不是有效的UTF-8编码,因此会产生Unicode替换字符(U+FFFD)。你有几个选择:

  1. 在将输入文件传递给Go之前将其转换为UTF-8。有许多工具可以做到这一点,编辑器通常也具有此功能。
  2. 尝试使用golang.org/x/text/encoding/charmapgolang.org/x/text/transform中的NewReader将输入转换为UTF-8。将结果的Reader传递给bufio.NewScanner。
  3. 将循环中的行更改为os.Stdout.Write(scanner.Bytes()); fmt.Println();。这可能避免将字节解释为超出换行符拆分的UTF-8。直接将字节写入os.Stdout还将避免对内容的任何(误)解释。
英文:

The issue you're encountering is that your input is likely not UTF-8 (which is what bufio and most of the Go language/stdlib expect). Instead, your input probably uses some extended-ASCII codepage, which is why the unaccented characters are passing through cleanly (UTF-8 is also a superset of 7-bit ASCII), but that the 'Ñ' is not passed through intact.

In this situation, the bit-representation of the accented character is not valid UTF-8, so the unicode replacement character (U+FFFD) is being produced. You've got a few options:

  1. Convert your input files to UTF-8 before passing them to Go. There are many utilities that can do this, and editors often have this feature.
  2. Try using golang.org/x/text/encoding/charmap together with NewReader from golang.org/x/text/transform to transform your input to UTF-8. Pass the resulting Reader to bufio.NewScanner
  3. Change the line in the loop to os.Stdout.Write(scanner.Bytes()); fmt.Println(); This might avoid the bytes being interpreted as UTF-8 beyond newline splitting. Writing the bytes directly to os.Stdout will further avoid any (mis)interpretation of the contents.

huangapple
  • 本文由 发表于 2015年4月17日 05:55:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/29686673.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定