2015年4月17日 05:55:14go评论121阅读模式

英文:

Read unicode characters with bufio scanner in Go

问题

我正在尝试读取一个包含像这样的名称的纯文本文件: "CASTAÑEDA"

代码基本上是这样的:

file, err := os.Open("C:/Files/file.txt")
defer file.Close()
if err != nil {
    log.Fatal(err)
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
    fmt.Println(scanner.Text())
}

然后，当读取"CASTAÑEDA"时，它打印出"CASTA�EDA"

在使用bufio读取时，有没有办法处理这些字符？

谢谢。

英文:

I'm trying to read a plain text file that contains names like this: "CASTAÑEDA"

The code is basically like this:

file, err := os.Open(&quot;C:/Files/file.txt&quot;)
defer file.Close()
if err != nil {
    log.Fatal(err)
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
    fmt.Println(scanner.Text())
}

Then, when "CASTAÑEDA" is read it prints "CASTA�EDA"

There's any way to handle that characters when reading with bufio?

Thanks.

答案1

得分: 9

你的文件很可能不是UTF-8编码的。因此（Go语言要求所有字符串都是UTF-8编码），你的控制台输出看起来会乱码。我建议在你的情况下使用golang.org/x/text/encoding/charmap和golang.org/x/text/transform包来将文件的数据转换为UTF-8编码。根据你的文件路径，我猜你可能是在Windows系统上。所以你的字符编码可能是Windows1252（如果你使用记事本等编辑过它）。

可以尝试像这样的代码：

package main
import (
	"bufio"
	"fmt"
	"log"
	"os"
	"golang.org/x/text/encoding/charmap"
	"golang.org/x/text/transform"
)
func main() {
	file, err := os.Open("C:/temp/file.txt")
	defer file.Close()
	if err != nil {
		log.Fatal(err)
	}
	dec := transform.NewReader(file, charmap.Windows1252.NewDecoder()) // 在这里插入你的编码
	scanner := bufio.NewScanner(dec)
	for scanner.Scan() {
		fmt.Println(scanner.Text())
	}
}

你可以在golang.org/x/text/encoding/charmap包中找到更多的编码，根据你的喜好将其插入到我的示例中。

英文:

Your file is, most propably, non UTF-8. Because of that (go expects all strings to be UTF-8) your console output looks mangled. I would advise usage of the packages golang.org/x/text/encoding/charmap and golang.org/x/text/transform in your case, to convert the file's data to UTF-8. As I might presume, looking at your file path, you are on Windows. So your character encoding might be Windows1252 (if you have edited it e.g. with notepad.exe).

Try something like this:

package main
import (
	&quot;bufio&quot;
	&quot;fmt&quot;
	&quot;log&quot;
	&quot;os&quot;
	&quot;golang.org/x/text/encoding/charmap&quot;
	&quot;golang.org/x/text/transform&quot;
)
func main() {
	file, err := os.Open(&quot;C:/temp/file.txt&quot;)
	defer file.Close()
	if err != nil {
		log.Fatal(err)
	}
	dec := transform.NewReader(file, charmap.Windows1252.NewDecoder()) &lt;- insert your enconding here
	scanner := bufio.NewScanner(dec)
	for scanner.Scan() {
		fmt.Println(scanner.Text())
	}
}

You can find more encodings in the package golang.org/x/text/encoding/charmap, that you can insert into my example to your liking.

答案2

得分: 4

你遇到的问题是你的输入很可能不是UTF-8编码（这是bufio和大多数Go语言/标准库所期望的）。相反，你的输入可能使用了一些扩展的ASCII代码页，这就是为什么非重音字符可以通过（UTF-8也是7位ASCII的超集），但是'Ñ'没有通过的原因。

在这种情况下，带重音字符的位表示不是有效的UTF-8编码，因此会产生Unicode替换字符（U+FFFD）。你有几个选择：

在将输入文件传递给Go之前将其转换为UTF-8。有许多工具可以做到这一点，编辑器通常也具有此功能。
尝试使用golang.org/x/text/encoding/charmap和golang.org/x/text/transform中的NewReader将输入转换为UTF-8。将结果的Reader传递给bufio.NewScanner。
将循环中的行更改为os.Stdout.Write(scanner.Bytes()); fmt.Println();。这可能避免将字节解释为超出换行符拆分的UTF-8。直接将字节写入os.Stdout还将避免对内容的任何（误）解释。

英文:

The issue you're encountering is that your input is likely not UTF-8 (which is what bufio and most of the Go language/stdlib expect). Instead, your input probably uses some extended-ASCII codepage, which is why the unaccented characters are passing through cleanly (UTF-8 is also a superset of 7-bit ASCII), but that the 'Ñ' is not passed through intact.

In this situation, the bit-representation of the accented character is not valid UTF-8, so the unicode replacement character (U+FFFD) is being produced. You've got a few options:

Convert your input files to UTF-8 before passing them to Go. There are many utilities that can do this, and editors often have this feature.
Try using golang.org/x/text/encoding/charmap together with NewReader from golang.org/x/text/transform to transform your input to UTF-8. Pass the resulting Reader to bufio.NewScanner
Change the line in the loop to os.Stdout.Write(scanner.Bytes()); fmt.Println(); This might avoid the bytes being interpreted as UTF-8 beyond newline splitting. Writing the bytes directly to os.Stdout will further avoid any (mis)interpretation of the contents.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Go语言中的bufio扫描器读取Unicode字符。

问题

答案1

答案2

Go + App Engine Datastore：如何过滤掉为空的行？

How do you decode query strings containing arrays in Go?

在调用其他服务时，我应该重用 HTTP 服务器中的上下文对象吗？

如何在Go语言中显示学生列表？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。