英文:
Read unicode characters with bufio scanner in Go
问题
我正在尝试读取一个包含像这样的名称的纯文本文件: "CASTAÑEDA"
代码基本上是这样的:
file, err := os.Open("C:/Files/file.txt")
defer file.Close()
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
然后,当读取"CASTAÑEDA"时,它打印出"CASTA�EDA"
在使用bufio读取时,有没有办法处理这些字符?
谢谢。
英文:
I'm trying to read a plain text file that contains names like this: "CASTAÑEDA"
The code is basically like this:
file, err := os.Open("C:/Files/file.txt")
defer file.Close()
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
Then, when "CASTAÑEDA" is read it prints "CASTA�EDA"
There's any way to handle that characters when reading with bufio?
Thanks.
答案1
得分: 9
你的文件很可能不是UTF-8编码的。因此(Go语言要求所有字符串都是UTF-8编码),你的控制台输出看起来会乱码。我建议在你的情况下使用golang.org/x/text/encoding/charmap
和golang.org/x/text/transform
包来将文件的数据转换为UTF-8编码。根据你的文件路径,我猜你可能是在Windows系统上。所以你的字符编码可能是Windows1252
(如果你使用记事本等编辑过它)。
可以尝试像这样的代码:
package main
import (
"bufio"
"fmt"
"log"
"os"
"golang.org/x/text/encoding/charmap"
"golang.org/x/text/transform"
)
func main() {
file, err := os.Open("C:/temp/file.txt")
defer file.Close()
if err != nil {
log.Fatal(err)
}
dec := transform.NewReader(file, charmap.Windows1252.NewDecoder()) // 在这里插入你的编码
scanner := bufio.NewScanner(dec)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}
你可以在golang.org/x/text/encoding/charmap
包中找到更多的编码,根据你的喜好将其插入到我的示例中。
英文:
Your file is, most propably, non UTF-8. Because of that (go expects all strings to be UTF-8) your console output looks mangled. I would advise usage of the packages golang.org/x/text/encoding/charmap
and golang.org/x/text/transform
in your case, to convert the file's data to UTF-8. As I might presume, looking at your file path, you are on Windows. So your character encoding might be Windows1252
(if you have edited it e.g. with notepad.exe).
Try something like this:
package main
import (
"bufio"
"fmt"
"log"
"os"
"golang.org/x/text/encoding/charmap"
"golang.org/x/text/transform"
)
func main() {
file, err := os.Open("C:/temp/file.txt")
defer file.Close()
if err != nil {
log.Fatal(err)
}
dec := transform.NewReader(file, charmap.Windows1252.NewDecoder()) <- insert your enconding here
scanner := bufio.NewScanner(dec)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}
You can find more encodings in the package golang.org/x/text/encoding/charmap
, that you can insert into my example to your liking.
答案2
得分: 4
你遇到的问题是你的输入很可能不是UTF-8编码(这是bufio和大多数Go语言/标准库所期望的)。相反,你的输入可能使用了一些扩展的ASCII代码页,这就是为什么非重音字符可以通过(UTF-8也是7位ASCII的超集),但是'Ñ'没有通过的原因。
在这种情况下,带重音字符的位表示不是有效的UTF-8编码,因此会产生Unicode替换字符(U+FFFD)。你有几个选择:
- 在将输入文件传递给Go之前将其转换为UTF-8。有许多工具可以做到这一点,编辑器通常也具有此功能。
- 尝试使用golang.org/x/text/encoding/charmap和golang.org/x/text/transform中的NewReader将输入转换为UTF-8。将结果的Reader传递给bufio.NewScanner。
- 将循环中的行更改为
os.Stdout.Write(scanner.Bytes()); fmt.Println();
。这可能避免将字节解释为超出换行符拆分的UTF-8。直接将字节写入os.Stdout
还将避免对内容的任何(误)解释。
英文:
The issue you're encountering is that your input is likely not UTF-8 (which is what bufio and most of the Go language/stdlib expect). Instead, your input probably uses some extended-ASCII codepage, which is why the unaccented characters are passing through cleanly (UTF-8 is also a superset of 7-bit ASCII), but that the 'Ñ' is not passed through intact.
In this situation, the bit-representation of the accented character is not valid UTF-8, so the unicode replacement character (U+FFFD) is being produced. You've got a few options:
- Convert your input files to UTF-8 before passing them to Go. There are many utilities that can do this, and editors often have this feature.
- Try using golang.org/x/text/encoding/charmap together with NewReader from golang.org/x/text/transform to transform your input to UTF-8. Pass the resulting Reader to bufio.NewScanner
- Change the line in the loop to
os.Stdout.Write(scanner.Bytes()); fmt.Println();
This might avoid the bytes being interpreted as UTF-8 beyond newline splitting. Writing the bytes directly toos.Stdout
will further avoid any (mis)interpretation of the contents.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论