英文:
Output unquoted Unicode in Go
问题
我正在使用goyaml
作为YAML格式化工具。通过加载和转储YAML文件,我可以对其进行源代码格式化。我将YAML源文件中的数据解组为结构体,将这些字节编组,并将字节写入输出文件。但是这个过程会将我的Unicode字符串转换为带引号字符串的字面版本,我不知道如何恢复它。
示例输入subtitle.yaml
:
line: 你好
我已经将所有内容简化为最小的可重现问题。以下是代码,使用_
来捕获不弹出的错误:
package main
import (
"io/ioutil"
"gopkg.in/yaml.v1"
)
type Subtitle struct {
Line string
}
func main() {
filename := "subtitle.yaml"
in, _ := ioutil.ReadFile(filename)
var subtitle Subtitle
_ = yaml.Unmarshal(in, &subtitle)
out, _ := yaml.Marshal(&subtitle)
_ = ioutil.WriteFile(filename, out, 0644)
}
实际输出subtitle.yaml
:
line: "\u4F60\u597D"
我想在获得变量out
之后恢复goyaml
中的奇怪现象。
下面是注释掉的用于打印符文的代码块,它在符文之间添加空格以增加清晰度。它输出以下内容。它显示Unicode符文(如你
)没有被解码,而是被当作字面量处理:
l i n e : "\ u 4 F 6 0 \ u 5 9 7 D "
在将其写入输出文件之前,我应该如何取消引号out
,使输出看起来像输入(尽管经过美化)?
期望的输出subtitle.yaml
:
line: "你好"
临时解决方案
我已经提交了https://github.com/go-yaml/yaml/issues/11。与此同时,@bobince关于yaml_emitter_set_unicode
的提示有助于发现问题。它被定义为C绑定,但从未被调用(或给予设置选项的机会)!我修改了encode.go
并在第20行添加了yaml_emitter_set_unicode(&e.emitter, true)
,一切都按预期工作。最好将其设置为可选,但这需要更改Marshal API。
英文:
I'm using goyaml
as a YAML beautifier. By loading and dumping a YAML file, I can source-format it. I unmarshal the data from a YAML source file into a struct, marshal those bytes, and write the bytes to an output file. But the process morphs my Unicode strings into the literal version of the quoted strings, and I don't know how to reverse it.
Example input subtitle.yaml
:
line: 你好
I've stripped everything down to the smallest reproducible problem. Here's the code, using _
to catch errors which don't pop-up:
package main
import (
"io/ioutil"
//"unicode/utf8"
//"fmt"
"gopkg.in/yaml.v1"
)
type Subtitle struct {
Line string
}
func main() {
filename := "subtitle.yaml"
in, _ := ioutil.ReadFile(filename)
var subtitle Subtitle
_ = goyaml.Unmarshal(in, &subtitle)
out, _ := goyaml.Marshal(&subtitle)
//for len(out) > 0 { // For debugging, see what the runes are
// r, size := utf8.DecodeRune(out)
// fmt.Printf("%c ", r)
// out = out[size:]
//}
_ = ioutil.WriteFile(filename, out, 0644)
}
Actual output subtitle.yaml
:
line: "\u4F60\u597D"
I want to reverse the weirdness in goyaml
after I get the variable out
.
The commented-out rune-printing code block, which adds spaces between runes for clarity, outputs the following. It shows that Unicode runes like 你
aren't being decoded, but treated literally:
l i n e : " \ u 4 F 6 0 \ u 5 9 7 D "
How can I unquote out
, before writing it to the output file, so that the output looks like the input (albeit beautified)?
Desired output subtitle.yaml
:
line: "你好"
Temporary Solution
I've filed https://github.com/go-yaml/yaml/issues/11. In the meantime, @bobince's tip on yaml_emitter_set_unicode
was helpful in unconvering the problem. It was defined as a C binding but never called (or given an option to set it)! I changed encode.go
and added yaml_emitter_set_unicode(&e.emitter, true)
to line 20, and everything works as expected. It would be better to make it optional, but that would require a change in the Marshal API.
答案1
得分: 1
遇到类似问题时,可以使用(*Regexp) ReplaceAllFunc函数来解决goyaml.Marshal()中的bug。这个函数可以用来扩展字节数组中转义的Unicode字符。这种方法可能对于生产环境来说有点不太规范,但对于示例来说是有效的。
package main
import (
"io/ioutil"
"unicode/utf8"
"regexp"
"strconv"
"launchpad.net/goyaml"
)
type Subtitle struct {
Line string
}
var reFind = regexp.MustCompile(`^\s*[^\s\:]+\:\s*".*\\u.*"\s*$`)
var reFindU = regexp.MustCompile(`\\u[0-9a-fA-F]{4}`)
func expandUnicodeInYamlLine(line []byte) []byte {
// TODO: restrict this to the quoted string value
return reFindU.ReplaceAllFunc(line, expandUnicodeRune)
}
func expandUnicodeRune(esc []byte) []byte {
ri, _:= strconv.ParseInt(string(esc[2:]), 16, 32)
r := rune(ri)
repr := make([]byte, utf8.RuneLen(r))
utf8.EncodeRune(repr, r)
return repr
}
func main() {
filename := "subtitle.yaml"
filenameOut := "subtitleout.yaml"
in, _ := ioutil.ReadFile(filename)
var subtitle Subtitle
_ = goyaml.Unmarshal(in, &subtitle)
out, _ := goyaml.Marshal(&subtitle)
out = reFind.ReplaceAllFunc(out, expandUnicodeInYamlLine)
_ = ioutil.WriteFile(filenameOut, out, 0644)
}
以上是代码的翻译部分。
英文:
Had a similar issue and could apply this to circumvent the bug in goyaml.Marshal(). (*Regexp) ReplaceAllFunc is your friend which you can use to expand the escaped Unicode runes in the byte array. A little bit too dirty for production maybe, but works for the example
package main
import (
"io/ioutil"
"unicode/utf8"
"regexp"
"strconv"
"launchpad.net/goyaml"
)
type Subtitle struct {
Line string
}
var reFind = regexp.MustCompile(`^\s*[^\s\:]+\:\s*".*\\u.*"\s*$`)
var reFindU = regexp.MustCompile(`\\u[0-9a-fA-F]{4}`)
func expandUnicodeInYamlLine(line []byte) []byte {
// TODO: restrict this to the quoted string value
return reFindU.ReplaceAllFunc(line, expandUnicodeRune)
}
func expandUnicodeRune(esc []byte) []byte {
ri, _:= strconv.ParseInt(string(esc[2:]), 16, 32)
r := rune(ri)
repr := make([]byte, utf8.RuneLen(r))
utf8.EncodeRune(repr, r)
return repr
}
func main() {
filename := "subtitle.yaml"
filenameOut := "subtitleout.yaml"
in, _ := ioutil.ReadFile(filename)
var subtitle Subtitle
_ = goyaml.Unmarshal(in, &subtitle)
out, _ := goyaml.Marshal(&subtitle)
out = reFind.ReplaceAllFunc(out, expandUnicodeInYamlLine)
_ = ioutil.WriteFile(filenameOut, out, 0644)
}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论