英文:
Unmarshal an ISO-8859-1 XML input in Go
问题
当您的XML输入不是以UTF-8编码时,xml包的Unmarshal
函数似乎需要一个CharsetReader
。
您在哪里可以找到这样的东西?
英文:
When your XML input isn't encoded in UTF-8, the Unmarshal
function of the xml package seems to require a CharsetReader
.
Where do you find such a thing ?
答案1
得分: 53
2015年及以后的更新答案:
import (
"encoding/xml"
"golang.org/x/net/html/charset"
)
reader := bytes.NewReader(theXml)
decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReaderLabel
err = decoder.Decode(&parsed)
英文:
Updated answer for 2015 & beyond:
import (
"encoding/xml"
"golang.org/x/net/html/charset"
)
reader := bytes.NewReader(theXml)
decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReaderLabel
err = decoder.Decode(&parsed)
答案2
得分: 22
在@anschel-schaffer-cohen的建议和@mjibson的评论的基础上进行扩展,使用上面提到的go-charset包可以让你使用以下三行代码来实现所需的结果。
decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReader
err = decoder.Decode(&parsed)
只需记住在应用程序启动时通过调用以下代码让charset
知道它的数据文件在哪里。
charset.CharsetDir = ".../src/code.google.com/p/go-charset/datafiles"
编辑
与上述的charset.CharsetDir =
等不同,更明智的做法是直接导入数据文件。它们被视为嵌入资源:
import (
"code.google.com/p/go-charset/charset"
_ "code.google.com/p/go-charset/data"
...
)
go install
会自动处理,这也避免了部署的麻烦(如何获取与执行应用程序相关的数据文件)。
使用带有下划线的导入只会调用包的init()
函数,将所需的内容加载到内存中。
英文:
Expanding on @anschel-schaffer-cohen suggestion and @mjibson's comment,
using the go-charset package as mentioned above allows you to use these three lines
decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReader
err = decoder.Decode(&parsed)
to achieve the required result. just remember to let charset
know where its data files are by calling
charset.CharsetDir = ".../src/code.google.com/p/go-charset/datafiles"
at some point when the app starts up.
1: http://code.google.com/p/go-charset/ "go-charset"
EDIT
Instead of the above, charset.CharsetDir =
etc. it's more sensible to just import the data files. they are treated as an embedded resource:
import (
"code.google.com/p/go-charset/charset"
_ "code.google.com/p/go-charset/data"
...
)
go install
will just do its thing, this also avoids the deployment headache (where/how do I get data files relative to the executing app?).
using import with an underscore just calls the package's init()
func which loads the required stuff into memory.
答案3
得分: 12
这是一个示例的Go程序,它使用CharsetReader函数将XML输入从ISO-8859-1转换为UTF-8。该程序打印测试文件XML的注释。
package main
import (
"bytes"
"fmt"
"io"
"os"
"strings"
"utf8"
"xml"
)
type CharsetISO88591er struct {
r io.ByteReader
buf *bytes.Buffer
}
func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
buf := bytes.NewBuffer(make([]byte, 0, utf8.UTFMax))
return &CharsetISO88591er{r.(io.ByteReader), buf}
}
func (cs *CharsetISO88591er) ReadByte() (b byte, err os.Error) {
// http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
// Date: 1999 July 27; Last modified: 27-Feb-2001 05:08
if cs.buf.Len() <= 0 {
r, err := cs.r.ReadByte()
if err != nil {
return 0, err
}
if r < utf8.RuneSelf {
return r, nil
}
cs.buf.WriteRune(int(r))
}
return cs.buf.ReadByte()
}
func (cs *CharsetISO88591er) Read(p []byte) (int, os.Error) {
// Use ReadByte method.
return 0, os.EINVAL
}
func isCharset(charset string, names []string) bool {
charset = strings.ToLower(charset)
for _, n := range names {
if charset == strings.ToLower(n) {
return true
}
}
return false
}
func IsCharsetISO88591(charset string) bool {
// http://www.iana.org/assignments/character-sets
// (last updated 2010-11-04)
names := []string{
// Name
"ISO_8859-1:1987",
// Alias (preferred MIME name)
"ISO-8859-1",
// Aliases
"iso-ir-100",
"ISO_8859-1",
"latin1",
"l1",
"IBM819",
"CP819",
"csISOLatin1",
}
return isCharset(charset, names)
}
func IsCharsetUTF8(charset string) bool {
names := []string{
"UTF-8",
// Default
"",
}
return isCharset(charset, names)
}
func CharsetReader(charset string, input io.Reader) (io.Reader, os.Error) {
switch {
case IsCharsetUTF8(charset):
return input, nil
case IsCharsetISO88591(charset):
return NewCharsetISO88591(input), nil
}
return nil, os.NewError("CharsetReader: unexpected charset: " + charset)
}
func main() {
// Print the XML comments from the test file, which should
// contain most of the printable ISO-8859-1 characters.
r, err := os.Open("ISO88591.xml")
if err != nil {
fmt.Println(err)
return
}
defer r.Close()
fmt.Println("file:", r.Name())
p := xml.NewParser(r)
p.CharsetReader = CharsetReader
for t, err := p.Token(); t != nil && err == nil; t, err = p.Token() {
switch t := t.(type) {
case xml.ProcInst:
fmt.Println(t.Target, string(t.Inst))
case xml.Comment:
fmt.Println(string([]byte(t)))
}
}
}
要将具有encoding="ISO-8859-1"
的XML从io.Reader
r
解组为结构result
,同时使用程序中的CharsetReader
函数从ISO-8859-1
转换为UTF-8
,请编写:
p := xml.NewParser(r)
p.CharsetReader = CharsetReader
err := p.Unmarshal(&result, nil)
英文:
Here's a sample Go program which uses a CharsetReader function to convert XML input from ISO-8859-1 to UTF-8. The program prints the test file XML comments.
package main
import (
"bytes"
"fmt"
"io"
"os"
"strings"
"utf8"
"xml"
)
type CharsetISO88591er struct {
r io.ByteReader
buf *bytes.Buffer
}
func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
buf := bytes.NewBuffer(make([]byte, 0, utf8.UTFMax))
return &CharsetISO88591er{r.(io.ByteReader), buf}
}
func (cs *CharsetISO88591er) ReadByte() (b byte, err os.Error) {
// http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
// Date: 1999 July 27; Last modified: 27-Feb-2001 05:08
if cs.buf.Len() <= 0 {
r, err := cs.r.ReadByte()
if err != nil {
return 0, err
}
if r < utf8.RuneSelf {
return r, nil
}
cs.buf.WriteRune(int(r))
}
return cs.buf.ReadByte()
}
func (cs *CharsetISO88591er) Read(p []byte) (int, os.Error) {
// Use ReadByte method.
return 0, os.EINVAL
}
func isCharset(charset string, names []string) bool {
charset = strings.ToLower(charset)
for _, n := range names {
if charset == strings.ToLower(n) {
return true
}
}
return false
}
func IsCharsetISO88591(charset string) bool {
// http://www.iana.org/assignments/character-sets
// (last updated 2010-11-04)
names := []string{
// Name
"ISO_8859-1:1987",
// Alias (preferred MIME name)
"ISO-8859-1",
// Aliases
"iso-ir-100",
"ISO_8859-1",
"latin1",
"l1",
"IBM819",
"CP819",
"csISOLatin1",
}
return isCharset(charset, names)
}
func IsCharsetUTF8(charset string) bool {
names := []string{
"UTF-8",
// Default
"",
}
return isCharset(charset, names)
}
func CharsetReader(charset string, input io.Reader) (io.Reader, os.Error) {
switch {
case IsCharsetUTF8(charset):
return input, nil
case IsCharsetISO88591(charset):
return NewCharsetISO88591(input), nil
}
return nil, os.NewError("CharsetReader: unexpected charset: " + charset)
}
func main() {
// Print the XML comments from the test file, which should
// contain most of the printable ISO-8859-1 characters.
r, err := os.Open("ISO88591.xml")
if err != nil {
fmt.Println(err)
return
}
defer r.Close()
fmt.Println("file:", r.Name())
p := xml.NewParser(r)
p.CharsetReader = CharsetReader
for t, err := p.Token(); t != nil && err == nil; t, err = p.Token() {
switch t := t.(type) {
case xml.ProcInst:
fmt.Println(t.Target, string(t.Inst))
case xml.Comment:
fmt.Println(string([]byte(t)))
}
}
}
To unmarshal XML with encoding="ISO-8859-1"
from an io.Reader
r
into a structure result
, while using the CharsetReader
function from the program to translate from ISO-8859-1
to UTF-8
, write:
p := xml.NewParser(r)
p.CharsetReader = CharsetReader
err := p.Unmarshal(&result, nil)
答案4
得分: 7
似乎有一个外部库可以处理这个问题:go-charset
。我自己没有尝试过,它对你有用吗?
英文:
There appears to be an external library which handles this: go-charset
. I haven't tried it myself; does it work for you?
答案5
得分: 6
这是@peterSO的代码的更新版本,适用于go1:
package main
import (
"bytes"
"io"
"strings"
)
type CharsetISO88591er struct {
r io.ByteReader
buf *bytes.Buffer
}
func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
buf := bytes.Buffer{}
return &CharsetISO88591er{r.(io.ByteReader), &buf}
}
func (cs *CharsetISO88591er) Read(p []byte) (n int, err error) {
for _ = range p {
if r, err := cs.r.ReadByte(); err != nil {
break
} else {
cs.buf.WriteRune(rune(r))
}
}
return cs.buf.Read(p)
}
func isCharset(charset string, names []string) bool {
charset = strings.ToLower(charset)
for _, n := range names {
if charset == strings.ToLower(n) {
return true
}
}
return false
}
func IsCharsetISO88591(charset string) bool {
// http://www.iana.org/assignments/character-sets
// (last updated 2010-11-04)
names := []string{
// Name
"ISO_8859-1:1987",
// Alias (preferred MIME name)
"ISO-8859-1",
// Aliases
"iso-ir-100",
"ISO_8859-1",
"latin1",
"l1",
"IBM819",
"CP819",
"csISOLatin1",
}
return isCharset(charset, names)
}
func CharsetReader(charset string, input io.Reader) (io.Reader, error) {
if IsCharsetISO88591(charset) {
return NewCharsetISO88591(input), nil
}
return input, nil
}
调用方式为:
d := xml.NewDecoder(reader)
d.CharsetReader = CharsetReader
err := d.Decode(&dst)
英文:
Edit: do not use this, use the go-charset answer.
Here's an updated version of @peterSO's code that works with go1:
package main
import (
"bytes"
"io"
"strings"
)
type CharsetISO88591er struct {
r io.ByteReader
buf *bytes.Buffer
}
func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
buf := bytes.Buffer{}
return &CharsetISO88591er{r.(io.ByteReader), &buf}
}
func (cs *CharsetISO88591er) Read(p []byte) (n int, err error) {
for _ = range p {
if r, err := cs.r.ReadByte(); err != nil {
break
} else {
cs.buf.WriteRune(rune(r))
}
}
return cs.buf.Read(p)
}
func isCharset(charset string, names []string) bool {
charset = strings.ToLower(charset)
for _, n := range names {
if charset == strings.ToLower(n) {
return true
}
}
return false
}
func IsCharsetISO88591(charset string) bool {
// http://www.iana.org/assignments/character-sets
// (last updated 2010-11-04)
names := []string{
// Name
"ISO_8859-1:1987",
// Alias (preferred MIME name)
"ISO-8859-1",
// Aliases
"iso-ir-100",
"ISO_8859-1",
"latin1",
"l1",
"IBM819",
"CP819",
"csISOLatin1",
}
return isCharset(charset, names)
}
func CharsetReader(charset string, input io.Reader) (io.Reader, error) {
if IsCharsetISO88591(charset) {
return NewCharsetISO88591(input), nil
}
return input, nil
}
Called with:
d := xml.NewDecoder(reader)
d.CharsetReader = CharsetReader
err := d.Decode(&dst)
答案6
得分: 0
目前在Go发行版中没有提供任何字符集读取器,也没有在其他地方找到。这并不奇怪,因为在撰写本文时,该钩子的年龄不到一个月。
由于CharsetReader被定义为CharsetReader func(charset string, input io.Reader) (io.Reader, os.Error)
,你可以自己创建一个。
在测试中有一个示例,但可能对你没有太大用处。
英文:
There aren't any provided in the go distribution at the moment, or anywhere else I can find. Not surprising as that hook is less than a month old at the time of writing.
Since a CharsetReader is defined as CharsetReader func(charset string, input io.Reader) (io.Reader, os.Error)
, you could make your own.
There's one example in the tests, but that might not be exactly useful to you.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论