在Go中解析ISO-8859-1编码的XML输入

huangapple go评论218阅读模式
英文:

Unmarshal an ISO-8859-1 XML input in Go

问题

当您的XML输入不是以UTF-8编码时,xml包的Unmarshal函数似乎需要一个CharsetReader

您在哪里可以找到这样的东西?

英文:

When your XML input isn't encoded in UTF-8, the Unmarshal function of the xml package seems to require a CharsetReader.

Where do you find such a thing ?

答案1

得分: 53

2015年及以后的更新答案:

import (
	"encoding/xml"
	"golang.org/x/net/html/charset"
)
reader := bytes.NewReader(theXml)
decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReaderLabel
err = decoder.Decode(&parsed)
英文:

Updated answer for 2015 & beyond:

import (
	"encoding/xml"
	"golang.org/x/net/html/charset"
)
reader := bytes.NewReader(theXml)
decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReaderLabel
err = decoder.Decode(&parsed)

答案2

得分: 22

在@anschel-schaffer-cohen的建议和@mjibson的评论的基础上进行扩展,使用上面提到的go-charset包可以让你使用以下三行代码来实现所需的结果。

decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReader
err = decoder.Decode(&parsed)

只需记住在应用程序启动时通过调用以下代码让charset知道它的数据文件在哪里。

charset.CharsetDir = ".../src/code.google.com/p/go-charset/datafiles"

编辑

与上述的charset.CharsetDir = 等不同,更明智的做法是直接导入数据文件。它们被视为嵌入资源:

import (
    "code.google.com/p/go-charset/charset"
    _ "code.google.com/p/go-charset/data"
    ...
)

go install会自动处理,这也避免了部署的麻烦(如何获取与执行应用程序相关的数据文件)。

使用带有下划线的导入只会调用包的init()函数,将所需的内容加载到内存中。

英文:

Expanding on @anschel-schaffer-cohen suggestion and @mjibson's comment,
using the go-charset package as mentioned above allows you to use these three lines

decoder := xml.NewDecoder(reader)
decoder.CharsetReader = charset.NewReader
err = decoder.Decode(&parsed)

to achieve the required result. just remember to let charset know where its data files are by calling

charset.CharsetDir = ".../src/code.google.com/p/go-charset/datafiles"

at some point when the app starts up.

1: http://code.google.com/p/go-charset/ "go-charset"

EDIT

Instead of the above, charset.CharsetDir = etc. it's more sensible to just import the data files. they are treated as an embedded resource:

import (
    "code.google.com/p/go-charset/charset"
    _ "code.google.com/p/go-charset/data"
    ...
)

go install will just do its thing, this also avoids the deployment headache (where/how do I get data files relative to the executing app?).

using import with an underscore just calls the package's init() func which loads the required stuff into memory.

答案3

得分: 12

这是一个示例的Go程序,它使用CharsetReader函数将XML输入从ISO-8859-1转换为UTF-8。该程序打印测试文件XML的注释。

package main

import (
	"bytes"
	"fmt"
	"io"
	"os"
	"strings"
	"utf8"
	"xml"
)

type CharsetISO88591er struct {
	r   io.ByteReader
	buf *bytes.Buffer
}

func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
	buf := bytes.NewBuffer(make([]byte, 0, utf8.UTFMax))
	return &CharsetISO88591er{r.(io.ByteReader), buf}
}

func (cs *CharsetISO88591er) ReadByte() (b byte, err os.Error) {
	// http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
	// Date: 1999 July 27; Last modified: 27-Feb-2001 05:08
	if cs.buf.Len() <= 0 {
		r, err := cs.r.ReadByte()
		if err != nil {
			return 0, err
		}
		if r < utf8.RuneSelf {
			return r, nil
		}
		cs.buf.WriteRune(int(r))
	}
	return cs.buf.ReadByte()
}

func (cs *CharsetISO88591er) Read(p []byte) (int, os.Error) {
	// Use ReadByte method.
	return 0, os.EINVAL
}

func isCharset(charset string, names []string) bool {
	charset = strings.ToLower(charset)
	for _, n := range names {
		if charset == strings.ToLower(n) {
			return true
		}
	}
	return false
}

func IsCharsetISO88591(charset string) bool {
	// http://www.iana.org/assignments/character-sets
	// (last updated 2010-11-04)
	names := []string{
		// Name
		"ISO_8859-1:1987",
		// Alias (preferred MIME name)
		"ISO-8859-1",
		// Aliases
		"iso-ir-100",
		"ISO_8859-1",
		"latin1",
		"l1",
		"IBM819",
		"CP819",
		"csISOLatin1",
	}
	return isCharset(charset, names)
}

func IsCharsetUTF8(charset string) bool {
	names := []string{
		"UTF-8",
		// Default
		"",
	}
	return isCharset(charset, names)
}

func CharsetReader(charset string, input io.Reader) (io.Reader, os.Error) {
	switch {
	case IsCharsetUTF8(charset):
		return input, nil
	case IsCharsetISO88591(charset):
		return NewCharsetISO88591(input), nil
	}
	return nil, os.NewError("CharsetReader: unexpected charset: " + charset)
}

func main() {
	// Print the XML comments from the test file, which should
	// contain most of the printable ISO-8859-1 characters.
	r, err := os.Open("ISO88591.xml")
	if err != nil {
		fmt.Println(err)
		return
	}
	defer r.Close()
	fmt.Println("file:", r.Name())
	p := xml.NewParser(r)
	p.CharsetReader = CharsetReader
	for t, err := p.Token(); t != nil && err == nil; t, err = p.Token() {
		switch t := t.(type) {
		case xml.ProcInst:
			fmt.Println(t.Target, string(t.Inst))
		case xml.Comment:
			fmt.Println(string([]byte(t)))
		}
	}
}

要将具有encoding="ISO-8859-1"的XML从io.Reader r解组为结构result,同时使用程序中的CharsetReader函数从ISO-8859-1转换为UTF-8,请编写:

p := xml.NewParser(r)
p.CharsetReader = CharsetReader
err := p.Unmarshal(&result, nil)
英文:

Here's a sample Go program which uses a CharsetReader function to convert XML input from ISO-8859-1 to UTF-8. The program prints the test file XML comments.

package main

import (
	&quot;bytes&quot;
	&quot;fmt&quot;
	&quot;io&quot;
	&quot;os&quot;
	&quot;strings&quot;
	&quot;utf8&quot;
	&quot;xml&quot;
)

type CharsetISO88591er struct {
	r   io.ByteReader
	buf *bytes.Buffer
}

func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
	buf := bytes.NewBuffer(make([]byte, 0, utf8.UTFMax))
	return &amp;CharsetISO88591er{r.(io.ByteReader), buf}
}

func (cs *CharsetISO88591er) ReadByte() (b byte, err os.Error) {
	// http://unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
	// Date: 1999 July 27; Last modified: 27-Feb-2001 05:08
	if cs.buf.Len() &lt;= 0 {
		r, err := cs.r.ReadByte()
		if err != nil {
			return 0, err
		}
		if r &lt; utf8.RuneSelf {
			return r, nil
		}
		cs.buf.WriteRune(int(r))
	}
	return cs.buf.ReadByte()
}

func (cs *CharsetISO88591er) Read(p []byte) (int, os.Error) {
	// Use ReadByte method.
	return 0, os.EINVAL
}

func isCharset(charset string, names []string) bool {
	charset = strings.ToLower(charset)
	for _, n := range names {
		if charset == strings.ToLower(n) {
			return true
		}
	}
	return false
}

func IsCharsetISO88591(charset string) bool {
	// http://www.iana.org/assignments/character-sets
	// (last updated 2010-11-04)
	names := []string{
		// Name
		&quot;ISO_8859-1:1987&quot;,
		// Alias (preferred MIME name)
		&quot;ISO-8859-1&quot;,
		// Aliases
		&quot;iso-ir-100&quot;,
		&quot;ISO_8859-1&quot;,
		&quot;latin1&quot;,
		&quot;l1&quot;,
		&quot;IBM819&quot;,
		&quot;CP819&quot;,
		&quot;csISOLatin1&quot;,
	}
	return isCharset(charset, names)
}

func IsCharsetUTF8(charset string) bool {
	names := []string{
		&quot;UTF-8&quot;,
		// Default
		&quot;&quot;,
	}
	return isCharset(charset, names)
}

func CharsetReader(charset string, input io.Reader) (io.Reader, os.Error) {
	switch {
	case IsCharsetUTF8(charset):
		return input, nil
	case IsCharsetISO88591(charset):
		return NewCharsetISO88591(input), nil
	}
	return nil, os.NewError(&quot;CharsetReader: unexpected charset: &quot; + charset)
}

func main() {
	// Print the XML comments from the test file, which should
	// contain most of the printable ISO-8859-1 characters.
	r, err := os.Open(&quot;ISO88591.xml&quot;)
	if err != nil {
		fmt.Println(err)
		return
	}
	defer r.Close()
	fmt.Println(&quot;file:&quot;, r.Name())
	p := xml.NewParser(r)
	p.CharsetReader = CharsetReader
	for t, err := p.Token(); t != nil &amp;&amp; err == nil; t, err = p.Token() {
		switch t := t.(type) {
		case xml.ProcInst:
			fmt.Println(t.Target, string(t.Inst))
		case xml.Comment:
			fmt.Println(string([]byte(t)))
		}
	}
}

To unmarshal XML with encoding=&quot;ISO-8859-1&quot; from an io.Reader r into a structure result, while using the CharsetReader function from the program to translate from ISO-8859-1 to UTF-8, write:

p := xml.NewParser(r)
p.CharsetReader = CharsetReader
err := p.Unmarshal(&amp;result, nil)

答案4

得分: 7

似乎有一个外部库可以处理这个问题:go-charset。我自己没有尝试过,它对你有用吗?

英文:

There appears to be an external library which handles this: go-charset. I haven't tried it myself; does it work for you?

答案5

得分: 6

这是@peterSO的代码的更新版本,适用于go1:

package main

import (
    "bytes"
    "io"
    "strings"
)

type CharsetISO88591er struct {
    r   io.ByteReader
    buf *bytes.Buffer
}

func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
    buf := bytes.Buffer{}
    return &CharsetISO88591er{r.(io.ByteReader), &buf}
}

func (cs *CharsetISO88591er) Read(p []byte) (n int, err error) {
    for _ = range p {
        if r, err := cs.r.ReadByte(); err != nil {
            break
        } else {
            cs.buf.WriteRune(rune(r))
        }
    }
    return cs.buf.Read(p)
}

func isCharset(charset string, names []string) bool {
    charset = strings.ToLower(charset)
    for _, n := range names {
        if charset == strings.ToLower(n) {
            return true
        }
    }
    return false
}

func IsCharsetISO88591(charset string) bool {
    // http://www.iana.org/assignments/character-sets
    // (last updated 2010-11-04)
    names := []string{
        // Name
        "ISO_8859-1:1987",
        // Alias (preferred MIME name)
        "ISO-8859-1",
        // Aliases
        "iso-ir-100",
        "ISO_8859-1",
        "latin1",
        "l1",
        "IBM819",
        "CP819",
        "csISOLatin1",
    }
    return isCharset(charset, names)
}

func CharsetReader(charset string, input io.Reader) (io.Reader, error) {
    if IsCharsetISO88591(charset) {
        return NewCharsetISO88591(input), nil
    }
    return input, nil
}

调用方式为:

d := xml.NewDecoder(reader)
d.CharsetReader = CharsetReader
err := d.Decode(&dst)
英文:

Edit: do not use this, use the go-charset answer.

Here's an updated version of @peterSO's code that works with go1:

package main

import (
	&quot;bytes&quot;
	&quot;io&quot;
	&quot;strings&quot;
)

type CharsetISO88591er struct {
	r   io.ByteReader
	buf *bytes.Buffer
}

func NewCharsetISO88591(r io.Reader) *CharsetISO88591er {
	buf := bytes.Buffer{}
	return &amp;CharsetISO88591er{r.(io.ByteReader), &amp;buf}
}

func (cs *CharsetISO88591er) Read(p []byte) (n int, err error) {
	for _ = range p {
		if r, err := cs.r.ReadByte(); err != nil {
			break
		} else {
			cs.buf.WriteRune(rune(r))
		}
	}
	return cs.buf.Read(p)
}

func isCharset(charset string, names []string) bool {
	charset = strings.ToLower(charset)
	for _, n := range names {
		if charset == strings.ToLower(n) {
			return true
		}
	}
	return false
}

func IsCharsetISO88591(charset string) bool {
	// http://www.iana.org/assignments/character-sets
	// (last updated 2010-11-04)
	names := []string{
		// Name
		&quot;ISO_8859-1:1987&quot;,
		// Alias (preferred MIME name)
		&quot;ISO-8859-1&quot;,
		// Aliases
		&quot;iso-ir-100&quot;,
		&quot;ISO_8859-1&quot;,
		&quot;latin1&quot;,
		&quot;l1&quot;,
		&quot;IBM819&quot;,
		&quot;CP819&quot;,
		&quot;csISOLatin1&quot;,
	}
	return isCharset(charset, names)
}

func CharsetReader(charset string, input io.Reader) (io.Reader, error) {
	if IsCharsetISO88591(charset) {
		return NewCharsetISO88591(input), nil
	}
	return input, nil
}

Called with:

d := xml.NewDecoder(reader)
d.CharsetReader = CharsetReader
err := d.Decode(&amp;dst)

答案6

得分: 0

目前在Go发行版中没有提供任何字符集读取器,也没有在其他地方找到。这并不奇怪,因为在撰写本文时,该钩子的年龄不到一个月。

由于CharsetReader被定义为CharsetReader func(charset string, input io.Reader) (io.Reader, os.Error),你可以自己创建一个。
测试中有一个示例,但可能对你没有太大用处。

英文:

There aren't any provided in the go distribution at the moment, or anywhere else I can find. Not surprising as that hook is less than a month old at the time of writing.

Since a CharsetReader is defined as CharsetReader func(charset string, input io.Reader) (io.Reader, os.Error), you could make your own.
There's one example in the tests, but that might not be exactly useful to you.

huangapple
  • 本文由 发表于 2011年5月14日 23:00:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/6002619.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定