电子邮件主题,不同字符集的标题解码,如ISO-2022-JP,GB-2312等。

huangapple go评论184阅读模式
英文:

Email subject, header decoding in different charset like ISO-2022-JP, GB-2312, etc

问题

我正在处理一个项目,需要处理不同字符集的电子邮件编码/解码。以下是一个用Python编写的示例代码:

  1. from email.header import Header, decode_header, make_header
  2. from charset import text_to_utf8
  3. class ....
  4. def decode_header(self, header):
  5. decoded_header = decode_header(header)
  6. if decoded_header[0][1] is None:
  7. return text_to_utf8(decoded_header[0][0]).decode("utf-8", "replace")
  8. else:
  9. return decoded_header[0][0].decode(decoded_header[0][1].replace("windows-", "cp"), "replace")

基本上,对于像"=?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA=?="这样的文本,"decode_header"函数会尝试找到编码方式:"iso-2022-jp",然后使用"decode"函数将字符集解码为Unicode。

现在,在Go语言中,我可以做类似的事情:

  1. import "mime"
  2. dec := new(mime.WordDecoder)
  3. text := "=?utf-8?q?=C3=89ric?= <eric@example.org>, =?utf-8?q?Ana=C3=AFs?= <anais@example.org>"
  4. header, err := dec.DecodeHeader(text)

似乎mime.WordDecoder允许设置一个字符集解码器的"hook":

  1. type WordDecoder struct {
  2. // CharsetReader, if non-nil, defines a function to generate
  3. // charset-conversion readers, converting from the provided
  4. // charset into UTF-8.
  5. // Charsets are always lower-case. utf-8, iso-8859-1 and us-ascii charsets
  6. // are handled by default.
  7. // One of the the CharsetReader's result values must be non-nil.
  8. CharsetReader func(charset string, input io.Reader) (io.Reader, error)
  9. }

我想知道是否有任何库可以像Python中的"decode"函数一样,允许我转换任意字符集,就像上面的示例一样。我不想编写一个像mime/encodedword.go中使用的那样庞大的"switch-case"语句:

  1. func (d *WordDecoder) convert(buf *bytes.Buffer, charset string, content []byte) error {
  2. switch {
  3. case strings.EqualFold("utf-8", charset):
  4. buf.Write(content)
  5. case strings.EqualFold("iso-8859-1", charset):
  6. for _, c := range content {
  7. buf.WriteRune(rune(c))
  8. }
  9. ...

非常感谢您的帮助。

英文:

I am working on a project which needs to deal with email encoding/decoding in different charsets. A python code for this can be shown in the below:

  1. from email.header import Header, decode_header, make_header
  2. from charset import text_to_utf8
  3. class ....
  4. def decode_header(self, header):
  5. decoded_header = decode_header(header)
  6. if decoded_header[0][1] is None:
  7. return text_to_utf8(decoded_header[0][0]).decode(&quot;utf-8&quot;, &quot;replace&quot;)
  8. else:
  9. return decoded_header[0][0].decode(decoded_header[0][1].replace(&quot;windows-&quot;, &quot;cp&quot;), &quot;replace&quot;)

Basically, for the text like "=?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA=?="; the "decode_header" function just tries to find the encoding: 'iso-2022-jp'; then it will use the "decode" function to decode the charset to unicode.

Now, in go, i can do something similar to like:

  1. import &quot;mime&quot;
  2. dec := new(mime.WordDecoder)
  3. text := &quot;=?utf-8?q?=C3=89ric?= &lt;eric@example.org&gt;, =?utf-8?q?Ana=C3=AFs?= &lt;anais@example.org&gt;&quot;
  4. header, err := dec.DecodeHeader(text)
  5. Seems that there mime.WordDecoder allow to put a charset decoder &quot;hook&quot;:
  6. type WordDecoder struct {
  7. // CharsetReader, if non-nil, defines a function to generate
  8. // charset-conversion readers, converting from the provided
  9. // charset into UTF-8.
  10. // Charsets are always lower-case. utf-8, iso-8859-1 and us-ascii charsets
  11. // are handled by default.
  12. // One of the the CharsetReader&#39;s result values must be non-nil.
  13. CharsetReader func(charset string, input io.Reader) (io.Reader, error)
  14. }

I am wondering is there any library which can allow me to convert arbitrary charset like the "decode" function in python as shown in the above example. I don't want to write a big "switch-case"like the one being used in mime/encodedword.go:

  1. func (d *WordDecoder) convert(buf *bytes.Buffer, charset string, content []byte) error {
  2. switch {
  3. case strings.EqualFold(&quot;utf-8&quot;, charset):
  4. buf.Write(content)
  5. case strings.EqualFold(&quot;iso-8859-1&quot;, charset):
  6. for _, c := range content {
  7. buf.WriteRune(rune(c))
  8. }
  9. ....

Any help would be very appreciated.

Thanks.

答案1

得分: 1

谢谢。看起来包golang.org/x/net/html/charset已经提供了一个包含可用编码的映射。以下代码适用于我:

  1. import "golang.org/x/net/html/charset"
  2. CharsetReader := func (label string, input io.Reader) (io.Reader, error) {
  3. label = strings.Replace(label, "windows-", "cp", -1)
  4. encoding, _ := charset.Lookup(label)
  5. return encoding.NewDecoder().Reader(input), nil
  6. }
  7. dec := mime.WordDecoder{CharsetReader: CharsetReader}
  8. text := "=?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA=?="
  9. header, err := dec.DecodeHeader(text)

感谢您的帮助!

英文:

Thanks. It seems that the package golang.org/x/net/html/charset already provided a map with available encoding. The following code works for me:

  1. import &quot;golang.org/x/net/html/charset&quot;
  2. CharsetReader := func (label string, input io.Reader) (io.Reader, error) {
  3. label = strings.Replace(label, &quot;windows-&quot;, &quot;cp&quot;, -1)
  4. encoding, _ := charset.Lookup(label)
  5. return encoding.NewDecoder().Reader(input), nil
  6. }
  7. dec := mime.WordDecoder{CharsetReader: CharsetReader}
  8. text := &quot;=?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA=?=&quot;
  9. header, err := dec.DecodeHeader(text)

Thanks for your help!

答案2

得分: 0

我不确定这是否是你要找的,但有一个golang.org/x/text包,我正在使用它将Windows-1251转换为UTF-8。代码如下:

  1. import (
  2. "golang.org/x/text/encoding/charmap"
  3. "golang.org/x/text/transform"
  4. "io/ioutil"
  5. "strings"
  6. )
  7. func convert(s string) string {
  8. sr := strings.NewReader(s)
  9. tr := transform.NewReader(sr, charmap.Windows1251.NewDecoder())
  10. buf, err := ioutil.ReadAll(tr)
  11. if err != nil {
  12. return ""
  13. }
  14. return string(buf)
  15. }

我认为在你的情况下,如果你想避免使用"一个大的'switch-case'",你可以创建一个包含所有可用编码的映射,并做如下操作:

  1. var encodings = map[string]transform.Transformer{
  2. "win-1251": charmap.Windows1251.NewDecoder(),
  3. }
  4. func convert(s, charset string) string {
  5. buf, err := ioutil.ReadAll(transform.NewReader(strings.NewReader(s), encodings[charset]))
  6. if err != nil {
  7. return ""
  8. }
  9. return string(buf)
  10. }

以上是翻译好的内容,请确认是否满意。

英文:

I'm not sure it is what you are looking for but there is golang.org/x/text package which I'm using to convert Windows-1251 to UTF-8. Code looks like

  1. import (
  2. &quot;golang.org/x/text/encoding/charmap&quot;
  3. &quot;golang.org/x/text/transform&quot;
  4. &quot;io/ioutil&quot;
  5. &quot;strings&quot;
  6. )
  7. func convert(s string) string {
  8. sr := strings.NewReader(s)
  9. tr := transform.NewReader(sr, charmap.Windows1251.NewDecoder())
  10. buf, err := ioutil.ReadAll(tr)
  11. if err != nil {
  12. return &quot;&quot;
  13. }
  14. return string(buf)
  15. }

I think in your case if you want to avoid "a big 'switch-case'" you can create kind of map with full list of available encodings and just make something like:

  1. var encodings = map[string]transform.Transformer{
  2. &quot;win-1251&quot;: charmap.Windows1251.NewDecoder(),
  3. }
  4. func convert(s, charset string) string {
  5. buf, err := ioutil.ReadAll(transform.NewReader(strings.NewReader(s), encodings[charset]))
  6. if err != nil {
  7. return &quot;&quot;
  8. }
  9. return string(buf)
  10. }

huangapple
  • 本文由 发表于 2016年1月30日 10:30:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/35097318.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定