为code.google.com/p/go.text/transform创建一个转换器。

huangapple go评论91阅读模式
英文:

Go: Making a transformer for code.google.com/p/go.text/transform

问题

有一段代码需要翻译,内容如下:

// 用于规范化UTF8字符串的本地辅助函数。
func isMn(r rune) bool {
    return unicode.Is(unicode.Mn, r) // Mn: 非间隔标记
}

// 此映射用于RemoveAccents函数,将非重音字符转换为非重音字符。
var transliterations = map[rune]string{
    'Æ': "E", 'Ð': "D", 'Ł': "L", 'Ø': "OE", 'Þ': "Th", 'ß': "ss", 'æ': "e", 'ð': "d", 'ł': "l", 'ø': "oe", 'þ': "th", 'Œ': "OE", 'œ': "oe",
}

// removeAccentsBytes将带重音的UTF8字符转换为它们的非重音等效字符,从[]byte中。
func removeAccentsBytesDashes(b []byte) ([]byte, error) {
    mnBuf := make([]byte, len(b))
    t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
    n, _, err := t.Transform(mnBuf, b, true)
    if err != nil {
        return nil, err
    }
    mnBuf = mnBuf[:n]
    tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*2))
    for i, w := 0, 0; i < len(mnBuf); i += w {
        r, width := utf8.DecodeRune(mnBuf[i:])
        if r == '-' {
            tlBuf.WriteByte(' ')
        } else {
            if d, ok := transliterations[r]; ok {
                tlBuf.WriteString(d)
            } else {
                tlBuf.WriteRune(r)
            }
        }
        w = width
    }
    return tlBuf.Bytes(), nil
}

这段代码的功能是对文本进行规范化和去重音处理。首先,它定义了一个辅助函数isMn,用于判断一个字符是否为非间隔标记。然后,它定义了一个映射transliterations,用于将非重音字符转换为非重音字符。接下来,removeAccentsBytesDashes函数将带重音的UTF8字符转换为非重音等效字符。它使用了transform包中的Chain函数来构建一个转换链,其中包括了norm.NFDtransform.RemoveFunc(isMn)norm.NFC三个转换操作。最后,它遍历转换后的字符,根据需要进行处理,并返回处理后的结果。

请问有什么我可以帮助你的吗?

英文:

For some time I've been normalizing & de-accenting text by doing:

// Local helper function for normalization of UTF8 strings.
func isMn (r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
// This map is used by RemoveAccents function to convert non-accented characters.
var transliterations = map[rune]string{&#39;&#198;&#39;:&quot;E&quot;,&#39;&#208;&#39;:&quot;D&quot;,&#39;Ł&#39;:&quot;L&quot;,&#39;&#216;&#39;:&quot;OE&quot;,&#39;&#222;&#39;:&quot;Th&quot;,&#39;&#223;&#39;:&quot;ss&quot;,&#39;&#230;&#39;:&quot;e&quot;,&#39;&#240;&#39;:&quot;d&quot;,&#39;ł&#39;:&quot;l&quot;,&#39;&#248;&#39;:&quot;oe&quot;,&#39;&#254;&#39;:&quot;th&quot;,&#39;Œ&#39;:&quot;OE&quot;,&#39;œ&#39;:&quot;oe&quot;}
//  removeAccentsBytes converts accented UTF8 characters into their non-accented equivalents, from a []byte.
func removeAccentsBytesDashes(b []byte) ([]byte, error) {
mnBuf := make([]byte, len(b))
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
n, _, err := t.Transform(mnBuf, b, true)
if err != nil {
return nil, err
}
mnBuf = mnBuf[:n]
tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*2))
for i, w := 0, 0; i &lt; len(mnBuf); i += w {
r, width := utf8.DecodeRune(mnBuf[i:])
if r==&#39;-&#39; {
tlBuf.WriteByte(&#39; &#39;)
} else {
if d, ok := transliterations[r]; ok {
tlBuf.WriteString(d)
} else {
tlBuf.WriteRune(r)
}
}
w = width
}
return tlBuf.Bytes(), nil
}

After that I lowercase the whole thing and apply a series of regular expressions.

This way of doing it is very heavy. I reckon I should be able to do the entire thing in one loop over the bytes, instead of 10 loops, plus the regular expressions are slow.

My first thought was to modify the above function to do the lowercasings directly in the loop (the second part of the removeAccentsBytes function). But then I decided I'd like to combine it all into a single loop, including the transform function.

On this I first tried to get the transformation tables out of the transform source, then by copying and modifying it, but I can't seem to get it to give me whatever tables it's using for the transformation. It turns out that even norm.NFD = 1 and norm.NFC = 0, and I have yet to figure out how its parsing the fact that the paramters are 0 or 1 and somehow getting a transformation table out of this.

Reading its code I can see it's written efficiently anyway, and obviously beyond by beginner's Go skills, so I thought it might be better to use transform.Chain to add in my own transformers.

I can't find any instructions anywhere on how to write a transformer that will be accepted by transform.Chain. Nothing.

Does anyone have any information on how I can make a transformer for this?

答案1

得分: 2

transform.Chain

func Chain(t ...Transformer) Transformer

接受一个 transform.Transformer 数组作为参数。

type Transformer interface {
    Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error)
}

所以你只需要创建一个实现了 Transformer 接口的类型:

type DenormalizeAndDeaccent struct {
}

func (t *DenomarlizeAndDeaccent) Transform(dst, src []byte, atEOF bool) (int, int, error)   {
    result, err := removeAccentsBytesDashes(src)
    if err != nil {
        return 0, 0, nil
    }
    n := copy(dst, result)
    if n < len(src) {
        err = ErrShortDst
    }
    return n, len(src), err
}
英文:

transform.Chain

func Chain(t ...Transformer) Transformer

takes an array of transform.Transformer

type Transformer interface {
Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error)
}

so you just need to create a type that implements the Transformer interface:

type DenormalizeAndDeaccent struct {
}
func (t *DenomarlizeAndDeaccent) Transform(dst, src []byte, atEOF bool) (int, int, error)   {
result, err := removeAccentsBytesDashes(src)
if err != nil {
return 0, 0, nil
}
n := copy(dst, result)
if n &lt; len(src) {
err = ErrShortDst
}
return n, len(src), err
}

huangapple
  • 本文由 发表于 2014年8月20日 19:28:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/25403542.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定