2014年8月20日 19:28:51go评论104阅读模式

英文:

Go: Making a transformer for code.google.com/p/go.text/transform

问题

有一段代码需要翻译，内容如下：

// 用于规范化UTF8字符串的本地辅助函数。
func isMn(r rune) bool {
    return unicode.Is(unicode.Mn, r) // Mn: 非间隔标记
}

// 此映射用于RemoveAccents函数，将非重音字符转换为非重音字符。
var transliterations = map[rune]string{
    'Æ': "E", 'Ð': "D", 'Ł': "L", 'Ø': "OE", 'Þ': "Th", 'ß': "ss", 'æ': "e", 'ð': "d", 'ł': "l", 'ø': "oe", 'þ': "th", 'Œ': "OE", 'œ': "oe",
}

// removeAccentsBytes将带重音的UTF8字符转换为它们的非重音等效字符，从[]byte中。
func removeAccentsBytesDashes(b []byte) ([]byte, error) {
    mnBuf := make([]byte, len(b))
    t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
    n, _, err := t.Transform(mnBuf, b, true)
    if err != nil {
        return nil, err
    }
    mnBuf = mnBuf[:n]
    tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*2))
    for i, w := 0, 0; i < len(mnBuf); i += w {
        r, width := utf8.DecodeRune(mnBuf[i:])
        if r == '-' {
            tlBuf.WriteByte(' ')
        } else {
            if d, ok := transliterations[r]; ok {
                tlBuf.WriteString(d)
            } else {
                tlBuf.WriteRune(r)
            }
        }
        w = width
    }
    return tlBuf.Bytes(), nil
}

这段代码的功能是对文本进行规范化和去重音处理。首先，它定义了一个辅助函数isMn，用于判断一个字符是否为非间隔标记。然后，它定义了一个映射transliterations，用于将非重音字符转换为非重音字符。接下来，removeAccentsBytesDashes函数将带重音的UTF8字符转换为非重音等效字符。它使用了transform包中的Chain函数来构建一个转换链，其中包括了norm.NFD、transform.RemoveFunc(isMn)和norm.NFC三个转换操作。最后，它遍历转换后的字符，根据需要进行处理，并返回处理后的结果。

请问有什么我可以帮助你的吗？

英文:

For some time I've been normalizing & de-accenting text by doing:

// Local helper function for normalization of UTF8 strings.
func isMn (r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
// This map is used by RemoveAccents function to convert non-accented characters.
var transliterations = map[rune]string{&#39;&#198;&#39;:&quot;E&quot;,&#39;&#208;&#39;:&quot;D&quot;,&#39;Ł&#39;:&quot;L&quot;,&#39;&#216;&#39;:&quot;OE&quot;,&#39;&#222;&#39;:&quot;Th&quot;,&#39;&#223;&#39;:&quot;ss&quot;,&#39;&#230;&#39;:&quot;e&quot;,&#39;&#240;&#39;:&quot;d&quot;,&#39;ł&#39;:&quot;l&quot;,&#39;&#248;&#39;:&quot;oe&quot;,&#39;&#254;&#39;:&quot;th&quot;,&#39;Œ&#39;:&quot;OE&quot;,&#39;œ&#39;:&quot;oe&quot;}
//  removeAccentsBytes converts accented UTF8 characters into their non-accented equivalents, from a []byte.
func removeAccentsBytesDashes(b []byte) ([]byte, error) {
mnBuf := make([]byte, len(b))
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
n, _, err := t.Transform(mnBuf, b, true)
if err != nil {
return nil, err
}
mnBuf = mnBuf[:n]
tlBuf := bytes.NewBuffer(make([]byte, 0, len(mnBuf)*2))
for i, w := 0, 0; i &lt; len(mnBuf); i += w {
r, width := utf8.DecodeRune(mnBuf[i:])
if r==&#39;-&#39; {
tlBuf.WriteByte(&#39; &#39;)
} else {
if d, ok := transliterations[r]; ok {
tlBuf.WriteString(d)
} else {
tlBuf.WriteRune(r)
}
}
w = width
}
return tlBuf.Bytes(), nil
}

After that I lowercase the whole thing and apply a series of regular expressions.

This way of doing it is very heavy. I reckon I should be able to do the entire thing in one loop over the bytes, instead of 10 loops, plus the regular expressions are slow.

My first thought was to modify the above function to do the lowercasings directly in the loop (the second part of the removeAccentsBytes function). But then I decided I'd like to combine it all into a single loop, including the transform function.

On this I first tried to get the transformation tables out of the transform source, then by copying and modifying it, but I can't seem to get it to give me whatever tables it's using for the transformation. It turns out that even norm.NFD = 1 and norm.NFC = 0, and I have yet to figure out how its parsing the fact that the paramters are 0 or 1 and somehow getting a transformation table out of this.

Reading its code I can see it's written efficiently anyway, and obviously beyond by beginner's Go skills, so I thought it might be better to use transform.Chain to add in my own transformers.

I can't find any instructions anywhere on how to write a transformer that will be accepted by transform.Chain. Nothing.

Does anyone have any information on how I can make a transformer for this?

答案1

得分: 2

transform.Chain

func Chain(t ...Transformer) Transformer

接受一个 transform.Transformer 数组作为参数。

type Transformer interface {
    Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error)
}

所以你只需要创建一个实现了 Transformer 接口的类型：

type DenormalizeAndDeaccent struct {
}

func (t *DenomarlizeAndDeaccent) Transform(dst, src []byte, atEOF bool) (int, int, error)   {
    result, err := removeAccentsBytesDashes(src)
    if err != nil {
        return 0, 0, nil
    }
    n := copy(dst, result)
    if n < len(src) {
        err = ErrShortDst
    }
    return n, len(src), err
}

英文:

transform.Chain

func Chain(t ...Transformer) Transformer

takes an array of transform.Transformer

type Transformer interface {
Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error)
}

so you just need to create a type that implements the Transformer interface:

type DenormalizeAndDeaccent struct {
}
func (t *DenomarlizeAndDeaccent) Transform(dst, src []byte, atEOF bool) (int, int, error)   {
result, err := removeAccentsBytesDashes(src)
if err != nil {
return 0, 0, nil
}
n := copy(dst, result)
if n &lt; len(src) {
err = ErrShortDst
}
return n, len(src), err
}

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为code.google.com/p/go.text/transform创建一个转换器。

问题

答案1

Golang – 打包和哈希二进制数据

如何获取运行 goroutine 的结果？

GOLANG 检查 MongoDB 是否正在运行

mux.Vars无法从httpTest请求中检索变量。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论