删除所有特殊字符,但保留重音字母。

huangapple go评论72阅读模式
英文:

Remove all special characters but not accented letters

问题

我需要从一个字符串中删除除了重音字母之外的所有符号。我的代码删除了包括重音字母在内的所有符号:

str := "cafè!?"
reg, err := regexp.Compile(`[^\w]`)
str := reg.ReplaceAllString(str, " ")

我期望的输出是:

cafè

但是我的代码输出是:

caf

我想要包括è、é、à、ò、ì(当然还有从a到z的所有字母和从0到9的数字)。

我该怎么做?
谢谢你的帮助。

英文:

I need to delete from a string all symbols exept accentend letters in GO. My code instead delete all symbols included accented letters:

str := "cafè!?"
reg, err := regexp.Compile(`[^\w]`)
str := reg.ReplaceAllString(str, " ")

I expect the following output:

cafè

But the output with my code is:

caf

I want to include è, é, à, ò, ì (and of course all letters from a to z and numbers from 0 to 9)

How can I do?
Thanks for your help

答案1

得分: 1

要包含 è, é, à, ò, ì,只需将它们添加到正则表达式中:[^\wèéàòìÈÉÀÒÌ]

你也可以使用 [^\d\p{Latin}],但这将匹配更多的字符。

\d 用于匹配数字,\p{Latin} 是一个 Unicode 类,包括所有拉丁字符,包括所有的变音符号。

例如:

re := regexp.MustCompile(`[^\d\p{Latin}]`)
fmt.Println(re.ReplaceAllString(`Test123éËà-ŞŨğБла通用`, ""))

将打印:

Test123éËàŞŨğ
英文:

To include è, é, à, ò, ì, just add them to the regex: [^\wèéàòìÈÉÀÒÌ]

You might also use [^\d\p{Latin}], but that'll match more characters.

\d is for digits and \p{Latin} is a Unicode class for all Latin characters, including all diacritics.

For example:

re := regexp.MustCompile(`[^\d\p{Latin}]`)
fmt.Println(re.ReplaceAllString(`Test123éËà-ŞŨğБла通用`, ""))

Will print:

Test123éËàŞŨğ

答案2

得分: 1

这里的所有“特殊”字符都是标点符号(我假设也包括符号字符),所以使用以下正则表达式:

[\p{P}\p{S}]+

如果你想要移除除了字母以外的任何字符,你需要使用以下正则表达式:

\P{L}+

请参考正则表达式演示 #1正则表达式演示 #2
在这里,

  • \p{P} 匹配任何标点符号(如逗号、句号)

  • \p{S} 匹配符号,如数学符号等

  • \P{L} 匹配任何非 Unicode 字母的字符。

英文:

All "special" characters here are punctuation (and I assume also symbol) chars, so use

[\p{P}\p{S}]+

If you want to remove any chars but any letters you need to use

\P{L}+

See regex demo #1 and regex demo #2.
Here,

  • \p{P} matches any punctuation proper (like commas, dots)
  • \p{S} symbols, as mathematical, etc. symbols
  • \P{L} - any char other than a Unicode letter.

答案3

得分: -2

你可以使用一个Unicode文本分割库来迭代图形簇,并检查每个图形簇中的第一个符文的类别(字母或数字)。

import (
	"strings"
	"unicode"

	"github.com/rivo/uniseg"
)

func stripSpecial(s string) string {
	var b strings.Builder
	gr := uniseg.NewGraphemes(s)
	for gr.Next() {
		r := gr.Runes()[0]
		if unicode.IsLetter(r) || unicode.IsDigit(r) {
			b.WriteString(gr.Str())
		}
	}
	return b.String()
}

该代码首先将字符串分割成图形簇:

"cafè!?" -> ["c", "a", "f", "è", "!", "?"]

每个图形簇可能包含多个Unicode代码点。第一个代码点确定字符的类型,剩余的代码点(如果有)是重音符号或其他修饰符号。因此,我们进行过滤和连接:

["c", "a", "f", "è"] -> "cafè"

这将通过任何带重音符号或不带重音符号的字母和数字,无论它们如何规范化,以及无论有什么重音符号(包括z̶̰̬̰͈̅̒̚͝å̷̢̡̦̼̥̘̙̺̩̮̱̟̳̙͂́̇̓̉́͒̎͜ḽ̷̢̣̹̳̊̋ͅg̵̙̞͈̥̳̗͙͚͛̀͘o̴̧̟̞̞̠̯͈͔̽̎͋̅́̈̅̊̒文本)。它将排除某些字符,如零宽连接符,这会导致某些语言中的单词变形...因此,如果你关心国际受众,你可能需要检查你的受众是否使用零宽连接符。因此,这将破坏某些脚本,如天城文。

英文:

You can use a Unicode text segmentation library to iterate over grapheme clusters, and check that the first rune in each grapheme cluster has the right category (letter or digit).

import (
	"strings"
	"unicode"

	"github.com/rivo/uniseg"
)

func stripSpecial(s string) string {
	var b strings.Builder
	gr := uniseg.NewGraphemes(s)
	for gr.Next() {
		r := gr.Runes()[0]
		if unicode.IsLetter(r) || unicode.IsDigit(r) {
			b.WriteString(gr.Str())
		}
	}
	return b.String()
}

The code works by first breaking the string into grapheme clusters,

"cafè!?" -> ["c", "a", "f", "è", "!", "?"]

Each grapheme cluster may contain multiple Unicode code points. The first code point determines the type of character, and the remaining code points (if any) are accent marks or other modifiers. So we filter and concatenate:

["c", "a", "f", "è"] -> "cafè"

This will pass through any accented or unaccented letters and digits, no matter how they are normalized, and no matter what accents (including z̶̰̬̰͈̅̒̚͝å̷̢̡̦̼̥̘̙̺̩̮̱̟̳̙͂́̇̓̉́͒̎͜ḽ̷̢̣̹̳̊̋ͅg̵̙̞͈̥̳̗͙͚͛̀͘o̴̧̟̞̞̠̯͈͔̽̎͋̅́̈̅̊̒ text). It will exclude certain characters like zero-width joiners which will cause words in certain languages to get mangled... so if you care about an international audience, you may want to review if your audience uses zero-width joiners. So, this will mangle certain scripts like Devanagari.

huangapple
  • 本文由 发表于 2021年12月14日 00:46:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/70338062.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定