2021年12月14日 00:46:22go评论90阅读模式

英文:

Remove all special characters but not accented letters

问题

我需要从一个字符串中删除除了重音字母之外的所有符号。我的代码删除了包括重音字母在内的所有符号：

str := "caf&#232;!?"
reg, err := regexp.Compile(`[^\w]`)
str := reg.ReplaceAllString(str, " ")

我期望的输出是：

caf&#232;

但是我的代码输出是：

caf

我想要包括è、é、à、ò、ì（当然还有从a到z的所有字母和从0到9的数字）。

我该怎么做？
谢谢你的帮助。

英文:

I need to delete from a string all symbols exept accentend letters in GO. My code instead delete all symbols included accented letters:

str := &quot;caf&#232;!?&quot;
reg, err := regexp.Compile(`[^\w]`)
str := reg.ReplaceAllString(str, &quot; &quot;)

I expect the following output:

caf&#232;

But the output with my code is:

caf

I want to include è, é, à, ò, ì (and of course all letters from a to z and numbers from 0 to 9)

How can I do?
Thanks for your help

答案1

得分: 1

要包含 è, é, à, ò, ì，只需将它们添加到正则表达式中：[^\wèéàòìÈÉÀÒÌ]

你也可以使用 [^\d\p{Latin}]，但这将匹配更多的字符。

\d 用于匹配数字，\p{Latin} 是一个 Unicode 类，包括所有拉丁字符，包括所有的变音符号。

例如：

re := regexp.MustCompile(`[^\d\p{Latin}]`)
fmt.Println(re.ReplaceAllString(`Test123&#233;&#203;&#224;-ŞŨğБла通用`, ""))

将打印：

Test123&#233;&#203;&#224;ŞŨğ

英文:

To include è, é, à, ò, ì, just add them to the regex: [^\wèéàòìÈÉÀÒÌ]

You might also use [^\d\p{Latin}], but that'll match more characters.

\d is for digits and \p{Latin} is a Unicode class for all Latin characters, including all diacritics.

For example:

re := regexp.MustCompile(`[^\d\p{Latin}]`)
fmt.Println(re.ReplaceAllString(`Test123&#233;&#203;&#224;-ŞŨğБла通用`, &quot;&quot;))

Will print:

Test123&#233;&#203;&#224;ŞŨğ

答案2

得分: 1

这里的所有“特殊”字符都是标点符号（我假设也包括符号字符），所以使用以下正则表达式：

[\p{P}\p{S}]+

如果你想要移除除了字母以外的任何字符，你需要使用以下正则表达式：

\P{L}+

请参考正则表达式演示 #1和正则表达式演示 #2。
在这里，

\p{P} 匹配任何标点符号（如逗号、句号）
\p{S} 匹配符号，如数学符号等
\P{L} 匹配任何非 Unicode 字母的字符。

英文:

All "special" characters here are punctuation (and I assume also symbol) chars, so use

[\p{P}\p{S}]+

If you want to remove any chars but any letters you need to use

\P{L}+

See regex demo #1 and regex demo #2.
Here,

\p{P} matches any punctuation proper (like commas, dots)
\p{S} symbols, as mathematical, etc. symbols
\P{L} - any char other than a Unicode letter.

答案3

得分: -2

你可以使用一个Unicode文本分割库来迭代图形簇，并检查每个图形簇中的第一个符文的类别（字母或数字）。

import (
	"strings"
	"unicode"

	"github.com/rivo/uniseg"
)

func stripSpecial(s string) string {
	var b strings.Builder
	gr := uniseg.NewGraphemes(s)
	for gr.Next() {
		r := gr.Runes()[0]
		if unicode.IsLetter(r) || unicode.IsDigit(r) {
			b.WriteString(gr.Str())
		}
	}
	return b.String()
}

该代码首先将字符串分割成图形簇：

"cafè!?" -> ["c", "a", "f", "è", "!", "?"]

每个图形簇可能包含多个Unicode代码点。第一个代码点确定字符的类型，剩余的代码点（如果有）是重音符号或其他修饰符号。因此，我们进行过滤和连接：

["c", "a", "f", "è"] -> "cafè"

这将通过任何带重音符号或不带重音符号的字母和数字，无论它们如何规范化，以及无论有什么重音符号（包括z̶̰̬̰͈̅̒̚͝å̷̢̡̦̼̥̘̙̺̩̮̱̟̳̙͂́̇̓̉́͒̎͜ḽ̷̢̣̹̳̊̋ͅg̵̙̞͈̥̳̗͙͚͛̀͘o̴̧̟̞̞̠̯͈͔̽̎͋̅́̈̅̊̒文本）。它将排除某些字符，如零宽连接符，这会导致某些语言中的单词变形...因此，如果你关心国际受众，你可能需要检查你的受众是否使用零宽连接符。因此，这将破坏某些脚本，如天城文。

英文:

You can use a Unicode text segmentation library to iterate over grapheme clusters, and check that the first rune in each grapheme cluster has the right category (letter or digit).

import (
	&quot;strings&quot;
	&quot;unicode&quot;

	&quot;github.com/rivo/uniseg&quot;
)

func stripSpecial(s string) string {
	var b strings.Builder
	gr := uniseg.NewGraphemes(s)
	for gr.Next() {
		r := gr.Runes()[0]
		if unicode.IsLetter(r) || unicode.IsDigit(r) {
			b.WriteString(gr.Str())
		}
	}
	return b.String()
}

The code works by first breaking the string into grapheme clusters,

&quot;caf&#232;!?&quot; -&gt; [&quot;c&quot;, &quot;a&quot;, &quot;f&quot;, &quot;&#232;&quot;, &quot;!&quot;, &quot;?&quot;]

Each grapheme cluster may contain multiple Unicode code points. The first code point determines the type of character, and the remaining code points (if any) are accent marks or other modifiers. So we filter and concatenate:

[&quot;c&quot;, &quot;a&quot;, &quot;f&quot;, &quot;&#232;&quot;] -&gt; &quot;caf&#232;&quot;

This will pass through any accented or unaccented letters and digits, no matter how they are normalized, and no matter what accents (including z̶̰̬̰͈̅̒̚͝å̷̢̡̦̼̥̘̙̺̩̮̱̟̳̙͂́̇̓̉́͒̎͜ḽ̷̢̣̹̳̊̋ͅg̵̙̞͈̥̳̗͙͚͛̀͘o̴̧̟̞̞̠̯͈͔̽̎͋̅́̈̅̊̒ text). It will exclude certain characters like zero-width joiners which will cause words in certain languages to get mangled... so if you care about an international audience, you may want to review if your audience uses zero-width joiners. So, this will mangle certain scripts like Devanagari.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

删除所有特殊字符，但保留重音字母。

问题

答案1

答案2

答案3

Go Variable Assignment with Interfaces and Dynamic Types

使用Golang和HereApi访问路线或地理位置的电动汽车充电站详细信息。

当使用”go run”命令时，避免输入主包中的所有go文件。

S3对象在使用Golang SDK时未过期

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论