2013年11月8日 01:39:54go评论154阅读模式

英文:

Go, Regular Expression : very challenging regex on Characters

问题

你认为只用正则表达式就可以实现吗？

这是我在Go Playground上的尝试。

这段代码有些混乱，但是成功了。

http://play.golang.org/p/YysZCB3vlu

我希望将扩展的韩文字符转换为完整的字母。
例如，将"ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔ"转换为"좋은값이싸요아침안녕하세요웬"。

对于无法正确渲染韩文字符的浏览器：

좋 은 값 이 싸 요 아 침 안 녕 하 세 요 웬

简单的部分是，韩文字母只能以一个辅音+一个或两个元音开头。可以用(.([ㅏ-ㅣ])+)来匹配。

具有挑战性的部分是，元音后面可以有零个、一个或最多两个可选的辅音。另一个困难之处在于，在最多两个可选的辅音之后，我们有另一个辅音，它不属于前一个字母，而是表示一个新字母的开始。

就像下面这样：

ㄱㅏㅂㅅㅇㅣ
= ㄱㅏㅂㅅ  +  ㅇㅣ
= 값 + 이
= 값이

通过使用if条件和基本的正则表达式，可以捕获所有的模式。但如果我有更简短的版本就好了。

我的最终目标是将"ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔㄴ"转换为"좋은값이싸요아침안녕하세요웬"。

对于无法正确渲染韩文字符的浏览器：

좋 은 값 이 싸 요 아 침 안 녕 하 세 요 웬

英文:

Do you think it is possible only with Regex?

Here is my try on Go Playground

This is successful with some dirty code

http://play.golang.org/p/YysZCB3vlu

I want expanded Korean characters to be converted a complete letter.
For example, "ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔ" to 좋은값이싸요아침안녕하세요웬

> For browser that don't render korean characters correctly:<br/>
>> 좋   은   값   이   싸   요   아   침   안   녕   하   세   요   웬

The easy part is that Korean letter can only start with One Consonant + One or Two Vowel. That can be caught with (.([ㅏ-ㅣ])+).

The challenging part is Zero or One or Maximum Two Optional Consonants that follows the vowel. Another reason why it is hard is that after the maximum two optional consonants, we have another consonants that does not belong the previous letter and this consonants means another start of a new one letter.

Like below:

ㄱㅏㅂㅅㅇㅣ
= ㄱㅏㅂㅅ  +  ㅇㅣ
= 값 + 이
= 값이

It is possible to catch all the patterns with if-condition and basic regex. But it would be good if I have shorter version of this.

My ultimate goal is to convert "ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔㄴ" to 좋은값이싸요아침안녕하세요웬

> For browser that don't render korean characters correctly:<br/>
>> 좋   은   값   이   싸   요   아   침   안   녕   하   세   요   웬

答案1

得分: 2

我不懂韩语，但听起来你可能的输入组合是：

C（辅音）V（元音）
CVV
CVVC
CVVCC
CVC
CVCC

因此，一个捕获这些组合的正则表达式规则（不捕获下一个单词的第一个辅音）是：
CV{1,2}C{0,2}(?!V)

然后，你只需要定义你的C和V字符类，比如用[ㅏ-ㅣ]替换V。

使用你的程序循环遍历字符串中找到的匹配项，并输出组合的单词。

编辑：Go语言不支持负向先行断言，所以建议按照以下步骤进行操作：

反转字符串（类似于https://stackoverflow.com/questions/1752414/how-to-reverse-a-string-in-go，但要小心处理Unicode字节序列）
在C{0,2}V{1,2}C上运行匹配
反转每个匹配项并执行单词连接/查找

还有其他方法可以解决缺乏负向先行断言的问题，但可能需要更多的代码来操作下一个匹配项在输入字符串中的起始位置。

此外，在定义你将要查找的元音或辅音字符集时，最好使用Unicode转义序列而不是韩文字形本身（通常使用\x1161），但我不确定Go语言是否支持在正则表达式中使用Unicode引用...

英文:

I don't know Korean, but it sounds like your possible input combinations are:

C(Consonant) V(Vowel)
CVV
CVVC
CVVCC
CVC
CVCC

So a regex rule to capture that (without capturing the first consonant of the next word) is:
CV{1,2}C{0,2}(?!V)

Then you just need to define your C and V character classes, such as replacing V with [ㅏ-ㅣ]

Use your program to loop through the matches found in the string, and output the combined word

EDIT: Go doesn't support negative lookahead, so I suggest doing the following:

Reverse the string (something like https://stackoverflow.com/questions/1752414/how-to-reverse-a-string-in-go, but be careful with unicode byte sequences)
Run a match on C{0,2}V{1,2}C
Reverse each match and perform the word join/lookup

There are other ways of getting around the lack of negative lookahead, but it will probably involve a lot more code to manipulate where the next match will start in the input string.

Also, when defining the set of characters you will look for as vowels or consonants, it would be better to use the unicode escape sequence rather than the Korean glyphs themselves (normally, e.g., \x1161), but I'm not sure Go supports unicode reference in regex either...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Go，正则表达式：在字符上非常具有挑战性的正则表达式。

问题

答案1

JavaScript POST请求到Golang服务器出错 – JSON输入意外结束

Clarification on using equal sign and map on Go

debuild通知：make[1]：go命令未找到

为什么我不能在下面的Go代码中使用空白标识符？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。