Go,正则表达式:在字符上非常具有挑战性的正则表达式。

huangapple go评论109阅读模式
英文:

Go, Regular Expression : very challenging regex on Characters

问题

你认为只用正则表达式就可以实现吗?

这是我在Go Playground上的尝试。

这段代码有些混乱,但是成功了。

http://play.golang.org/p/YysZCB3vlu

我希望将扩展的韩文字符转换为完整的字母。
例如,将"ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔ"转换为"좋은값이싸요아침안녕하세요웬"。

对于无法正确渲染韩文字符的浏览器:

좋   은   값   이   싸   요   아   침   안   녕   하   세   요   웬

简单的部分是,韩文字母只能以一个辅音+一个或两个元音开头。可以用(.([ㅏ-ㅣ])+)来匹配。

具有挑战性的部分是,元音后面可以有零个、一个或最多两个可选的辅音。另一个困难之处在于,在最多两个可选的辅音之后,我们有另一个辅音,它不属于前一个字母,而是表示一个新字母的开始。

就像下面这样:

ㄱㅏㅂㅅㅇㅣ
= ㄱㅏㅂㅅ  +  ㅇㅣ
= 값 + 이
= 값이

通过使用if条件和基本的正则表达式,可以捕获所有的模式。但如果我有更简短的版本就好了。

我的最终目标是将"ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔㄴ"转换为"좋은값이싸요아침안녕하세요웬"。

对于无法正确渲染韩文字符的浏览器:

좋   은   값   이   싸   요   아   침   안   녕   하   세   요   웬

英文:

Do you think it is possible only with Regex?

Here is my try on Go Playground

This is successful with some dirty code

http://play.golang.org/p/YysZCB3vlu

I want expanded Korean characters to be converted a complete letter.
For example, "ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔ" to 좋은값이싸요아침안녕하세요웬

> For browser that don't render korean characters correctly:<br/>
>> 좋&nbsp;&nbsp;&nbsp;은&nbsp;&nbsp;&nbsp;값&nbsp;&nbsp;&nbsp;이&nbsp;&nbsp;&nbsp;싸&nbsp;&nbsp;&nbsp;요&nbsp;&nbsp;&nbsp;아&nbsp;&nbsp;&nbsp;침&nbsp;&nbsp;&nbsp;안&nbsp;&nbsp;&nbsp;녕&nbsp;&nbsp;&nbsp;하&nbsp;&nbsp;&nbsp;세&nbsp;&nbsp;&nbsp;요&nbsp;&nbsp;&nbsp;웬

The easy part is that Korean letter can only start with One Consonant + One or Two Vowel. That can be caught with (.([ㅏ-ㅣ])+).

The challenging part is Zero or One or Maximum Two Optional Consonants that follows the vowel. Another reason why it is hard is that after the maximum two optional consonants, we have another consonants that does not belong the previous letter and this consonants means another start of a new one letter.

Like below:

ㄱㅏㅂㅅㅇㅣ
= ㄱㅏㅂㅅ  +  ㅇㅣ
= 값 + 이
= 값이

It is possible to catch all the patterns with if-condition and basic regex. But it would be good if I have shorter version of this.

My ultimate goal is to convert "ㅈㅗㅎㅇㅡㄴㄱㅏㅂㅅㅇㅣㅆㅏㅇㅛㅇㅏㅊㅣㅁㅇㅏㄴㄴㅕㅇㅎㅏㅅㅔㅇㅛㅇㅜㅔㄴ" to 좋은값이싸요아침안녕하세요웬

> For browser that don't render korean characters correctly:<br/>
>> 좋&nbsp;&nbsp;&nbsp;은&nbsp;&nbsp;&nbsp;값&nbsp;&nbsp;&nbsp;이&nbsp;&nbsp;&nbsp;싸&nbsp;&nbsp;&nbsp;요&nbsp;&nbsp;&nbsp;아&nbsp;&nbsp;&nbsp;침&nbsp;&nbsp;&nbsp;안&nbsp;&nbsp;&nbsp;녕&nbsp;&nbsp;&nbsp;하&nbsp;&nbsp;&nbsp;세&nbsp;&nbsp;&nbsp;요&nbsp;&nbsp;&nbsp;웬

答案1

得分: 2

我不懂韩语,但听起来你可能的输入组合是:

C(辅音)V(元音)
CVV
CVVC
CVVCC
CVC
CVCC

因此,一个捕获这些组合的正则表达式规则(不捕获下一个单词的第一个辅音)是:
CV{1,2}C{0,2}(?!V)

然后,你只需要定义你的C和V字符类,比如用[ㅏ-ㅣ]替换V。

使用你的程序循环遍历字符串中找到的匹配项,并输出组合的单词。

编辑:Go语言不支持负向先行断言,所以建议按照以下步骤进行操作:

  1. 反转字符串(类似于https://stackoverflow.com/questions/1752414/how-to-reverse-a-string-in-go,但要小心处理Unicode字节序列)
  2. C{0,2}V{1,2}C上运行匹配
  3. 反转每个匹配项并执行单词连接/查找

还有其他方法可以解决缺乏负向先行断言的问题,但可能需要更多的代码来操作下一个匹配项在输入字符串中的起始位置。

此外,在定义你将要查找的元音或辅音字符集时,最好使用Unicode转义序列而不是韩文字形本身(通常使用\x1161),但我不确定Go语言是否支持在正则表达式中使用Unicode引用...

英文:

I don't know Korean, but it sounds like your possible input combinations are:

C(Consonant) V(Vowel)
CVV
CVVC
CVVCC
CVC
CVCC

So a regex rule to capture that (without capturing the first consonant of the next word) is:
CV{1,2}C{0,2}(?!V)

Then you just need to define your C and V character classes, such as replacing V with [ㅏ-ㅣ]

Use your program to loop through the matches found in the string, and output the combined word

EDIT: Go doesn't support negative lookahead, so I suggest doing the following:

  1. Reverse the string (something like https://stackoverflow.com/questions/1752414/how-to-reverse-a-string-in-go, but be careful with unicode byte sequences)
  2. Run a match on C{0,2}V{1,2}C
  3. Reverse each match and perform the word join/lookup

There are other ways of getting around the lack of negative lookahead, but it will probably involve a lot more code to manipulate where the next match will start in the input string.

Also, when defining the set of characters you will look for as vowels or consonants, it would be better to use the unicode escape sequence rather than the Korean glyphs themselves (normally, e.g., \x1161), but I'm not sure Go supports unicode reference in regex either...

huangapple
  • 本文由 发表于 2013年11月8日 01:39:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/19842859.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定