Golang正则表达式替换,排除引号字符串。

huangapple go评论123阅读模式
英文:

Golang regex replace excluding quoted strings

问题

我正在尝试在Golang中实现removeComments函数,该函数的实现可以从此JavaScript实现中找到。我希望能够从文本中删除任何注释。例如:

/* 这是注释,应该被删除 */

但是,"/* 这是引用,所以不应该被删除*/"

在JavaScript实现中,匹配的引用部分没有被捕获在组中,所以我可以轻松地将它们过滤掉。然而,在Golang中,似乎很难判断匹配的部分是否被捕获在组中。那么,我该如何在Golang中实现与JavaScript版本相同的removeComments逻辑呢?

英文:

I'm trying to implement the removeComments function in Golang from this Javascript implementation. I'm hoping to remove any comments from the text. For example:

/* this is comments, and should be removed */

However, "/* this is quoted, so it should not be removed*/"

In the Javascript implementation, quoted matching are not captured in groups, so I can easily filter them out. However, in Golang, it seems it's not easy to tell whether the matched part is captured in a group or not. So how can I implement the same removeComments logic in Golang as the same in the Javascript version?

答案1

得分: 3

背景

正确的方法是匹配和捕获带引号的字符串(注意其中可能包含转义字符),然后匹配多行注释。

代码中的正则表达式演示

以下是处理该问题的代码:

package main
import (
    "fmt"
    "regexp"
)
func main() {
    reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
    txt := `random text
        /* removable comment */
        "but /* never remove this */ one"
         more random *text*`
    fmt.Println(reg.ReplaceAllString(txt, "$1"))
}

请参见Playground演示

解释

我建议的正则表达式是根据最佳正则表达式技巧的概念编写的,由2个选择组成:

  • ("[^"\\]*(?:\\.[^"\\]*)*") - 双引号字符串字面量正则表达式 - 第1组(通过外部未转义括号形成的捕获组,稍后可以通过替换后向引用访问)匹配可以包含转义序列的双引号字符串字面量。该部分匹配:

    • " - 前导双引号
    • [^"\\]* - 0个或多个字符,除了"\[^...]构造是一个否定字符类,匹配除了其中定义的字符之外的任何字符)(*是一个匹配量词,表示零个或多个出现)
    • (?:\\.[^"\\]*)*" - 0个或多个序列(参见最后的*非捕获组仅用于分组子模式而不形成捕获)转义序列(\\.匹配一个字面上的\后跟任何字符),后跟0个或多个除了"\之外的字符
  • | - 或者

  • /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ - 多行注释正则表达式部分匹配(不形成捕获组,因此无法通过后向引用从替换模式中访问)

    • / - 字面上的斜杠
    • \* - 字面上的星号
    • [^*]* - 零个或多个字符,除了星号
    • \*+ - 1个或多个(+是一个匹配一个或多个出现的量词)星号
    • (?:[^/*][^*]*\*+)* - 0个或多个序列(非捕获,我们后面不使用它)除了/*之外的任何字符(参见[^/*]),后跟0个或多个除了星号之外的字符(参见[^*]*),然后后跟1个或多个星号(参见\*+
    • / - 字面上的(尾部,闭合)斜杠。

注意:**这个多行注释正则表达式是我测试过的最快的。**双引号字面量正则表达式也是如此,因为"[^"\\]*(?:\\.[^"\\]*)*"是根据展开循环技术编写的:没有选择,只使用了特定顺序的字符类和*+量词,以实现最快的匹配。

对模式增强的说明

如果您计划扩展到匹配单引号字面量,那么没有什么更容易的方法,只需通过在第1个捕获组中添加另一个选择来重用双引号字符串字面量正则表达式,并将双引号替换为单引号:

reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
                                                    ^-------------------------^

这是支持单引号和双引号字面量的正则表达式演示,删除多行注释

添加对单行注释的支持类似:只需在末尾添加//[^\n\r]*选择:

reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*[\r\n]*`)
                                                                                                              ^-----------^

这是支持单引号和双引号字面量的正则表达式演示,删除多行和单行注释

英文:

BACKGROUND

The correct way to do the task is to match and capture quoted strings (bearing in mind there can be escaped entities inside) and then matching the multiline comments.

REGEX IN-CODE DEMO

Here is the code to deal with that:

package main
import (
    "fmt"
    "regexp"
)
func main() {
    reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*")|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
        txt := `random text
            /* removable comment */
            "but /* never remove this */ one"
             more random *text*`
        fmt.Println(reg.ReplaceAllString(txt, "$1"))
}

See the Playground demo

EXPLANATION

The regex I suggest is written with the Best Regex Trick Ever concept in mind and consists of 2 alternatives:

  • ("[^"\\]*(?:\\.[^"\\]*)*") - Double quoted string literal regex - Group 1 (see the capturing group formed with the outer pair of unescaped parentheses and later accessible via replacement backreferences) matching double quoted string literals that can contain escaped sequences. This part matches:
    • " - a leading double quote
    • [^"\\]* - 0+ characters other than " and \ (as [^...] construct is a negated character class that matches any characters but those defined inside it) (the * is a zero or more occurrences matching quantifier)
    • (?:\\.[^"\\]*)*" - 0+ sequences (see the last * and the non-capturing group used only to group subpatterns without forming a capture) of an escaped sequence (the \\. matches a literal \ followed with any character) followed with 0+
      characters other than " and \
  • | - or
  • /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ - multiline comment regex part matches *without forming a capture group (thus, unavailable from the replacement pattern via backreferences) and matches
    • / - the / literal slash
    • \* - the literal asterisk
    • [^*]* - zero or more characters other than an asterisk
    • \*+ - 1 or more (the + is a one or more occurrences matching quantifier) asterisks
    • (?:[^/*][^*]*\*+)* - 0+ sequences (non-capturing, we do not use it later) of any character but a / or * (see [^/*]), followed with 0+ characters other than an asterisk (see [^*]*) and then followed with 1+ asterisks (see \*+).
    • / - a literal (trailing, closing) slash.

NOTE: This multiline comment regex is the fastest I have ever tested. Same goes for the double quoted literal regex as "[^"\\]*(?:\\.[^"\\]*)*" is written with the unroll-the-loop technique in mind: no alternations, only character classes with * and + quantifiers are used in a specific order to allow the fastest matching.

NOTES ON PATTERN ENHANCEMENTS

If you plan to extend to matching single quoted literals, there is nothing easier, just add another alternative into the 1st capture group by re-using the double quoted string literal regex and replacing the double quotes with single ones:

reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
                                                    ^-------------------------^

Here is the single- and double-quoted literal supporting regex demo removing the miltiline comments

Adding a single line comment support is similar: just add //[^\n\r]* alternative to the end:

reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*[\r\n]*`)
                                                                                                              ^-----------^

Here is single- and double-quoted literal supporting regex demo removing the miltiline and singleline comments

答案2

得分: 2

我从来没有读过/写过Go语言的任何内容,所以请谅解。幸运的是,我了解正则表达式。我对Go的正则表达式进行了一些研究,似乎它们缺乏大多数现代特性(如引用)。

尽管如此,我已经开发了一个正则表达式,似乎符合你的要求。我假设所有的字符串都是单行的。下面是正则表达式:

reg := regexp.MustCompile(`(?m)^([^"\n]*)/\*([^*]+|(\*+[^/]))*\*/`)

txt := `random text
        /* removable comment */
        "but /* never remove this */ one"
        more random *text*`

fmt.Println(reg.ReplaceAllString(txt, ""))

变体:上面的版本不会删除引号后面的注释。这个版本会删除,但可能需要多次运行。

reg := regexp.MustCompile(
   `(?m)^(([^"\n]*|("[^"\n]*"))*)/\*([^*]+|(\*+[^/]))*\*/`
)
txt := `
   random text
   what /* removable comment */
   hi "but /* never remove this */ one" then /*whats here*/ i don't know /*what*/
   more random *text*
`
newtxt := reg.ReplaceAllString(txt, "")
fmt.Println(newtxt)
newtxt = reg.ReplaceAllString(newtxt, "")
fmt.Println(newtxt)

解释

  • (?m) 表示多行模式。Regex101给出了一个很好的解释:

    ^ 和 $ 锚点现在分别在每行的开头/结尾匹配,而不是整个字符串的开头/结尾。

    它需要锚定到每行的开头(使用 ^)以确保引号没有开始。

  • 第一个正则表达式中有这个:[^"\n]*。基本上,它匹配除了 "\n 之外的所有内容。我添加了括号,因为这些内容不是注释,所以需要放回去。

  • 第二个正则表达式中有这个:(([^"\n]*|("[^"\n]*"))*)。这个正则表达式可以匹配 [^"\n]*(就像第一个正则表达式一样),或者(|)它可以匹配一对引号(以及它们之间的内容)"[^"\n]*"。它是重复的,所以当有多个引号对时也能工作。请注意,与简单的正则表达式一样,这些非注释的内容被捕获。

  • 两个正则表达式都使用了这个:/\*([^*]+|(\*+[^/]))*\*/。它匹配 /*,后面跟着任意数量的以下内容之一:

    • [^*]+* 字符

      或者

    • \*+[^/] 一个或多个 *,后面不跟着 /

  • 然后它匹配闭合的 */

  • 在替换过程中,${1} 引用了被捕获的非注释内容,因此它们被重新插入到字符串中。

英文:

I've never read/written anything in Go, so bear with me. Fortunately, I know regex. I did a little research on Go regexes, and it would seem that they lack most modern features (such as references).

Despite that, I've developed a regex that seems to be what you're looking for. I'm assuming that all strings are single line. Here it is:

reg := regexp.MustCompile(`(?m)^([^"\n]*)/\*([^*]+|(\*+[^/]))*\*+/`)

txt := `random text
		/* removable comment */
			"but /* never remove this */ one"
		more random *text*`

fmt.Println(reg.ReplaceAllString(txt, ""))

Variation: The version above will not remove comments that happen after quotation marks. This version will, but it may need to be run multiple times.

reg := regexp.MustCompile(
   `(?m)^(([^"\n]*|("[^"\n]*"))*)/\*([^*]+|(\*+[^/]))*\*+/`
)
txt := `
   random text
   what /* removable comment */
   hi "but /* never remove this */ one" then /*whats here*/ i don't know /*what*/
   more random *text*
`
newtxt := reg.ReplaceAllString(txt, "")
fmt.Println(newtxt)
newtxt = reg.ReplaceAllString(newtxt, "")
fmt.Println(newtxt)

Explanation

  • (?m) means multiline mode. Regex101 gives a nice explanation of this:

>The ^ and $ anchors now match at the beginning/end of each line respectively, instead of beginning/end of the entire string.

It needs to be anchored to the beginning of each line (with ^) to ensure a quote hasn't started.

  • The first regex has this: [^"\n]*. Essentially, it's matching everything that's not " or \n. I've added parenthesis because this stuff isn't comments, so it needs to be put back.

  • The second regex has this: (([^"\n]*|("[^"\n]*"))*). The regex, with this statement can either match [^"\n]* (like the first regex does), or (|) it can match a pair of quotes (and the content between them) with "[^"\n]*". It's repeating so it works when there are more than one quote pair, for example. Note that, like the simpler regex, this non-comment stuff is being captured.

  • Both regexes use this: /\*([^*]+|(\*+[^/]))*\*+/. It matches /* followed by any amount of either:

  • [^*]+ Non * chars

or

  • \*+[^/] One or more *s that are not followed by /.

  • And then it matches the closing */

  • During replacement, the ${1} refers to the non-comment things that were captured, so they're reinserted into the string.

答案3

得分: 2

只是为了好玩,另一种方法是使用状态机实现的最小词法分析器,受到Rob Pike的演讲(http://cuddle.googlecode.com/hg/talk/lex.html)的启发并得到很好的描述。代码更冗长,但更易读、易懂和易于修改,而不是使用正则表达式。它还可以与任何Reader和Writer一起使用,而不仅仅是字符串,因此不会消耗内存,甚至可能更快。

type stateFn func(*lexer) stateFn

func run(l *lexer) {
    for state := lexText; state != nil; {
        state = state(l)
    }
}

type lexer struct {
    io.RuneReader
    io.Writer
}

func lexText(l *lexer) stateFn {
    for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
        switch r {
        case '"':
            l.Write([]byte(string(r)))
            return lexQuoted
        case '/':
            r, _, err = l.ReadRune()
            if r == '*' {
                return lexComment
            } else {
                l.Write([]byte("/"))
                l.Write([]byte(string(r)))
            }
        default:
            l.Write([]byte(string(r)))
        }
    }
    return nil
}

func lexQuoted(l *lexer) stateFn {
    for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
        if r == '"' {
            l.Write([]byte(string(r)))
            return lexText
        }
        l.Write([]byte(string(r)))
    }

    return nil
}

func lexComment(l *lexer) stateFn {
    for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
        if r == '*' {
            r, _, err = l.ReadRune()
            if r == '/' {
                return lexText
            }
        }
    }

    return nil
}

你可以看到它的工作原理:http://play.golang.org/p/HyvEeANs1u

英文:

Just for fun another approach, minimal lexer implemented as state machine, inspired by and well described in Rob Pike talk http://cuddle.googlecode.com/hg/talk/lex.html. Code is more verbose but more readable, understandable and hackable then regexp. Also it can work with any Reader and Writer, not strings only so don't consumes RAM and should even be faster.

type stateFn func(*lexer) stateFn
func run(l *lexer) {
for state := lexText; state != nil; {
state = state(l)
}
}
type lexer struct {
io.RuneReader
io.Writer
}
func lexText(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
switch r {
case '"':
l.Write([]byte(string(r)))
return lexQuoted
case '/':
r, _, err = l.ReadRune()
if r == '*' {
return lexComment
} else {
l.Write([]byte("/"))
l.Write([]byte(string(r)))
}
default:
l.Write([]byte(string(r)))
}
}
return nil
}
func lexQuoted(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
if r == '"' {
l.Write([]byte(string(r)))
return lexText
}
l.Write([]byte(string(r)))
}
return nil
}
func lexComment(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
if r == '*' {
r, _, err = l.ReadRune()
if r == '/' {
return lexText
}
}
}
return nil
}

You can see it works http://play.golang.org/p/HyvEeANs1u

答案4

得分: 2

以下是翻译好的内容:

这些不保留格式的方式
首选方式(如果未匹配到第1组,则返回NULL)
在golang playground中有效:

https://play.golang.org/p/yKtPk5QCQV

fmt.Println(reg.ReplaceAllString(txt, "$1"))

(?:/*[^]*+(?:[^/][^]*+)/|//[^\n](?:\n|$))|("[^"\](?:\[\S\s][^"\])"|'[^'\](?:\[\S\s][^'\])'|[\S\s][^/"'])

(?: # 注释
/* # 开始 /* .. / 注释
[^
]* *+
(?: [^/] [^]* *+ )*
/ # 结束 /* .. / 注释
|
// [^\n]
# 开始 // 注释
(?: \n | $ ) # 结束 // 注释
)
|
( # (1 开始), 非注释
"
[^"\]* # 双引号内的文本
(?: \ [\S\s] [^"\]* )*
"
|
'
[^'\]* # 单引号内的文本
(?: \ [\S\s] [^'\]* )*
'
| [\S\s] # 任何其他字符
[^/"']*
) # (1 结束)

另一种方式(第1组始终匹配,但可能为空)
在golang playground中有效:

https://play.golang.org/p/7FDGZSmMtP

fmt.Println(reg.ReplaceAllString(txt, "$1"))

(?:/*[^]*+(?:[^/][^]*+)/|//[^\n](?:\n|$))?((?:"[^"\](?:\[\S\s][^"\])"|'[^'\](?:\[\S\s][^'\])'|[\S\s][^/"'])?)

(?: # 注释
/* # 开始 /* .. / 注释
[^
]* *+
(?: [^/] [^]* *+ )*
/ # 结束 /* .. / 注释
|
// [^\n]
# 开始 // 注释
(?: \n | $ ) # 结束 // 注释
)?
( # (1 开始), 非注释
(?:
"
[^"\]* # 双引号内的文本
(?: \ [\S\s] [^"\]* )*
"
|
'
[^'\]* # 单引号内的文本
(?: \ [\S\s] [^'\]* )*
'
| [\S\s] # 任何其他字符
[^/"']*
)?
) # (1 结束)

Cadilac - 保留格式(不幸的是,Golang无法进行断言)
以防您切换到其他正则表达式引擎时发布。

raw: ((?:(?:^[ \t])?(?:/*[^]*+(?:[^/][^]*+)/(?:[ \t]\r?\n(?=[ \t](?:\r?\n|/*|//)))?|//(?:[^\]|\(?:\r?\n)?)?(?:\r?\n(?=[ \t](?:\r?\n|/*|//))|(?=\r?\n))))+)|("[^"\](?:\[\S\s][^"\])"|'[^'\](?:\[\S\s][^'\])'|(?:\r?\n|[\S\s])[^/"'\\s])

delimited: /((?:(?:^[ \t])?(?:/*[^]*+(?:[^/][^]*+)/(?:[ \t]\r?\n(?=[ \t](?:\r?\n|/*|//)))?|//(?:[^\]|\(?:\r?\n)?)?(?:\r?\n(?=[ \t](?:\r?\n|/*|//))|(?=\r?\n))))+)|("[^"\](?:\[\S\s][^"\])"|'[^'\](?:\[\S\s][^'\])'|(?:\r?\n|[\S\s])[^/"'\\s])/

( # (1 开始), 注释
(?:
(?: ^ [ \t]* )? # <- 保留格式
(?:
/* # 开始 /* .. / 注释
[^
]* *+
(?: [^/] [^]* *+ )*
/ # 结束 /* .. / 注释
(?: # <- 保留格式
[ \t]
\r? \n
(?=
[ \t]*
(?: \r? \n | /* | // )
)
)?
|
// # 开始 // 注释
(?: # 可能的行继续
[^\]
| \
(?: \r? \n )?
)?
(?: # 结束 // 注释
\r? \n
(?= # <- 保留格式
[ \t]

(?: \r? \n | /* | // )
)
| (?= \r? \n )
)
)
)+ # 如果需要,获取多个注释块
) # (1 结束)

| ## 或者

( # (2 开始), 非注释
"
[^"\]* # 双引号内的文本
(?: \ [\S\s] [^"\]* )*
"
|
'
[^'\]* # 单引号内的文本
(?: \ [\S\s] [^'\]* )*
'
|
(?: \r? \n | [\S\s] ) # 换行符或任何其他字符
[^/"'\\s]* # 不以注释、字符串、转义或行继续(转义+换行符)开头的字符
) # (2 结束)

英文:

These do not preserve formatting


Preferred way (produces a NULL if group 1 is not matched)
works in golang playground -

     # https://play.golang.org/p/yKtPk5QCQV
# fmt.Println(reg.ReplaceAllString(txt, &quot;$1&quot;))
# (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))|(&quot;[^&quot;\\]*(?:\\[\S\s][^&quot;\\]*)*&quot;|&#39;[^&#39;\\]*(?:\\[\S\s][^&#39;\\]*)*&#39;|[\S\s][^/&quot;&#39;\\]*)
(?:                              # Comments 
/\*                              # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/                                # End /* .. */ comment
|  
//  [^\n]*                       # Start // comment
(?: \n | $ )                     # End // comment
)
|  
(                                # (1 start), Non - comments 
&quot;
[^&quot;\\]*                          # Double quoted text
(?: \\ [\S\s] [^&quot;\\]* )*
&quot;
|  
&#39;
[^&#39;\\]*                          # Single quoted text
(?: \\ [\S\s] [^&#39;\\]* )*
&#39; 
|  [\S\s]                           # Any other char
[^/&quot;&#39;\\]*                        # Chars which doesn&#39;t start a comment, string, escape, or line continuation (escape + newline)
)                                # (1 end)

Alternative way (group 1 is always matched, but could be empty)
works in golang playground -

 # https://play.golang.org/p/7FDGZSmMtP
# fmt.Println(reg.ReplaceAllString(txt, &quot;$1&quot;))
# (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))?((?:&quot;[^&quot;\\]*(?:\\[\S\s][^&quot;\\]*)*&quot;|&#39;[^&#39;\\]*(?:\\[\S\s][^&#39;\\]*)*&#39;|[\S\s][^/&quot;&#39;\\]*)?)     
(?:                              # Comments 
/\*                              # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/                                # End /* .. */ comment
|  
//  [^\n]*                       # Start // comment
(?: \n | $ )                     # End // comment
)?
(                                # (1 start), Non - comments 
(?:
&quot;
[^&quot;\\]*                          # Double quoted text
(?: \\ [\S\s] [^&quot;\\]* )*
&quot;
|  
&#39;
[^&#39;\\]*                          # Single quoted text
(?: \\ [\S\s] [^&#39;\\]* )*
&#39; 
|  [\S\s]                           # Any other char
[^/&quot;&#39;\\]*                        # Chars which doesn&#39;t start a comment, string, escape, or line continuation (escape + newline)
)?
)                                # (1 end)

The Cadilac - Preserves Formatting

(Unfortunately, this can't be done in Golang because Golang cannot do Assertions)
Posted incase you move to a different regex engine.

     # raw:   ((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|(&quot;[^&quot;\\]*(?:\\[\S\s][^&quot;\\]*)*&quot;|&#39;[^&#39;\\]*(?:\\[\S\s][^&#39;\\]*)*&#39;|(?:\r?\n|[\S\s])[^/&quot;&#39;\\\s]*)
# delimited:  /((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|(&quot;[^&quot;\\]*(?:\\[\S\s][^&quot;\\]*)*&quot;|&#39;[^&#39;\\]*(?:\\[\S\s][^&#39;\\]*)*&#39;|(?:\r?\n|[\S\s])[^\/&quot;&#39;\\\s]*)/
(                                # (1 start), Comments 
(?:
(?: ^ [ \t]* )?                  # &lt;- To preserve formatting
(?:
/\*                              # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/                                # End /* .. */ comment
(?:                              # &lt;- To preserve formatting 
[ \t]* \r? \n                                      
(?=
[ \t]*                  
(?: \r? \n | /\* | // )
)
)?
|  
//                               # Start // comment
(?:                              # Possible line-continuation
[^\\] 
|  \\ 
(?: \r? \n )?
)*?
(?:                              # End // comment
\r? \n                               
(?=                              # &lt;- To preserve formatting
[ \t]*                          
(?: \r? \n | /\* | // )
)
|  (?= \r? \n )
)
)
)+                               # Grab multiple comment blocks if need be
)                                # (1 end)
|                                 ## OR
(                                # (2 start), Non - comments 
&quot;
[^&quot;\\]*                          # Double quoted text
(?: \\ [\S\s] [^&quot;\\]* )*
&quot;
|  
&#39;
[^&#39;\\]*                          # Single quoted text
(?: \\ [\S\s] [^&#39;\\]* )*
&#39; 
|  
(?: \r? \n | [\S\s] )            # Linebreak or Any other char
[^/&quot;&#39;\\\s]*                      # Chars which doesn&#39;t start a comment, string, escape,
# or line continuation (escape + newline)
)                                # (2 end)

答案5

得分: 1

示例

播放 Golang 示例

(每个阶段的工作都会输出,可以通过向下滚动查看最终结果。)

方法

为了解决 Golang 的有些有限正则表达式语法,使用了一些“技巧”:

  1. 用唯一字符替换起始引号和结束引号。关键是用于标识起始引号和结束引号的字符必须不同,并且在处理的文本中极不可能出现。
  2. 用一个或多个不同的唯一序列替换所有不以未终止的起始引号为前导的注释起始符(/*)。
  3. 类似地,用一个不同的唯一序列替换所有不以在其之前没有起始引号的结束引号为后继的注释结束符(*/)。
  4. 删除所有剩余的 /*...*/ 注释序列。
  5. 通过撤销步骤 2 和 3 中进行的替换,取消之前“屏蔽”的注释起始符和结束符。

限制

当前演示未考虑注释中出现双引号的可能性,例如 /* Not expected: &quot; */注意:我觉得这个问题可以解决,只是还没有付出努力,所以如果你认为这可能是个问题,请告诉我,我会研究一下。

英文:

Demo

Play golang demo

(The workings at each stage are output and the end result can be seen by scrolling down.)

Method

A few "tricks" are used to work around Golang's somewhat limited regex syntax:

  1. Replace start quotes and end quotes with a unique character. Crucially, the characters used to identify start and end quotes must be different from each other and extremely unlikely to appear in the text being processed.
  2. Replace all comment starters (/*) that are not preceeded by an unterminated start quote with a unique sequence of one or more characters.
  3. Similarly, replace all comment enders (*/) that are not succeeded by an end quote that does not have a start quote before it with a different unique sequence of one or more characters.
  4. Remove all remaining /*...*/ comment sequences.
  5. Unmask the previously "masked" comment starters/enders by reversing the replacements made in steps 2 and 3 above.

Limitations

The current demo doesn't address the possibility of a double quote appearing within a comment, e.g. /* Not expected: &quot; */. Note: My feeling is this could be handled - just haven't put the effort in yet - so let me know if you think it could be an issue and I'll look into it.

答案6

得分: 0

请尝试这个例子..

play golang

英文:

Try this example..

play golang

huangapple
  • 本文由 发表于 2016年4月20日 01:16:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/36725194.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定