How to create parametrized regex (by terms of C#) which matches strings delimited by custom multicharacter delimiter?

huangapple go评论148阅读模式
英文:

How to create parametrized regex (by terms of C#) which matches strings delimited by custom multicharacter delimiter?

问题

I can help you translate the provided text. Here's the translation:

所以,我想在文本中查找字符串。文本可以包含多行。字符串可以由自定义分隔符分隔 - 这应该是可参数化的。文本中可以包含多个字符串,甚至在同一行中。例如:如果分隔符是(三个双引号):""",那么在这个文本中:

> lorem ipsum """findthis""" "but not this" 'nor this'
> """anotherstringtofind"""
>
> ""blabla"" """yet another""""""text to find"""

它应该找到:findthisanotherstringtofindyet anothertext to find
(请注意,分隔符不在匹配的字符串中,尽管如果需要,我可以使用C#将它们删除。)

我可以做类似的事情,只是对于单个字符分隔符:
使用正则表达式:"[{0}](([^{0}])*)[{0}]"

就像这样:

public static MatchCollection FindString(this string input, char delimiter, RegexOptions regexOptions = RegexOptions.Multiline)
{
    var regexString = string.Format("[{0}](([^{0}])*)[{0}]", delimiter);
    var rx = new Regex(regexString, regexOptions);

    MatchCollection matches = rx.Matches(input);

    return matches;
}

我猜,解决方案将使用前瞻运算符,但我无法弄清楚如何将其与在单个字符情况下具有类似效果的[^]相结合使用。是否有可能“否定”整个字符序列(以便不将它们放入匹配项中)?

我认为这个问题类似,但我不熟悉Python。

一些澄清:
我的期望是每个分隔符对都要使用一次。因此,例如,这个测试应该通过:

var inputText = "??abc?? ??def?? ??xyz??";

var matches = inputText.FindString("??", RegexOptions.Singleline);

Assert.Equal(3, matches.Count);

是否可以在C#中使用正则表达式解决这个问题?
提前谢谢!

英文:

So, I want to find strings in a text. The text can contain multiple lines. The strings can be delimited by custom delimiters - this should be parameterized. There can be multiple strings in the text, even in one line. For example: if the delimiter is (three double quatation marks): """ then in this text:

> lorem ipsum """findthis""" "but not this" 'nor this'
> """anotherstringtofind"""
>
> ""blabla"" """yet another""""""text to find"""

It should find: findthis, anotherstringtofind, yet another, text to find.
(Notice, that the delimiters are not present in the matched strings, although I can remove them using C#, if needed.)

I can do a similar thing, just for one character delimiters:
with regex: "[{0}](([^{0}])*)[{0}]"

Like this:

public static MatchCollection FindString(this string input, char delimeter, RegexOptions regexOptions = RegexOptions.Multiline)
{
    var regexString = string.Format("[{0}](([^{0}])*)[{0}]", delimeter);
    var rx = new Regex(regexString, regexOptions);

    MatchCollection matches = rx.Matches(input);

    return matches;
}

I guess, the solution would use look-ahead operators, but I could not figure out how to combine it with something, which has similar effect like [^] in case of single characters. Is it even possible to "negate" a whole sequence of characters (to not put them into the matches)?

I think this question is similar, but I'm not familiar with Python.

Some clarification:
My expectation is to use each and delimiter pair exactly once. So, e.g. this pass should pass:

            var inputText = "??abc?? ??def?? ??xyz??";

            var matches = inputText.FindString("??", RegexOptions.Singleline);

            Assert.Equal(3, matches.Count);

Is it possible to solve this in C# using regex?
Thank you in advance!

答案1

得分: 1

你可以使用懒惰量词来替代否定字符类。在你的示例中,使用"""应该会导致正则表达式如下:"""(.*?)"""

此外,请注意,你当前的尝试错误地使用字符类作为分隔符,因为["""]等同于["],进而等同于简单的"。在正则表达式中使用你的分隔符时,请直接使用它,不需要任何额外的包装。

但是,在使用正则表达式之前,不要忘记转义你的分隔符。因此,如果你的分隔符在正则表达式中是[],那么它应该写成\[\]

你的方法应该像这样:

public static MatchCollection FindString(string input, string delimiter, RegexOptions regexOptions = RegexOptions.Multiline)
{
    string pattern = string.Format("{0}(.*?){0}", Regex.Escape(delimiter));
    var rx = new Regex(pattern, regexOptions);
    return rx.Matches(input);
}

>甚至可以“否定”整个字符序列吗?

是的,是可能的:(?:(?!foo).)+可以用来匹配类似这样的内容。或者对于你的示例,可以使用"""(?:(?!""").)*"""。但从性能角度来看,与简单的懒惰量词相比,性能会差很多。

英文:

You can use lazy quantifier instead of negated character class. In you example with """ it should lead to regex like """(.*?)"""

Also, notice that your current attempt incorrectly uses character classes for delimiters, as ["""] is equivalent to ["], and in turn to simple ". Use your delimiter as is, without any additional wrappers.

But don't forget to escape your delimiter before use in regex. So, that if you have delimiter like [] in regex it should be \[\].

Your method would look like this:

public static MatchCollection FindString(string input, string delimiter, RegexOptions regexOptions = RegexOptions.Multiline)
{
    string pattern = string.Format("{0}(.*?){0}", Regex.Escape(delimiter));
    var rx = new Regex(pattern, regexOptions);
    return rx.Matches(input);
}

>Is it even possible to "negate" a whole sequence of characters

Yes, it is possible: (?:(?!foo).)+ can be used to match something like this. Or for your example """(?:(?!""").)*""". But it would be way worse performance-wise comparing to simple lazy quantifier.

huangapple
  • 本文由 发表于 2023年6月30日 04:51:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76584544.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定