根据大写字母分割,但不在下划线之间分割。

huangapple go评论58阅读模式
英文:

Split on capitalized words not between underscores

问题

给定以下字符串:ThisIsA_SimpleTest_Case

我想要在所有不在下划线之间且在下划线之间的大写单词上进行分割,并在下划线之间的字符串的第一个下划线上进行分割。

预期的分割结果:This Is A SimpleTest Case

我想出了以下在 Java 正则表达式中不起作用的正则表达式:

(?=_[a-zA-Z]*_|[A-Z])

但显然这不起作用,因为它是一个或(or)而不是一个与(and)。而且这会在下划线之间的所有大写单词上进行分割,而我想要忽略这一点。

英文:

Given the following string: ThisIsA_SimpleTest_Case

I want to split on all capitalized words not between underscores and on the first underscore of a string between underscores.

The expected splitted result: This Is A SimpleTest Case

I came up with the following none working regex, for the Java regex flavor:

(?=_[a-zA-Z]*_|[A-Z])

But this ofcourse doesn't work since it's an or and not an and. Also this splits on all capitalized words within underscores which is something I want to ignore.

答案1

得分: 1

Wiktor是对的,尝试匹配而不是拆分你不想要的部分应该更容易。

但因为这是一个有趣的挑战,我得到了一个可以按照你想要的方式拆分它的方法。
_|(?<!_)(?=[A-Z])(?=[^_]*(?:_[^_]*_[^_]*)*[^_]*$)

它也适用于多组下划线。
(当然可以进行改进,我可能会尝试简化它)

思路是:

  • _| 在任何下划线处进行拆分,并从最终列表中移除它。
  • (?<!_) 不是紧跟在下划线后面。如果不这样做,拆分后可能会得到空匹配(这些情况已经由 _| 处理)。如果您不关心这一点,可以跳过此步骤。
  • (?=[A-Z]) 在大写字母前进行拆分。
  • (?=[^_]*(?:_[^_]*_[^_]*)*[^_]*$) 但在拆分前必须跟随偶数个下划线。如果有奇数个下划线,意味着您位于两个下划线之间,不应进行拆分。我假设字符串中不会出现奇数个下划线。

https://regex101.com/r/Iov1Yl/1/ 进行测试。

英文:

Wiktor is right, it should be easier to try to match instead of splitting on what you don't want.

But because it's a fun challenge, I got one that will split it like you wanted.
_|(?&lt;!_)(?=[A-Z])(?=[^_]*(?:_[^_]*_[^_]*)*[^_]*$)

Also works with multiple pairs of underscores.
(It can certainly be improved, I might try to simplify it)

The idea is :

  • _| Split on any underscore removing it from the final list.
  • (?&lt;!_) Not right after an underscore. If you don't do that, you might get empty matches after the split (cases already handled by the _|). Can be skipped if you don't care.
  • (?=[A-Z]) Split before capital letters.
  • (?=[^_]*(?:_[^_]*_[^_]*)*[^_]*$) But it must be followed by an even number of underscores. If there are an odd number, it means you're between 2 and it should not split. I assume there can't be an odd number of underscores in the string.

Test at https://regex101.com/r/Iov1Yl/1/

答案2

得分: 1

你可能会在以下情况下进行分割:

  • (?=(?&lt;!_)[A-Z](?![A-Za-z]*_)) 如果它是一个位置,一个 A-Z 字符不直接在 _ 之前,并且右侧没有 _
  • | 或者
  • (?&lt;!_[A-Za-z]{0,1000}|^)(?=[A-Z]) 如果它是一个位置,在左侧的内容不是下划线或字符串的开头,并且右侧直接是一个 A-Z 字符
  • | 或者
  • _ 匹配下划线

示例代码:

String regex = "(?=(?&lt;!_)[A-Z](?![A-Za-z]*_))|(?&lt;!_[A-Za-z]{0,1000}|^)(?=[A-Z])|_";
String str = "ThisIsA_SimpleTest_Case";
String[] parts = str.split(regex);

for (String part : parts)
    System.out.println(part);

输出:

This
Is
A
SimpleTest
Case
英文:

You might split on:

(?=(?&lt;!_)[A-Z](?![A-Za-z]*_))|(?&lt;!_[A-Za-z]{0,1000}|^)(?=[A-Z])|_
  • (?=(?&lt;!_)[A-Z](?![A-Za-z]*_)) If it is a position where a char A-Z is not directly preceded by _ and has no _ at the right
  • | Or
  • (?&lt;!_[A-Za-z]{0,1000}|^)(?=[A-Z]) If it is a position where what is at the left is not an underscore or the start of the string, and what is directly at the right is a char A-Z
  • | Or
  • _ Match an underscore

Regex demo | Java demo

Example code

String regex = &quot;(?=(?&lt;!_)[A-Z](?![A-Za-z]*_))|(?&lt;!_[A-Za-z]{0,1000}|^)(?=[A-Z])|_&quot;;
String str = &quot;ThisIsA_SimpleTest_Case&quot;;
String[] parts = str.split(regex);

for (String part : parts)
	System.out.println(part);

Output

This
Is
A
SimpleTest
Case

答案3

得分: 1

在分割之前的另一种方法:

在分割之前对字符串进行了更改,查看上下文:

public static void main(String[] args) {
    String input = "ThisIsA_SimpleTest_Case";
    String inputReplace1 = input.replaceAll("_(\\w+[a-z])([A-Z]\\w+)_", ",$1#$2");
    String inputReplace2 = inputReplace1.replaceAll("(?<=[a-z])(?=[A-Z])", ",");
    String inputReplace3 = inputReplace2.replaceAll("#", "");
    System.out.println(Arrays.asList(inputReplace3.split(",")));
}

输出:

[This, Is, A, SimpleTest, Case]
英文:

Another approach before split:

The string is changed before split, see context:

public static void main(String[] args) {
    String input = &quot;ThisIsA_SimpleTest_Case&quot;;
    String inputReplace1 =  input.replaceAll(&quot;_(\\w+[a-z])([A-Z]\\w+)_&quot;, &quot;,$1#$2&quot;);
    String inputReplace2 = inputReplace1.replaceAll(&quot;(?&lt;=[a-z])(?=[A-Z])&quot;, &quot;,&quot;);
    String inputReplace3 = inputReplace2.replaceAll(&quot;#&quot;, &quot;&quot;);
    System.out.println(Arrays.asList(inputReplace3.split(&quot;,&quot;)));
}

Output:

[This, Is, A, SimpleTest, Case]

huangapple
  • 本文由 发表于 2020年10月8日 21:33:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/64263648.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定