如何在Java中按空格和特殊字符拆分字符串,并获取它们的偏移值。

huangapple go评论115阅读模式
英文:

How to split a string on whitespace and on special char while getting there offset values in java

问题

  1. import re
  2. text = "I live, in India."
  3. pattern = r"(\S+)|(\p{Punct})"
  4. matches = re.finditer(pattern, text)
  5. output = []
  6. offsets = []
  7. for match in matches:
  8. token = match.group(0)
  9. output.append(token)
  10. start = match.start()
  11. end = match.end()
  12. offsets.append((start, end))
  13. print(output)
  14. print(offsets)
英文:

I am try to split/match a string on Punctuation and white space and also need to get there offset values.

Ex - "I live, in India."

i want output like - ["I","live", ",", "in", "India", "."]
and also the start and end index value of each token.

I have tried using -

String text = "I live, in India.";

  1. Pattern p1 = Pattern.compile("\\S+");
  2. Pattern p2 = Pattern.compile("\\p{Punct}");
  3. Matcher m1 = p1.matcher(text);
  4. Matcher m2 = p2.matcher(text);

This will give the desire result but can i combine both the pattern in a single pattern ?

答案1

得分: 4

  1. import java.util.regex.Matcher;
  2. import java.util.regex.Pattern;
  3. public class Main {
  4. public static void main(String[] args) {
  5. Pattern pattern = Pattern.compile("\\b\\S+\\b|\\p{Punct}");
  6. Matcher matcher = pattern.matcher("I live, in India.");
  7. while (matcher.find()) {
  8. System.out.println(matcher.group() + " => " + matcher.start());
  9. }
  10. }
  11. }

Output:

  1. I => 0
  2. live => 2
  3. , => 6
  4. in => 8
  5. India => 11
  6. . => 16

Explanation of regex:

  1. \b specifies [word boundary][2].
  2. | specifies OR.
  3. \p{Punct} specifies [punctuation][3].
  4. \S+ specifies [one or more][4] non-whitespace character.
  1. <details>
  2. <summary>英文:</summary>
  3. ## [`Matcher#start`][1]
  4. import java.util.regex.Matcher;
  5. import java.util.regex.Pattern;
  6. public class Main {
  7. public static void main(String[] args) {
  8. Pattern pattern = Pattern.compile(&quot;\\b\\S+\\b|\\p{Punct}&quot;);
  9. Matcher matcher = pattern.matcher(&quot;I live, in India.&quot;);
  10. while (matcher.find()) {
  11. System.out.println(matcher.group() + &quot; =&gt; &quot; + matcher.start());
  12. }
  13. }
  14. }
  15. **Output:**
  16. I =&gt; 0
  17. live =&gt; 2
  18. , =&gt; 6
  19. in =&gt; 8
  20. India =&gt; 11
  21. . =&gt; 16
  22. **Explanation of regex:**
  23. 1. `\b` specifies [word boundary][2].
  24. 2. `|` specifies `OR`.
  25. 3. `\p{Punct}` specifies [punctuation][3].
  26. 4. `\S+` specifies [one or more][4] non-whitespace character.
  27. [1]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#start()
  28. [2]: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
  29. [3]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
  30. [4]: https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
  31. </details>

huangapple
  • 本文由 发表于 2020年10月14日 01:02:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/64339842.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定