如何在Java中按空格和特殊字符拆分字符串,并获取它们的偏移值。

huangapple go评论76阅读模式
英文:

How to split a string on whitespace and on special char while getting there offset values in java

问题

import re

text = "I live, in India."

pattern = r"(\S+)|(\p{Punct})"
matches = re.finditer(pattern, text)

output = []
offsets = []

for match in matches:
    token = match.group(0)
    output.append(token)
    start = match.start()
    end = match.end()
    offsets.append((start, end))

print(output)
print(offsets)
英文:

I am try to split/match a string on Punctuation and white space and also need to get there offset values.

Ex - "I live, in India."

i want output like - ["I","live", ",", "in", "India", "."]
and also the start and end index value of each token.

I have tried using -

String text = "I live, in India.";

Pattern p1 = Pattern.compile("\\S+");
Pattern p2 = Pattern.compile("\\p{Punct}");		    
Matcher m1 = p1.matcher(text);
Matcher m2 = p2.matcher(text);

This will give the desire result but can i combine both the pattern in a single pattern ?

答案1

得分: 4

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("\\b\\S+\\b|\\p{Punct}");
        Matcher matcher = pattern.matcher("I live, in India.");
        while (matcher.find()) {
            System.out.println(matcher.group() + " => " + matcher.start());
        }
    }
}

Output:

I => 0
live => 2
, => 6
in => 8
India => 11
. => 16

Explanation of regex:

  1. \b specifies [word boundary][2].
  2. | specifies OR.
  3. \p{Punct} specifies [punctuation][3].
  4. \S+ specifies [one or more][4] non-whitespace character.

<details>
<summary>英文:</summary>

## [`Matcher#start`][1]

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class Main {
    	public static void main(String[] args) {
    		Pattern pattern = Pattern.compile(&quot;\\b\\S+\\b|\\p{Punct}&quot;);
    		Matcher matcher = pattern.matcher(&quot;I live, in India.&quot;);
    		while (matcher.find()) {
    			System.out.println(matcher.group() + &quot; =&gt; &quot; + matcher.start());
    		}
    	}
    }
**Output:**

    I =&gt; 0
    live =&gt; 2
    , =&gt; 6
    in =&gt; 8
    India =&gt; 11
    . =&gt; 16
**Explanation of regex:**

 1. `\b` specifies [word boundary][2].
 2. `|` specifies `OR`.
 3. `\p{Punct}` specifies [punctuation][3].
 4. `\S+` specifies [one or more][4] non-whitespace character.


  [1]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#start()
  [2]: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
  [3]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
  [4]: https://docs.oracle.com/javase/tutorial/essential/regex/quant.html

</details>



huangapple
  • 本文由 发表于 2020年10月14日 01:02:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/64339842.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定