英文:
How to split a string on whitespace and on special char while getting there offset values in java
问题
import re
text = "I live, in India."
pattern = r"(\S+)|(\p{Punct})"
matches = re.finditer(pattern, text)
output = []
offsets = []
for match in matches:
    token = match.group(0)
    output.append(token)
    start = match.start()
    end = match.end()
    offsets.append((start, end))
print(output)
print(offsets)
英文:
I am try to split/match a string on Punctuation and white space and also need to get there offset values.
Ex - "I live, in India."
i want output like - ["I","live", ",", "in", "India", "."]
and also the start and end index value of each token.
I have tried using -
String text = "I live, in India.";
Pattern p1 = Pattern.compile("\\S+");
Pattern p2 = Pattern.compile("\\p{Punct}");		    
Matcher m1 = p1.matcher(text);
Matcher m2 = p2.matcher(text);
This will give the desire result but can i combine both the pattern in a single pattern ?
答案1
得分: 4
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("\\b\\S+\\b|\\p{Punct}");
        Matcher matcher = pattern.matcher("I live, in India.");
        while (matcher.find()) {
            System.out.println(matcher.group() + " => " + matcher.start());
        }
    }
}
Output:
I => 0
live => 2
, => 6
in => 8
India => 11
. => 16
Explanation of regex:
\bspecifies [word boundary][2].|specifiesOR.\p{Punct}specifies [punctuation][3].\S+specifies [one or more][4] non-whitespace character.
<details>
<summary>英文:</summary>
## [`Matcher#start`][1]
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class Main {
    	public static void main(String[] args) {
    		Pattern pattern = Pattern.compile("\\b\\S+\\b|\\p{Punct}");
    		Matcher matcher = pattern.matcher("I live, in India.");
    		while (matcher.find()) {
    			System.out.println(matcher.group() + " => " + matcher.start());
    		}
    	}
    }
**Output:**
    I => 0
    live => 2
    , => 6
    in => 8
    India => 11
    . => 16
**Explanation of regex:**
 1. `\b` specifies [word boundary][2].
 2. `|` specifies `OR`.
 3. `\p{Punct}` specifies [punctuation][3].
 4. `\S+` specifies [one or more][4] non-whitespace character.
  [1]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#start()
  [2]: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
  [3]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
  [4]: https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
</details>
				通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论