英文:
How to split a string on whitespace and on special char while getting there offset values in java
问题
import re
text = "I live, in India."
pattern = r"(\S+)|(\p{Punct})"
matches = re.finditer(pattern, text)
output = []
offsets = []
for match in matches:
token = match.group(0)
output.append(token)
start = match.start()
end = match.end()
offsets.append((start, end))
print(output)
print(offsets)
英文:
I am try to split/match a string on Punctuation and white space and also need to get there offset values.
Ex - "I live, in India."
i want output like - ["I","live", ",", "in", "India", "."]
and also the start and end index value of each token.
I have tried using -
String text = "I live, in India.";
Pattern p1 = Pattern.compile("\\S+");
Pattern p2 = Pattern.compile("\\p{Punct}");
Matcher m1 = p1.matcher(text);
Matcher m2 = p2.matcher(text);
This will give the desire result but can i combine both the pattern in a single pattern ?
答案1
得分: 4
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("\\b\\S+\\b|\\p{Punct}");
Matcher matcher = pattern.matcher("I live, in India.");
while (matcher.find()) {
System.out.println(matcher.group() + " => " + matcher.start());
}
}
}
Output:
I => 0
live => 2
, => 6
in => 8
India => 11
. => 16
Explanation of regex:
\b
specifies [word boundary][2].|
specifiesOR
.\p{Punct}
specifies [punctuation][3].\S+
specifies [one or more][4] non-whitespace character.
<details>
<summary>英文:</summary>
## [`Matcher#start`][1]
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("\\b\\S+\\b|\\p{Punct}");
Matcher matcher = pattern.matcher("I live, in India.");
while (matcher.find()) {
System.out.println(matcher.group() + " => " + matcher.start());
}
}
}
**Output:**
I => 0
live => 2
, => 6
in => 8
India => 11
. => 16
**Explanation of regex:**
1. `\b` specifies [word boundary][2].
2. `|` specifies `OR`.
3. `\p{Punct}` specifies [punctuation][3].
4. `\S+` specifies [one or more][4] non-whitespace character.
[1]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#start()
[2]: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
[3]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
[4]: https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论