2020年10月14日 01:02:41go评论115阅读模式

英文:

How to split a string on whitespace and on special char while getting there offset values in java

问题

import re
text = "I live, in India."
pattern = r"(\S+)|(\p{Punct})"
matches = re.finditer(pattern, text)
output = []
offsets = []
for match in matches:
    token = match.group(0)
    output.append(token)
    start = match.start()
    end = match.end()
    offsets.append((start, end))
print(output)
print(offsets)

英文:

I am try to split/match a string on Punctuation and white space and also need to get there offset values.

Ex - "I live, in India."

i want output like - ["I","live", ",", "in", "India", "."]
and also the start and end index value of each token.

I have tried using -

String text = "I live, in India.";

Pattern p1 = Pattern.compile(&quot;\\S+&quot;);
Pattern p2 = Pattern.compile(&quot;\\p{Punct}&quot;);		    
Matcher m1 = p1.matcher(text);
Matcher m2 = p2.matcher(text);

This will give the desire result but can i combine both the pattern in a single pattern ?

答案1

得分: 4

import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("\\b\\S+\\b|\\p{Punct}");
        Matcher matcher = pattern.matcher("I live, in India.");
        while (matcher.find()) {
            System.out.println(matcher.group() + " => " + matcher.start());
        }
    }
}

Output:

I => 0
live => 2
, => 6
in => 8
India => 11
. => 16

Explanation of regex:

\b specifies [word boundary][2].
| specifies OR.
\p{Punct} specifies [punctuation][3].
\S+ specifies [one or more][4] non-whitespace character.


<details>
<summary>英文:</summary>
## [`Matcher#start`][1]
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class Main {
    	public static void main(String[] args) {
    		Pattern pattern = Pattern.compile(&quot;\\b\\S+\\b|\\p{Punct}&quot;);
    		Matcher matcher = pattern.matcher(&quot;I live, in India.&quot;);
    		while (matcher.find()) {
    			System.out.println(matcher.group() + &quot; =&gt; &quot; + matcher.start());
    		}
    	}
    }
**Output:**
    I =&gt; 0
    live =&gt; 2
    , =&gt; 6
    in =&gt; 8
    India =&gt; 11
    . =&gt; 16
**Explanation of regex:**
 1. `\b` specifies [word boundary][2].
 2. `|` specifies `OR`.
 3. `\p{Punct}` specifies [punctuation][3].
 4. `\S+` specifies [one or more][4] non-whitespace character.
  [1]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#start()
  [2]: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
  [3]: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
  [4]: https://docs.oracle.com/javase/tutorial/essential/regex/quant.html
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在Java中按空格和特殊字符拆分字符串，并获取它们的偏移值。

问题

答案1

从键盘读取并忽略已打印的文本

我在代码中使用了许多“OR”运算符，如何重构Java代码呢？

How to get list of all window handles in Java (Using JNA) on MacOS?

为什么JDBI中的查询是可关闭的？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。