问题

我有一个正则表达式，其中有一个变量(?P<next_tok>)，我该如何获取该变量匹配的索引？

以下是完整的正则表达式：
\S*[\.\?!](?P<after_tok>(?:[?!)";}]\*:@\'\({\[])|\s+(?P<next_tok>\S+))

示例：
http://play.golang.org/p/7CYfK50W2Q

我想获取匹配项以及正则表达式匹配中任何变量的索引。在golang中是否可能实现？

编辑：
我无法通过名称获取next_tok，但我可以通过FindAllStringSubmatchIndex获取所有子匹配项。

http://play.golang.org/p/SEaCLVKisr

英文:

I have a regular expression that has variables (?P<next_tok>) how can I grab the index of that variable match?

Here is the complete regexp:
\S*[\.\?!](?P<after_tok>(?:[?!)";}\]\*:@\'\({\[])|\s+(?P<next_tok>\S+))

Example:
http://play.golang.org/p/7CYfK50W2Q

I want to get the matches AND the index of any variable in the regexp match. Is this possible in golang?

EDIT:
I couldn't figure out how to get next_tok by name, but I was able to get all the submatches via FindAllStringSubmatchIndex

http://play.golang.org/p/SEaCLVKisr

答案1

得分: 2

你可以使用.FindAllStringSubmatchIndex方法：

package main

import (
	"fmt"
	"regexp"
	"unicode/utf8"
)

func main() {
	text := "这里...有一些缩写 E.R.B.，还有一个等等。在句子中的缩写中，句号既可以作为句子结束标记，也可以作为缩写的一部分，这不仅会导致句子边界的错误判断。分割错误会传播到依赖于准确句子分割的其他组件中，进而对后续分析产生负面影响。例如，Walker等人（2001）强调了对机器翻译的正确句子边界消歧的重要性，而Kiss和Strunk（2002b）则表明句子边界检测错误会导致词性标注的错误率升高。在本文中，我们提出了一种基于语言无关方法的句子边界检测方法，能够高精度地确定句子边界。它不使用额外的注释、词性标注或预编译列表来支持句子边界检测，而是从待分割的语料库中提取所有必要的数据。此外，它不使用正字法信息作为主要证据，因此适用于处理单例文本。它注重鲁棒性和灵活性，可以在不进行任何进一步调整的情况下应用于各种语言，并取得良好的结果。同时，所提出系统的模块化结构原则上可以集成特定语言的方法和线索，以进一步提高其准确性。基本算法是根据一个未注释的英文开发语料库进行实验确定的。我们将该系统应用于英文文本的其他语料库以及来自巴西葡萄牙语、荷兰语、爱沙尼亚语、法语、德语、意大利语、挪威语、西班牙语、瑞典语和土耳其语的语料库。在未经过对开发语料库的实验进行进一步添加或修改的情况下，该系统在十一种语言的报纸语料库中的句子边界检测平均准确率为98.74%。"

	var periodContextFmt string = `\S*[\.\?!](?P<after_tok>(?:[?!)&quot;;}\]\*:@\&#39;\({\[])|\s+(?P<next_tok>\S+))`
	sent := regexp.MustCompile(periodContextFmt)
	matches := sent.FindAllStringSubmatchIndex(text, -1)

	for _, match := range matches {
		fmt.Println("上下文: ", text[utf8.RuneCountInString(text[:match[0]]):utf8.RuneCountInString(text[:match[1]])])
		fmt.Println("下一个标记: ", text[utf8.RuneCountInString(text[:match[4]]):utf8.RuneCountInString(text[:match[5]])])
		fmt.Println("起始位置: ", utf8.RuneCountInString(text[:match[2]]))
		fmt.Println("结束位置: ", utf8.RuneCountInString(text[:match[4]]))
		fmt.Println("------")
	}
}

请参阅Go演示。

请注意，unicode/utf8导入和utf8.RuneCountInString是为了获取Unicode字符串中的Unicode字符索引，否则将获得字节偏移量。请参阅在推文消息中识别正确的标签索引。

英文:

You can use .FindAllStringSubmatchIndex:

package main

import (
	&quot;fmt&quot;
	&quot;regexp&quot;
	&quot;unicode/utf8&quot;
)

func main() {
	text := &quot;Here... are some initials E.R.B. and also an etc. in the middle.\nPeriods that form part of an abbreviation but are taken to be end-of-sentence markers\nor vice versa do not only introduce errors in the determination of sentence boundaries.\nSegmentation errors propagate into further components which rely on accurate\nsentence segmentation and subsequent analyses are most likely affected negatively.\nWalker et al. (2001), for example, stress the importance of correct sentence boundary\ndisambiguation for machine translation and Kiss and Strunk (2002b) show that errors\nin sentence boundary detection lead to a higher error rate in part-of-speech tagging.\nIn this paper, we present an approach to sentence boundary detection that builds\non language-independent methods and determines sentence boundaries with high accuracy.\nIt does not make use of additional annotations, part-of-speech tagging, or precompiled\nlists to support sentence boundary detection but extracts all necessary data\nfrom the corpus to be segmented. Also, it does not use orthographic information as primary\nevidence and is thus suited to process single-case text. It focuses on robustness\nand flexibility in that it can be applied with good results to a variety of languages without\nany further adjustments. At the same time, the modular structure of the proposed\nsystem makes it possible in principle to integrate language-specific methods and clues\nto further improve its accuracy. The basic algorithm has been determined experimentally\non the basis of an unannotated development corpus of English. We have applied\nthe resulting system to further corpora of English text as well as to corpora from ten\nother languages: Brazilian Portuguese, Dutch, Estonian, French, German, Italian, Norwegian,\nSpanish, Swedish, and Turkish. Without further additions or amendments to\nthe system produced through experimentation on the development corpus, the mean\naccuracy of sentence boundary detection on newspaper corpora in eleven languages is\n98.74 %.&quot;

	var periodContextFmt string = `\S*[\.\?!](?P&lt;after_tok&gt;(?:[?!)&quot;;}\]\*:@\&#39;\({\[])|\s+(?P&lt;next_tok&gt;\S+))`
	sent := regexp.MustCompile(periodContextFmt)
	matches := sent.FindAllStringSubmatchIndex(text, -1)

	for _, match := range matches {
		fmt.Println(&quot;context: &quot;, text[utf8.RuneCountInString(text[:match[0]]):utf8.RuneCountInString(text[:match[1]])])
		fmt.Println(&quot;next_tok: &quot;, text[utf8.RuneCountInString(text[:match[4]]):utf8.RuneCountInString(text[:match[5]])])
		fmt.Println(&quot;start: &quot;, utf8.RuneCountInString(text[:match[2]]))
		fmt.Println(&quot;end: &quot;, utf8.RuneCountInString(text[:match[4]]))
		fmt.Println(&quot;------&quot;)
	}
}

See the Go demo.

Note that the unicode/utf8 import and utf8.RuneCountInString is necessary to get the Unicode character indices in Unicode strings, otherwise, you will get byte offsets. See Identify the correct hashtag indexes in tweet messages.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Golang正则表达式：获取变量的索引位置

问题

答案1

在使用Ubuntu 12.04和Oracle时出现了错误字符。

通过变量访问的Go函数

在处理 panic 后继续执行函数的代码部分。

无法使用`syscall.Kill()`终止一个以守护进程方式运行的Go进程。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论