英文:
Golang Regular Expression: Getting index position of variable
问题
我有一个正则表达式,其中有一个变量(?P<next_tok>)
,我该如何获取该变量匹配的索引?
以下是完整的正则表达式:
\S*[\.\?!](?P<after_tok>(?:[?!)";}]\*:@\'\({\[])|\s+(?P<next_tok>\S+))
示例:
http://play.golang.org/p/7CYfK50W2Q
我想获取匹配项以及正则表达式匹配中任何变量的索引。在golang中是否可能实现?
编辑:
我无法通过名称获取next_tok
,但我可以通过FindAllStringSubmatchIndex获取所有子匹配项。
http://play.golang.org/p/SEaCLVKisr
英文:
I have a regular expression that has variables (?P<next_tok>)
how can I grab the index of that variable match?
Here is the complete regexp:
\S*[\.\?!](?P<after_tok>(?:[?!)";}\]\*:@\'\({\[])|\s+(?P<next_tok>\S+))
Example:
http://play.golang.org/p/7CYfK50W2Q
I want to get the matches AND the index of any variable in the regexp match. Is this possible in golang?
EDIT:
I couldn't figure out how to get next_tok
by name, but I was able to get all the submatches via FindAllStringSubmatchIndex
答案1
得分: 2
你可以使用.FindAllStringSubmatchIndex
方法:
package main
import (
"fmt"
"regexp"
"unicode/utf8"
)
func main() {
text := "这里...有一些缩写 E.R.B.,还有一个等等。在句子中的缩写中,句号既可以作为句子结束标记,也可以作为缩写的一部分,这不仅会导致句子边界的错误判断。分割错误会传播到依赖于准确句子分割的其他组件中,进而对后续分析产生负面影响。例如,Walker等人(2001)强调了对机器翻译的正确句子边界消歧的重要性,而Kiss和Strunk(2002b)则表明句子边界检测错误会导致词性标注的错误率升高。在本文中,我们提出了一种基于语言无关方法的句子边界检测方法,能够高精度地确定句子边界。它不使用额外的注释、词性标注或预编译列表来支持句子边界检测,而是从待分割的语料库中提取所有必要的数据。此外,它不使用正字法信息作为主要证据,因此适用于处理单例文本。它注重鲁棒性和灵活性,可以在不进行任何进一步调整的情况下应用于各种语言,并取得良好的结果。同时,所提出系统的模块化结构原则上可以集成特定语言的方法和线索,以进一步提高其准确性。基本算法是根据一个未注释的英文开发语料库进行实验确定的。我们将该系统应用于英文文本的其他语料库以及来自巴西葡萄牙语、荷兰语、爱沙尼亚语、法语、德语、意大利语、挪威语、西班牙语、瑞典语和土耳其语的语料库。在未经过对开发语料库的实验进行进一步添加或修改的情况下,该系统在十一种语言的报纸语料库中的句子边界检测平均准确率为98.74%。"
var periodContextFmt string = `\S*[\.\?!](?P<after_tok>(?:[?!)";}\]\*:@\'\({\[])|\s+(?P<next_tok>\S+))`
sent := regexp.MustCompile(periodContextFmt)
matches := sent.FindAllStringSubmatchIndex(text, -1)
for _, match := range matches {
fmt.Println("上下文: ", text[utf8.RuneCountInString(text[:match[0]]):utf8.RuneCountInString(text[:match[1]])])
fmt.Println("下一个标记: ", text[utf8.RuneCountInString(text[:match[4]]):utf8.RuneCountInString(text[:match[5]])])
fmt.Println("起始位置: ", utf8.RuneCountInString(text[:match[2]]))
fmt.Println("结束位置: ", utf8.RuneCountInString(text[:match[4]]))
fmt.Println("------")
}
}
请参阅Go演示。
请注意,unicode/utf8
导入和utf8.RuneCountInString
是为了获取Unicode字符串中的Unicode字符索引,否则将获得字节偏移量。请参阅在推文消息中识别正确的标签索引。
英文:
You can use .FindAllStringSubmatchIndex
:
package main
import (
"fmt"
"regexp"
"unicode/utf8"
)
func main() {
text := "Here... are some initials E.R.B. and also an etc. in the middle.\nPeriods that form part of an abbreviation but are taken to be end-of-sentence markers\nor vice versa do not only introduce errors in the determination of sentence boundaries.\nSegmentation errors propagate into further components which rely on accurate\nsentence segmentation and subsequent analyses are most likely affected negatively.\nWalker et al. (2001), for example, stress the importance of correct sentence boundary\ndisambiguation for machine translation and Kiss and Strunk (2002b) show that errors\nin sentence boundary detection lead to a higher error rate in part-of-speech tagging.\nIn this paper, we present an approach to sentence boundary detection that builds\non language-independent methods and determines sentence boundaries with high accuracy.\nIt does not make use of additional annotations, part-of-speech tagging, or precompiled\nlists to support sentence boundary detection but extracts all necessary data\nfrom the corpus to be segmented. Also, it does not use orthographic information as primary\nevidence and is thus suited to process single-case text. It focuses on robustness\nand flexibility in that it can be applied with good results to a variety of languages without\nany further adjustments. At the same time, the modular structure of the proposed\nsystem makes it possible in principle to integrate language-specific methods and clues\nto further improve its accuracy. The basic algorithm has been determined experimentally\non the basis of an unannotated development corpus of English. We have applied\nthe resulting system to further corpora of English text as well as to corpora from ten\nother languages: Brazilian Portuguese, Dutch, Estonian, French, German, Italian, Norwegian,\nSpanish, Swedish, and Turkish. Without further additions or amendments to\nthe system produced through experimentation on the development corpus, the mean\naccuracy of sentence boundary detection on newspaper corpora in eleven languages is\n98.74 %."
var periodContextFmt string = `\S*[\.\?!](?P<after_tok>(?:[?!)";}\]\*:@\'\({\[])|\s+(?P<next_tok>\S+))`
sent := regexp.MustCompile(periodContextFmt)
matches := sent.FindAllStringSubmatchIndex(text, -1)
for _, match := range matches {
fmt.Println("context: ", text[utf8.RuneCountInString(text[:match[0]]):utf8.RuneCountInString(text[:match[1]])])
fmt.Println("next_tok: ", text[utf8.RuneCountInString(text[:match[4]]):utf8.RuneCountInString(text[:match[5]])])
fmt.Println("start: ", utf8.RuneCountInString(text[:match[2]]))
fmt.Println("end: ", utf8.RuneCountInString(text[:match[4]]))
fmt.Println("------")
}
}
See the Go demo.
Note that the unicode/utf8
import and utf8.RuneCountInString
is necessary to get the Unicode character indices in Unicode strings, otherwise, you will get byte offsets. See Identify the correct hashtag indexes in tweet messages.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论