英文:
How can I clean the text for search using RegEx
问题
我可以使用以下代码来搜索文本str
是否包含任何一个或两个keys
,即它是否包含"MS"或"dynamics"或两者都包含。
package main
import (
"fmt"
"regexp"
)
func main() {
keys := []string{"MS", "dynamics"}
keysReg := fmt.Sprintf("(%s %s)|%s|%s", keys[0], keys[1], keys[0], keys[1]) // => "(MS dynamics)|MS|dynamics"
fmt.Println(keysReg)
str := "What is MS dynamics, is it a product from MS?"
re := regexp.MustCompile(`(?i)` + keysReg)
matches := re.FindAllString(str, -1)
fmt.Println("We found", len(matches), "matches, that are:", matches)
}
我希望用户输入他的短语,所以我修剪掉不需要的单词和字符,然后按照上述方式进行搜索。
假设用户输入为:"This,is,a,delimited,string",我需要动态构建keys
变量为"(delimited string)|delimited|string",以便我可以搜索我的变量str
以找到所有匹配项,所以我编写了以下代码:
s := "This,is,a,delimited,string"
t := regexp.MustCompile(`(?i),|\.|this|is|a`) // 这里使用反引号来包含表达式,(?i)表示不区分大小写
v := t.Split(s, -1)
fmt.Println(len(v))
fmt.Println(v)
但是我得到的输出是:
8
[ delimited string]
我的输入文本清理部分有什么问题?我期望的输出是:
2
[delimited string]
这是我的playground链接。
英文:
I can use the below code to search if the text str
contains any or both of the keys
, i.e.if it contains "MS" or
"dynamics" or
both of them
package main
import (
"fmt"
"regexp"
)
func main() {
keys := []string{"MS", "dynamics"}
keysReg := fmt.Sprintf("(%s %s)|%s|%s", keys[0], keys[1], keys[0], keys[1]) // => "(MS dynamics)|MS|dynamics"
fmt.Println(keysReg)
str := "What is MS dynamics, is it a product from MS?"
re := regexp.MustCompile(`(?i)` + keysReg)
matches := re.FindAllString(str, -1)
fmt.Println("We found", len(matches), "matches, that are:", matches)
}
I want the user to enter his phrase, so I trim unwanted words and characters, then doing the search as per above.
Let's say the user input was: This,is,a,delimited,string
and I need to build the keys
variable dynamically to be (delimited string)|delimited|string
so that I can search for my variable str
for all the matches, so I wrote the below:
s := "This,is,a,delimited,string"
t := regexp.MustCompile(`(?i),|\.|this|is|a`) // backticks are used here to contain the expression, (?i) for case insensetive
v := t.Split(s, -1)
fmt.Println(len(v))
fmt.Println(v)
But I got the output as:
8
[ delimited string]
What is the wrong part in my cleaning of the input text, I'm expecting the output to be:
2
[delimited string]
Here is my playground
答案1
得分: 1
引用Jamie Zawinski的名言:
> 有些人在面对问题时会想:“我知道了,我会使用正则表达式。”现在他们有两个问题了。
两件事情:
- 不要试图从字符串中清除垃圾(“清理”它),而是从中提取完整的单词。
- Unicode 是一个复杂的问题;所以即使你成功地提取了单词,你还必须确保你的单词在构建正则表达式之前被正确地“转义”,以免包含任何可能被解释为正则表达式语法的字符。
package main
import (
"errors"
"fmt"
"regexp"
"strings"
)
func build(words ...string) (*regexp.Regexp, error) {
var sb strings.Builder
switch len(words) {
case 0:
return nil, errors.New("empty input")
case 1:
return regexp.Compile(regexp.QuoteMeta(words[0]))
}
quoted := make([]string, len(words))
for i, w := range words {
quoted[i] = regexp.QuoteMeta(w)
}
sb.WriteByte('(')
for i, w := range quoted {
if i > 0 {
sb.WriteByte('\x20')
}
sb.WriteString(w)
}
sb.WriteString(`)|`)
for i, w := range quoted {
if i > 0 {
sb.WriteByte('|')
}
sb.WriteString(w)
}
return regexp.Compile(sb.String())
}
var words = regexp.MustCompile(`\pL+`)
func main() {
allWords := words.FindAllString("\tThis\v\x20\x20,\t\tis\t\t,?a!,¿delimited?,string‽", -1)
re, err := build(allWords...)
if err != nil {
panic(err)
}
fmt.Println(re)
}
进一步阅读:
- <https://pkg.go.dev/regexp/syntax>
- <https://pkg.go.dev/regexp#QuoteMeta>
- <https://pkg.go.dev/unicode#pkg-variables> 和 <https://pkg.go.dev/unicode#Categories>
英文:
To quote the famous quip from Jamie Zawinski,
> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Two things:
- Instead of trying to weed out garbage from the string ("cleaning" it), extract complete words from it instead.
- Unicode is a compilcated matter; so even after you have succeeded with extracting words, you have to make sure your words are properly "escaped" to not contain any characters which might be interpreted as RE syntax before building a regexp of them.
package main
import (
"errors"
"fmt"
"regexp"
"strings"
)
func build(words ...string) (*regexp.Regexp, error) {
var sb strings.Builder
switch len(words) {
case 0:
return nil, errors.New("empty input")
case 1:
return regexp.Compile(regexp.QuoteMeta(words[0]))
}
quoted := make([]string, len(words))
for i, w := range words {
quoted[i] = regexp.QuoteMeta(w)
}
sb.WriteByte('(')
for i, w := range quoted {
if i > 0 {
sb.WriteByte('\x20')
}
sb.WriteString(w)
}
sb.WriteString(`)|`)
for i, w := range quoted {
if i > 0 {
sb.WriteByte('|')
}
sb.WriteString(w)
}
return regexp.Compile(sb.String())
}
var words = regexp.MustCompile(`\pL+`)
func main() {
allWords := words.FindAllString("\tThis\v\x20\x20,\t\tis\t\t,?a!,¿delimited?,string‽", -1)
re, err := build(allWords...)
if err != nil {
panic(err)
}
fmt.Println(re)
}
Further reading:
- <https://pkg.go.dev/regexp/syntax>
- <https://pkg.go.dev/regexp#QuoteMeta>
- <https://pkg.go.dev/unicode#pkg-variables> and <https://pkg.go.dev/unicode#Categories>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论