如何使用正则表达式清理文本以进行搜索?

huangapple go评论83阅读模式
英文:

How can I clean the text for search using RegEx

问题

我可以使用以下代码来搜索文本str是否包含任何一个或两个keys,即它是否包含"MS"或"dynamics"或两者都包含。

package main

import (
	"fmt"
	"regexp"
)

func main() {
	keys := []string{"MS", "dynamics"}
	keysReg := fmt.Sprintf("(%s %s)|%s|%s", keys[0], keys[1], keys[0], keys[1]) // => "(MS dynamics)|MS|dynamics"
	fmt.Println(keysReg)
	str := "What is MS dynamics, is it a product from MS?"
	re := regexp.MustCompile(`(?i)` + keysReg)
	matches := re.FindAllString(str, -1)
	fmt.Println("We found", len(matches), "matches, that are:", matches)
}

我希望用户输入他的短语,所以我修剪掉不需要的单词和字符,然后按照上述方式进行搜索。
假设用户输入为:"This,is,a,delimited,string",我需要动态构建keys变量为"(delimited string)|delimited|string",以便我可以搜索我的变量str以找到所有匹配项,所以我编写了以下代码:

s := "This,is,a,delimited,string"
t := regexp.MustCompile(`(?i),|\.|this|is|a`) // 这里使用反引号来包含表达式,(?i)表示不区分大小写
v := t.Split(s, -1)
fmt.Println(len(v))
fmt.Println(v)

但是我得到的输出是:

8
[      delimited string]

我的输入文本清理部分有什么问题?我期望的输出是:

2
[delimited string]

这是我的playground链接。

英文:

I can use the below code to search if the text str contains any or both of the keys, i.e.if it contains "MS" or "dynamics" or both of them

package main

import (
	"fmt"
	"regexp"
)

func main() {
	keys := []string{"MS", "dynamics"}
	keysReg := fmt.Sprintf("(%s %s)|%s|%s", keys[0], keys[1], keys[0], keys[1]) // => "(MS dynamics)|MS|dynamics"
	fmt.Println(keysReg)
	str := "What is MS dynamics, is it a product from MS?"
	re := regexp.MustCompile(`(?i)` + keysReg)
	matches := re.FindAllString(str, -1)
	fmt.Println("We found", len(matches), "matches, that are:", matches)
}

I want the user to enter his phrase, so I trim unwanted words and characters, then doing the search as per above.
Let's say the user input was: This,is,a,delimited,string and I need to build the keys variable dynamically to be (delimited string)|delimited|string so that I can search for my variable str for all the matches, so I wrote the below:

	s := "This,is,a,delimited,string"
	t := regexp.MustCompile(`(?i),|\.|this|is|a`) // backticks are used here to contain the expression, (?i) for case insensetive
	v := t.Split(s, -1)
	fmt.Println(len(v))
	fmt.Println(v)

But I got the output as:

8
[      delimited string]

What is the wrong part in my cleaning of the input text, I'm expecting the output to be:

2
[delimited string]

Here is my playground

答案1

得分: 1

引用Jamie Zawinski的名言:

> 有些人在面对问题时会想:“我知道了,我会使用正则表达式。”现在他们有两个问题了。

两件事情:

  • 不要试图从字符串中清除垃圾(“清理”它),而是从中提取完整的单词。
  • Unicode 是一个复杂的问题;所以即使你成功地提取了单词,你还必须确保你的单词在构建正则表达式之前被正确地“转义”,以免包含任何可能被解释为正则表达式语法的字符。
package main

import (
	"errors"
	"fmt"
	"regexp"
	"strings"
)

func build(words ...string) (*regexp.Regexp, error) {
	var sb strings.Builder

	switch len(words) {
	case 0:
		return nil, errors.New("empty input")
	case 1:
		return regexp.Compile(regexp.QuoteMeta(words[0]))
	}

	quoted := make([]string, len(words))
	for i, w := range words {
		quoted[i] = regexp.QuoteMeta(w)
	}

	sb.WriteByte('(')
	for i, w := range quoted {
		if i > 0 {
			sb.WriteByte('\x20')
		}
		sb.WriteString(w)
	}
	sb.WriteString(`)|`)
	for i, w := range quoted {
		if i > 0 {
			sb.WriteByte('|')
		}
		sb.WriteString(w)
	}

	return regexp.Compile(sb.String())
}

var words = regexp.MustCompile(`\pL+`)

func main() {
	allWords := words.FindAllString("\tThis\v\x20\x20,\t\tis\t\t,?a!,¿delimited?,string‽", -1)

	re, err := build(allWords...)
	if err != nil {
		panic(err)
	}

	fmt.Println(re)
}

进一步阅读:

  • <https://pkg.go.dev/regexp/syntax>
  • <https://pkg.go.dev/regexp#QuoteMeta>
  • <https://pkg.go.dev/unicode#pkg-variables> 和 <https://pkg.go.dev/unicode#Categories>
英文:

To quote the famous quip from Jamie Zawinski,

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Two things:

  • Instead of trying to weed out garbage from the string ("cleaning" it), extract complete words from it instead.
  • Unicode is a compilcated matter; so even after you have succeeded with extracting words, you have to make sure your words are properly "escaped" to not contain any characters which might be interpreted as RE syntax before building a regexp of them.
package main

import (
	&quot;errors&quot;
	&quot;fmt&quot;
	&quot;regexp&quot;
	&quot;strings&quot;
)

func build(words ...string) (*regexp.Regexp, error) {
	var sb strings.Builder

	switch len(words) {
	case 0:
		return nil, errors.New(&quot;empty input&quot;)
	case 1:
		return regexp.Compile(regexp.QuoteMeta(words[0]))
	}

	quoted := make([]string, len(words))
	for i, w := range words {
		quoted[i] = regexp.QuoteMeta(w)
	}

	sb.WriteByte(&#39;(&#39;)
	for i, w := range quoted {
		if i &gt; 0 {
			sb.WriteByte(&#39;\x20&#39;)
		}
		sb.WriteString(w)
	}
	sb.WriteString(`)|`)
	for i, w := range quoted {
		if i &gt; 0 {
			sb.WriteByte(&#39;|&#39;)
		}
		sb.WriteString(w)
	}

	return regexp.Compile(sb.String())
}

var words = regexp.MustCompile(`\pL+`)

func main() {
	allWords := words.FindAllString(&quot;\tThis\v\x20\x20,\t\tis\t\t,?a!,&#191;delimited?,string&quot;, -1)

	re, err := build(allWords...)
	if err != nil {
		panic(err)
	}

	fmt.Println(re)
}

Further reading:

  • <https://pkg.go.dev/regexp/syntax>
  • <https://pkg.go.dev/regexp#QuoteMeta>
  • <https://pkg.go.dev/unicode#pkg-variables> and <https://pkg.go.dev/unicode#Categories>

huangapple
  • 本文由 发表于 2022年8月30日 16:46:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/73539557.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定