2022年8月30日 16:46:35go评论83阅读模式

英文:

How can I clean the text for search using RegEx

问题

我可以使用以下代码来搜索文本str是否包含任何一个或两个keys，即它是否包含"MS"或"dynamics"或两者都包含。

package main

import (
	"fmt"
	"regexp"
)

func main() {
	keys := []string{"MS", "dynamics"}
	keysReg := fmt.Sprintf("(%s %s)|%s|%s", keys[0], keys[1], keys[0], keys[1]) // => "(MS dynamics)|MS|dynamics"
	fmt.Println(keysReg)
	str := "What is MS dynamics, is it a product from MS?"
	re := regexp.MustCompile(`(?i)` + keysReg)
	matches := re.FindAllString(str, -1)
	fmt.Println("We found", len(matches), "matches, that are:", matches)
}

我希望用户输入他的短语，所以我修剪掉不需要的单词和字符，然后按照上述方式进行搜索。
假设用户输入为："This,is,a,delimited,string"，我需要动态构建keys变量为"(delimited string)|delimited|string"，以便我可以搜索我的变量str以找到所有匹配项，所以我编写了以下代码：

s := "This,is,a,delimited,string"
t := regexp.MustCompile(`(?i),|\.|this|is|a`) // 这里使用反引号来包含表达式，(?i)表示不区分大小写
v := t.Split(s, -1)
fmt.Println(len(v))
fmt.Println(v)

但是我得到的输出是：

8
[      delimited string]

我的输入文本清理部分有什么问题？我期望的输出是：

2
[delimited string]

这是我的playground链接。

英文:

I can use the below code to search if the text str contains any or both of the keys, i.e.if it contains "MS" or "dynamics" or both of them

package main

import (
	&quot;fmt&quot;
	&quot;regexp&quot;
)

func main() {
	keys := []string{&quot;MS&quot;, &quot;dynamics&quot;}
	keysReg := fmt.Sprintf(&quot;(%s %s)|%s|%s&quot;, keys[0], keys[1], keys[0], keys[1]) // =&gt; &quot;(MS dynamics)|MS|dynamics&quot;
	fmt.Println(keysReg)
	str := &quot;What is MS dynamics, is it a product from MS?&quot;
	re := regexp.MustCompile(`(?i)` + keysReg)
	matches := re.FindAllString(str, -1)
	fmt.Println(&quot;We found&quot;, len(matches), &quot;matches, that are:&quot;, matches)
}

I want the user to enter his phrase, so I trim unwanted words and characters, then doing the search as per above.
Let's say the user input was: This,is,a,delimited,string and I need to build the keys variable dynamically to be (delimited string)|delimited|string so that I can search for my variable str for all the matches, so I wrote the below:

	s := &quot;This,is,a,delimited,string&quot;
	t := regexp.MustCompile(`(?i),|\.|this|is|a`) // backticks are used here to contain the expression, (?i) for case insensetive
	v := t.Split(s, -1)
	fmt.Println(len(v))
	fmt.Println(v)

But I got the output as:

8
[      delimited string]

What is the wrong part in my cleaning of the input text, I'm expecting the output to be:

2
[delimited string]

Here is my playground

答案1

得分: 1

引用Jamie Zawinski的名言：

> 有些人在面对问题时会想：“我知道了，我会使用正则表达式。”现在他们有两个问题了。

两件事情：

不要试图从字符串中清除垃圾（“清理”它），而是从中提取完整的单词。
Unicode 是一个复杂的问题；所以即使你成功地提取了单词，你还必须确保你的单词在构建正则表达式之前被正确地“转义”，以免包含任何可能被解释为正则表达式语法的字符。

package main

import (
	"errors"
	"fmt"
	"regexp"
	"strings"
)

func build(words ...string) (*regexp.Regexp, error) {
	var sb strings.Builder

	switch len(words) {
	case 0:
		return nil, errors.New("empty input")
	case 1:
		return regexp.Compile(regexp.QuoteMeta(words[0]))
	}

	quoted := make([]string, len(words))
	for i, w := range words {
		quoted[i] = regexp.QuoteMeta(w)
	}

	sb.WriteByte('(')
	for i, w := range quoted {
		if i > 0 {
			sb.WriteByte('\x20')
		}
		sb.WriteString(w)
	}
	sb.WriteString(`)|`)
	for i, w := range quoted {
		if i > 0 {
			sb.WriteByte('|')
		}
		sb.WriteString(w)
	}

	return regexp.Compile(sb.String())
}

var words = regexp.MustCompile(`\pL+`)

func main() {
	allWords := words.FindAllString("\tThis\v\x20\x20,\t\tis\t\t,?a!,&#191;delimited?,string‽", -1)

	re, err := build(allWords...)
	if err != nil {
		panic(err)
	}

	fmt.Println(re)
}

进一步阅读：

<https://pkg.go.dev/regexp/syntax>
<https://pkg.go.dev/regexp#QuoteMeta>
<https://pkg.go.dev/unicode#pkg-variables> 和 <https://pkg.go.dev/unicode#Categories>

英文:

To quote the famous quip from Jamie Zawinski,

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Two things:

Instead of trying to weed out garbage from the string ("cleaning" it), extract complete words from it instead.
Unicode is a compilcated matter; so even after you have succeeded with extracting words, you have to make sure your words are properly "escaped" to not contain any characters which might be interpreted as RE syntax before building a regexp of them.

package main

import (
	&quot;errors&quot;
	&quot;fmt&quot;
	&quot;regexp&quot;
	&quot;strings&quot;
)

func build(words ...string) (*regexp.Regexp, error) {
	var sb strings.Builder

	switch len(words) {
	case 0:
		return nil, errors.New(&quot;empty input&quot;)
	case 1:
		return regexp.Compile(regexp.QuoteMeta(words[0]))
	}

	quoted := make([]string, len(words))
	for i, w := range words {
		quoted[i] = regexp.QuoteMeta(w)
	}

	sb.WriteByte(&#39;(&#39;)
	for i, w := range quoted {
		if i &gt; 0 {
			sb.WriteByte(&#39;\x20&#39;)
		}
		sb.WriteString(w)
	}
	sb.WriteString(`)|`)
	for i, w := range quoted {
		if i &gt; 0 {
			sb.WriteByte(&#39;|&#39;)
		}
		sb.WriteString(w)
	}

	return regexp.Compile(sb.String())
}

var words = regexp.MustCompile(`\pL+`)

func main() {
	allWords := words.FindAllString(&quot;\tThis\v\x20\x20,\t\tis\t\t,?a!,&#191;delimited?,string‽&quot;, -1)

	re, err := build(allWords...)
	if err != nil {
		panic(err)
	}

	fmt.Println(re)
}

如何使用正则表达式清理文本以进行搜索？

问题

答案1

为什么在处理POST请求时，Go HTTP客户端会添加transfer-encoding=chunked头部？

返回由PostgreSQL中的集合组成的返回值。

当将指针设置为nil时，结构体指针字段不会改变。

getting error when running pre-commit hook for golang repo [named files must be .go files: ./…]

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论