2022年11月2日 11:06:35go评论190阅读模式

英文:

Go - regex inside loop

问题

我有一个包含600个正则表达式模式的文件，需要按顺序执行以查找网站的特定ID。

示例：

regex/www\.effectiveperformanceformat\.com/5
regex/bam-cell\.nr-data\.net/5
regex/advgoogle\.com/5
regex/googleapi\.club/5
regex/doubleclickbygoogle\.com/5
regex/googlerank\.info/5
regex/google-pr7\.de/5
regex/usemarketings\.com/5
regex/google-rank\.org/5
regex/googleanalytcs\.com/5
regex/xml\.trafficmoose\.com/5
regex/265\.com/5
regex/app-measurement\.com/5
regex/loftsbaacad\.com/5
regex/toldmeflex\.com/5
regex/r\.baresi\.xyz/5
regex/molodgytot\.biz/5
regex/ec\.walkme\.com/5
regex/px\.ads\.linkedin\.com/5
regex/hinisanex\.biz/5
regex/buysellads\.com/5
regex/buysellads\.net/5
regex/servedby-buysellads\.com/5
regex/carbonads\.(net|com)/5
regex/oulddev\.biz/5
regex/click\.hoolig\.app/5
regex/engine\.blacraft\.com/5
regex/mc\.yandex\.ru/5
regex/ads\.gaming1\.com/5
regex/adform\.net/5
regex/luzulabeguile\.com/5
regex/ficanportio\.biz/5
regex/hidelen\.com/5
regex/earchmess\.fun/5
regex/acrvclk\.com/5
regex/track\.wg-aff\.com/5
regex/thumb\.tapecontent\.net/5
regex/betgorebysson\.club/5
regex/in-page-push\.com/5
regex/itphanpytor\.club/5
regex/mktoresp\.com/5
regex/xid\.i-mobile\.co\.jp/5
regex/ads\.tremorhub\.com/5

到目前为止，我正在使用以下代码：

for _, line := range file {
l := line
data := strings.Split(l, &quot;/&quot;)
if data[0] == &quot;regex&quot; {
				match, _ := regexp.MatchString(``+data[1]+``, website)
				if match {
					id, _ = strconv.Atoi(data[2])
				}
			}
}

这个方法是有效的，但我想知道是否有更优化的方法。因为如果网站与顶部的正则表达式不匹配，我需要一遍又一遍地遍历直到找到匹配的模式。

有人可以帮助我改进这个方法吗？

最好的问候

英文:

I have a file with a list of 600 regex patterns that most be performed in order to find a specific id for a website.

Example:

regex/www\.effectiveperformanceformat\.com/5
regex/bam-cell\.nr-data\.net/5
regex/advgoogle\.com/5
regex/googleapi\.club/5
regex/doubleclickbygoogle\.com/5
regex/googlerank\.info/5
regex/google-pr7\.de/5
regex/usemarketings\.com/5
regex/google-rank\.org/5
regex/googleanalytcs\.com/5
regex/xml\.trafficmoose\.com/5
regex/265\.com/5
regex/app-measurement\.com/5
regex/loftsbaacad\.com/5
regex/toldmeflex\.com/5
regex/r\.baresi\.xyz/5
regex/molodgytot\.biz/5
regex/ec\.walkme\.com/5
regex/px\.ads\.linkedin\.com/5
regex/hinisanex\.biz/5
regex/buysellads\.com/5
regex/buysellads\.net/5
regex/servedby-buysellads\.com/5
regex/carbonads\.(net|com)/5
regex/oulddev\.biz/5
regex/click\.hoolig\.app/5
regex/engine\.blacraft\.com/5
regex/mc\.yandex\.ru/5
regex/ads\.gaming1\.com/5
regex/adform\.net/5
regex/luzulabeguile\.com/5
regex/ficanportio\.biz/5
regex/hidelen\.com/5
regex/earchmess\.fun/5
regex/acrvclk\.com/5
regex/track\.wg-aff\.com/5
regex/thumb\.tapecontent\.net/5
regex/betgorebysson\.club/5
regex/in-page-push\.com/5
regex/itphanpytor\.club/5
regex/mktoresp\.com/5
regex/xid\.i-mobile\.co\.jp/5
regex/ads\.tremorhub\.com/5

So far what i'm using is something like this

for _, line := range file {
l := line
data := strings.Split(l, &quot;/&quot;)
if data[0] == &quot;regex&quot; {
				match, _ := regexp.MatchString(``+data[1]+``, website)
				if match {
					id, _ = strconv.Atoi(data[2])
				}
			}
}

This is working, but i wonder if there is a more optimized way to do this.
Because, if the website match with the regex on the top, great, but if not, i need to intenered the loop over and over till find it.

Anyone can help me to improve this?

Best regards

答案1

得分: 1

为了减少时间，您可以缓存正则表达式。

package main

import (
	"bufio"
	"bytes"
	"fmt"
	csvutils "github.com/alessiosavi/GoGPUtils/csv"
	"log"
	"os"
	"regexp"
	"strconv"
	"strings"
	"time"
)

func main() {
	now := time.Now()
	Precomputed("www.google.it")
	fmt.Println(time.Since(now))
	now = time.Now()
	NonPrecomputed("www.google.it")
	fmt.Println(time.Since(now))
}
func NonPrecomputed(website string) int {
	for _, line := range cachedLines {
		l := line
		data := strings.Split(l, "/")
		if data[0] == "regex" {
			match, _ := regexp.MatchString(`+data[1]+`, website)
			if match {
				id, _ := strconv.Atoi(data[2])
				return id
			}
		}
	}

	return -1
}
func Precomputed(site string) int {
	for regex, id := range rawRegex {
		if ok := regex.MatchString(site); ok {
			return id
		}
	}
	return -1
}

var rawRegex map[*regexp.Regexp]int = make(map[*regexp.Regexp]int)
var cachedLines []string
var sites []string

func init() {
	now := time.Now()
	file, err := os.ReadFile("regex.txt")
	if err != nil {
		panic(err)
	}

	scanner := bufio.NewScanner(bytes.NewReader(file))

	for scanner.Scan() {
		txt := scanner.Text()
		cachedLines = append(cachedLines, txt)
		split := strings.Split(txt, "/")
		if len(split) == 3 {
			compile, err := regexp.Compile(split[1])
			if err != nil {
				panic(err)
			}
			if rawRegex[compile], err = strconv.Atoi(split[2]); err != nil {
				panic(err)
			}
		}
	}
	file, err = os.ReadFile("top500Domains.csv")
	if err != nil {
		panic(err)
	}
	_, csvData, err := csvutils.ReadCSV(file, ',')
	if err != nil {
		panic(err)
	}
	for _, line := range csvData {
		sites = append(sites, line[1])
	}
	log.Println("Init took:", time.Since(now))
}

init 方法负责正则表达式缓存。它将所有正则表达式加载到一个带有相对索引的映射中（还会加载测试数据以进行基准测试）。

然后有两个方法：

Precomputed：使用缓存的正则表达式映射
NonPrecomputed：您的代码片段的复制粘贴

如您所见，NonPrecomputed 方法能够执行 63 次，而 Precomputed 方法能够执行 10000 次。
您可以看到 NonPrecomputed 方法分配了约 67 MB 的内存，而 Precomputed 方法没有分配内存（由于初始缓存）。

C:\opt\SP\Workspace\Go\Temp>go test -bench=. -benchmem -benchtime=10s
2022/11/03 00:45:35 Init took: 10.8397ms
goos: windows
goarch: amd64
pkg: Temp
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Benchmark_Precomputed-8            10000           1113887 ns/op               0 B/op          0 allocs/op
Benchmark_NonPrecomputed-8            63         298434740 ns/op        65782238 B/op     484595 allocs/op
PASS
ok      Temp    41.548s

英文:

In order to reduce the time you can cache the regexp.

package main

import (
	&quot;bufio&quot;
	&quot;bytes&quot;
	&quot;fmt&quot;
	csvutils &quot;github.com/alessiosavi/GoGPUtils/csv&quot;
	&quot;log&quot;
	&quot;os&quot;
	&quot;regexp&quot;
	&quot;strconv&quot;
	&quot;strings&quot;
	&quot;time&quot;
)

func main() {
	now := time.Now()
	Precomputed(&quot;www.google.it&quot;)
	fmt.Println(time.Since(now))
	now = time.Now()
	NonPrecomputed(&quot;www.google.it&quot;)
	fmt.Println(time.Since(now))
}
func NonPrecomputed(website string) int {
	for _, line := range cachedLines {
		l := line
		data := strings.Split(l, &quot;/&quot;)
		if data[0] == &quot;regex&quot; {
			match, _ := regexp.MatchString(``+data[1]+``, website)
			if match {
				id, _ := strconv.Atoi(data[2])
				return id
			}
		}
	}

	return -1
}
func Precomputed(site string) int {
	for regex, id := range rawRegex {
		if ok := regex.MatchString(site); ok {
			return id
		}
	}
	return -1
}

var rawRegex map[*regexp.Regexp]int = make(map[*regexp.Regexp]int)
var cachedLines []string
var sites []string

func init() {
	now := time.Now()
	file, err := os.ReadFile(&quot;regex.txt&quot;)
	if err != nil {
		panic(err)
	}

	scanner := bufio.NewScanner(bytes.NewReader(file))

	for scanner.Scan() {
		txt := scanner.Text()
		cachedLines = append(cachedLines, txt)
		split := strings.Split(txt, &quot;/&quot;)
		if len(split) == 3 {
			compile, err := regexp.Compile(split[1])
			if err != nil {
				panic(err)
			}
			if rawRegex[compile], err = strconv.Atoi(split[2]); err != nil {
				panic(err)
			}
		}
	}
	file, err = os.ReadFile(&quot;top500Domains.csv&quot;)
	if err != nil {
		panic(err)
	}
	_, csvData, err := csvutils.ReadCSV(file, &#39;,&#39;)
	if err != nil {
		panic(err)
	}
	for _, line := range csvData {
		sites = append(sites, line[1])
	}
	log.Println(&quot;Init took:&quot;, time.Since(now))
}

The init method take care of regexp cache. It will load all the regexp in a map with the relative index (it will load the test data too just for the benchmark).

Then you have 2 method:

Precomputed: use the map of cached regexp
NonPrecomputed: the copy->paste of your snippet

As you can see where the NonPrecomputed method is able to perform 63 execution, the Precomputed is able to perform 10000 execution.
As you can see the NonPrecomputed method allocate ~67 MB when the Precomputed method have no allocation (due to the initial cache)

C:\opt\SP\Workspace\Go\Temp&gt;go test -bench=. -benchmem -benchtime=10s
2022/11/03 00:45:35 Init took: 10.8397ms
goos: windows
goarch: amd64
pkg: Temp
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Benchmark_Precomputed-8            10000           1113887 ns/op               0 B/op          0 allocs/op
Benchmark_NonPrecomputed-8            63         298434740 ns/op        65782238 B/op     484595 allocs/op
PASS
ok      Temp    41.548s

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在循环中使用正则表达式的 Go 代码。

问题

答案1

使用Go访问嵌套数组和对象中的数据

Sqlx "missing destination name" for struct tag through pointer

UDP服务器/客户端在Go中无法读取数据

在Go语言中控制指针

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论