在循环中使用正则表达式的 Go 代码。

huangapple go评论90阅读模式
英文:

Go - regex inside loop

问题

我有一个包含600个正则表达式模式的文件,需要按顺序执行以查找网站的特定ID。

示例:

regex/www\.effectiveperformanceformat\.com/5
regex/bam-cell\.nr-data\.net/5
regex/advgoogle\.com/5
regex/googleapi\.club/5
regex/doubleclickbygoogle\.com/5
regex/googlerank\.info/5
regex/google-pr7\.de/5
regex/usemarketings\.com/5
regex/google-rank\.org/5
regex/googleanalytcs\.com/5
regex/xml\.trafficmoose\.com/5
regex/265\.com/5
regex/app-measurement\.com/5
regex/loftsbaacad\.com/5
regex/toldmeflex\.com/5
regex/r\.baresi\.xyz/5
regex/molodgytot\.biz/5
regex/ec\.walkme\.com/5
regex/px\.ads\.linkedin\.com/5
regex/hinisanex\.biz/5
regex/buysellads\.com/5
regex/buysellads\.net/5
regex/servedby-buysellads\.com/5
regex/carbonads\.(net|com)/5
regex/oulddev\.biz/5
regex/click\.hoolig\.app/5
regex/engine\.blacraft\.com/5
regex/mc\.yandex\.ru/5
regex/ads\.gaming1\.com/5
regex/adform\.net/5
regex/luzulabeguile\.com/5
regex/ficanportio\.biz/5
regex/hidelen\.com/5
regex/earchmess\.fun/5
regex/acrvclk\.com/5
regex/track\.wg-aff\.com/5
regex/thumb\.tapecontent\.net/5
regex/betgorebysson\.club/5
regex/in-page-push\.com/5
regex/itphanpytor\.club/5
regex/mktoresp\.com/5
regex/xid\.i-mobile\.co\.jp/5
regex/ads\.tremorhub\.com/5

到目前为止,我正在使用以下代码:

for _, line := range file {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
				match, _ := regexp.MatchString(``+data[1]+``, website)
				if match {
					id, _ = strconv.Atoi(data[2])
				}
			}
}

这个方法是有效的,但我想知道是否有更优化的方法。因为如果网站与顶部的正则表达式不匹配,我需要一遍又一遍地遍历直到找到匹配的模式。

有人可以帮助我改进这个方法吗?

最好的问候

英文:

I have a file with a list of 600 regex patterns that most be performed in order to find a specific id for a website.

Example:

regex/www\.effectiveperformanceformat\.com/5
regex/bam-cell\.nr-data\.net/5
regex/advgoogle\.com/5
regex/googleapi\.club/5
regex/doubleclickbygoogle\.com/5
regex/googlerank\.info/5
regex/google-pr7\.de/5
regex/usemarketings\.com/5
regex/google-rank\.org/5
regex/googleanalytcs\.com/5
regex/xml\.trafficmoose\.com/5
regex/265\.com/5
regex/app-measurement\.com/5
regex/loftsbaacad\.com/5
regex/toldmeflex\.com/5
regex/r\.baresi\.xyz/5
regex/molodgytot\.biz/5
regex/ec\.walkme\.com/5
regex/px\.ads\.linkedin\.com/5
regex/hinisanex\.biz/5
regex/buysellads\.com/5
regex/buysellads\.net/5
regex/servedby-buysellads\.com/5
regex/carbonads\.(net|com)/5
regex/oulddev\.biz/5
regex/click\.hoolig\.app/5
regex/engine\.blacraft\.com/5
regex/mc\.yandex\.ru/5
regex/ads\.gaming1\.com/5
regex/adform\.net/5
regex/luzulabeguile\.com/5
regex/ficanportio\.biz/5
regex/hidelen\.com/5
regex/earchmess\.fun/5
regex/acrvclk\.com/5
regex/track\.wg-aff\.com/5
regex/thumb\.tapecontent\.net/5
regex/betgorebysson\.club/5
regex/in-page-push\.com/5
regex/itphanpytor\.club/5
regex/mktoresp\.com/5
regex/xid\.i-mobile\.co\.jp/5
regex/ads\.tremorhub\.com/5

So far what i'm using is something like this

for _, line := range file {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
				match, _ := regexp.MatchString(``+data[1]+``, website)
				if match {
					id, _ = strconv.Atoi(data[2])
				}
			}
}

This is working, but i wonder if there is a more optimized way to do this.
Because, if the website match with the regex on the top, great, but if not, i need to intenered the loop over and over till find it.

Anyone can help me to improve this?

Best regards

答案1

得分: 1

为了减少时间,您可以缓存正则表达式。

package main

import (
	"bufio"
	"bytes"
	"fmt"
	csvutils "github.com/alessiosavi/GoGPUtils/csv"
	"log"
	"os"
	"regexp"
	"strconv"
	"strings"
	"time"
)

func main() {
	now := time.Now()
	Precomputed("www.google.it")
	fmt.Println(time.Since(now))
	now = time.Now()
	NonPrecomputed("www.google.it")
	fmt.Println(time.Since(now))
}
func NonPrecomputed(website string) int {
	for _, line := range cachedLines {
		l := line
		data := strings.Split(l, "/")
		if data[0] == "regex" {
			match, _ := regexp.MatchString(`+data[1]+`, website)
			if match {
				id, _ := strconv.Atoi(data[2])
				return id
			}
		}
	}

	return -1
}
func Precomputed(site string) int {
	for regex, id := range rawRegex {
		if ok := regex.MatchString(site); ok {
			return id
		}
	}
	return -1
}

var rawRegex map[*regexp.Regexp]int = make(map[*regexp.Regexp]int)
var cachedLines []string
var sites []string

func init() {
	now := time.Now()
	file, err := os.ReadFile("regex.txt")
	if err != nil {
		panic(err)
	}

	scanner := bufio.NewScanner(bytes.NewReader(file))

	for scanner.Scan() {
		txt := scanner.Text()
		cachedLines = append(cachedLines, txt)
		split := strings.Split(txt, "/")
		if len(split) == 3 {
			compile, err := regexp.Compile(split[1])
			if err != nil {
				panic(err)
			}
			if rawRegex[compile], err = strconv.Atoi(split[2]); err != nil {
				panic(err)
			}
		}
	}
	file, err = os.ReadFile("top500Domains.csv")
	if err != nil {
		panic(err)
	}
	_, csvData, err := csvutils.ReadCSV(file, ',')
	if err != nil {
		panic(err)
	}
	for _, line := range csvData {
		sites = append(sites, line[1])
	}
	log.Println("Init took:", time.Since(now))
}

init 方法负责正则表达式缓存。它将所有正则表达式加载到一个带有相对索引的映射中(还会加载测试数据以进行基准测试)。

然后有两个方法:

  • Precomputed:使用缓存的正则表达式映射
  • NonPrecomputed:您的代码片段的复制粘贴

如您所见,NonPrecomputed 方法能够执行 63 次,而 Precomputed 方法能够执行 10000 次。
您可以看到 NonPrecomputed 方法分配了约 67 MB 的内存,而 Precomputed 方法没有分配内存(由于初始缓存)。

C:\opt\SP\Workspace\Go\Temp>go test -bench=. -benchmem -benchtime=10s
2022/11/03 00:45:35 Init took: 10.8397ms
goos: windows
goarch: amd64
pkg: Temp
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Benchmark_Precomputed-8            10000           1113887 ns/op               0 B/op          0 allocs/op
Benchmark_NonPrecomputed-8            63         298434740 ns/op        65782238 B/op     484595 allocs/op
PASS
ok      Temp    41.548s
英文:

In order to reduce the time you can cache the regexp.

package main

import (
	"bufio"
	"bytes"
	"fmt"
	csvutils "github.com/alessiosavi/GoGPUtils/csv"
	"log"
	"os"
	"regexp"
	"strconv"
	"strings"
	"time"
)

func main() {
	now := time.Now()
	Precomputed("www.google.it")
	fmt.Println(time.Since(now))
	now = time.Now()
	NonPrecomputed("www.google.it")
	fmt.Println(time.Since(now))
}
func NonPrecomputed(website string) int {
	for _, line := range cachedLines {
		l := line
		data := strings.Split(l, "/")
		if data[0] == "regex" {
			match, _ := regexp.MatchString(``+data[1]+``, website)
			if match {
				id, _ := strconv.Atoi(data[2])
				return id
			}
		}
	}

	return -1
}
func Precomputed(site string) int {
	for regex, id := range rawRegex {
		if ok := regex.MatchString(site); ok {
			return id
		}
	}
	return -1
}

var rawRegex map[*regexp.Regexp]int = make(map[*regexp.Regexp]int)
var cachedLines []string
var sites []string

func init() {
	now := time.Now()
	file, err := os.ReadFile("regex.txt")
	if err != nil {
		panic(err)
	}

	scanner := bufio.NewScanner(bytes.NewReader(file))

	for scanner.Scan() {
		txt := scanner.Text()
		cachedLines = append(cachedLines, txt)
		split := strings.Split(txt, "/")
		if len(split) == 3 {
			compile, err := regexp.Compile(split[1])
			if err != nil {
				panic(err)
			}
			if rawRegex[compile], err = strconv.Atoi(split[2]); err != nil {
				panic(err)
			}
		}
	}
	file, err = os.ReadFile("top500Domains.csv")
	if err != nil {
		panic(err)
	}
	_, csvData, err := csvutils.ReadCSV(file, ',')
	if err != nil {
		panic(err)
	}
	for _, line := range csvData {
		sites = append(sites, line[1])
	}
	log.Println("Init took:", time.Since(now))
}

The init method take care of regexp cache. It will load all the regexp in a map with the relative index (it will load the test data too just for the benchmark).

Then you have 2 method:

  • Precomputed: use the map of cached regexp
  • NonPrecomputed: the copy->paste of your snippet

As you can see where the NonPrecomputed method is able to perform 63 execution, the Precomputed is able to perform 10000 execution.
As you can see the NonPrecomputed method allocate ~67 MB when the Precomputed method have no allocation (due to the initial cache)

C:\opt\SP\Workspace\Go\Temp>go test -bench=. -benchmem -benchtime=10s
2022/11/03 00:45:35 Init took: 10.8397ms
goos: windows
goarch: amd64
pkg: Temp
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Benchmark_Precomputed-8            10000           1113887 ns/op               0 B/op          0 allocs/op
Benchmark_NonPrecomputed-8            63         298434740 ns/op        65782238 B/op     484595 allocs/op
PASS
ok      Temp    41.548s

huangapple
  • 本文由 发表于 2022年11月2日 11:06:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/74283926.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定