英文:
Go - regex inside loop
问题
我有一个包含600个正则表达式模式的文件,需要按顺序执行以查找网站的特定ID。
示例:
regex/www\.effectiveperformanceformat\.com/5
regex/bam-cell\.nr-data\.net/5
regex/advgoogle\.com/5
regex/googleapi\.club/5
regex/doubleclickbygoogle\.com/5
regex/googlerank\.info/5
regex/google-pr7\.de/5
regex/usemarketings\.com/5
regex/google-rank\.org/5
regex/googleanalytcs\.com/5
regex/xml\.trafficmoose\.com/5
regex/265\.com/5
regex/app-measurement\.com/5
regex/loftsbaacad\.com/5
regex/toldmeflex\.com/5
regex/r\.baresi\.xyz/5
regex/molodgytot\.biz/5
regex/ec\.walkme\.com/5
regex/px\.ads\.linkedin\.com/5
regex/hinisanex\.biz/5
regex/buysellads\.com/5
regex/buysellads\.net/5
regex/servedby-buysellads\.com/5
regex/carbonads\.(net|com)/5
regex/oulddev\.biz/5
regex/click\.hoolig\.app/5
regex/engine\.blacraft\.com/5
regex/mc\.yandex\.ru/5
regex/ads\.gaming1\.com/5
regex/adform\.net/5
regex/luzulabeguile\.com/5
regex/ficanportio\.biz/5
regex/hidelen\.com/5
regex/earchmess\.fun/5
regex/acrvclk\.com/5
regex/track\.wg-aff\.com/5
regex/thumb\.tapecontent\.net/5
regex/betgorebysson\.club/5
regex/in-page-push\.com/5
regex/itphanpytor\.club/5
regex/mktoresp\.com/5
regex/xid\.i-mobile\.co\.jp/5
regex/ads\.tremorhub\.com/5
到目前为止,我正在使用以下代码:
for _, line := range file {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
match, _ := regexp.MatchString(``+data[1]+``, website)
if match {
id, _ = strconv.Atoi(data[2])
}
}
}
这个方法是有效的,但我想知道是否有更优化的方法。因为如果网站与顶部的正则表达式不匹配,我需要一遍又一遍地遍历直到找到匹配的模式。
有人可以帮助我改进这个方法吗?
最好的问候
英文:
I have a file with a list of 600 regex patterns that most be performed in order to find a specific id for a website.
Example:
regex/www\.effectiveperformanceformat\.com/5
regex/bam-cell\.nr-data\.net/5
regex/advgoogle\.com/5
regex/googleapi\.club/5
regex/doubleclickbygoogle\.com/5
regex/googlerank\.info/5
regex/google-pr7\.de/5
regex/usemarketings\.com/5
regex/google-rank\.org/5
regex/googleanalytcs\.com/5
regex/xml\.trafficmoose\.com/5
regex/265\.com/5
regex/app-measurement\.com/5
regex/loftsbaacad\.com/5
regex/toldmeflex\.com/5
regex/r\.baresi\.xyz/5
regex/molodgytot\.biz/5
regex/ec\.walkme\.com/5
regex/px\.ads\.linkedin\.com/5
regex/hinisanex\.biz/5
regex/buysellads\.com/5
regex/buysellads\.net/5
regex/servedby-buysellads\.com/5
regex/carbonads\.(net|com)/5
regex/oulddev\.biz/5
regex/click\.hoolig\.app/5
regex/engine\.blacraft\.com/5
regex/mc\.yandex\.ru/5
regex/ads\.gaming1\.com/5
regex/adform\.net/5
regex/luzulabeguile\.com/5
regex/ficanportio\.biz/5
regex/hidelen\.com/5
regex/earchmess\.fun/5
regex/acrvclk\.com/5
regex/track\.wg-aff\.com/5
regex/thumb\.tapecontent\.net/5
regex/betgorebysson\.club/5
regex/in-page-push\.com/5
regex/itphanpytor\.club/5
regex/mktoresp\.com/5
regex/xid\.i-mobile\.co\.jp/5
regex/ads\.tremorhub\.com/5
So far what i'm using is something like this
for _, line := range file {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
match, _ := regexp.MatchString(``+data[1]+``, website)
if match {
id, _ = strconv.Atoi(data[2])
}
}
}
This is working, but i wonder if there is a more optimized way to do this.
Because, if the website match with the regex on the top, great, but if not, i need to intenered the loop over and over till find it.
Anyone can help me to improve this?
Best regards
答案1
得分: 1
为了减少时间,您可以缓存正则表达式。
package main
import (
"bufio"
"bytes"
"fmt"
csvutils "github.com/alessiosavi/GoGPUtils/csv"
"log"
"os"
"regexp"
"strconv"
"strings"
"time"
)
func main() {
now := time.Now()
Precomputed("www.google.it")
fmt.Println(time.Since(now))
now = time.Now()
NonPrecomputed("www.google.it")
fmt.Println(time.Since(now))
}
func NonPrecomputed(website string) int {
for _, line := range cachedLines {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
match, _ := regexp.MatchString(`+data[1]+`, website)
if match {
id, _ := strconv.Atoi(data[2])
return id
}
}
}
return -1
}
func Precomputed(site string) int {
for regex, id := range rawRegex {
if ok := regex.MatchString(site); ok {
return id
}
}
return -1
}
var rawRegex map[*regexp.Regexp]int = make(map[*regexp.Regexp]int)
var cachedLines []string
var sites []string
func init() {
now := time.Now()
file, err := os.ReadFile("regex.txt")
if err != nil {
panic(err)
}
scanner := bufio.NewScanner(bytes.NewReader(file))
for scanner.Scan() {
txt := scanner.Text()
cachedLines = append(cachedLines, txt)
split := strings.Split(txt, "/")
if len(split) == 3 {
compile, err := regexp.Compile(split[1])
if err != nil {
panic(err)
}
if rawRegex[compile], err = strconv.Atoi(split[2]); err != nil {
panic(err)
}
}
}
file, err = os.ReadFile("top500Domains.csv")
if err != nil {
panic(err)
}
_, csvData, err := csvutils.ReadCSV(file, ',')
if err != nil {
panic(err)
}
for _, line := range csvData {
sites = append(sites, line[1])
}
log.Println("Init took:", time.Since(now))
}
init
方法负责正则表达式缓存。它将所有正则表达式加载到一个带有相对索引的映射中(还会加载测试数据以进行基准测试)。
然后有两个方法:
Precomputed
:使用缓存的正则表达式映射NonPrecomputed
:您的代码片段的复制粘贴
如您所见,NonPrecomputed
方法能够执行 63 次,而 Precomputed
方法能够执行 10000 次。
您可以看到 NonPrecomputed
方法分配了约 67 MB 的内存,而 Precomputed
方法没有分配内存(由于初始缓存)。
C:\opt\SP\Workspace\Go\Temp>go test -bench=. -benchmem -benchtime=10s
2022/11/03 00:45:35 Init took: 10.8397ms
goos: windows
goarch: amd64
pkg: Temp
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Benchmark_Precomputed-8 10000 1113887 ns/op 0 B/op 0 allocs/op
Benchmark_NonPrecomputed-8 63 298434740 ns/op 65782238 B/op 484595 allocs/op
PASS
ok Temp 41.548s
英文:
In order to reduce the time you can cache the regexp.
package main
import (
"bufio"
"bytes"
"fmt"
csvutils "github.com/alessiosavi/GoGPUtils/csv"
"log"
"os"
"regexp"
"strconv"
"strings"
"time"
)
func main() {
now := time.Now()
Precomputed("www.google.it")
fmt.Println(time.Since(now))
now = time.Now()
NonPrecomputed("www.google.it")
fmt.Println(time.Since(now))
}
func NonPrecomputed(website string) int {
for _, line := range cachedLines {
l := line
data := strings.Split(l, "/")
if data[0] == "regex" {
match, _ := regexp.MatchString(``+data[1]+``, website)
if match {
id, _ := strconv.Atoi(data[2])
return id
}
}
}
return -1
}
func Precomputed(site string) int {
for regex, id := range rawRegex {
if ok := regex.MatchString(site); ok {
return id
}
}
return -1
}
var rawRegex map[*regexp.Regexp]int = make(map[*regexp.Regexp]int)
var cachedLines []string
var sites []string
func init() {
now := time.Now()
file, err := os.ReadFile("regex.txt")
if err != nil {
panic(err)
}
scanner := bufio.NewScanner(bytes.NewReader(file))
for scanner.Scan() {
txt := scanner.Text()
cachedLines = append(cachedLines, txt)
split := strings.Split(txt, "/")
if len(split) == 3 {
compile, err := regexp.Compile(split[1])
if err != nil {
panic(err)
}
if rawRegex[compile], err = strconv.Atoi(split[2]); err != nil {
panic(err)
}
}
}
file, err = os.ReadFile("top500Domains.csv")
if err != nil {
panic(err)
}
_, csvData, err := csvutils.ReadCSV(file, ',')
if err != nil {
panic(err)
}
for _, line := range csvData {
sites = append(sites, line[1])
}
log.Println("Init took:", time.Since(now))
}
The init
method take care of regexp cache. It will load all the regexp in a map with the relative index (it will load the test data too just for the benchmark).
Then you have 2 method:
Precomputed
: use the map of cached regexpNonPrecomputed
: the copy->paste of your snippet
As you can see where the NonPrecomputed
method is able to perform 63 execution, the Precomputed
is able to perform 10000 execution.
As you can see the NonPrecomputed
method allocate ~67 MB when the Precomputed
method have no allocation (due to the initial cache)
C:\opt\SP\Workspace\Go\Temp>go test -bench=. -benchmem -benchtime=10s
2022/11/03 00:45:35 Init took: 10.8397ms
goos: windows
goarch: amd64
pkg: Temp
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Benchmark_Precomputed-8 10000 1113887 ns/op 0 B/op 0 allocs/op
Benchmark_NonPrecomputed-8 63 298434740 ns/op 65782238 B/op 484595 allocs/op
PASS
ok Temp 41.548s
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论