在长期运行中,正则表达式和缓存哪个更快?

huangapple go评论102阅读模式
英文:

Go regular expressions vs caching - which is faster in the long run?

问题

我有一个服务,在其内部对某些内容进行“允许”或“不允许”(为了简单起见)的验证,这是基于正则表达式匹配的。伪代码如下:

func isAllowed(s string) {
  return regex.match(pattern, s)
}

我知道正则表达式很慢,尽管Golang有一个稍微简化了的正则表达式以满足其性能要求,但它仍然不能与精确的字符串比较相提并论。我也知道我的函数会经常被调用,并且会重复使用相同的值。所以,我考虑使用缓存:

var cache = make(map[string]bool)

func isAllowed(s string) {
  if result, found := cache[s]; found {
    return result
  }
  allowed := regex.match(pattern, s) // 忽略语法;这里我简化为伪代码
  cache[s] = allowed
  return allowed
}

现在,如果字符串已经在缓存中,我就可以避免进行正则表达式操作。但是...这个缓存中可能会有很多值,比如成千上万个。所以,为了在缓存中查找值,我可能需要进行1万次字符串比较,而不是单个正则表达式操作。

所以,我的问题是,字符串比较比Go的正则表达式匹配快多少?缓存会提高还是降低效率?

英文:

I've got a service that somewhere in its internals does a validation on whether something is "allowed" or "not allowed" (to keep it simple), which is based on a regular expression match. In pseudo-code:

func isAllowed(s string) {
  return regex.match(pattern, s)
}

Now, I know that regex is slow, and even though Golang has a slightly dumbed-down flavor of regex to meet its performance SLAs, it's still not going to be the same as an exact string comparison. And I also know that my function is going to be called quite often with repeated values. So, I thought of making a cache:

var cache = make(map[string]bool)

func isAllowed(s string) {
  if result, found := cache
展开收缩
; found { return result } allowed := regex.match(pattern, s) // ignore syntax here; I'm simplifying this as pseudo-code cache
展开收缩
= allowed return allowed }

So now I can avoid the regex operation if the string is already in my cache. But...there are potentially going to be a lot, like thousands or 10,000s of values in this cache. So just to look up values in the cache I might have to do 10,000 string comparisons, rather than a single regex operation.

So, I guess my question is, how much faster is a string comparison than a Go regex match? Is caching going help or hurt my efficiency?

答案1

得分: 1

这个技术被称为记忆化

在Go的regexp包中,[哈希]map查找的时间复杂度是O(1)(常数时间)。Go的regexp包中的正则表达式保证在O(N)(线性时间)内运行,其中N是输入的长度(详见https://pkg.go.dev/regexp#pkg-overview和https://swtch.com/~rsc/regexp/regexp1.html)。

这意味着你在时间和空间之间进行了权衡:TANSTAAFL(没有免费的午餐)。

至于map查找相对于正则表达式的速度提升有多大,唯一的方法是对一些使用实际输入的代表性内容进行一些基准测试。

你可能需要考虑以下几个问题:

  • 从性能角度来看,这个授权函数所花费的时间是否真的很重要?
  • 你有多少次会得到缓存命中和缓存未命中?
  • 如果这是一个长时间运行的服务/守护程序,缓存会无限增长并最终导致服务/守护程序崩溃吗?
  • 你是否希望使用更复杂的缓存,其中缓存条目将过期或被驱逐以保持增长在限制范围内?

最后,

  • 如果你需要从字符串中解析出一些位用于授权目的,也许更好的性能改进方法是重新思考你的方法,并将授权规则/标志作为某种数据类型(结构体或位图)与执行授权测试的相关函数一起维护。
英文:

This technique is called memoization.

A [hash]map lookup is O(1) [constant] time. The regular expressions in Go's regexp package are guaranteed to run in O(N) (linear) time, where N is the length of the input (see https://pkg.go.dev/regexp#pkg-overview, and https://swtch.com/~rsc/regexp/regexp1.html for details).

So that means you are trading time for space: TANSTAAFL

As to how much faster a map lookup might be over the regular expression might be, the only way to find out would be to run some benchmarks on something using something representative of your actual input.

Some questions you might want to consider:

  • Is the time spent in this authorization function actually significant from a performance perspective?
  • How often will you get a cache hit versus a cache miss?
  • If this is a long-running service/daemon, is the cache going to grow without limit and ultimately crash your service/daemon?
  • Might you want to use a more sophisticated cache where cache entries will expire or get evicted to keep growth within limits?

And finally,

  • If you're having to parse bits out of a string for authorization purposes, perhaps a better performance improvement might be to rethink your approach and maintain your authorization rules/flags as some sort of datatype (a structure or bit map) with associated functions for performing authorization tests.

答案2

得分: 0

据记录,我进行了基准测试,并发现缓存将我的性能提高了两个数量级。因此,我认为我会在内存上付出代价,并获得性能的提升。 在长期运行中,正则表达式和缓存哪个更快?

英文:

For the record, I ran benchmarks and found that caching improved my performance by two orders of magnitude. So, I think I will pay the price in memory, and pocket the performance gain. 在长期运行中,正则表达式和缓存哪个更快?

huangapple
  • 本文由 发表于 2023年2月9日 02:05:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75389978.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定