Varnish – use the cache when UTM_, gclid and other campaign params are used, otherwise pass if other querystring present

huangapple go评论71阅读模式
英文:

Varnish - use the cache when UTM_, gclid and other campaign params are used, otherwise pass if other querystring present

问题

以下是要翻译的内容:

In short, how can the following rule be changed to allow caching if specified querystring parameters are present, but disallow caching if they are mixed with any other undefined parameters?

如果指定的查询字符串参数存在,但与任何其他未定义参数混合时不允许缓存,应如何更改以下规则?

if (req.url~"\?.*$" && !req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
    set req.http.X - Cacheable = "NO:Contains Querystring";
    return (pass);
}

Long Explanation:
长解释:

Ok, so have a varnish instance running with reverse Apache SSL terminator and a wordpress backend on Apache.
好的,所以我有一个Varnish实例运行在反向Apache SSL终结器上,并且有一个基于Apache的WordPress后端。

After deploying the default config, I have quickly learned that all querystrings are disabled from the cache, which is all well and good. However when an adwords visitor arrives, the url will be loaded with utm_ and other campaign specific parameters, which basically busts through the cache with the default config. This is not desired, as the pages are still static so its better to ignore these parameters and still peruse the cache. This is what I have implemented, and this rule works great on static pages being hit with any combination of defined utm/gclid/fbclid parameters.

在部署默认配置后,我很快就了解到所有查询字符串都被禁用了,这是很好的。但是当AdWords访问者到来时,URL将加载有utm_和其他特定于广告活动的参数,这基本上通过默认配置绕过了缓存。这是不希望的,因为页面仍然是静态的,所以最好忽略这些参数并继续查看缓存。这就是我实现的内容,这个规则在被任何已定义的utm/gclid/fbclid参数的组合访问的静态页面上运作得很好。

sub vcl_recv {
  if (req.url~"\?.*$" && !req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
    set req.http.X - Cacheable = "NO:Contains Querystring";
    return (pass);
  }

  if (req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|fb_local|mr:[A-z]+)=") {
    set req.url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|fb_local|mr:[A-z]+)=[%.+-_A-z0-9]+&?", "");
  }
  set req.url = regsub(req.url, "(\?&?)$", "");
}

However there's a problem if there is a mix of defined and undefined params:

然而,如果混合了已定义和未定义的参数,就会出现问题:

/home <-- varnish serves cached page
/home?gclid=x <-- varnish serves same cached page as above, great
/home?a=1 <-- caching disabled here
/home?a=1&gclid=x  <-- varnish redirects to /home?a=1 and serves an uncached page. I want varnish to not redirect here (retain the gclid for client in the url) and serve an uncached page.

I have then tried to change the rules to following:

然后,我尝试将规则更改为以下内容:

sub vcl_recv {
  if (req.url~"\?.*$" && !req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
    set req.http.X - Cacheable = "NO:Contains Querystring";
    return (pass);
  }

  set req.http.x-cache-url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|fb_local|mr:[A-z]+)=[%.+-_A-z0-9]+&?", "");
}
sub vcl_hash {
    hash_data(req.http.x-cache-url);
    return (lookup);
}

This forces varnish to never use specified params in the hash. This works great not to do the redirect, but the undesired behaviour is that varnish will cache any urls containing both defined and undefined parameters - effectively allowing a way to poison the cache in the long run, so:

这迫使Varnish永远不在哈希中使用指定的参数。这在不进行重定向方面效果很好,但不希望的行为是,Varnish将缓存包含已定义和未定义参数的任何URL - 从长远来看,这实际上允许了一种污染缓存的方式,因此:

/home <-- varnish serves a cached page
/home?gclid=x <-- varnish serves same cached page as above, great
/home?a=1 <-- caching disabled here
/home?a=1&gclid=x  <-- varnish serves the cached page /home?a=1 however retains the original url. I want to avoid caching here with any undefined parameters in the querystring.

Has anyone got any ideas how I could define such a rule?
有没有人有任何想法,我如何定义这样的规则?

英文:

In short, how can the following rule be changed to allow caching if specified querystring parameters are present, but disallow caching if they are mixed with any other undefined parameters?

if (req.url~"\?.*$" && !req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
    set req.http.X - Cacheable = "NO:Contains Querystring";
    return (pass);
  }

Long Explanation:
Ok, so have a varnish instance running with reverse Apache SSL terminator and a wordpress backend on Apache.

After deploying the default config, I have quickly learned that all querystrings are disabled from the cache, which is all well and good. However when an adwords visitor arrives, the url will be loaded with utm_ and other campaign specific parameters, which basically busts through the cache with the default config. This is not desired, as the pages are still static so its better to ignore these parameters and still peruse the cache. This is what I have implemented, and this rule works great on static pages being hit with any combination of defined utm/gclid/fbclid parameters.

sub vcl_recv {

  if (req.url~"\?.*$" && !req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
    set req.http.X - Cacheable = "NO:Contains Querystring";
    return (pass);
  }

  if (req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|fb_local|mr:[A-z]+)=") {
    set req.url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|fb_local|mr:[A-z]+)=[%.+-_A-z0-9]+&?", "");
  }
  set req.url = regsub(req.url, "(\?&?)$", "");

}

However there's a problem if there is a mix of defined and undefined params:

/home <-- varnish serves cached page
/home?gclid=x <-- varnish serves same cached page as above, great
/home?a=1 <-- caching disabled here
/home?a=1&gclid=x  <-- varnish redirects to /home?a=1 and serves an uncached page. I want varnish to not redirect here (retain the gclid for client in the url) and serve an uncached page.

I have then tried to change the rules to following:

sub vcl_recv {

  if (req.url~"\?.*$" && !req.url~"(\?|&)(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|mr:[A-z]+)=") {
    set req.http.X - Cacheable = "NO:Contains Querystring";
    return (pass);
  }

  set req.http.x-cache-url = regsuball(req.url, "(gclid|cx|ie|cof|siteurl|zanpid|origin|utm_[a-z]+|fb_local|mr:[A-z]+)=[%.+-_A-z0-9]+&?", "");

}
sub vcl_hash {
    hash_data(req.http.x-cache-url);
    return (lookup);
}

This forces varnish to never use specified params in the hash. This works great not to do the redirect, but the undesired behaviour is that varnish will cache any urls containing both defined and undefined parameters - effectively allowing a way to poison the cache in the long run, so:

/home <-- varnish serves a cached page
/home?gclid=x <-- varnish serves same cached page as above, great
/home?a=1 <-- caching disabled here
/home?a=1&gclid=x  <-- varnish serves the cached page /home?a=1 however retains the original url. I want to avoid caching here with any undefined parameters in the querystring.

Has anyone got any ideas how I could define such a rule?

答案1

得分: 2

以下是用于删除跟踪查询字符串参数的VCL代码片段:

sub vcl_recv {
    if (req.url ~ "(\?|&)(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=") {
        set req.url = regsuball(req.url, "&(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "");
        set req.url = regsuball(req.url, "\?(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "?");
        set req.url = regsub(req.url, "?&", "?");
        set req.url = regsub(req.url, "?$", "");
    }
}

这个代码片段会删除URL中的跟踪参数,只保留其他参数。

如果有其他需要翻译的内容,请继续提供。

英文:

This is the VCL snippet I typically use to strip off tracking query string parameters:

sub vcl_recv {
    if (req.url ~ "(\?|&)(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=") {
        set req.url = regsuball(req.url, "&(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "");
        set req.url = regsuball(req.url, "\?(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "?");
        set req.url = regsub(req.url, "\?&", "?");
        set req.url = regsub(req.url, "\?$", "");
    }
}

Here's the varnishlog -g request -i requrl output that proves how this works:

$ varnishlog -g request -i requrl
*   << Request  >> 32770
-   ReqURL         /?gclid=x
-   ReqURL         /?gclid=x
-   ReqURL         /?
-   ReqURL         /?
-   ReqURL         /

*   << Request  >> 5
-   ReqURL         /?a=1&gclid=x
-   ReqURL         /?a=1
-   ReqURL         /?a=1
-   ReqURL         /?a=1
-   ReqURL         /?a=1
**  << BeReq    >> 6

All the ReqURL log lines illustrate how the URL evolves from its original value into the final value given the 4 changes it goes through once the regex pattern is matched.

  • If the URL is /?gclid=x, the parameter will be stripped off and the URL ends up being /
  • If the URL is /?a=1&gclid=x, the gclid parameter will be stripped off while the a parameter remains untouched.

Update

As mentioned by @sash in the comments, if certain query string parameters appear after having stripped off the tracking parameters, the cache needs to be bypassed.

Here's the original VCL where an extra if-statement is added to bypass the cache:

sub vcl_recv {
    if (req.url ~ "(\?|&)(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=") {
        set req.url = regsuball(req.url, "&(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "");
        set req.url = regsuball(req.url, "\?(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "?");
        set req.url = regsub(req.url, "\?&", "?");
        set req.url = regsub(req.url, "\?$", "");
    }

    if (req.url ~ "(\?|&)(a|b|c)=") {
        return(pass);
    }
}

In this example the appearance of the a, b or c querystring parameter causes the cache to be bypassed.

Update 2

After further feedback by @sash in the comments, here's a VCL snippet that will remove the tracking query string parameters.

If any other ones appear that are not the tracking ones, bypass the cache

sub vcl_recv {
    if (req.url ~ "(\?|&)(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=") {
        set req.url = regsuball(req.url, "&(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "");
        set req.url = regsuball(req.url, "\?(utm_source|utm_medium|utm_campaign|utm_content|gclid|cx|ie|cof|siteurl)=([A-z0-9_\-\.%25]+)", "?");
        set req.url = regsub(req.url, "\?&", "?");
        set req.url = regsub(req.url, "\?$", "");
    }

    if (req.url ~ "\?[^&]+=") {
        return(pass);
    }
}

huangapple
  • 本文由 发表于 2023年3月8日 19:05:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/75672194.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定