删除阻止列表文件中的行,其中这些行的结尾与白名单文件中的条目匹配

huangapple go评论41阅读模式
英文:

Delete lines in blocklist file, where the end of those lines match an entry in a whiltelist file

问题

I'm trying to delete lines in ad blocklist file, but only if the end of the blocklist line matches an entry in a whitelist file. Therefore do not delete blocklist lines if there is a match at the start or middle of the blocklist line.

Eg:

**Blocklist file**
randomsites.com
calendar.google.com
google.com
google.com.fake.com
**Whitelist file**
google.com
**Output to new_blocklist**
randomsites.com
google.com.fake.com

This line I've tried works but takes many minutes (on openwrt router) to process ~300k lines blocklist:

awk 'FNR==NR{a[$0];next} {for (i in a) {if ($0 ~ i "$") next}}1' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist

This line here works on exact whole line matches only but is very quick, e.g., seconds only. Could it possibly be edited somehow to meet the criteria (and faster than above)?

awk 'NR==FNR{a[$0];next} !($0 in a)' /tmp/whitelist /tmp/blocklist > /tmp/tempfile

Thanks, everyone.

英文:

I'm trying to delete lines in ad blocklist file, but only if the end of the blocklist line matches an entry in a whitelist file. Therefore do not delete blocklist lines if there is a match at eg the start or middle of the blocklist line.

Eg:

**Blocklist file**
randomsites.com
calendar.google.com
google.com
google.com.fake.com
**Whitelist file**
google.com
**Output to new_blocklist**
randomsites.com
google.com.fake.com

Might not be a legit address above ie google.com.fake.com, but the example does demonstrate how I plan for this whitelist to work.

This line I've tried works, but is taking many minutes (on openwrt router) to process ~300k lines blocklist:

awk 'FNR==NR{a[$0];next} {for (i in a) {if ($0 ~ i "$") next}}1' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist

This line here works on exact whole line matches only, but is very quick eg seconds only. Could it possibly be edited somehow to meet the criteria (and faster than above!)?

awk 'NR==FNR{a[$0];next} !($0 in a)' /tmp/whitelist /tmp/blocklist > /tmp/tempfile

Thanks everyone.

答案1

得分: 2

以下是翻译好的代码部分:

也许不需要查找,您可以使用|进行模式组合,将整个表达式放在括号内,并以$结尾。

点号匹配任何字符,您需要转义它以匹配文字点号。

awk '
    FNR == NR {
      gsub(/\./, "\\.")
      tmp = tmp sep $0
      sep = "|"
      next
    }
    FNR == 1 {
        regexp = "(^|[.])(" tmp ")$"
    }
    $0 !~ regexp
' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist
英文:

Maybe instead of a lookup, you could assemble a pattern with an alternation once using | and group the whole expression between parenthesis and ending with $.

The dot matches any character, you would have to escape that to match a literal dot.

awk '
    FNR == NR {
      gsub(/\./, "\\.")
      tmp = tmp sep $0
      sep = "|" 
      next
    }
    FNR == 1 {
        regexp = "(^|[.])(" tmp ")$"
    }
    $0 !~ regexp
' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist

答案2

得分: 1

以下是翻译好的部分:

这可能是你想要的:

$ cat tst.awk
BEGIN { FS="." }
NR==FNR {
    allow[$0]
    next
}
{
    addr = $NF
    for ( i=NF-1; i>=1; i-- ) {
        addr = $i FS addr
        if ( addr in allow ) {
            next
        }
    }
}
{ print }

<p>

$ awk -f tst.awk allow block
randomsites.com
google.com.fake.com

上述代码会对块列表中的每个以`.`分隔的子字符串进行字面字符串哈希查找,从右侧开始,因此速度快且稳定。对于像 `google.com` 这样的简单域名,它只会执行 1 次允许数组的查找,就像你的 `!($0 in a)` 一样;对于像 `google.com.fake.com` 这样的其他域名,它会执行比域名部分更少的迭代/查找,也就是在这种情况下有 4 个部分,所以只有 3 次迭代/查找,直到找到允许数组中的匹配项为止。即使对于这种情况,它仍然是每次都只进行哈希查找,因此应该仍然很快。

P.S. 这个问题的旧术语是黑名单/白名单,当前术语是[阻止列表/允许列表](https://www.linkedin.com/pulse/allowlist-blocklist-better-terms-everyone-lets-use-them-rob-black/),而不是黑名单/白名单。

<details>
<summary>英文:</summary>

This might do what you want:

    $ cat tst.awk
    BEGIN { FS=&quot;.&quot; }
    NR==FNR {
        allow[$0]
        next
    }
    {
        addr = $NF
        for ( i=NF-1; i&gt;=1; i-- ) {
            addr = $i FS addr
            if ( addr in allow ) {
                next
            }
        }
    }
    { print }

&lt;p&gt;

    $ awk -f tst.awk allow block
    randomsites.com
    google.com.fake.com

The above is doing literal string hash lookups of each `.`-separate substring from your blocklist, starting from the right side, and so will be fast and robust. For a simple domain name in your blocklist like `google.com` it&#39;ll only do 1 lookup of the allow array, just like your `!($0 in a)` does, for others like `google.com.fake.com` it&#39;ll do 1 less iterations/lookups than there are parts of the domain, i.e. 4 parts in this case so 3 iterations/lookups, until if/when it finds a match in the allow array. Even for that, though, it&#39;s just hash lookup each time so it should still be fast.

P.S. old terminology for this was blacklist/whitelist, current is [blocklist/allowlist](https://www.linkedin.com/pulse/allowlist-blocklist-better-terms-everyone-lets-use-them-rob-black/) rather than blocklist/whitelist.

</details>



# 答案3
**得分**: 1

The block-less ternaries-only `awk` approach, and escape more than OP's requirements :

---

    mawk 'NR == FNR ? (__ = __$_ "|")<_ : $_!~(!_ < FNR \
          ? _ : substr(_, gsub("[?./_:;&=]", "[&]", __), 
                           sub(".$", ")$", __)))__' __='(&' \

    <( printf '%s' 'google.com') <( printf '%s' 'randomsites.com
                                                 calendar.google.com
                                                 google.com
                                                 google.com.fake.com' )
---

     1	randomsites.com

     2	google.com.fake.com

<details>
<summary>英文:</summary>

The block-less ternaries-only `awk` approach, and escape more than OP&#39;s requirements :

---

    mawk &#39;NR == FNR ? (__ = __$_ &quot;|&quot;)&lt;_ : $_!~(!_ &lt; FNR \
          ? _ : substr(_, gsub(&quot;[?./_:;=&amp;]&quot;, &quot;[&amp;]&quot;, __), 
                           sub(&quot;.$&quot;, &quot;)$&quot;, __)))__&#39; __=&#39;(&#39; \

    &lt;( printf &#39;%s&#39; &#39;google.com&#39;) &lt;( printf &#39;%s&#39; &#39;randomsites.com
                                                 calendar.google.com
                                                 google.com
                                                 google.com.fake.com&#39; )
---

     1	randomsites.com

     2	google.com.fake.com


</details>



huangapple
  • 本文由 发表于 2023年5月6日 19:05:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76188526.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定