2023年5月6日 19:05:54go评论57阅读模式

英文:

Delete lines in blocklist file, where the end of those lines match an entry in a whiltelist file

问题

I'm trying to delete lines in ad blocklist file, but only if the end of the blocklist line matches an entry in a whitelist file. Therefore do not delete blocklist lines if there is a match at the start or middle of the blocklist line.

Eg:

**Blocklist file**
randomsites.com
calendar.google.com
google.com
google.com.fake.com

**Whitelist file**
google.com

**Output to new_blocklist**
randomsites.com
google.com.fake.com

This line I've tried works but takes many minutes (on openwrt router) to process ~300k lines blocklist:

awk 'FNR==NR{a[$0];next} {for (i in a) {if ($0 ~ i "$") next}}1' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist

This line here works on exact whole line matches only but is very quick, e.g., seconds only. Could it possibly be edited somehow to meet the criteria (and faster than above)?

awk 'NR==FNR{a[$0];next} !($0 in a)' /tmp/whitelist /tmp/blocklist > /tmp/tempfile

Thanks, everyone.

英文:

Eg:

**Blocklist file**
randomsites.com
calendar.google.com
google.com
google.com.fake.com

**Whitelist file**
google.com

**Output to new_blocklist**
randomsites.com
google.com.fake.com

Might not be a legit address above ie google.com.fake.com, but the example does demonstrate how I plan for this whitelist to work.

This line I've tried works, but is taking many minutes (on openwrt router) to process ~300k lines blocklist:

awk &#39;FNR==NR{a[$0];next} {for (i in a) {if ($0 ~ i &quot;$&quot;) next}}1&#39; /tmp/whitelist /tmp/blocklist &gt; /tmp/new_blocklist

This line here works on exact whole line matches only, but is very quick eg seconds only. Could it possibly be edited somehow to meet the criteria (and faster than above!)?

awk &#39;NR==FNR{a[$0];next} !($0 in a)&#39; /tmp/whitelist /tmp/blocklist &gt; /tmp/tempfile

Thanks everyone.

答案1

得分: 2

以下是翻译好的代码部分：

也许不需要查找，您可以使用|进行模式组合，将整个表达式放在括号内，并以$结尾。

点号匹配任何字符，您需要转义它以匹配文字点号。

awk '
    FNR == NR {
      gsub(/\./, "\\.")
      tmp = tmp sep $0
      sep = "|"
      next
    }
    FNR == 1 {
        regexp = "(^|[.])(" tmp ")$"
    }
    $0 !~ regexp
' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist

英文:

Maybe instead of a lookup, you could assemble a pattern with an alternation once using | and group the whole expression between parenthesis and ending with $.

The dot matches any character, you would have to escape that to match a literal dot.

awk &#39;
    FNR == NR {
      gsub(/\./, &quot;\\.&quot;)
      tmp = tmp sep $0
      sep = &quot;|&quot; 
      next
    }
    FNR == 1 {
        regexp = &quot;(^|[.])(&quot; tmp &quot;)$&quot;
    }
    $0 !~ regexp
&#39; /tmp/whitelist /tmp/blocklist &gt; /tmp/new_blocklist

答案2

得分: 1

以下是翻译好的部分：

这可能是你想要的：

$ cat tst.awk
BEGIN { FS="." }
NR==FNR {
    allow[$0]
    next
}
{
    addr = $NF
    for ( i=NF-1; i>=1; i-- ) {
        addr = $i FS addr
        if ( addr in allow ) {
            next
        }
    }
}
{ print }

<p>

$ awk -f tst.awk allow block
randomsites.com
google.com.fake.com

上述代码会对块列表中的每个以`.`分隔的子字符串进行字面字符串哈希查找，从右侧开始，因此速度快且稳定。对于像 `google.com` 这样的简单域名，它只会执行 1 次允许数组的查找，就像你的 `!($0 in a)` 一样；对于像 `google.com.fake.com` 这样的其他域名，它会执行比域名部分更少的迭代/查找，也就是在这种情况下有 4 个部分，所以只有 3 次迭代/查找，直到找到允许数组中的匹配项为止。即使对于这种情况，它仍然是每次都只进行哈希查找，因此应该仍然很快。

P.S. 这个问题的旧术语是黑名单/白名单，当前术语是[阻止列表/允许列表](https://www.linkedin.com/pulse/allowlist-blocklist-better-terms-everyone-lets-use-them-rob-black/)，而不是黑名单/白名单。

<details>
<summary>英文:</summary>

This might do what you want:

    $ cat tst.awk
    BEGIN { FS=&quot;.&quot; }
    NR==FNR {
        allow[$0]
        next
    }
    {
        addr = $NF
        for ( i=NF-1; i&gt;=1; i-- ) {
            addr = $i FS addr
            if ( addr in allow ) {
                next
            }
        }
    }
    { print }

&lt;p&gt;

    $ awk -f tst.awk allow block
    randomsites.com
    google.com.fake.com

The above is doing literal string hash lookups of each `.`-separate substring from your blocklist, starting from the right side, and so will be fast and robust. For a simple domain name in your blocklist like `google.com` it&#39;ll only do 1 lookup of the allow array, just like your `!($0 in a)` does, for others like `google.com.fake.com` it&#39;ll do 1 less iterations/lookups than there are parts of the domain, i.e. 4 parts in this case so 3 iterations/lookups, until if/when it finds a match in the allow array. Even for that, though, it&#39;s just hash lookup each time so it should still be fast.

P.S. old terminology for this was blacklist/whitelist, current is [blocklist/allowlist](https://www.linkedin.com/pulse/allowlist-blocklist-better-terms-everyone-lets-use-them-rob-black/) rather than blocklist/whitelist.

</details>



# 答案3
**得分**: 1

The block-less ternaries-only `awk` approach, and escape more than OP's requirements :

---

    mawk 'NR == FNR ? (__ = __$_ "|")<_ : $_!~(!_ < FNR \
          ? _ : substr(_, gsub("[?./_:;&=]", "[&]", __), 
                           sub(".$", ")$", __)))__' __='(&' \

    <( printf '%s' 'google.com') <( printf '%s' 'randomsites.com
                                                 calendar.google.com
                                                 google.com
                                                 google.com.fake.com' )
---

     1	randomsites.com

     2	google.com.fake.com

<details>
<summary>英文:</summary>

The block-less ternaries-only `awk` approach, and escape more than OP&#39;s requirements :

---

    mawk &#39;NR == FNR ? (__ = __$_ &quot;|&quot;)&lt;_ : $_!~(!_ &lt; FNR \
          ? _ : substr(_, gsub(&quot;[?./_:;=&amp;]&quot;, &quot;[&amp;]&quot;, __), 
                           sub(&quot;.$&quot;, &quot;)$&quot;, __)))__&#39; __=&#39;(&#39; \

    &lt;( printf &#39;%s&#39; &#39;google.com&#39;) &lt;( printf &#39;%s&#39; &#39;randomsites.com
                                                 calendar.google.com
                                                 google.com
                                                 google.com.fake.com&#39; )
---

     1	randomsites.com

     2	google.com.fake.com


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

删除阻止列表文件中的行，其中这些行的结尾与白名单文件中的条目匹配

问题

答案1

答案2

bash + 如何在同一行上打印序列行

awk – 如果匹配到模式，添加一个具有下一行中找到的值的列。

AWK命令将第三个逗号替换为换行，或者换句话说，将数据分成三列。

在AWK中带有字段标题的类似Countif的函数

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论