Delete lines in blocklist file, where the end of those lines match an entry from an allowlist file (Using Dnsmasq syntax)

huangapple go评论52阅读模式
英文:

Delete lines in blocklist file, where the end of those lines match an entry from an allowlist file (Using Dnsmasq syntax)

问题

以下是翻译的内容:

这是我以前提出问题的修改,这次要考虑Dnsmasq语法中的块列表。我尝试删除块列表文件中的行,但仅当块列表行的末尾与允许列表文件中的条目匹配时才删除。因此,子域块列表条目也应被删除。尝试坚持使用awk,因为该软件包已包含在OpenWRT中,而不是可用但需要额外下载的gawk。

块列表文件(使用Dnsmasq风格语法):

local=/randomsites.com/
local=/calendar.google.com/
local=/google.com/
local=/google.com.fake.com/
local=/fakegoogle.com/

允许列表文件:

google.com

希望输出到new_blocklist:

local=/randomsites.com/
local=/google.com.fake.com/
local=/fakegoogle.com/

上面是您提供的翻译内容。

英文:

This is a modification of a question I've previously asked, this time to account for Dnsmasq syntax in the blocklist. I'm trying to delete lines in an blocklist file, but only if the end of the blocklist line matches an entry in an allowlist file. Subdomain blocklist entries should therefore also be removed. Trying to stick to using awk since that package is included in OpenWRT, as apposed to eg gawk which is available, but is an additional download.

Blocklist file (with dnsmasq style syntax)

local=/randomsites.com/
local=/calendar.google.com/
local=/google.com/
local=/google.com.fake.com/
local=/fakegoogle.com/

Allowlist file

google.com

Desire output to new_blocklist

local=/randomsites.com/
local=/google.com.fake.com/
local=/fakegoogle.com/

Below is where I've tried to modify a solution from a previous question I asked, but this time to account for dnsmasq syntax in the blocklist. This method was extremely fast to process large lists (~300k lines in around 10 seconds on dual core router), which is especially useful for lower powered routers.

$ cat tst.awk
BEGIN { FS="." }
NR==FNR {
    allow[$0"/"]
    next
}
{
    addr = $NF
    for ( i=NF-1; i>=1; i-- ) {
        addr = $i FS addr
        if ( substr(addr,8) in allow ) {
            next
        }
    }
}
{ print }

awk -f tst.awk allow block is producing output:

local=/randomsites.com/
local=/calendar.google.com/
local=/google.com.fake.com/
local=/fakegoogle.com/

And so is not removing the local=/calendar.google.com/ entry as desired.

The exact previous solution by Ed Morton which worked perfectly for a blocklist of eg google.com, calendar.google.com (ie without the dnsmasq syntax of local=/...../) was:

$ cat tst.awk
BEGIN { FS="." }
NR==FNR {
    allow[$0]
    next
}
{
    addr = $NF
    for ( i=NF-1; i>=1; i-- ) {
        addr = $i FS addr
        if ( addr in allow ) {
            next
        }
    }
}
{ print }

I realise I haven't modified this solution correctly, but I have tried quite a lot time/reading to at least try and solve myself first.

REPORTING ON SOLUTIONS BELOW

Both @markp-fuso and @Ed Morton (author of original awk solution exc dnsmasq syntax) solutions are producing exactly the same, and correct, result. Now the interesting part is the run-times of both solutions on a netgear r7800 openwrt router, which is a dual core 2015 CPU. Multiple runs on each produced consistent runs times:

@markp-fuso solution:

300k lines blocklist, 13 lines allowlist = 23.3 seconds
300k lines blocklist, 300k lines allowlist = 21.5 seconds

@Ed Morton solution:

300k lines blocklist, 13 lines allowlist = 47.4 seconds
300k lines blocklist, 300k lines allowlist = 46

To note, both solutions have a faster runtime with the larger allowlist!

Thankyou both! This is really great, and a big contribution to a little project we have going for OpenWRT to block ads on router. Please delete link if not allowed here:

https://forum.openwrt.org/t/adblock-lean-set-up-adblock-using-dnsmasq-blocklist/157076/35

答案1

得分: 2

Here is the translated content:

修改回答 OP 之前问题的一个想法:

$ cat tst.awk
BEGIN   { FS="/" }                       # 使用 "/" 进行分割
NR==FNR { allow[$0]; next }
        { n=split($2,arr,".")            # 进一步使用 "." 进行分割

          # 与之前问题的回答一样处理

          addr = arr[n]
          for ( i=n-1; i>=1; i-- ) {
              addr = arr[i] "." addr
              if ( addr in allow )
                 next
          }
        }
1                                        # 打印当前行;与 "{ print }" 的行为相同

**注意:** 通过使用 `FS="/"`,然后引用 `$2`,我们去掉了 `local=/` 和(尾随的)`/`

测试一下:

$ awk -f tst.awk allow block
local=/randomsites.com/
local=/google.com.fake.com/
local=/fakegoogle.com/
英文:

One idea for modifying the answer to OP's previous question:

$ cat tst.awk
BEGIN   { FS="/" }                       # split on "/"
NR==FNR { allow[$0]; next }
        { n=split($2,arr,".")            # further split on "."

          # process as with answer to previous question

          addr = arr[n]
          for ( i=n-1; i>=1; i-- ) {
              addr = arr[i] "." addr
              if ( addr in allow )
                 next
          }
        }
1                                        # print current line; behaves identically to "{ print }"

NOTE: by using FS="/" and then referencing $2 we are stripping off the local=/ and (trailing) /

Taking for a test drive:

$ awk -f tst.awk allow block
local=/randomsites.com/
local=/google.com.fake.com/
local=/fakegoogle.com/

答案2

得分: 2

$ cat tst.awk
BEGIN { FS="." }
NR==FNR {
    allow[$0]
    next
}
{
    orig = $0
    gsub("^local=|/$","")
    addr = $NF
    for ( i=NF-1; i>=1; i-- ) {
        addr = $i FS addr
        if ( addr in allow ) {
            next
        }
    }
}
{ print orig }

<p>

$ awk -f tst.awk allowlist blocklist
local=/randomsites.com/
local=/google.com.fake.com/
local=/fakegoogle.com/
英文:
$ cat tst.awk
BEGIN { FS=&quot;.&quot; }
NR==FNR {
    allow[$0]
    next
}
{
    orig = $0
    gsub(&quot;^local=/|/$&quot;,&quot;&quot;)
    addr = $NF
    for ( i=NF-1; i&gt;=1; i-- ) {
        addr = $i FS addr
        if ( addr in allow ) {
            next
        }
    }
}
{ print orig }

<p>

$ awk -f tst.awk allowlist blocklist
local=/randomsites.com/
local=/google.com.fake.com/
local=/fakegoogle.com/

答案3

得分: 1

在您的特定情况下,您也可以使用 grep。如果您有允许列表的内容,您可以简单地使用 grep -v 与模式,例如:

grep -v '\([.]\|/\)google.com/$' blocklist

如果 allowlist 包含一个如上所示的单个条目,在bash中,您可以执行以下操作:

grep -v '\([.]\|/\)'"$(<allowlist)"'/$' blocklist

如果您正在使用POSIX shell,则可以结合使用 read -r 并从 allowlist 中的单个条目构建模式,例如:

read -r match <allowlist
pattern="\([.]\|/\)$match/$"
grep "$pattern" blocklist

示例用法/输出

所有上述方法与您在 allowlistblocklist 中显示的内容提供相同的输出,例如:

$ grep -v '\([.]\|/\)'"$(<allowlist)"'/$' blocklist
local=/randomsites.com/
local=/google.com.fake.com/
local=/fakegoogle.com/

注意:如果 allowlist 包含多个条目,则需要在bash中使用 readarray -t 来填充一个带有内容的索引数组,并从那里构建模式。在POSIX shell中,您只需循环 while read -r match; do ... done < allowlist

grep 只是提供另一种方法。如果 allowlist 包含多个条目,那么在多次调用 grep 时创建的子shell将会有所不同。如果您有问题,请告诉我。

英文:

In your specific case you could also use grep. If you have the contents of allow list, you can simply use grep -v with the pattern, e.g.

grep -v &#39;\([.]\|/\)google.com/$&#39; blocklist

If allowlist contains a single entry as shown, then in bash you could do:

grep -v &#39;\([.]\|/\)&#39;&quot;$(&lt;allowlist)&quot;&#39;/$&#39; blocklist

And if you are using POSIX shell, then you could combine read -r and build the pattern from the single entry in allowlist, e.g.

read -r match &lt;allowlist
pattern=&quot;\([.]\|/\)$match/$&quot;
grep -v &quot;$pattern&quot; blocklist

Example Use/Output

All above provide the same output with the contents in allowlist and blocklist as you show, e.g.

$ grep -v &#39;\([.]\|/\)&#39;&quot;$(&lt;allowlist)&quot;&#39;/$&#39; blocklist
local=/randomsites.com/
local=/google.com.fake.com/
local=/fakegoogle.com/

Note: if allowlist contains multiple entries, then you would need to use readarray -t in bash to fill an indexed array with the contents and build your pattern from there. In POSIX shell you would just loop while read -r match; do ... done &lt; allowlist.

grep just provides another approach. With a single entry in allowlist there would be little, if any, difference in efficiency between the use of awk and grep. If however allowlist contains multiple entries, then awk would provide a better solution avoiding the subshells created with multiple calls to grep. Let me know if you have questions.

huangapple
  • 本文由 发表于 2023年5月15日 03:31:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76249336.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定