英文:
Delete lines in blocklist file, where the end of those lines match an entry in a whiltelist file
问题
I'm trying to delete lines in ad blocklist file, but only if the end of the blocklist line matches an entry in a whitelist file. Therefore do not delete blocklist lines if there is a match at the start or middle of the blocklist line.
Eg:
**Blocklist file**
randomsites.com
calendar.google.com
google.com
google.com.fake.com
**Whitelist file**
google.com
**Output to new_blocklist**
randomsites.com
google.com.fake.com
This line I've tried works but takes many minutes (on openwrt router) to process ~300k lines blocklist:
awk 'FNR==NR{a[$0];next} {for (i in a) {if ($0 ~ i "$") next}}1' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist
This line here works on exact whole line matches only but is very quick, e.g., seconds only. Could it possibly be edited somehow to meet the criteria (and faster than above)?
awk 'NR==FNR{a[$0];next} !($0 in a)' /tmp/whitelist /tmp/blocklist > /tmp/tempfile
Thanks, everyone.
英文:
I'm trying to delete lines in ad blocklist file, but only if the end of the blocklist line matches an entry in a whitelist file. Therefore do not delete blocklist lines if there is a match at eg the start or middle of the blocklist line.
Eg:
**Blocklist file**
randomsites.com
calendar.google.com
google.com
google.com.fake.com
**Whitelist file**
google.com
**Output to new_blocklist**
randomsites.com
google.com.fake.com
Might not be a legit address above ie google.com.fake.com, but the example does demonstrate how I plan for this whitelist to work.
This line I've tried works, but is taking many minutes (on openwrt router) to process ~300k lines blocklist:
awk 'FNR==NR{a[$0];next} {for (i in a) {if ($0 ~ i "$") next}}1' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist
This line here works on exact whole line matches only, but is very quick eg seconds only. Could it possibly be edited somehow to meet the criteria (and faster than above!)?
awk 'NR==FNR{a[$0];next} !($0 in a)' /tmp/whitelist /tmp/blocklist > /tmp/tempfile
Thanks everyone.
答案1
得分: 2
以下是翻译好的代码部分:
也许不需要查找,您可以使用|
进行模式组合,将整个表达式放在括号内,并以$
结尾。
点号匹配任何字符,您需要转义它以匹配文字点号。
awk '
FNR == NR {
gsub(/\./, "\\.")
tmp = tmp sep $0
sep = "|"
next
}
FNR == 1 {
regexp = "(^|[.])(" tmp ")$"
}
$0 !~ regexp
' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist
英文:
Maybe instead of a lookup, you could assemble a pattern with an alternation once using |
and group the whole expression between parenthesis and ending with $
.
The dot matches any character, you would have to escape that to match a literal dot.
awk '
FNR == NR {
gsub(/\./, "\\.")
tmp = tmp sep $0
sep = "|"
next
}
FNR == 1 {
regexp = "(^|[.])(" tmp ")$"
}
$0 !~ regexp
' /tmp/whitelist /tmp/blocklist > /tmp/new_blocklist
答案2
得分: 1
以下是翻译好的部分:
这可能是你想要的:
$ cat tst.awk
BEGIN { FS="." }
NR==FNR {
allow[$0]
next
}
{
addr = $NF
for ( i=NF-1; i>=1; i-- ) {
addr = $i FS addr
if ( addr in allow ) {
next
}
}
}
{ print }
<p>
$ awk -f tst.awk allow block
randomsites.com
google.com.fake.com
上述代码会对块列表中的每个以`.`分隔的子字符串进行字面字符串哈希查找,从右侧开始,因此速度快且稳定。对于像 `google.com` 这样的简单域名,它只会执行 1 次允许数组的查找,就像你的 `!($0 in a)` 一样;对于像 `google.com.fake.com` 这样的其他域名,它会执行比域名部分更少的迭代/查找,也就是在这种情况下有 4 个部分,所以只有 3 次迭代/查找,直到找到允许数组中的匹配项为止。即使对于这种情况,它仍然是每次都只进行哈希查找,因此应该仍然很快。
P.S. 这个问题的旧术语是黑名单/白名单,当前术语是[阻止列表/允许列表](https://www.linkedin.com/pulse/allowlist-blocklist-better-terms-everyone-lets-use-them-rob-black/),而不是黑名单/白名单。
<details>
<summary>英文:</summary>
This might do what you want:
$ cat tst.awk
BEGIN { FS="." }
NR==FNR {
allow[$0]
next
}
{
addr = $NF
for ( i=NF-1; i>=1; i-- ) {
addr = $i FS addr
if ( addr in allow ) {
next
}
}
}
{ print }
<p>
$ awk -f tst.awk allow block
randomsites.com
google.com.fake.com
The above is doing literal string hash lookups of each `.`-separate substring from your blocklist, starting from the right side, and so will be fast and robust. For a simple domain name in your blocklist like `google.com` it'll only do 1 lookup of the allow array, just like your `!($0 in a)` does, for others like `google.com.fake.com` it'll do 1 less iterations/lookups than there are parts of the domain, i.e. 4 parts in this case so 3 iterations/lookups, until if/when it finds a match in the allow array. Even for that, though, it's just hash lookup each time so it should still be fast.
P.S. old terminology for this was blacklist/whitelist, current is [blocklist/allowlist](https://www.linkedin.com/pulse/allowlist-blocklist-better-terms-everyone-lets-use-them-rob-black/) rather than blocklist/whitelist.
</details>
# 答案3
**得分**: 1
The block-less ternaries-only `awk` approach, and escape more than OP's requirements :
---
mawk 'NR == FNR ? (__ = __$_ "|")<_ : $_!~(!_ < FNR \
? _ : substr(_, gsub("[?./_:;&=]", "[&]", __),
sub(".$", ")$", __)))__' __='(&' \
<( printf '%s' 'google.com') <( printf '%s' 'randomsites.com
calendar.google.com
google.com
google.com.fake.com' )
---
1 randomsites.com
2 google.com.fake.com
<details>
<summary>英文:</summary>
The block-less ternaries-only `awk` approach, and escape more than OP's requirements :
---
mawk 'NR == FNR ? (__ = __$_ "|")<_ : $_!~(!_ < FNR \
? _ : substr(_, gsub("[?./_:;=&]", "[&]", __),
sub(".$", ")$", __)))__' __='(' \
<( printf '%s' 'google.com') <( printf '%s' 'randomsites.com
calendar.google.com
google.com
google.com.fake.com' )
---
1 randomsites.com
2 google.com.fake.com
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论