英文:
Match line that contains two strings from another file
问题
我成功地提取了包含模式文件中至少一个术语的行:
grep -wFf pattern.txt source.txt
上述命令将返回自source.txt
中的所有行,因为每行中至少有一个来自pattern.txt
的术语。我的使用管道的尝试(在相关问题中只考虑两个搜索术语)没有成功。
grep
不是必需的。awk
、sed
、perl
同样可以使用。我在Python中有一个解决方案,但它非常慢(极速慢)。
谢谢!
英文:
I have a file source.txt
containing two columns of strings separated by a whitespace.
foo bar
foo baz
goo gaa
Also, there is another file pattern.txt
which is a list of strings (1 per line) that should serve as pattern source. This could look like
foo
bar
goo
The goal is to extract only lines, that contain two strings from the pattern file.
Repetitions are fine (e.g. foo foo
would be valid).
So the desired output here would be
foo bar
I managed to extract lines that contain at least one term from the pattern file with grep
:
grep -wFf pattern.txt source.txt
The command above would return all lines from source.txt
since at least one term from pattern.txt
is present in each line. My approaches using piped grep
commands (which are shown in related questions considering only two search terms) have not worked out.
grep
is not mandatory. awk
, sed
, perl
work as well. I have a solution in Python, but it is terribly slow (¬blazinglyfast).
Thank you!
Response to Answers
My Python solution looks like this:
import sys
f_pattern = sys.argv[1]
f_source = sys.argv[2]
with open(f_pattern, "r", encoding="utf-8") as fp:
pattern = set(fp.read().split("\n"))
with open(f_source, "r", encoding="utf-8") as fp:
for line in fp:
w1, w2 = line.strip("\n").split(" ")
if w1 in pattern and w2 in pattern:
print(line, end="") # \n still present in line string
Indeed, it's not that bad (time-wise) compared to some answers.
(My) Python
time python matcher.py pattern.txt source.txt
>> 158,12s user 1,82s system 99% cpu 2:40,08 total
awk by @Avinash Chandravansi
time awk -F' ' 'FNR==NR {arr [$0];next} $2 in arr' pattern.txt source.txt
>> 106,72s user 5,69s system 99% cpu 1:52,88 total
Not quite sure yet, but I think that gives an incorrect result.
awk by @KamilCuk
time awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
>> Unclear, more then 20 minutes. Ctrl+C
awk by @Fravadona
time awk 'FNR==NR {patterns[$0]; next}($1 in patterns) && ($2 in patterns)' pattern.txt source.txt
>> 95,45s user 2,46s system 99% cpu 1:38,03 total
^-- This seems to be the accepted answer (for me).
答案1
得分: 4
你正在使用 grep -F
,所以我猜"patterns"不是正则表达式。现在,如果你想要匹配完整的字符串(而不是子字符串),你可以这样做:
awk '
FNR == NR { patterns[$0]; next }
($1 in patterns) && ($2 in patterns)
' pattern.txt source.txt
英文:
You're using grep -F
so I guess that the "patterns" aren't regexps. Now, if you're looking for matching the full strings (and not a substring) then you can do:
awk '
FNR == NR { patterns[$0]; next }
($1 in patterns) && ($2 in patterns)
' pattern.txt source.txt
</details>
# 答案2
**得分**: 1
使用awk,将模式存储在数组中,然后检查是否至少有两个匹配。
```shell
$ awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
foo bar
英文:
With awk, store the patterns in array and then check if at least two match.
$ awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
foo bar
答案3
得分: 0
这可能适用于您(GNU sed):
sed 'H;1h;$!d;x;y/\n/|/;s#.*#/(&).*(&)/p;d#' patternFile | sed -Ef - file
从patternFile创建一个sed脚本并将其应用于源文件。
在同一匹配中两次使用相同的交替正则表达式以打印结果,否则删除该行。
英文:
This might work for you (GNU sed):
sed 'H;1h;$!d;x;y/\n/|/;s#.*#/(&).*(&)/p;d#' patternFile | sed -Ef - file
Create a sed script from the patternFile and apply it to source file.
Using the same alternation regexp twice in the same match print the result, otherwise delete the line.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论