
huangapple go评论69阅读模式

Match line that contains two strings from another file



grep -wFf pattern.txt source.txt





I have a file source.txt containing two columns of strings separated by a whitespace.

foo bar
foo baz
goo gaa

Also, there is another file pattern.txt which is a list of strings (1 per line) that should serve as pattern source. This could look like


The goal is to extract only lines, that contain two strings from the pattern file.
Repetitions are fine (e.g. foo foo would be valid).

So the desired output here would be

foo bar

I managed to extract lines that contain at least one term from the pattern file with grep:

grep -wFf pattern.txt source.txt

The command above would return all lines from source.txt since at least one term from pattern.txt is present in each line. My approaches using piped grep commands (which are shown in related questions considering only two search terms) have not worked out.

grep is not mandatory. awk, sed, perl work as well. I have a solution in Python, but it is terribly slow (¬blazinglyfast).

Thank you!

Response to Answers

My Python solution looks like this:

import sys

f_pattern = sys.argv[1]
f_source = sys.argv[2]

with open(f_pattern, "r", encoding="utf-8") as fp:
    pattern = set(fp.read().split("\n"))

with open(f_source, "r", encoding="utf-8") as fp:
    for line in fp:
        w1, w2 = line.strip("\n").split(" ")
        if w1 in pattern and w2 in pattern:
            print(line, end="")  # \n still present in line string

Indeed, it's not that bad (time-wise) compared to some answers.
(My) Python

time python matcher.py pattern.txt source.txt 
>> 158,12s user 1,82s system 99% cpu 2:40,08 total

awk by @Avinash Chandravansi

time awk -F' ' 'FNR==NR {arr [$0];next} $2 in arr' pattern.txt source.txt
>> 106,72s user 5,69s system 99% cpu 1:52,88 total

Not quite sure yet, but I think that gives an incorrect result.

awk by @KamilCuk

time awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
>> Unclear, more then 20 minutes. Ctrl+C

awk by @Fravadona

time awk 'FNR==NR {patterns[$0]; next}($1 in patterns) && ($2 in patterns)' pattern.txt source.txt
>> 95,45s user 2,46s system 99% cpu 1:38,03 total

^-- This seems to be the accepted answer (for me).


得分: 4

你正在使用 grep -F,所以我猜"patterns"不是正则表达式。现在,如果你想要匹配完整的字符串(而不是子字符串),你可以这样做:

awk '
    FNR == NR { patterns[$0]; next }
    ($1 in patterns) && ($2 in patterns)
' pattern.txt source.txt

You're using grep -F so I guess that the "patterns" aren't regexps. Now, if you're looking for matching the full strings (and not a substring) then you can do:

awk '
    FNR == NR { patterns[$0]; next }
    ($1 in patterns) && ($2 in patterns)
' pattern.txt source.txt


# 答案2
**得分**: 1


$ awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
foo bar

With awk, store the patterns in array and then check if at least two match.

$ awk &#39;NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt &gt;= 2){ print; break; }}}&#39; pattern.txt source.txt 
foo bar


得分: 0

这可能适用于您(GNU sed):

sed 'H;1h;$!d;x;y/\n/|/;s#.*#/(&).*(&)/p;d#' patternFile | sed -Ef - file




This might work for you (GNU sed):

sed &#39;H;1h;$!d;x;y/\n/|/;s#.*#/(&amp;).*(&amp;)/p;d#&#39; patternFile | sed -Ef - file

Create a sed script from the patternFile and apply it to source file.

Using the same alternation regexp twice in the same match print the result, otherwise delete the line.

  • 本文由 发表于 2023年2月16日 02:49:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75464230.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
