匹配包含来自另一个文件的两个字符串的行。

huangapple go评论42阅读模式
英文:

Match line that contains two strings from another file

问题

我成功地提取了包含模式文件中至少一个术语的行:

grep -wFf pattern.txt source.txt

上述命令将返回自source.txt中的所有行,因为每行中至少有一个来自pattern.txt的术语。我的使用管道的尝试(在相关问题中只考虑两个搜索术语)没有成功。

grep不是必需的。awksedperl同样可以使用。我在Python中有一个解决方案,但它非常慢(极速慢)。

谢谢!

英文:

I have a file source.txt containing two columns of strings separated by a whitespace.

foo bar
foo baz
goo gaa

Also, there is another file pattern.txt which is a list of strings (1 per line) that should serve as pattern source. This could look like

foo
bar
goo

The goal is to extract only lines, that contain two strings from the pattern file.
Repetitions are fine (e.g. foo foo would be valid).

So the desired output here would be

foo bar

I managed to extract lines that contain at least one term from the pattern file with grep:

grep -wFf pattern.txt source.txt

The command above would return all lines from source.txt since at least one term from pattern.txt is present in each line. My approaches using piped grep commands (which are shown in related questions considering only two search terms) have not worked out.

grep is not mandatory. awk, sed, perl work as well. I have a solution in Python, but it is terribly slow (¬blazinglyfast).

Thank you!

Response to Answers

My Python solution looks like this:

import sys

f_pattern = sys.argv[1]
f_source = sys.argv[2]

with open(f_pattern, "r", encoding="utf-8") as fp:
    pattern = set(fp.read().split("\n"))

with open(f_source, "r", encoding="utf-8") as fp:
    for line in fp:
        w1, w2 = line.strip("\n").split(" ")
        if w1 in pattern and w2 in pattern:
            print(line, end="")  # \n still present in line string

Indeed, it's not that bad (time-wise) compared to some answers.
(My) Python

time python matcher.py pattern.txt source.txt 
>> 158,12s user 1,82s system 99% cpu 2:40,08 total

awk by @Avinash Chandravansi

time awk -F' ' 'FNR==NR {arr [$0];next} $2 in arr' pattern.txt source.txt
>> 106,72s user 5,69s system 99% cpu 1:52,88 total

Not quite sure yet, but I think that gives an incorrect result.

awk by @KamilCuk

time awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
>> Unclear, more then 20 minutes. Ctrl+C

awk by @Fravadona

time awk 'FNR==NR {patterns[$0]; next}($1 in patterns) && ($2 in patterns)' pattern.txt source.txt
>> 95,45s user 2,46s system 99% cpu 1:38,03 total

^-- This seems to be the accepted answer (for me).

答案1

得分: 4

你正在使用 grep -F,所以我猜"patterns"不是正则表达式。现在,如果你想要匹配完整的字符串(而不是子字符串),你可以这样做:

awk '
    FNR == NR { patterns[$0]; next }
    ($1 in patterns) && ($2 in patterns)
' pattern.txt source.txt
英文:

You're using grep -F so I guess that the "patterns" aren't regexps. Now, if you're looking for matching the full strings (and not a substring) then you can do:

awk '
    FNR == NR { patterns[$0]; next }
    ($1 in patterns) && ($2 in patterns)
' pattern.txt source.txt

</details>



# 答案2
**得分**: 1

使用awk,将模式存储在数组中,然后检查是否至少有两个匹配。

```shell
$ awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
foo bar
英文:

With awk, store the patterns in array and then check if at least two match.

$ awk &#39;NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt &gt;= 2){ print; break; }}}&#39; pattern.txt source.txt 
foo bar

答案3

得分: 0

这可能适用于您(GNU sed):

sed 'H;1h;$!d;x;y/\n/|/;s#.*#/(&).*(&)/p;d#' patternFile | sed -Ef - file

从patternFile创建一个sed脚本并将其应用于源文件。

在同一匹配中两次使用相同的交替正则表达式以打印结果,否则删除该行。

英文:

This might work for you (GNU sed):

sed &#39;H;1h;$!d;x;y/\n/|/;s#.*#/(&amp;).*(&amp;)/p;d#&#39; patternFile | sed -Ef - file

Create a sed script from the patternFile and apply it to source file.

Using the same alternation regexp twice in the same match print the result, otherwise delete the line.

huangapple
  • 本文由 发表于 2023年2月16日 02:49:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/75464230.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定