2023年2月16日 02:49:31go评论78阅读模式

英文:

Match line that contains two strings from another file

问题

我成功地提取了包含模式文件中至少一个术语的行：

grep -wFf pattern.txt source.txt

上述命令将返回自source.txt中的所有行，因为每行中至少有一个来自pattern.txt的术语。我的使用管道的尝试（在相关问题中只考虑两个搜索术语）没有成功。

grep不是必需的。awk、sed、perl同样可以使用。我在Python中有一个解决方案，但它非常慢（极速慢）。

谢谢！

英文:

I have a file source.txt containing two columns of strings separated by a whitespace.

foo bar
foo baz
goo gaa

Also, there is another file pattern.txt which is a list of strings (1 per line) that should serve as pattern source. This could look like

foo
bar
goo

The goal is to extract only lines, that contain two strings from the pattern file.
Repetitions are fine (e.g. foo foo would be valid).

So the desired output here would be

foo bar

I managed to extract lines that contain at least one term from the pattern file with grep:

grep -wFf pattern.txt source.txt

The command above would return all lines from source.txt since at least one term from pattern.txt is present in each line. My approaches using piped grep commands (which are shown in related questions considering only two search terms) have not worked out.

grep is not mandatory. awk, sed, perl work as well. I have a solution in Python, but it is terribly slow (¬blazinglyfast).

Thank you!

Response to Answers

My Python solution looks like this:

import sys
f_pattern = sys.argv[1]
f_source = sys.argv[2]
with open(f_pattern, &quot;r&quot;, encoding=&quot;utf-8&quot;) as fp:
    pattern = set(fp.read().split(&quot;\n&quot;))
with open(f_source, &quot;r&quot;, encoding=&quot;utf-8&quot;) as fp:
    for line in fp:
        w1, w2 = line.strip(&quot;\n&quot;).split(&quot; &quot;)
        if w1 in pattern and w2 in pattern:
            print(line, end=&quot;&quot;)  # \n still present in line string

Indeed, it's not that bad (time-wise) compared to some answers.
(My) Python

time python matcher.py pattern.txt source.txt 
&gt;&gt; 158,12s user 1,82s system 99% cpu 2:40,08 total

awk by @Avinash Chandravansi

time awk -F&#39; &#39; &#39;FNR==NR {arr [$0];next} $2 in arr&#39; pattern.txt source.txt
&gt;&gt; 106,72s user 5,69s system 99% cpu 1:52,88 total

Not quite sure yet, but I think that gives an incorrect result.

awk by @KamilCuk

time awk &#39;NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt &gt;= 2){ print; break; }}}&#39; pattern.txt source.txt
&gt;&gt; Unclear, more then 20 minutes. Ctrl+C

awk by @Fravadona

time awk &#39;FNR==NR {patterns[$0]; next}($1 in patterns) &amp;&amp; ($2 in patterns)&#39; pattern.txt source.txt
&gt;&gt; 95,45s user 2,46s system 99% cpu 1:38,03 total

^-- This seems to be the accepted answer (for me).

答案1

得分: 4

你正在使用 grep -F，所以我猜"patterns"不是正则表达式。现在，如果你想要匹配完整的字符串（而不是子字符串），你可以这样做：

awk '
    FNR == NR { patterns[$0]; next }
    ($1 in patterns) && ($2 in patterns)
' pattern.txt source.txt

英文:

You're using grep -F so I guess that the "patterns" aren't regexps. Now, if you're looking for matching the full strings (and not a substring) then you can do:

awk &#39;
    FNR == NR { patterns[$0]; next }
    ($1 in patterns) &amp;&amp; ($2 in patterns)
&#39; pattern.txt source.txt
</details>
# 答案2
**得分**: 1
使用awk，将模式存储在数组中，然后检查是否至少有两个匹配。
```shell
$ awk 'NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt >= 2){ print; break; }}}' pattern.txt source.txt
foo bar

英文:

With awk, store the patterns in array and then check if at least two match.

$ awk &#39;NR==FNR{a[$0];next} {cnt=0; for (k in a) { cnt += $0~k; if (cnt &gt;= 2){ print; break; }}}&#39; pattern.txt source.txt 
foo bar

答案3

得分: 0

这可能适用于您（GNU sed）：

sed 'H;1h;$!d;x;y/\n/|/;s#.*#/(&).*(&)/p;d#' patternFile | sed -Ef - file

从patternFile创建一个sed脚本并将其应用于源文件。

在同一匹配中两次使用相同的交替正则表达式以打印结果，否则删除该行。

英文:

This might work for you (GNU sed):

sed &#39;H;1h;$!d;x;y/\n/|/;s#.*#/(&amp;).*(&amp;)/p;d#&#39; patternFile | sed -Ef - file

Create a sed script from the patternFile and apply it to source file.

Using the same alternation regexp twice in the same match print the result, otherwise delete the line.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

匹配包含来自另一个文件的两个字符串的行。

问题

Response to Answers

答案1

答案3

如何注释匹配特定单词的 crontab 条目，然后恢复为原始状态

awk打印出错 – 没有足够的参数来满足格式字符串

尝试将Unix命令的输出作为字符串返回在Java中…未获得预期结果。

awk可以找到包含列表中字符串的字段吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。