2023年7月18日 16:49:49go评论83阅读模式

英文:

grep pattern on multiple columns but one

问题

我有一个制表符分隔的文件，有大约10,000列。
我想筛选包含某些模式的行，但不想检查第一列。

$ cut -f2,1691-1725 myfile.txt
1402138 2331    2331
1402422 181     630
1402795 1401            1401
1405425 2331
1405771 1727    1727
1406169 2331    2331
1406475 2252    2252
1408259 1744    1744

我尝试了这个：

cut -f2,1691-1725 myfile.txt | grep -E '^140|^141|^143|^144|^145|^146|^148|^149' > myoutput.txt

但它保留了所有内容，因为grep也应用于第一列。

期望的输出：

1402795 1401            1401

我尝试使用awk，但找不到不需要列出每列的简便方法。

英文:

I have a tab delimiter file with ~10,000 columns.
I want to filter rows where some columns contain pattern. But I don't want to have the grep checking first column.

$ cut -f2,1691-1725 myfile.txt
1402138 2331    2331
1402422 181     630
1402795 1401            1401
1405425 2331
1405771 1727    1727
1406169 2331    2331
1406475 2252    2252
1408259 1744    1744

I tried this :

cut -f2,1691-1725 myfile.txt | grep -E &#39;^140|^141|^143|^144|^145|^146|^148|^149&#39; &gt; myoutput.txt

but it keeps all because grep also applies on first column

Desired output :

1402795 1401            1401

I search a way with awk but can't find easy one without listing each columns

Thank you

答案1

得分: 1

根据我理解你的问题，以及从你提供的输入和输出看，你有以下需求：

问题: 提供一个命令，给定一组列号，返回所有列为空和/或与给定正则表达式匹配的行

在涉及到列和正则表达式时，最好的答案显然是awk：

awk 'BEGIN{a[2]; for(i=1691;i<=1725;++i) a[i]}
     BEGIN{ere="^14[01345689]"}
     BEGIN{FS=OFS="\t"}
     {f=1;for(i=1;i<=NF;++i) if(i in a) if(!($i ~ ere)) f=($i=="")}
     f' file

工作原理:

BEGIN{a[2]; for(i=1691;i<=1725;++i) a[i]}: 定义一个关联数组 a，它以感兴趣的列作为索引。在这种情况下，这些列是 2, 1691-1725。
BEGIN{ere="^14[01345689]"}: 定义每列应匹配的正则表达式。
BEGIN{FS=OFS="\t"}: 将输入和输出字段分隔符定义为制表符。
f=1;for(i=1;i<=NF;++i) if(i in a) if(!($i ~ ere)) f=($i==""):
每次读取一行时，执行以下操作：
1. f=1: 将变量 f 设置为1。这将用作确定是否打印一行的标志。
2. for(i=1;i<=NF;++i) ...: 循环遍历所有字段。
3. ... if(i in a)...: 检查字段是否是感兴趣的字段。
4. ... if(!($i ~ ere)) f=($i==""): 如果字段是感兴趣的字段，且字段不匹配正则表达式，则将标志 f 设置为零或非零，具体取决于字段是否为空。
f: 如果标志 f 不为零，则打印当前行。（这是 (f!=0){print $0} 的简写形式）

英文:

From what I understand from your question, and from what I can see from your provided input and output, you have the following requirement:

Question: Provide a command that, given a set of column-numbers, returns all rows where each column is empty and/or matches a given regular expression

When talking about columns and regular expressions, the best answer is clearly awk:

awk &#39;BEGIN{a[2]; for(i=1691;i&lt;=1725;++i) a[i]}
     BEGIN{ere=&quot;^14[01345689]&quot;}
     BEGIN{FS=OFS=&quot;\t&quot;}
     {f=1;for(i=1;i&lt;=NF;++i) if(i in a) if(!($i ~ ere)) f=($i==&quot;&quot;)}
     f&#39; file

How does it work:

BEGIN{a[2]; for(i=1691;i<=1725;++i) a[i]}: Define an associative array a which is indexed by the columns of interest. In this case, those are 2, 1691-1725
BEGIN{ere="^14[01345689]"}: Define the regular expression that each column should match
BEGIN{FS=OFS="\t"}: Define the input and output field separator to be the <tab>-character
f=1;for(i=1;i<=NF;++i) if(i in a) if(!($i ~ ere)) f=($i==""):
Each time a line is read, perform the following actions:
1. f=1: Set the variable f equal to 1. This will be used as a flag to determine whether a line should be printed
2. for(i=1;i<=NF;++i) ...: loop over all fields
3. ... if(i in a)...: and check if the field is a field of interest
4. ... if(!($i ~ ere)) f=($i==""): if the field is a field of interest and if the field does not match the regular expression, then set the flag f to ZERO or not ZERO depending if the field is empty or not.
f: If the flag f is not zero, print the current line. (This is a shorthand for (f!=0){print $0}

答案2

得分: 1

如果您的数据始终从完全相同的字符列开始，您可以使用^.{n}来跳过n个字符，然后开始匹配。

$ cat myfile.txt
1402138 2331    2331
1402422 181     630
1402795 1401            1401
1405425 2331
1405771 1727    1727
1406169 2331    2331
1406475 2252    2252
1408259 1744    1744
$ cat myfile.txt | grep -E '&#39;^.{24}(140|141|143|144|145|146|148|149)&#39;'
1402795 1401            1401

上述模式首先匹配从行的开头开始的任意 24 个字符（不包括换行符），然后尝试从第 25 个字符位置开始匹配(140|141|143|144|145|146|148|149)。

在这种情况下，(140|141|143|144|145|146|148|149) 也可以简化为 14[01345689]。

英文:

If your data always starts at the exact same character column, you could use ^.{n} to skip n characters, then start matching.

$ cat myfile.txt
1402138 2331    2331
1402422 181     630
1402795 1401            1401
1405425 2331
1405771 1727    1727
1406169 2331    2331
1406475 2252    2252
1408259 1744    1744
$ cat myfile.txt | grep -E &#39;^.{24}(140|141|143|144|145|146|148|149)&#39;
1402795 1401            1401

The pattern above first matches any 24 characters (except newline) since the start of the line, then tries to match (140|141|143|144|145|146|148|149) starting from character position 25.

In this scenario (140|141|143|144|145|146|148|149) could be simplified to 14[01345689] as well.

答案3

得分: 0

对于对某一列数值进行筛选，你可能最好使用awk（grep 更适合用于筛选整行），类似这样：

prompt> cat file.txt
Name Value1 Value2
a 1 2
b 2 5
c 5 1
prompt> awk -F " " '{if ($2==1) print $0}' file.txt
a 1 2

英文:

For filtering on the values of a certain column, you might better use awk (grep is better suited for filtering the entire line), something like this:

prompt&gt; cat file.txt
Name Value1 Value2
a 1 2
b 2 5
c 5 1
prompt&gt; awk -F &quot; &quot; &#39;{if ($2==1) print $0}&#39; file.txt
a 1 2

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在多列中使用 grep 模式，但只有一列。

问题

答案1

答案2

答案3

使用awk比较两个文件的差异，当某一列匹配时

I need a Bash that checks if a specific number, let's say 25454111, matches any of the patterns stored in a text file.can I achieve this?

如何在bash脚本中打印一个字符串以及该字符串的所有n行。

将搜索模式的组合grep结果分组到一个对象数组中

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。