在多列中使用 grep 模式,但只有一列。

huangapple go评论58阅读模式
英文:

grep pattern on multiple columns but one

问题

我有一个制表符分隔的文件,有大约10,000列。
我想筛选包含某些模式的行,但不想检查第一列。

$ cut -f2,1691-1725 myfile.txt
1402138 2331    2331
1402422 181     630
1402795 1401            1401
1405425 2331
1405771 1727    1727
1406169 2331    2331
1406475 2252    2252
1408259 1744    1744

我尝试了这个:

cut -f2,1691-1725 myfile.txt | grep -E '^140|^141|^143|^144|^145|^146|^148|^149' > myoutput.txt

但它保留了所有内容,因为grep也应用于第一列。

期望的输出:

1402795 1401            1401

我尝试使用awk,但找不到不需要列出每列的简便方法。

英文:

I have a tab delimiter file with ~10,000 columns.
I want to filter rows where some columns contain pattern. But I don't want to have the grep checking first column.

$ cut -f2,1691-1725 myfile.txt
1402138 2331    2331
1402422 181     630
1402795 1401            1401
1405425 2331
1405771 1727    1727
1406169 2331    2331
1406475 2252    2252
1408259 1744    1744

I tried this :

cut -f2,1691-1725 myfile.txt | grep -E '^140|^141|^143|^144|^145|^146|^148|^149' > myoutput.txt

but it keeps all because grep also applies on first column

Desired output :

1402795 1401            1401

I search a way with awk but can't find easy one without listing each columns

Thank you

答案1

得分: 1

根据我理解你的问题,以及从你提供的输入和输出看,你有以下需求:

问题: 提供一个命令,给定一组列号,返回所有列为空和/或与给定正则表达式匹配的行

在涉及到列和正则表达式时,最好的答案显然是awk:

awk 'BEGIN{a[2]; for(i=1691;i<=1725;++i) a[i]}
     BEGIN{ere="^14[01345689]"}
     BEGIN{FS=OFS="\t"}
     {f=1;for(i=1;i<=NF;++i) if(i in a) if(!($i ~ ere)) f=($i=="")}
     f' file

工作原理:

  1. BEGIN{a[2]; for(i=1691;i<=1725;++i) a[i]}: 定义一个关联数组 a,它以感兴趣的列作为索引。在这种情况下,这些列是 2, 1691-1725
  2. BEGIN{ere="^14[01345689]"}: 定义每列应匹配的正则表达式。
  3. BEGIN{FS=OFS="\t"}: 将输入和输出字段分隔符定义为制表符。
  4. f=1;for(i=1;i<=NF;++i) if(i in a) if(!($i ~ ere)) f=($i==""):
    每次读取一行时,执行以下操作:

    1. f=1: 将变量 f 设置为1。这将用作确定是否打印一行的标志。
    2. for(i=1;i<=NF;++i) ...: 循环遍历所有字段。
    3. ... if(i in a)...: 检查字段是否是感兴趣的字段。
    4. ... if(!($i ~ ere)) f=($i==""): 如果字段是感兴趣的字段,且字段 匹配正则表达式,则将标志 f 设置为零或非零,具体取决于字段是否为空。
  5. f: 如果标志 f 不为零,则打印当前行。(这是 (f!=0){print $0} 的简写形式)
英文:

From what I understand from your question, and from what I can see from your provided input and output, you have the following requirement:

Question: Provide a command that, given a set of column-numbers, returns all rows where each column is empty and/or matches a given regular expression

When talking about columns and regular expressions, the best answer is clearly awk:

awk &#39;BEGIN{a[2]; for(i=1691;i&lt;=1725;++i) a[i]}
     BEGIN{ere=&quot;^14[01345689]&quot;}
     BEGIN{FS=OFS=&quot;\t&quot;}
     {f=1;for(i=1;i&lt;=NF;++i) if(i in a) if(!($i ~ ere)) f=($i==&quot;&quot;)}
     f&#39; file

How does it work:

  1. BEGIN{a[2]; for(i=1691;i&lt;=1725;++i) a[i]}: Define an associative array a which is indexed by the columns of interest. In this case, those are 2, 1691-1725
  2. BEGIN{ere=&quot;^14[01345689]&quot;}: Define the regular expression that each column should match
  3. BEGIN{FS=OFS=&quot;\t&quot;}: Define the input and output field separator to be the &lt;tab&gt;-character
  4. f=1;for(i=1;i&lt;=NF;++i) if(i in a) if(!($i ~ ere)) f=($i==&quot;&quot;):
    Each time a line is read, perform the following actions:

    1. f=1: Set the variable f equal to 1. This will be used as a flag to determine whether a line should be printed
    2. for(i=1;i&lt;=NF;++i) ...: loop over all fields
    3. ... if(i in a)...: and check if the field is a field of interest
    4. ... if(!($i ~ ere)) f=($i==&quot;&quot;): if the field is a field of interest and if the field does not match the regular expression, then set the flag f to ZERO or not ZERO depending if the field is empty or not.
  5. f: If the flag f is not zero, print the current line. (This is a shorthand for (f!=0){print $0}

答案2

得分: 1

如果您的数据始终从完全相同的字符列开始,您可以使用^.{n}来跳过n个字符,然后开始匹配。

$ cat myfile.txt
1402138 2331    2331
1402422 181     630
1402795 1401            1401
1405425 2331
1405771 1727    1727
1406169 2331    2331
1406475 2252    2252
1408259 1744    1744

$ cat myfile.txt | grep -E '&#39;^.{24}(140|141|143|144|145|146|148|149)&#39;'
1402795 1401            1401

上述模式首先匹配从行的开头开始的任意 24 个字符(不包括换行符),然后尝试从第 25 个字符位置开始匹配(140|141|143|144|145|146|148|149)

在这种情况下,(140|141|143|144|145|146|148|149) 也可以简化为 14[01345689]

英文:

If your data always starts at the exact same character column, you could use ^.{n} to skip n characters, then start matching.

$ cat myfile.txt
1402138 2331    2331
1402422 181     630
1402795 1401            1401
1405425 2331
1405771 1727    1727
1406169 2331    2331
1406475 2252    2252
1408259 1744    1744

$ cat myfile.txt | grep -E &#39;^.{24}(140|141|143|144|145|146|148|149)&#39;
1402795 1401            1401

The pattern above first matches any 24 characters (except newline) since the start of the line, then tries to match (140|141|143|144|145|146|148|149) starting from character position 25.

In this scenario (140|141|143|144|145|146|148|149) could be simplified to 14[01345689] as well.

答案3

得分: 0

对于对某一列数值进行筛选,你可能最好使用awkgrep 更适合用于筛选整行),类似这样:

prompt> cat file.txt
Name Value1 Value2
a 1 2
b 2 5
c 5 1

prompt> awk -F " " '{if ($2==1) print $0}' file.txt
a 1 2
英文:

For filtering on the values of a certain column, you might better use awk (grep is better suited for filtering the entire line), something like this:

prompt&gt; cat file.txt
Name Value1 Value2
a 1 2
b 2 5
c 5 1

prompt&gt; awk -F &quot; &quot; &#39;{if ($2==1) print $0}&#39; file.txt
a 1 2

huangapple
  • 本文由 发表于 2023年7月18日 16:49:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76711009.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定