在多列中使用 grep 模式,但只有一列。

huangapple go评论83阅读模式
英文:

grep pattern on multiple columns but one

问题

我有一个制表符分隔的文件,有大约10,000列。
我想筛选包含某些模式的行,但不想检查第一列。

  1. $ cut -f2,1691-1725 myfile.txt
  2. 1402138 2331 2331
  3. 1402422 181 630
  4. 1402795 1401 1401
  5. 1405425 2331
  6. 1405771 1727 1727
  7. 1406169 2331 2331
  8. 1406475 2252 2252
  9. 1408259 1744 1744

我尝试了这个:

  1. cut -f2,1691-1725 myfile.txt | grep -E '^140|^141|^143|^144|^145|^146|^148|^149' > myoutput.txt

但它保留了所有内容,因为grep也应用于第一列。

期望的输出:

  1. 1402795 1401 1401

我尝试使用awk,但找不到不需要列出每列的简便方法。

英文:

I have a tab delimiter file with ~10,000 columns.
I want to filter rows where some columns contain pattern. But I don't want to have the grep checking first column.

  1. $ cut -f2,1691-1725 myfile.txt
  2. 1402138 2331 2331
  3. 1402422 181 630
  4. 1402795 1401 1401
  5. 1405425 2331
  6. 1405771 1727 1727
  7. 1406169 2331 2331
  8. 1406475 2252 2252
  9. 1408259 1744 1744

I tried this :

  1. cut -f2,1691-1725 myfile.txt | grep -E '^140|^141|^143|^144|^145|^146|^148|^149' > myoutput.txt

but it keeps all because grep also applies on first column

Desired output :

  1. 1402795 1401 1401

I search a way with awk but can't find easy one without listing each columns

Thank you

答案1

得分: 1

根据我理解你的问题,以及从你提供的输入和输出看,你有以下需求:

问题: 提供一个命令,给定一组列号,返回所有列为空和/或与给定正则表达式匹配的行

在涉及到列和正则表达式时,最好的答案显然是awk:

  1. awk 'BEGIN{a[2]; for(i=1691;i<=1725;++i) a[i]}
  2. BEGIN{ere="^14[01345689]"}
  3. BEGIN{FS=OFS="\t"}
  4. {f=1;for(i=1;i<=NF;++i) if(i in a) if(!($i ~ ere)) f=($i=="")}
  5. f' file

工作原理:

  1. BEGIN{a[2]; for(i=1691;i<=1725;++i) a[i]}: 定义一个关联数组 a,它以感兴趣的列作为索引。在这种情况下,这些列是 2, 1691-1725
  2. BEGIN{ere="^14[01345689]"}: 定义每列应匹配的正则表达式。
  3. BEGIN{FS=OFS="\t"}: 将输入和输出字段分隔符定义为制表符。
  4. f=1;for(i=1;i<=NF;++i) if(i in a) if(!($i ~ ere)) f=($i==""):
    每次读取一行时,执行以下操作:

    1. f=1: 将变量 f 设置为1。这将用作确定是否打印一行的标志。
    2. for(i=1;i<=NF;++i) ...: 循环遍历所有字段。
    3. ... if(i in a)...: 检查字段是否是感兴趣的字段。
    4. ... if(!($i ~ ere)) f=($i==""): 如果字段是感兴趣的字段,且字段 匹配正则表达式,则将标志 f 设置为零或非零,具体取决于字段是否为空。
  5. f: 如果标志 f 不为零,则打印当前行。(这是 (f!=0){print $0} 的简写形式)
英文:

From what I understand from your question, and from what I can see from your provided input and output, you have the following requirement:

Question: Provide a command that, given a set of column-numbers, returns all rows where each column is empty and/or matches a given regular expression

When talking about columns and regular expressions, the best answer is clearly awk:

  1. awk &#39;BEGIN{a[2]; for(i=1691;i&lt;=1725;++i) a[i]}
  2. BEGIN{ere=&quot;^14[01345689]&quot;}
  3. BEGIN{FS=OFS=&quot;\t&quot;}
  4. {f=1;for(i=1;i&lt;=NF;++i) if(i in a) if(!($i ~ ere)) f=($i==&quot;&quot;)}
  5. f&#39; file

How does it work:

  1. BEGIN{a[2]; for(i=1691;i&lt;=1725;++i) a[i]}: Define an associative array a which is indexed by the columns of interest. In this case, those are 2, 1691-1725
  2. BEGIN{ere=&quot;^14[01345689]&quot;}: Define the regular expression that each column should match
  3. BEGIN{FS=OFS=&quot;\t&quot;}: Define the input and output field separator to be the &lt;tab&gt;-character
  4. f=1;for(i=1;i&lt;=NF;++i) if(i in a) if(!($i ~ ere)) f=($i==&quot;&quot;):
    Each time a line is read, perform the following actions:

    1. f=1: Set the variable f equal to 1. This will be used as a flag to determine whether a line should be printed
    2. for(i=1;i&lt;=NF;++i) ...: loop over all fields
    3. ... if(i in a)...: and check if the field is a field of interest
    4. ... if(!($i ~ ere)) f=($i==&quot;&quot;): if the field is a field of interest and if the field does not match the regular expression, then set the flag f to ZERO or not ZERO depending if the field is empty or not.
  5. f: If the flag f is not zero, print the current line. (This is a shorthand for (f!=0){print $0}

答案2

得分: 1

如果您的数据始终从完全相同的字符列开始,您可以使用^.{n}来跳过n个字符,然后开始匹配。

  1. $ cat myfile.txt
  2. 1402138 2331 2331
  3. 1402422 181 630
  4. 1402795 1401 1401
  5. 1405425 2331
  6. 1405771 1727 1727
  7. 1406169 2331 2331
  8. 1406475 2252 2252
  9. 1408259 1744 1744
  10. $ cat myfile.txt | grep -E '&#39;^.{24}(140|141|143|144|145|146|148|149)&#39;'
  11. 1402795 1401 1401

上述模式首先匹配从行的开头开始的任意 24 个字符(不包括换行符),然后尝试从第 25 个字符位置开始匹配(140|141|143|144|145|146|148|149)

在这种情况下,(140|141|143|144|145|146|148|149) 也可以简化为 14[01345689]

英文:

If your data always starts at the exact same character column, you could use ^.{n} to skip n characters, then start matching.

  1. $ cat myfile.txt
  2. 1402138 2331 2331
  3. 1402422 181 630
  4. 1402795 1401 1401
  5. 1405425 2331
  6. 1405771 1727 1727
  7. 1406169 2331 2331
  8. 1406475 2252 2252
  9. 1408259 1744 1744
  10. $ cat myfile.txt | grep -E &#39;^.{24}(140|141|143|144|145|146|148|149)&#39;
  11. 1402795 1401 1401

The pattern above first matches any 24 characters (except newline) since the start of the line, then tries to match (140|141|143|144|145|146|148|149) starting from character position 25.

In this scenario (140|141|143|144|145|146|148|149) could be simplified to 14[01345689] as well.

答案3

得分: 0

对于对某一列数值进行筛选,你可能最好使用awkgrep 更适合用于筛选整行),类似这样:

  1. prompt> cat file.txt
  2. Name Value1 Value2
  3. a 1 2
  4. b 2 5
  5. c 5 1
  6. prompt> awk -F " " '{if ($2==1) print $0}' file.txt
  7. a 1 2
英文:

For filtering on the values of a certain column, you might better use awk (grep is better suited for filtering the entire line), something like this:

  1. prompt&gt; cat file.txt
  2. Name Value1 Value2
  3. a 1 2
  4. b 2 5
  5. c 5 1
  6. prompt&gt; awk -F &quot; &quot; &#39;{if ($2==1) print $0}&#39; file.txt
  7. a 1 2

huangapple
  • 本文由 发表于 2023年7月18日 16:49:49
  • 转载请务必保留本文链接:https://go.coder-hub.com/76711009.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定