英文:
grep pattern on multiple columns but one
问题
我有一个制表符分隔的文件,有大约10,000列。
我想筛选包含某些模式的行,但不想检查第一列。
$ cut -f2,1691-1725 myfile.txt
1402138 2331 2331
1402422 181 630
1402795 1401 1401
1405425 2331
1405771 1727 1727
1406169 2331 2331
1406475 2252 2252
1408259 1744 1744
我尝试了这个:
cut -f2,1691-1725 myfile.txt | grep -E '^140|^141|^143|^144|^145|^146|^148|^149' > myoutput.txt
但它保留了所有内容,因为grep也应用于第一列。
期望的输出:
1402795 1401 1401
我尝试使用awk,但找不到不需要列出每列的简便方法。
英文:
I have a tab delimiter file with ~10,000 columns.
I want to filter rows where some columns contain pattern. But I don't want to have the grep checking first column.
$ cut -f2,1691-1725 myfile.txt
1402138 2331 2331
1402422 181 630
1402795 1401 1401
1405425 2331
1405771 1727 1727
1406169 2331 2331
1406475 2252 2252
1408259 1744 1744
I tried this :
cut -f2,1691-1725 myfile.txt | grep -E '^140|^141|^143|^144|^145|^146|^148|^149' > myoutput.txt
but it keeps all because grep also applies on first column
Desired output :
1402795 1401 1401
I search a way with awk but can't find easy one without listing each columns
Thank you
答案1
得分: 1
根据我理解你的问题,以及从你提供的输入和输出看,你有以下需求:
问题: 提供一个命令,给定一组列号,返回所有列为空和/或与给定正则表达式匹配的行
在涉及到列和正则表达式时,最好的答案显然是awk:
awk 'BEGIN{a[2]; for(i=1691;i<=1725;++i) a[i]}
BEGIN{ere="^14[01345689]"}
BEGIN{FS=OFS="\t"}
{f=1;for(i=1;i<=NF;++i) if(i in a) if(!($i ~ ere)) f=($i=="")}
f' file
工作原理:
BEGIN{a[2]; for(i=1691;i<=1725;++i) a[i]}
: 定义一个关联数组a
,它以感兴趣的列作为索引。在这种情况下,这些列是2, 1691-1725
。BEGIN{ere="^14[01345689]"}
: 定义每列应匹配的正则表达式。BEGIN{FS=OFS="\t"}
: 将输入和输出字段分隔符定义为制表符。f=1;for(i=1;i<=NF;++i) if(i in a) if(!($i ~ ere)) f=($i=="")
:
每次读取一行时,执行以下操作:f=1
: 将变量f
设置为1。这将用作确定是否打印一行的标志。for(i=1;i<=NF;++i) ...
: 循环遍历所有字段。... if(i in a)...
: 检查字段是否是感兴趣的字段。... if(!($i ~ ere)) f=($i=="")
: 如果字段是感兴趣的字段,且字段 不 匹配正则表达式,则将标志f
设置为零或非零,具体取决于字段是否为空。
f
: 如果标志f
不为零,则打印当前行。(这是(f!=0){print $0}
的简写形式)
英文:
From what I understand from your question, and from what I can see from your provided input and output, you have the following requirement:
Question: Provide a command that, given a set of column-numbers, returns all rows where each column is empty and/or matches a given regular expression
When talking about columns and regular expressions, the best answer is clearly awk:
awk 'BEGIN{a[2]; for(i=1691;i<=1725;++i) a[i]}
BEGIN{ere="^14[01345689]"}
BEGIN{FS=OFS="\t"}
{f=1;for(i=1;i<=NF;++i) if(i in a) if(!($i ~ ere)) f=($i=="")}
f' file
How does it work:
BEGIN{a[2]; for(i=1691;i<=1725;++i) a[i]}
: Define an associative arraya
which is indexed by the columns of interest. In this case, those are2, 1691-1725
BEGIN{ere="^14[01345689]"}
: Define the regular expression that each column should matchBEGIN{FS=OFS="\t"}
: Define the input and output field separator to be the <tab>-characterf=1;for(i=1;i<=NF;++i) if(i in a) if(!($i ~ ere)) f=($i=="")
:
Each time a line is read, perform the following actions:f=1
: Set the variablef
equal to 1. This will be used as a flag to determine whether a line should be printedfor(i=1;i<=NF;++i) ...
: loop over all fields... if(i in a)...
: and check if the field is a field of interest... if(!($i ~ ere)) f=($i=="")
: if the field is a field of interest and if the field does not match the regular expression, then set the flagf
to ZERO or not ZERO depending if the field is empty or not.
f
: If the flagf
is not zero, print the current line. (This is a shorthand for(f!=0){print $0}
答案2
得分: 1
如果您的数据始终从完全相同的字符列开始,您可以使用^.{n}
来跳过n
个字符,然后开始匹配。
$ cat myfile.txt
1402138 2331 2331
1402422 181 630
1402795 1401 1401
1405425 2331
1405771 1727 1727
1406169 2331 2331
1406475 2252 2252
1408259 1744 1744
$ cat myfile.txt | grep -E ''^.{24}(140|141|143|144|145|146|148|149)''
1402795 1401 1401
上述模式首先匹配从行的开头开始的任意 24 个字符(不包括换行符),然后尝试从第 25 个字符位置开始匹配(140|141|143|144|145|146|148|149)
。
在这种情况下,(140|141|143|144|145|146|148|149)
也可以简化为 14[01345689]
。
英文:
If your data always starts at the exact same character column, you could use ^.{n}
to skip n
characters, then start matching.
$ cat myfile.txt
1402138 2331 2331
1402422 181 630
1402795 1401 1401
1405425 2331
1405771 1727 1727
1406169 2331 2331
1406475 2252 2252
1408259 1744 1744
$ cat myfile.txt | grep -E '^.{24}(140|141|143|144|145|146|148|149)'
1402795 1401 1401
The pattern above first matches any 24 characters (except newline) since the start of the line, then tries to match (140|141|143|144|145|146|148|149)
starting from character position 25.
In this scenario (140|141|143|144|145|146|148|149)
could be simplified to 14[01345689]
as well.
答案3
得分: 0
对于对某一列数值进行筛选,你可能最好使用awk
(grep
更适合用于筛选整行),类似这样:
prompt> cat file.txt
Name Value1 Value2
a 1 2
b 2 5
c 5 1
prompt> awk -F " " '{if ($2==1) print $0}' file.txt
a 1 2
英文:
For filtering on the values of a certain column, you might better use awk
(grep
is better suited for filtering the entire line), something like this:
prompt> cat file.txt
Name Value1 Value2
a 1 2
b 2 5
c 5 1
prompt> awk -F " " '{if ($2==1) print $0}' file.txt
a 1 2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论