循环遍历列,然后转换单元格,最后在awk中显示唯一值。

huangapple go评论43阅读模式
英文:

Loop through columns, then transform cells, then show unique values in awk

问题

I can provide the requested translations:

Input:

I'd like to transform some data using awk but need some help pls. I want to extract for columns starting with "sam" (where the column number is undefined) everything before the first colon.

Desired output:

This is the best I've got so far... but it doesn't work.

awk -F"\t" '{ for(i=5; i<=NF; --i); split($i,a,":"); print a[1]}}' input > output

I know how to cut a column i.e. cut -d ':' -f2 but as far as I understand you can't combine cut with awk in a loop!

Then, I want to find all the unique values for columns starting with sam in the output file e.g.

0/1
1/0
0/0
1/1

I'm afraid I'm totally lost on an awk solution for that. I can do it in R but an awk solution would be preferred and much faster.

R solution:

output %>% pivot_longer(-c(col1:col4)) -> df_long
df_long %>% select(value)
unique(df_long)
英文:

I'd like to transform some data using awk but need some help pls. I want to extract for columns starting with "sam" (where the column number is undefined) everything before the first colon.

Input:

col1	col2	col3	col4	sam1	sam2	sam3
a	b	c	d	0/1:12	1/0:9	0/1:16
e	f	g	h	0/0:7	1/1:98	0/0:8

Desired output:

col1	col2	col3	col4	sam1	sam2	sam3
a	b	c	d	0/1	1/0	0/1
e	f	g	h	0/0	1/1	0/0

This is the best I've got so far... but it doesn't work.

awk -F&quot;\t&quot; &#39;{ for(i=5; i&lt;=NF; --i); split($i,a,&quot;:&quot;); print a[1]}}&#39; input &gt; output

I know how to cut a column i.e. cut -d &#39;:&#39; -f2 but as far as I understand you can't combine cut with awk in a loop!

Then, I want to find all the unique values for columns starting with sam in the output file e.g.

0/1
1/0
0/0
1/1

I'm afraid I'm totally lost on an awk solution for that. I can do it in R but an awk solution would be preferred and much faster.

R solution:

output %&gt;% pivot_longer(-c(col1:col4)) -&gt; df_long
df_long %&lt;&gt;% select(value)
unique(df_long)

答案1

得分: 1

    $ awk 'NR==1 {for(i=1;i<=NF;i++) if($i~/^sam/) cols[i]} 
                 {for(i=1;i<=NF;i++) 
                    if(i in cols) 
                      {split($i,t,":"); $i=t[1]}}1' file | 
      column -t

    col1  col2  col3  col4  sam1  sam2  sam3
    a     b     c     d     0/1   1/0   0/1
    e     f     g     h     0/0   1/1   0/0

or just interested in the unique values

    $ awk 'NR==1 {for(i=1;i<=NF;i++) if($i~/^sam/) cols[i]; next} 
                 {for(i in cols) {split($i,t,":"); if(!vals[t[1]]++) print t[1]}}' file

    0/1
    1/0
    0/0
    1/1
英文:
$ awk &#39;NR==1 {for(i=1;i&lt;=NF;i++) if($i~/^sam/) cols[i]} 
             {for(i=1;i&lt;=NF;i++) 
                if(i in cols) 
                  {split($i,t,&quot;:&quot;); $i=t[1]}}1&#39; file | 
  column -t

col1  col2  col3  col4  sam1  sam2  sam3
a     b     c     d     0/1   1/0   0/1
e     f     g     h     0/0   1/1   0/0

or just interested in the unique values

$ awk &#39;NR==1 {for(i=1;i&lt;=NF;i++) if($i~/^sam/) cols[i]; next} 
             {for(i in cols) {split($i,t,&quot;:&quot;); if(!vals[t[1]]++) print t[1]}}&#39; file

0/1
1/0
0/0
1/1

答案2

得分: 1

awk '
    BEGIN { FS = OFS = "\t" }
    NR==1 { while(i++<NF) if ($i ~ /^sam/) p[i] }
    NR>1 { for (i in p) { sub(/:.*$/,"",$i); u[$i] } }
    { print > "output" }
    END { for (i in u) print i > "unique" }
' input
英文:
awk &#39;
    BEGIN { FS = OFS = &quot;\t&quot; }
    NR==1 { while(i++&lt;NF) if ($i ~ /^sam/) p[i] }
    NR&gt;1 { for (i in p) { sub(/:.*$/,&quot;&quot;,$i); u[$i] } }
    { print &gt;&quot;output&quot; }
    END { for (i in u) print i &gt;&quot;unique&quot; }
&#39; input
  • use first row to populate a list with columns of interest
  • on subsequent rows, process relevant columns and copy amended values to hash
  • print each line to the file called "output"
  • at the end print the keys of the hash to a file called "unique"

awk's arrays are hashes so storing items as keys of an array gives the unique items

答案3

得分: 1

你可以很容易一次完成所有操作。例如,只需使用 gsub() 来从每个字段中移除 :XX,然后使用一个简单的数组来收集唯一的 sam 字段:

awk -F"\t" '{gsub(/:[^[:space:]]+/,"")} FNR>1 {for (i=5; i<=NF; i++) a[$i]++} END {for (i in a) print i}1' file

示例用法/输出

使用名为 file 的内容,你将得到:

$ awk -F"\t" '{gsub(/:[^[:space:]]+/,"")} FNR>1 {for (i=5; i<=NF; i++) a[$i]++} END {for (i in a) print i}1' file
col1    col2    col3    col4    sam1    sam2    sam3
a       b       c       d       0/1     1/0     0/1
e       f       g       h       0/0     1/1     0/0
1/0
1/1
0/0
0/1

在 Awk 脚本形式中

你可以将内容放入一个简单的脚本文件中,并使用 chmod +x 使其可执行,然后只需提供要读取的文件名作为参数。例如,创建名为 sam.awk 的文件,内容如下:

#!/bin/awk -f

BEGIN { FS = "\t" }
{
  gsub(/:[^[:space:]]+/,"")
  print
}
FNR>1 {
  for (i=5; i<=NF; i++)
    a[$i]++
}
END {
  for (i in a)
    print i
}

现在只需执行 chmod +x sam.awk 并运行 ./sam.awk file 来生成结果:

$ ./sam.awk file
col1    col2    col3    col4    sam1    sam2    sam3
a       b       c       d       0/1     1/0     0/1
e       f       g       h       0/0     1/1     0/0
1/0
1/1
0/0
0/1

不管是作为一行命令还是作为脚本,都可以,完全取决于你。

英文:

You can do it all in one go fairly easily. For example simply using gsub() to remove the :XX from each field and then a simple array to collect the unique sam fields you would have:

awk -F&quot;\t&quot; &#39;{gsub(/:[^[:space:]]+/,&quot;&quot;)} FNR&gt;1 {for (i=5; i&lt;=NF; i++) a[$i]++} END {for (i in a) print i}1&#39; file

Example Use/Output

With your content in file you would have:

$ awk -F&quot;\t&quot; &#39;{gsub(/:[^[:space:]]+/,&quot;&quot;)} FNR&gt;1 {for (i=5; i&lt;=NF; i++) a[$i]++} END {for (i in a) print i}1&#39; file
col1    col2    col3    col4    sam1    sam2    sam3
a       b       c       d       0/1     1/0     0/1
e       f       g       h       0/0     1/1     0/0
1/0
1/1
0/0
0/1

In Awk Script Form

You can put the contents in a simple script file and make it executable with chmod +x and then just provide the filename to read as an argument. For example, create sam.awk as follows:

#!/bin/awk -f

BEGIN { FS = &quot;\t&quot; }
{
  gsub(/:[^[:space:]]+/,&quot;&quot;)
  print
}
FNR&gt;1 {
  for (i=5; i&lt;=NF; i++)
    a[$i]++
}
END {
  for (i in a)
    print i
}

Now simply chmod +x sam.awk and execute ./sam.awk file to produce:

$ ./sam.awk file
col1    col2    col3    col4    sam1    sam2    sam3
a       b       c       d       0/1     1/0     0/1
e       f       g       h       0/0     1/1     0/0
1/0
1/1
0/0
0/1

Either way, as a one-liner or as a script is fine -- up to you entirely.

答案4

得分: 0

"查找以sam开头的列的所有唯一值"

$ grep -o "[0-9]/[0-9]" inputfile | sort -u
0/0
0/1
1/0
1/1
英文:

"find all the unique values for columns starting with sam"

$ grep -o &quot;[0-9]/[0-9]&quot; inputfile|sort -u
0/0
0/1
1/0
1/1

huangapple
  • 本文由 发表于 2023年3月4日 08:58:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75633023.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定