英文:
Loop through columns, then transform cells, then show unique values in awk
问题
I can provide the requested translations:
Input:
I'd like to transform some data using awk but need some help pls. I want to extract for columns starting with "sam" (where the column number is undefined) everything before the first colon.
Desired output:
This is the best I've got so far... but it doesn't work.
awk -F"\t" '{ for(i=5; i<=NF; --i); split($i,a,":"); print a[1]}}' input > output
I know how to cut a column i.e. cut -d ':' -f2 but as far as I understand you can't combine cut with awk in a loop!
Then, I want to find all the unique values for columns starting with sam in the output file e.g.
0/1
1/0
0/0
1/1
I'm afraid I'm totally lost on an awk solution for that. I can do it in R but an awk solution would be preferred and much faster.
R solution:
output %>% pivot_longer(-c(col1:col4)) -> df_long
df_long %>% select(value)
unique(df_long)
英文:
I'd like to transform some data using awk but need some help pls. I want to extract for columns starting with "sam" (where the column number is undefined) everything before the first colon.
Input:
col1	col2	col3	col4	sam1	sam2	sam3
a	b	c	d	0/1:12	1/0:9	0/1:16
e	f	g	h	0/0:7	1/1:98	0/0:8
Desired output:
col1	col2	col3	col4	sam1	sam2	sam3
a	b	c	d	0/1	1/0	0/1
e	f	g	h	0/0	1/1	0/0
This is the best I've got so far... but it doesn't work.
awk -F"\t" '{ for(i=5; i<=NF; --i); split($i,a,":"); print a[1]}}' input > output
I know how to cut a column i.e. cut -d ':' -f2 but as far as I understand you can't combine cut with awk in a loop!
Then, I want to find all the unique values for columns starting with sam in the output file e.g.
0/1
1/0
0/0
1/1
I'm afraid I'm totally lost on an awk solution for that. I can do it in R but an awk solution would be preferred and much faster.
R solution:
output %>% pivot_longer(-c(col1:col4)) -> df_long
df_long %<>% select(value)
unique(df_long)
答案1
得分: 1
    $ awk 'NR==1 {for(i=1;i<=NF;i++) if($i~/^sam/) cols[i]} 
                 {for(i=1;i<=NF;i++) 
                    if(i in cols) 
                      {split($i,t,":"); $i=t[1]}}1' file | 
      column -t
    col1  col2  col3  col4  sam1  sam2  sam3
    a     b     c     d     0/1   1/0   0/1
    e     f     g     h     0/0   1/1   0/0
or just interested in the unique values
    $ awk 'NR==1 {for(i=1;i<=NF;i++) if($i~/^sam/) cols[i]; next} 
                 {for(i in cols) {split($i,t,":"); if(!vals[t[1]]++) print t[1]}}' file
    0/1
    1/0
    0/0
    1/1
英文:
$ awk 'NR==1 {for(i=1;i<=NF;i++) if($i~/^sam/) cols[i]} 
             {for(i=1;i<=NF;i++) 
                if(i in cols) 
                  {split($i,t,":"); $i=t[1]}}1' file | 
  column -t
col1  col2  col3  col4  sam1  sam2  sam3
a     b     c     d     0/1   1/0   0/1
e     f     g     h     0/0   1/1   0/0
or just interested in the unique values
$ awk 'NR==1 {for(i=1;i<=NF;i++) if($i~/^sam/) cols[i]; next} 
             {for(i in cols) {split($i,t,":"); if(!vals[t[1]]++) print t[1]}}' file
0/1
1/0
0/0
1/1
答案2
得分: 1
awk '
    BEGIN { FS = OFS = "\t" }
    NR==1 { while(i++<NF) if ($i ~ /^sam/) p[i] }
    NR>1 { for (i in p) { sub(/:.*$/,"",$i); u[$i] } }
    { print > "output" }
    END { for (i in u) print i > "unique" }
' input
英文:
awk '
    BEGIN { FS = OFS = "\t" }
    NR==1 { while(i++<NF) if ($i ~ /^sam/) p[i] }
    NR>1 { for (i in p) { sub(/:.*$/,"",$i); u[$i] } }
    { print >"output" }
    END { for (i in u) print i >"unique" }
' input
- use first row to populate a list with columns of interest
 - on subsequent rows, process relevant columns and copy amended values to hash
 - print each line to the file called "output"
 - at the end print the keys of the hash to a file called "unique"
 
awk's arrays are hashes so storing items as keys of an array gives the unique items
答案3
得分: 1
你可以很容易一次完成所有操作。例如,只需使用 gsub() 来从每个字段中移除 :XX,然后使用一个简单的数组来收集唯一的 sam 字段:
awk -F"\t" '{gsub(/:[^[:space:]]+/,"")} FNR>1 {for (i=5; i<=NF; i++) a[$i]++} END {for (i in a) print i}1' file
示例用法/输出
使用名为 file 的内容,你将得到:
$ awk -F"\t" '{gsub(/:[^[:space:]]+/,"")} FNR>1 {for (i=5; i<=NF; i++) a[$i]++} END {for (i in a) print i}1' file
col1    col2    col3    col4    sam1    sam2    sam3
a       b       c       d       0/1     1/0     0/1
e       f       g       h       0/0     1/1     0/0
1/0
1/1
0/0
0/1
在 Awk 脚本形式中
你可以将内容放入一个简单的脚本文件中,并使用 chmod +x 使其可执行,然后只需提供要读取的文件名作为参数。例如,创建名为 sam.awk 的文件,内容如下:
#!/bin/awk -f
BEGIN { FS = "\t" }
{
  gsub(/:[^[:space:]]+/,"")
  print
}
FNR>1 {
  for (i=5; i<=NF; i++)
    a[$i]++
}
END {
  for (i in a)
    print i
}
现在只需执行 chmod +x sam.awk 并运行 ./sam.awk file 来生成结果:
$ ./sam.awk file
col1    col2    col3    col4    sam1    sam2    sam3
a       b       c       d       0/1     1/0     0/1
e       f       g       h       0/0     1/1     0/0
1/0
1/1
0/0
0/1
不管是作为一行命令还是作为脚本,都可以,完全取决于你。
英文:
You can do it all in one go fairly easily. For example simply using gsub() to remove the :XX from each field and then a simple array to collect the unique sam fields you would have:
awk -F"\t" '{gsub(/:[^[:space:]]+/,"")} FNR>1 {for (i=5; i<=NF; i++) a[$i]++} END {for (i in a) print i}1' file
Example Use/Output
With your content in file you would have:
$ awk -F"\t" '{gsub(/:[^[:space:]]+/,"")} FNR>1 {for (i=5; i<=NF; i++) a[$i]++} END {for (i in a) print i}1' file
col1    col2    col3    col4    sam1    sam2    sam3
a       b       c       d       0/1     1/0     0/1
e       f       g       h       0/0     1/1     0/0
1/0
1/1
0/0
0/1
In Awk Script Form
You can put the contents in a simple script file and make it executable with chmod +x and then just provide the filename to read as an argument. For example, create sam.awk as follows:
#!/bin/awk -f
BEGIN { FS = "\t" }
{
  gsub(/:[^[:space:]]+/,"")
  print
}
FNR>1 {
  for (i=5; i<=NF; i++)
    a[$i]++
}
END {
  for (i in a)
    print i
}
Now simply chmod +x sam.awk and execute ./sam.awk file to produce:
$ ./sam.awk file
col1    col2    col3    col4    sam1    sam2    sam3
a       b       c       d       0/1     1/0     0/1
e       f       g       h       0/0     1/1     0/0
1/0
1/1
0/0
0/1
Either way, as a one-liner or as a script is fine -- up to you entirely.
答案4
得分: 0
"查找以sam开头的列的所有唯一值"
$ grep -o "[0-9]/[0-9]" inputfile | sort -u
0/0
0/1
1/0
1/1
英文:
"find all the unique values for columns starting with sam"
$ grep -o "[0-9]/[0-9]" inputfile|sort -u
0/0
0/1
1/0
1/1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论