英文:
Loop through columns, then transform cells, then show unique values in awk
问题
I can provide the requested translations:
Input:
I'd like to transform some data using awk but need some help pls. I want to extract for columns starting with "sam" (where the column number is undefined) everything before the first colon.
Desired output:
This is the best I've got so far... but it doesn't work.
awk -F"\t" '{ for(i=5; i<=NF; --i); split($i,a,":"); print a[1]}}' input > output
I know how to cut a column i.e. cut -d ':' -f2
but as far as I understand you can't combine cut with awk in a loop!
Then, I want to find all the unique values for columns starting with sam in the output file e.g.
0/1
1/0
0/0
1/1
I'm afraid I'm totally lost on an awk solution for that. I can do it in R but an awk solution would be preferred and much faster.
R solution:
output %>% pivot_longer(-c(col1:col4)) -> df_long
df_long %>% select(value)
unique(df_long)
英文:
I'd like to transform some data using awk but need some help pls. I want to extract for columns starting with "sam" (where the column number is undefined) everything before the first colon.
Input:
col1 col2 col3 col4 sam1 sam2 sam3
a b c d 0/1:12 1/0:9 0/1:16
e f g h 0/0:7 1/1:98 0/0:8
Desired output:
col1 col2 col3 col4 sam1 sam2 sam3
a b c d 0/1 1/0 0/1
e f g h 0/0 1/1 0/0
This is the best I've got so far... but it doesn't work.
awk -F"\t" '{ for(i=5; i<=NF; --i); split($i,a,":"); print a[1]}}' input > output
I know how to cut a column i.e. cut -d ':' -f2
but as far as I understand you can't combine cut with awk in a loop!
Then, I want to find all the unique values for columns starting with sam in the output file e.g.
0/1
1/0
0/0
1/1
I'm afraid I'm totally lost on an awk solution for that. I can do it in R but an awk solution would be preferred and much faster.
R solution:
output %>% pivot_longer(-c(col1:col4)) -> df_long
df_long %<>% select(value)
unique(df_long)
答案1
得分: 1
$ awk 'NR==1 {for(i=1;i<=NF;i++) if($i~/^sam/) cols[i]}
{for(i=1;i<=NF;i++)
if(i in cols)
{split($i,t,":"); $i=t[1]}}1' file |
column -t
col1 col2 col3 col4 sam1 sam2 sam3
a b c d 0/1 1/0 0/1
e f g h 0/0 1/1 0/0
or just interested in the unique values
$ awk 'NR==1 {for(i=1;i<=NF;i++) if($i~/^sam/) cols[i]; next}
{for(i in cols) {split($i,t,":"); if(!vals[t[1]]++) print t[1]}}' file
0/1
1/0
0/0
1/1
英文:
$ awk 'NR==1 {for(i=1;i<=NF;i++) if($i~/^sam/) cols[i]}
{for(i=1;i<=NF;i++)
if(i in cols)
{split($i,t,":"); $i=t[1]}}1' file |
column -t
col1 col2 col3 col4 sam1 sam2 sam3
a b c d 0/1 1/0 0/1
e f g h 0/0 1/1 0/0
or just interested in the unique values
$ awk 'NR==1 {for(i=1;i<=NF;i++) if($i~/^sam/) cols[i]; next}
{for(i in cols) {split($i,t,":"); if(!vals[t[1]]++) print t[1]}}' file
0/1
1/0
0/0
1/1
答案2
得分: 1
awk '
BEGIN { FS = OFS = "\t" }
NR==1 { while(i++<NF) if ($i ~ /^sam/) p[i] }
NR>1 { for (i in p) { sub(/:.*$/,"",$i); u[$i] } }
{ print > "output" }
END { for (i in u) print i > "unique" }
' input
英文:
awk '
BEGIN { FS = OFS = "\t" }
NR==1 { while(i++<NF) if ($i ~ /^sam/) p[i] }
NR>1 { for (i in p) { sub(/:.*$/,"",$i); u[$i] } }
{ print >"output" }
END { for (i in u) print i >"unique" }
' input
- use first row to populate a list with columns of interest
- on subsequent rows, process relevant columns and copy amended values to hash
- print each line to the file called "output"
- at the end print the keys of the hash to a file called "unique"
awk's arrays are hashes so storing items as keys of an array gives the unique items
答案3
得分: 1
你可以很容易一次完成所有操作。例如,只需使用 gsub()
来从每个字段中移除 :XX
,然后使用一个简单的数组来收集唯一的 sam
字段:
awk -F"\t" '{gsub(/:[^[:space:]]+/,"")} FNR>1 {for (i=5; i<=NF; i++) a[$i]++} END {for (i in a) print i}1' file
示例用法/输出
使用名为 file
的内容,你将得到:
$ awk -F"\t" '{gsub(/:[^[:space:]]+/,"")} FNR>1 {for (i=5; i<=NF; i++) a[$i]++} END {for (i in a) print i}1' file
col1 col2 col3 col4 sam1 sam2 sam3
a b c d 0/1 1/0 0/1
e f g h 0/0 1/1 0/0
1/0
1/1
0/0
0/1
在 Awk 脚本形式中
你可以将内容放入一个简单的脚本文件中,并使用 chmod +x
使其可执行,然后只需提供要读取的文件名作为参数。例如,创建名为 sam.awk
的文件,内容如下:
#!/bin/awk -f
BEGIN { FS = "\t" }
{
gsub(/:[^[:space:]]+/,"")
print
}
FNR>1 {
for (i=5; i<=NF; i++)
a[$i]++
}
END {
for (i in a)
print i
}
现在只需执行 chmod +x sam.awk
并运行 ./sam.awk file
来生成结果:
$ ./sam.awk file
col1 col2 col3 col4 sam1 sam2 sam3
a b c d 0/1 1/0 0/1
e f g h 0/0 1/1 0/0
1/0
1/1
0/0
0/1
不管是作为一行命令还是作为脚本,都可以,完全取决于你。
英文:
You can do it all in one go fairly easily. For example simply using gsub()
to remove the :XX
from each field and then a simple array to collect the unique sam
fields you would have:
awk -F"\t" '{gsub(/:[^[:space:]]+/,"")} FNR>1 {for (i=5; i<=NF; i++) a[$i]++} END {for (i in a) print i}1' file
Example Use/Output
With your content in file
you would have:
$ awk -F"\t" '{gsub(/:[^[:space:]]+/,"")} FNR>1 {for (i=5; i<=NF; i++) a[$i]++} END {for (i in a) print i}1' file
col1 col2 col3 col4 sam1 sam2 sam3
a b c d 0/1 1/0 0/1
e f g h 0/0 1/1 0/0
1/0
1/1
0/0
0/1
In Awk Script Form
You can put the contents in a simple script file and make it executable with chmod +x
and then just provide the filename to read as an argument. For example, create sam.awk
as follows:
#!/bin/awk -f
BEGIN { FS = "\t" }
{
gsub(/:[^[:space:]]+/,"")
print
}
FNR>1 {
for (i=5; i<=NF; i++)
a[$i]++
}
END {
for (i in a)
print i
}
Now simply chmod +x sam.awk
and execute ./sam.awk file
to produce:
$ ./sam.awk file
col1 col2 col3 col4 sam1 sam2 sam3
a b c d 0/1 1/0 0/1
e f g h 0/0 1/1 0/0
1/0
1/1
0/0
0/1
Either way, as a one-liner or as a script is fine -- up to you entirely.
答案4
得分: 0
"查找以sam开头的列的所有唯一值"
$ grep -o "[0-9]/[0-9]" inputfile | sort -u
0/0
0/1
1/0
1/1
英文:
"find all the unique values for columns starting with sam"
$ grep -o "[0-9]/[0-9]" inputfile|sort -u
0/0
0/1
1/0
1/1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论