英文:
Extracting vcf columns substring with awk
问题
ID INFO
rs3094315 2
rs3131972 2
英文:
I have vcf file like this:
##bcftools_annotateVersion=1.3.1+htslib-1.3.1
##bcftools_annotateCommand=annotate
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG005
chr1 817186 rs3094315 G A 50 PASS platforms=2;platformnames=Illumina,CG;datasets=3;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair;callsets=5;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=IonExome,SolidSE75bp;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2 GT:PS:DP:ADALL:AD:GQ 1/1:.:809:0,363:78,428:237
chr1 817341 rs3131972 A G 50 PASS platforms=3;platformnames=Illumina,CG,Solid;datasets=4;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair,SolidSE75bp;callsets=6;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes,SolidSE75GATKHC;datasetsmissingcall=IonExome;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2 GT:PS:DP:ADALL:AD:GQ 1/1:.:732:1,330:99,391:302
I need to extract ID column and AN from INFO column to get:
ID INFO
rs3094315 2
rs3131972 2
I'm trying something like this awk '/^[^#]/ { print $3, gsub(/^[^AN=])/,"",$8)}' file.vcf
, but still not getting the desired result.
答案1
得分: 1
你可以尝试这个awk命令:
awk 'BEGIN{OFS="\t;"} /^##/{next} /^#/{print $3,$8; next} { split($8,a,";") for(i=1;i<=length(a);i++) if (a[i]~/^AN=/) {sub(/^AN=/,"",a[i]); break} printf "%s%s%s\n", $3, OFS, a[i] }' file
用这个示例,会输出:
ID INFO
rs3094315 2
rs3131972 2
英文:
You can try this awk:
awk 'BEGIN{OFS="\t"}
/^##/{next}
/^#/{print $3,$8; next}
{
split($8,a,";")
for(i=1;i<=length(a);i++) if (a[i]~/^AN=/) {sub(/^AN=/,"",a[i]); break}
printf "%s%s%s\n", $3, OFS, a[i]
}
' file
With the example, prints:
ID INFO
rs3094315 2
rs3131972 2
答案2
得分: 0
GNU sed
也许是这个更好的选择:
$ sed -En '/^#/!s/^(\w+\t){2}(\w+).*;AN=([^;]+).*/\t/p' file.vcf
rs3094315 2
rs3131972 3
英文:
GNU sed
is maybe a better choice for this:
$ sed -En '/^#/!s/^(\w+\t){2}(\w+).*;AN=([^;]+).*/\t/p' file.vcf
rs3094315 2
rs3131972 3
答案3
得分: 0
ID INFO
rs3094315 2
rs3131972 2
英文:
mawk NF=NF OFS='\t' FS='^##.+|^[^ \t]+[ \t]+[^ \t]+[ \t]+|[ \t]' \
'.+(FILTER[ \t]+|;AN=)|([ \t]+[^ \t]+[ \t]+[^ \t]+|;.+)$'
ID INFO
rs3094315 2
rs3131972 2
答案4
得分: 0
$ awk -F' *|AN=|;AF=' '{print(NR>2 ? $3"\t"$9 : $3"\t\t"$8)}' file
ID INFO
rs3094315 2
rs3131972 2
英文:
$ awk -F' *|AN=|;AF=' 'NR>2{print(NR==3 ? $3"\t\t"$8 : $3"\t"$9)}' file
ID INFO
rs3094315 2
rs3131972 2
答案5
得分: 0
通常提醒,有专门的工具用于这种情况,没有理由使用类似 awk 或 sed 的东西,特别是如果你是初学者,不真正了解命令在做什么,因为解析 vcf 文件存在许多陷阱。有足够的理由认为 awk 脚本会在 vcf 格式的新/旧版本上静默失败,产生奇怪而奇妙的结果。
简单的 bcftools 解决方案:
bcftools query -f'%ID [%AN]\n' in.vcf > out.txt
bcftools format
从 vcf 中提取字段。在 -f
格式标志中,对于 INFO
字段,使用格式 %FIELD
,对于 FORMAT
字段,使用格式 [%FIELD]
。
英文:
Usual reminder, there are dedicated tools for this kind of thing and there's no reason to use something like awk or sed, especially if you are a beginner and you don't really understand what the command is doing, as there are many pitfalls of parsing vcf files. There's every reason to think an awk script will silenty fail on a new/previous version of the vcf format in some weird and wonderful.
Simple bcftools solution:
bcftools query -f'%ID [%AN]\n' in.vcf > out.txt
bcftools format
extracts fields from the vcf. Within the -f
formqt flag, for INFO
fields, use the format %FIELD
and for FORMAT
fields, use the format [%FIELD]
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论