使用awk提取vcf列的子字符串

huangapple go评论85阅读模式
英文:

Extracting vcf columns substring with awk

问题

ID         INFO
rs3094315   2
rs3131972   2
英文:

I have vcf file like this:

##bcftools_annotateVersion=1.3.1+htslib-1.3.1
##bcftools_annotateCommand=annotate 
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG005
chr1	817186	rs3094315	G	A	50	PASS	platforms=2;platformnames=Illumina,CG;datasets=3;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair;callsets=5;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=IonExome,SolidSE75bp;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2	GT:PS:DP:ADALL:AD:GQ	1/1:.:809:0,363:78,428:237
chr1	817341	rs3131972	A	G	50	PASS	platforms=3;platformnames=Illumina,CG,Solid;datasets=4;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair,SolidSE75bp;callsets=6;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes,SolidSE75GATKHC;datasetsmissingcall=IonExome;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2	GT:PS:DP:ADALL:AD:GQ	1/1:.:732:1,330:99,391:302

I need to extract ID column and AN from INFO column to get:

ID         INFO
rs3094315   2
rs3131972   2

I'm trying something like this awk '/^[^#]/ { print $3, gsub(/^[^AN=])/,"",$8)}' file.vcf, but still not getting the desired result.

答案1

得分: 1

你可以尝试这个awk命令:

awk 'BEGIN{OFS="\t;"} /^##/{next} /^#/{print $3,$8; next} { split($8,a,";") for(i=1;i<=length(a);i++) if (a[i]~/^AN=/) {sub(/^AN=/,"",a[i]); break} printf "%s%s%s\n", $3, OFS, a[i] }' file

用这个示例,会输出:

ID	INFO
rs3094315	2
rs3131972	2
英文:

You can try this awk:

awk &#39;BEGIN{OFS=&quot;\t&quot;}
/^##/{next}
/^#/{print $3,$8; next}
{
    split($8,a,&quot;;&quot;)
    for(i=1;i&lt;=length(a);i++) if (a[i]~/^AN=/) {sub(/^AN=/,&quot;&quot;,a[i]); break}
    printf &quot;%s%s%s\n&quot;, $3, OFS, a[i]
}
&#39; file

With the example, prints:

ID	INFO
rs3094315	2
rs3131972	2

答案2

得分: 0

GNU sed也许是这个更好的选择:

$ sed -En '/^#/!s/^(\w+\t){2}(\w+).*;AN=([^;]+).*/\t/p' file.vcf
rs3094315    2
rs3131972    3
英文:

GNU sed is maybe a better choice for this:

$ sed -En &#39;/^#/!s/^(\w+\t){2}(\w+).*;AN=([^;]+).*/\t/p&#39; file.vcf
rs3094315    2
rs3131972    3

答案3

得分: 0

ID INFO
rs3094315 2
rs3131972 2

英文:
 mawk NF=NF OFS=&#39;\t&#39; FS=&#39;^##.+|^[^ \t]+[ \t]+[^ \t]+[ \t]+|[ \t]&#39; \
               &#39;.+(FILTER[ \t]+|;AN=)|([ \t]+[^ \t]+[ \t]+[^ \t]+|;.+)$&#39; 

	ID	INFO	
	rs3094315	2	
	rs3131972	2

答案4

得分: 0

$ awk -F' *|AN=|;AF=' '{print(NR>2 ? $3"\t"$9 : $3"\t\t"$8)}' file
ID              INFO
rs3094315       2
rs3131972       2
英文:
$ awk -F&#39; *|AN=|;AF=&#39; &#39;NR&gt;2{print(NR==3 ? $3&quot;\t\t&quot;$8 : $3&quot;\t&quot;$9)}&#39; file
ID              INFO
rs3094315       2
rs3131972       2

答案5

得分: 0

通常提醒,有专门的工具用于这种情况,没有理由使用类似 awk 或 sed 的东西,特别是如果你是初学者,不真正了解命令在做什么,因为解析 vcf 文件存在许多陷阱。有足够的理由认为 awk 脚本会在 vcf 格式的新/旧版本上静默失败,产生奇怪而奇妙的结果。

简单的 bcftools 解决方案:

bcftools query -f'%ID [%AN]\n' in.vcf > out.txt

bcftools format 从 vcf 中提取字段。在 -f 格式标志中,对于 INFO 字段,使用格式 %FIELD,对于 FORMAT 字段,使用格式 [%FIELD]

英文:

Usual reminder, there are dedicated tools for this kind of thing and there's no reason to use something like awk or sed, especially if you are a beginner and you don't really understand what the command is doing, as there are many pitfalls of parsing vcf files. There's every reason to think an awk script will silenty fail on a new/previous version of the vcf format in some weird and wonderful.

Simple bcftools solution:

bcftools query -f&#39;%ID [%AN]\n&#39; in.vcf &gt; out.txt

bcftools format extracts fields from the vcf. Within the -f formqt flag, for INFO fields, use the format %FIELD and for FORMAT fields, use the format [%FIELD].

huangapple
  • 本文由 发表于 2023年6月1日 19:50:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76381570.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定