在Awk中拆分SAM文件,保留N行作为标题。

huangapple go评论51阅读模式
英文:

Split a SAM file in Awk keeping N number of lines as header

问题

我有一个非常大的序列比对映射(SAM)文件,如下所示:

 @X   YYYYYY ZZZZZ\
 @X   ssssss ddddd\
 @X   CCCCCC LLLLL

> FFFFFF	117	ch1	16448	0	*	=	16448	0	TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG	JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########	MC:Z:55S22M23S	RG:Z:Sample_POP1	AS:i:0	XS:i:0
> FFFFFF	117	ch6	16448	0	*	=	16448	0	TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG	JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########	MC:Z:55S22M23S	RG:Z:Sample_POP1	AS:i:0	XS:i:0
> FFFFFF	117	ch2	16448	0	*	=	16448	0	TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG	JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########	MC:Z:55S22M23S	RG:Z:Sample_POP1	AS:i:0	XS:i:0
> FFFFFF	117	ch5	16448	0	*	=	16448	0	TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG	JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########	MC:Z:55S22M23S	RG:Z:Sample_POP1	AS:i:0	XS:i:0
> FFFFFF	117	ch1	16448	0	*	=	16448	0	TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG	JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########	MC:Z:55S22M23S	RG:Z:Sample_POP1	AS:i:0	XS:i:0

我想根据第3列将文件拆分,以便我可以执行 awk '{print > $3}' file.txt,这已经可以正常工作。现在,我想将以下这些行:

 @X   YYYYYY ZZZZZ\
 @X   ssssss ddddd\
 @X   CCCCCC LLLLL

作为所有拆分文件的标题,我该如何实现?

我尝试过这样做:

awk '$1 ~ /^@/ {print > $3}'  file.txt

这样是否符合您的要求?

英文:

I have a very big Sequence Alignment Map (SAM) file as depicted below

 @X   YYYYYY ZZZZZ\
 @X   ssssss ddddd\
 @X   CCCCCC LLLLL
 
> FFFFFF	117	ch1	16448	0	*	=	16448	0	TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG	JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########	MC:Z:55S22M23S	RG:Z:Sample_POP1	AS:i:0	XS:i:0
> FFFFFF	117	ch6	16448	0	*	=	16448	0	TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG	JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########	MC:Z:55S22M23S	RG:Z:Sample_POP1	AS:i:0	XS:i:0
> FFFFFF	117	ch2	16448	0	*	=	16448	0	TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG	JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########	MC:Z:55S22M23S	RG:Z:Sample_POP1	AS:i:0	XS:i:0
> FFFFFF	117	ch5	16448	0	*	=	16448	0	TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG	JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########	MC:Z:55S22M23S	RG:Z:Sample_POP1	AS:i:0	XS:i:0
> FFFFFF	117	ch1	16448	0	*	=	16448	0	TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG	JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########	MC:Z:55S22M23S	RG:Z:Sample_POP1	AS:i:0	XS:i:0

I want to split the file based on column 3 so I can do awk '{print > $3}' file.txt which is working fine. Now I want to keep the lines

 @X   YYYYYY ZZZZZ\
 @X   ssssss ddddd\
 @X   CCCCCC LLLLL

as header on top of all the splitted files, how can I do that?

I tried this:

awk '$1 ~ /^@/ {print > $3}'  file.txt

答案1

得分: 3

你需要跟踪文件是否是之前见过的,如果不是,第一次写入之前写入头部。

   !seen[$3]++ { printf "%s", header > $3 }
   { print > $3 }' file.txt

内部变量 ORS 通常包含一个换行符,但习惯上使用该变量,这样如果你想要使用不同的输出记录分隔符,只需要在一个地方更改该字符串。

如果 $3 中有超过几十个不同的值,这可能会耗尽文件句柄,但如果你的脚本在其他方面运行正常,那么在你的情况下可能不是问题。

(一种蛮力的解决方法是在每次写入后关闭并重新打开文件,这会使脚本运行得慢得多。如果你有足够的内存,更好的解决方法是将所有结果收集到内存中,只有在读取完所有数据后才写入。更复杂的方法是保持一个缓冲区,比如说,20 个打开的文件句柄,并在需要写入不在其中的文件时关闭最近最少使用的文件句柄。)

英文:

You have to keep track of whether the file is one you have seen before, and if not, write the header before you write to it for the first time.

awk '$1 ~ /^@/ { header = header $0 ORS; next }
   !seen[$3]++ { printf "%s", header >$3 }
   { print > $3 }' file.txt

The internal variable ORS usually contains a newline but it's customary to use the variable so that you only need to change the string in one place if you want to use a different output record separator.

This can run out of file handles if you have more than a couple of dozen distinct values in $3 but if your script otherwise works, I guess that's not a problem in your case.

(The brute-force fix is to close and reopen the file after each write, which makes the script much slower. A better fix if you have the memory is to collect all the results into RAM and only write when you have read all the data. A more sophisticated approach would keep a buffer of, say, 20 open file handles, and close the least recently used when you need to write to a file which isn't among them.)

答案2

得分: -1

如果“标题行”始终包含3个字段,则条件可以如下:

对于包含多于3个字段的行,将第一个和第二个字段设置为空;否则,将行原样打印:

file.txt中的内容:

cat file.txt
@X   YYYYYY ZZZZZ\
@X   ssssss ddddd\
@X   CCCCCC LLLLL
> FFFFFF    117 ch1 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch6 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch2 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch5 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
> FFFFFF    117 ch1 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0

awk:

awk '{ if( NF > 3) $1=$2=""; print }' file.txt
@X   YYYYYY ZZZZZ\
@X   ssssss ddddd\
@X   CCCCCC LLLLL
117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:

<details>
<summary>英文:</summary>

If &quot;header lines&quot; always  contains 3 fields then criteria can be as follows:

For lines containing more than 3 fields, set first and second field as &quot;&quot;; else, print line as it is:

file.txt used:

    cat file.txt
    @X   YYYYYY ZZZZZ\
    @X   ssssss ddddd\
    @X   CCCCCC LLLLL
    
    &gt; FFFFFF    117 ch1 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
    &gt; FFFFFF    117 ch6 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
    &gt; FFFFFF    117 ch2 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
    &gt; FFFFFF    117 ch5 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0
    &gt; FFFFFF    117 ch1 16448   0   *   =   16448   0   TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG    JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH##########    MC:Z:55S22M23S  RG:Z:Sample_POP1    AS:i:0  XS:i:0


awk:


    awk &#39;{ if( NF &gt; 3) $1=$2=&quot;&quot;; print }&#39; file.txt
    @X   YYYYYY ZZZZZ\
    @X   ssssss ddddd\
    @X   CCCCCC LLLLL

    117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
    117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
    117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
    117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
    117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0

</details>



huangapple
  • 本文由 发表于 2023年2月16日 18:26:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/75470881.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定