英文:
Split a SAM file in Awk keeping N number of lines as header
问题
我有一个非常大的序列比对映射(SAM)文件,如下所示:
@X YYYYYY ZZZZZ\
@X ssssss ddddd\
@X CCCCCC LLLLL
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
我想根据第3列将文件拆分,以便我可以执行 awk '{print > $3}' file.txt
,这已经可以正常工作。现在,我想将以下这些行:
@X YYYYYY ZZZZZ\
@X ssssss ddddd\
@X CCCCCC LLLLL
作为所有拆分文件的标题,我该如何实现?
我尝试过这样做:
awk '$1 ~ /^@/ {print > $3}' file.txt
这样是否符合您的要求?
英文:
I have a very big Sequence Alignment Map (SAM) file as depicted below
@X YYYYYY ZZZZZ\
@X ssssss ddddd\
@X CCCCCC LLLLL
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
I want to split the file based on column 3 so I can do awk '{print > $3}' file.txt
which is working fine. Now I want to keep the lines
@X YYYYYY ZZZZZ\
@X ssssss ddddd\
@X CCCCCC LLLLL
as header on top of all the splitted files, how can I do that?
I tried this:
awk '$1 ~ /^@/ {print > $3}' file.txt
答案1
得分: 3
你需要跟踪文件是否是之前见过的,如果不是,第一次写入之前写入头部。
!seen[$3]++ { printf "%s", header > $3 }
{ print > $3 }' file.txt
内部变量 ORS
通常包含一个换行符,但习惯上使用该变量,这样如果你想要使用不同的输出记录分隔符,只需要在一个地方更改该字符串。
如果 $3
中有超过几十个不同的值,这可能会耗尽文件句柄,但如果你的脚本在其他方面运行正常,那么在你的情况下可能不是问题。
(一种蛮力的解决方法是在每次写入后关闭并重新打开文件,这会使脚本运行得慢得多。如果你有足够的内存,更好的解决方法是将所有结果收集到内存中,只有在读取完所有数据后才写入。更复杂的方法是保持一个缓冲区,比如说,20 个打开的文件句柄,并在需要写入不在其中的文件时关闭最近最少使用的文件句柄。)
英文:
You have to keep track of whether the file is one you have seen before, and if not, write the header before you write to it for the first time.
awk '$1 ~ /^@/ { header = header $0 ORS; next }
!seen[$3]++ { printf "%s", header >$3 }
{ print > $3 }' file.txt
The internal variable ORS
usually contains a newline but it's customary to use the variable so that you only need to change the string in one place if you want to use a different output record separator.
This can run out of file handles if you have more than a couple of dozen distinct values in $3
but if your script otherwise works, I guess that's not a problem in your case.
(The brute-force fix is to close and reopen the file after each write, which makes the script much slower. A better fix if you have the memory is to collect all the results into RAM and only write when you have read all the data. A more sophisticated approach would keep a buffer of, say, 20 open file handles, and close the least recently used when you need to write to a file which isn't among them.)
答案2
得分: -1
如果“标题行”始终包含3个字段,则条件可以如下:
对于包含多于3个字段的行,将第一个和第二个字段设置为空;否则,将行原样打印:
file.txt中的内容:
cat file.txt
@X YYYYYY ZZZZZ\
@X ssssss ddddd\
@X CCCCCC LLLLL
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
awk:
awk '{ if( NF > 3) $1=$2=""; print }' file.txt
@X YYYYYY ZZZZZ\
@X ssssss ddddd\
@X CCCCCC LLLLL
117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:
<details>
<summary>英文:</summary>
If "header lines" always contains 3 fields then criteria can be as follows:
For lines containing more than 3 fields, set first and second field as ""; else, print line as it is:
file.txt used:
cat file.txt
@X YYYYYY ZZZZZ\
@X ssssss ddddd\
@X CCCCCC LLLLL
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
> FFFFFF 117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
awk:
awk '{ if( NF > 3) $1=$2=""; print }' file.txt
@X YYYYYY ZZZZZ\
@X ssssss ddddd\
@X CCCCCC LLLLL
117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch6 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch2 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch5 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
117 ch1 16448 0 * = 16448 0 TCTTGCACTGATCTGATGGACAGCATTGATGACATAACACGGAGACTGTTGCTAAAAACATCCGATAAAACTCGTGCTCAGACACCAAATACTCAAGAAG JJFEJDDDBDJJJHJDDDHDJJEFJDJJCDFDJEJCEHHFDDDJDJEHEEJFJJJHDIFJJJJJDJDDHHJCDDJJFJFJEJFEDJJJDH########## MC:Z:55S22M23S RG:Z:Sample_POP1 AS:i:0 XS:i:0
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论