英文:
Filtering a fasta file with sequences that match a certain string in another file
问题
使用 BLAST,我得到了一个包含两列以制表符分隔的文件,一列是物种名称,另一列是基因名称(参考数据库中最相似基因的名称)。我的目标是在第一个文件中找到所有关联基因名称为最常见的物种名称,并使用这个物种列表过滤第二个文件(FASTA 格式),只保留这些物种的序列。
例如,如果这是 BLAST 文件:
sp1 XXX
sp2 AAA
sp3 XXX
sp4 XXX
sp5 BBB
匹配最常见基因的物种将是 sp1、sp3 和 sp4。然后在 FASTA 文件中,原本包含许多物种的,我想要保留只有 sp1、sp3 和 sp4 的序列。所以从这个:
>sp1
AATCGAGTCGT
>sp2
AATCGCGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
>sp5
AAACGAGTCGT
到这个 FASTA 文件:
>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
英文:
With BLAST I have obtained a file with two tab-separated columns, one with species names and the other with a gene name (the name of the most similar gene in a reference database). My goal is to find in the first file all the species names for which the associated gene name is the most common one, and use this list of species to filter a second file (FASTA-formatted) keeping only those species.
For example if this is BLAST file:
sp1 XXX
sp2 AAA
sp3 XXX
sp4 XXX
sp5 BBB
The species that match the most common gene would be sp1, sp3 and sp4. Then in the FASTA file, which originally contains many species, I would like to keep only the sequences of species sp1, sp3 and sp4. So going from this:
>sp1
AATCGAGTCGT
>sp2
AATCGCGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
>sp5
AAACGAGTCGT
To this fasta file:
>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
So far I've tried for a few days different approaches but for one reason or another it never works. I've tried creating this script (don.awk) to obtain the most common gene name (2nd column):
BEGIN {
FS="\t"
}
{a[$2]++; if (a[$2] > comV) { comN=$2; comV=a[$2]} }
END {
printf("%s\n", comN, comV)
}
And then running it with
nawk -f don.awk blastfile
Then I tried assigning the name to a variable (for example var1) and using it in
awk -F'\t' '$2 ~ /$var1/ {print $0}' blastfile > result
to first filter the original file. But for some reason I can't save the variables or I get errors. But I guess there must be an easier way.
答案1
得分: 3
使用 GNU awk
与 grep
:
grep -A 1 -f <(awk '
BEGIN{PROCINFO["sorted_in"] = "@val_num_desc";}
{a[$2]++; b[$1]=$2}
END{for (i in a){j++; if (j==1){tgt=i; break}}
for (ii in b){if(tgt==b[ii]){print ">"ii}}' blast.txt) input.fasta | grep -v '^--' > filtered.fasta
输出
>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
描述
- 将数组排序顺序设置为降序,以数组值的数值为基准:
BEGIN{PROCINFO["sorted_in"] = "@val_num_desc";}
- 使用第二列值作为索引,填充第一个数组以统计第二列值的出现次数。同时,使用第一列作为键,第二列作为值填充第二个数组。
{a[$2]++; b[$1]=$2}
- 提取第一个数组的第一个索引作为目标基因名称,用于从第二个数组中筛选物种名称。
for (i in a){j++; if (j==1){tgt=i; break}}
- 使用上述提取的基因名来获取物种名称,用于筛选 input.fasta 文件中的物种:
for (ii in b){if(tgt==b[ii]){print ">"ii}}
- 使用
grep
的-A 1
选项和-f
选项来提取所需的 fasta 记录。
grep -A 1 -f <(awk...) input.fasta`
- 使用管道符
|
到另一个 grep 来去除匹配组之间的--
,并将输出重定向到一个文件。
| grep -v '^--' > filtered.fasta
- 或者只使用 GNU
awk
:
awk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_desc";}
BEGINFILE{fnum++}
fnum==1{a[$2]++; b[$1]=$2}
ENDFILE{if(fnum==1){
for (i in a){j++; if (j==1){tgt=i; break}}}
for (ii in b){if(tgt!=b[ii]){delete b[ii]}}
}
fnum==2{for (i in b){if($0==">"i){
getline n;
printf "%s\n%s\n", $0, n}}}' blast.txt input.fasta
细节
- 将数组排序顺序设置为降序,以数组值的数值为基准:
BEGIN{PROCINFO["sorted_in"] = "@val_num_desc";}
- 对于每个处理的文件,增加变量
fnum
的值。
BEGINFILE{fnum++}
- 对于第一个处理的文件,使用第二列值作为索引,填充第一个数组以统计第二列值的出现次数。同时,使用第一列作为键,第二列作为值填充第二个数组。
fnum==1{a[$2]++; b[$1]=$2}
- 当完成处理第一个文件时,提取第一个数组的第一个索引作为目标基因名称,用于从第二个数组中筛选物种名称。然后,通过删除不匹配目标基因名的条目来筛选第二个数组。
ENDFILE{if(fnum==1){
for (i in a){j++; if (j==1){tgt=i; break}}}
for (ii in b){if(tgt!=b[ii]){delete b[ii]}}
}
- 如果处理第二个文件,则将当前行与筛选后的数组中的每个条目进行比较,如果有匹配,则打印当前行和接下来的一行,并将输出重定向到一个文件。
fnum==2{for (i in b){if($0==">"i){
getline n;
printf "%s\n%s\n", $0, n}}}' blast.txt input.fasta > filtered.fasta
英文:
Using GNU awk
with grep
:
grep -A 1 -f <(awk '
BEGIN{PROCINFO["sorted_in"] = "@val_num_desc"}
{a[$2]++; b[$1]=$2}
END{for (i in a){j++; if (j==1){tgt=i; break}}
for (ii in b){if(tgt==b[ii]){print ">"ii}}}' blast.txt) input.fasta | grep -v '^--' > filtered.fasta
Output
>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
Description
- Set sort order for arrays to descending, numeric value of array values:
BEGIN{PROCINFO["sorted_in"] = "@val_num_desc"}
- Populate array count of column 2 values using value as index. And populate second array using column 1 as key and column 2 as value.
{a[$2]++; b[$1]=$2}
- Extract first array index as the target gene name for use in filtering species names from second array.
for (i in a){j++; if (j==1){tgt=i; break}}
- Use extracted gene name above to get species names for use in filtering input.fasta file:
for (ii in b){if(tgt==b[ii]){print ">"ii}}
- Use
grep
the-A 1
switch and the-f
switch to extract the desired fasta records.
grep -A 1 -f <(awk...) input.fasta`
- Pipe
|
to another grep to get rid of the--
between contiguous groups of matches and redirect the output to a file.
| grep -v '^--' > filtered.fasta
-or- Just using GNU awk
:
awk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_desc"}
BEGINFILE{fnum++}
fnum==1{a[$2]++; b[$1]=$2}
ENDFILE{if(fnum==1){
for (i in a){j++; if (j==1){tgt=i; break}}}
for (ii in b){if(tgt!=b[ii]){delete b[ii]}}
}
fnum==2{for (i in b){if($0==">"i){
getline n;
printf "%s\n%s\n", $0, n}}
}' blast.txt input.fasta
Details
- Set sort order for arrays to descending, numeric value of array values:
BEGIN{PROCINFO["sorted_in"] = "@val_num_desc"}
- For each file processed, increment variable 'fnum'
BEGINFILE{fnum++}
- For first file processed, populate array count of column 2 values using value as index. And populate second array using column 1 as key and column 2 as value.
fnum==1{a[$2]++; b[$1]=$2}
- When finished processing first file, Extract first array index as the target gene name for use in filtering species names from second array. Then filter second array by deleting entries that do not match the target gene name.
ENDFILE{if(fnum==1){
for (i in a){j++; if (j==1){tgt=i; break}}}
for (ii in b){if(tgt!=b[ii]){delete b[ii]}}
}
- If processing the second file, then compare the current row to each entry in the filtered array and if there is a match print the current row and the following row and redirect the output to a file.
fnum==2{for (i in b){if($0==">"i){
getline n;
printf "%s\n%s\n", $0, n}}
}' blast.txt input.fasta > filtered.fasta
</details>
# 答案2
**得分**: 2
使用标准的UNIX工具和一个Perl单行命令来提取最常见的基因。使用`grep`和`cut`来从blast文件中提取物种信息。使用`grep`来提取FASTA头部,然后提取具有所需物种的头部,然后提取序列ID。最后,使用[`seqtk subseq`](https://github.com/lh3/seqtk) 来根据ID从FASTA文件中提取reads(使用 `conda` 安装它)。
```shell
cut -f2 blast.txt | sort | uniq -c | sort -k1,1n | tail -n1 | perl -lane 'print $F[1];' > most_frequent_gene.txt
grep -f most_frequent_gene.txt blast.txt | cut -f1 > species_w_most_frequent_gene.txt
grep '^>' input.fasta | grep -f species_w_most_frequent_gene.txt | grep -Po '^>\K\S+' > seqids_w_species_w_most_frequent_gene.txt
seqtk subseq input.fasta seqids_w_species_w_most_frequent_gene.txt > seqs_w_species_w_most_frequent_gene.fasta
在这个示例中使用的输入和输出文件如下:
head blast.txt input.fasta most_frequent_gene.txt seqids_w_species_w_most_frequent_gene.txt seqs_w_species_w_most_frequent_gene.fasta species_w_most_frequent_gene.txt
==> blast.txt <==
sp1 XXX
sp2 AAA
sp3 XXX
sp4 XXX
sp5 BBB
==> input.fasta <==
>seq1 sp1
ACTG
>seq2 sp2
ACTGA
>seq3 sp3
ACTGAC
>seq4 sp3
ACTGAC
==> most_frequent_gene.txt <==
XXX
==> seqids_w_species_w_most_frequent_gene.txt <==
seq1
seq3
seq4
==> seqs_w_species_w_most_frequent_gene.fasta <==
>seq1 sp1
ACTG
>seq3 sp3
ACTGAC
>seq4 sp3
ACTGAC
==> species_w_most_frequent_gene.txt <==
sp1
sp3
sp4
参考链接:
在这里,GNU的grep
使用以下选项:
-P
:使用Perl正则表达式。
-o
:仅打印匹配项(每行一个匹配项),而不是整个行。
-f file
:从file中获取模式,每行一个。
\K
:使正则表达式引擎在\K
之前匹配的所有内容保持不变,并在匹配时不包括在内。具体来说,在打印匹配项时忽略正则表达式的前面部分。
Perl单行命令使用这些命令行标志:
-e
:告诉Perl在行内查找代码,而不是在文件中查找。
-n
:逐行遍历输入,将其默认赋值给$_
。
-l
:在执行行内代码之前剥离输入行分隔符(在*NIX中默认为"\n"),并在打印时追加它。
-a
:在空格或-F
选项中指定的正则表达式上将$_
分割为数组@F
。
要安装seqtk
和许多其他未在此处列出的生物信息工具,请使用conda
,特别是miniconda
,例如:
conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... 在这里使用seqtk ...
conda deactivate
英文:
Use standard UNIX tools plus a perl one-liner to extract the most frequent gene. Use grep
and cut
to extract the species from the blast file. Use grep
to extract the FASTA headers, then extract those with the desired species, then to extract just the sequence ids. Finally, to extract reads from FASTA file by IDs, use seqtk subseq
(install it with conda
).
cut -f2 blast.txt | sort | uniq -c | sort -k1,1n | tail -n1 | perl -lane 'print $F[1];' > most_frequent_gene.txt
grep -f most_frequent_gene.txt blast.txt | cut -f1 > species_w_most_frequent_gene.txt
grep '^>' input.fasta | grep -f species_w_most_frequent_gene.txt | grep -Po '^>\K\S+' > seqids_w_species_w_most_frequent_gene.txt
seqtk subseq input.fasta seqids_w_species_w_most_frequent_gene.txt > seqs_w_species_w_most_frequent_gene.fasta
Inputs and outputs used in this example:
head blast.txt input.fasta most_frequent_gene.txt seqids_w_species_w_most_frequent_gene.txt seqs_w_species_w_most_frequent_gene.fasta species_w_most_frequent_gene.txt
==> blast.txt <==
sp1 XXX
sp2 AAA
sp3 XXX
sp4 XXX
sp5 BBB
==> input.fasta <==
>seq1 sp1
ACTG
>seq2 sp2
ACTGA
>seq3 sp3
ACTGAC
>seq4 sp3
ACTGAC
==> most_frequent_gene.txt <==
XXX
==> seqids_w_species_w_most_frequent_gene.txt <==
seq1
seq3
seq4
==> seqs_w_species_w_most_frequent_gene.fasta <==
>seq1 sp1
ACTG
>seq3 sp3
ACTGAC
>seq4 sp3
ACTGAC
==> species_w_most_frequent_gene.txt <==
sp1
sp3
sp4
See also:
Here, GNU grep
uses the following options:
-P
: Use Perl regexes.
-o
: Print the matches only (1 match per line), not the entire lines.
-f file
: Obtain patterns from file, one per line.
\K
: Cause the regex engine to "keep" everything it had matched prior to the \K
and not include it in the match. Specifically, ignore the preceding part of the regex when printing the match.
The Perl one-liner uses these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-n
: Loop over the input one line at a time, assigning it to $_
by default.
-l
: Strip the input line separator ("\n"
on *NIX by default) before executing the code in-line, and append it when printing.
-a
: Split $_
into array @F
on whitespace or on the regex specified in -F
option.
perldoc perlrun
: how to execute the Perl interpreter: command line switchesperldoc perlrequick
: Perl regular expressions quick start
To install seqtk
and many other bioinformatics tools not listed here, use conda
, specifically miniconda
, for example:
conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate
答案3
得分: 2
使用任何awk:
NR == FNR {
gene2species[$2] = ($2 in gene2species ? gene2species[$2] FS : "") $1
if ( ++cnt[$2] > maxCnt ) {
maxCnt = cnt[$2]
maxGene = $2
}
next
}
FNR == 1 {
split(gene2species[maxGene],tmp)
for ( i in tmp ) {
maxSpecies[">"tmp[i]]
}
}
/^>/ {
isMaxSpecies = ($1 in maxSpecies ? 1 : 0)
}
isMaxSpecies
例如:
awk -f tst.awk blastfile fastafile
英文:
Using any awk:
$ cat tst.awk
NR == FNR {
gene2species[$2] = ($2 in gene2species ? gene2species[$2] FS : "") $1
if ( ++cnt[$2] > maxCnt ) {
maxCnt = cnt[$2]
maxGene = $2
}
next
}
FNR == 1 {
split(gene2species[maxGene],tmp)
for ( i in tmp ) {
maxSpecies[">"tmp[i]]
}
}
/^>/ {
isMaxSpecies = ($1 in maxSpecies ? 1 : 0)
}
isMaxSpecies
<p>
$ awk -f tst.awk blastfile fastafile
>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
The above assumes there will only be 1 species with the max count, as in the example you provided. If there can be multiple species with the same max count of genes then tweak it to:
$ cat tst.awk
NR == FNR {
gene2species[$2] = ($2 in gene2species ? gene2species[$2] FS : "") $1
if ( ++cnt[$2] > maxCnt ) {
maxCnt = cnt[$2]
}
next
}
FNR == 1 {
for ( gene in cnt ) {
if ( cnt[gene] == maxCnt ) {
split(gene2species[gene],tmp)
for ( i in tmp ) {
maxSpecies[">"tmp[i]]
}
}
}
}
/^>/ {
isMaxSpecies = ($1 in maxSpecies ? 1 : 0)
}
isMaxSpecies
e.g.:
$ cat blastfile
sp1 XXX
sp2 AAA
sp3 XXX
sp4 AAA
sp5 BBB
<p>
$ awk -f tst.awk blastfile fastafile
>sp1
AATCGAGTCGT
>sp2
AATCGCGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
答案4
得分: 0
如果你想使用除了awk
之外的其他工具来完成,你可以执行以下操作:
#!/bin/bash
mostCommont=$(awk '{ print $2 }' ./input | sort | uniq -c | sort -nr | head -1 | awk '{print $2}')
grep "$mostCommont" ./input | awk '{print $1}'
或者作为一行命令:
grep "$(awk '{ print $2 }' ./input | sort | uniq -c | sort -nr | head -1 | awk '{print $2}')" ./input | awk '{print $1}'
只需将./input
替换为你的输入文件的路径。
根据你的示例数据,这将输出以下内容:
sp1
sp3
sp4
如果你想保留基因名称,只需删除最后的... | awk '{print $1}'
。
英文:
If you want to do it using other tools than awk
, you can do the following:
#!/bin/bash
mostCommont=$(awk '{ print $2 }' ./input | sort | uniq -c | sort -nr | head -1 | awk '{print $2}')
grep "$mostCommont" ./input | awk '{print $1}'
Or as a one-liner:
grep "$(awk '{ print $2 }' ./input | sort | uniq -c | sort -nr | head -1 | awk '{print $2}')" ./input | awk '{print $1}'
Just replace ./input
with the path to your input file.
Based on your sample data, this will print the following:
sp1
sp3
sp4
If you want to keep the gene name, just remove the final ... | awk '{print $1}
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论