在另一个文件中匹配特定字符串的序列,过滤Fasta文件。

huangapple go评论64阅读模式
英文:

Filtering a fasta file with sequences that match a certain string in another file

问题

使用 BLAST,我得到了一个包含两列以制表符分隔的文件,一列是物种名称,另一列是基因名称(参考数据库中最相似基因的名称)。我的目标是在第一个文件中找到所有关联基因名称为最常见的物种名称,并使用这个物种列表过滤第二个文件(FASTA 格式),只保留这些物种的序列。

例如,如果这是 BLAST 文件:

sp1 XXX
sp2 AAA
sp3 XXX
sp4 XXX
sp5 BBB

匹配最常见基因的物种将是 sp1、sp3 和 sp4。然后在 FASTA 文件中,原本包含许多物种的,我想要保留只有 sp1、sp3 和 sp4 的序列。所以从这个:

>sp1
AATCGAGTCGT
>sp2
AATCGCGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
>sp5
AAACGAGTCGT

到这个 FASTA 文件:

>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
英文:

With BLAST I have obtained a file with two tab-separated columns, one with species names and the other with a gene name (the name of the most similar gene in a reference database). My goal is to find in the first file all the species names for which the associated gene name is the most common one, and use this list of species to filter a second file (FASTA-formatted) keeping only those species.

For example if this is BLAST file:

sp1 XXX
sp2 AAA
sp3 XXX
sp4 XXX
sp5 BBB

The species that match the most common gene would be sp1, sp3 and sp4. Then in the FASTA file, which originally contains many species, I would like to keep only the sequences of species sp1, sp3 and sp4. So going from this:

>sp1
AATCGAGTCGT
>sp2
AATCGCGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG
>sp5
AAACGAGTCGT

To this fasta file:

>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG

So far I've tried for a few days different approaches but for one reason or another it never works. I've tried creating this script (don.awk) to obtain the most common gene name (2nd column):

BEGIN {
  FS="\t"
}
{a[$2]++; if (a[$2] > comV) { comN=$2; comV=a[$2]} }
END {
    printf("%s\n", comN, comV)
}

And then running it with

nawk -f don.awk blastfile

Then I tried assigning the name to a variable (for example var1) and using it in

awk -F'\t' '$2 ~ /$var1/ {print $0}' blastfile > result

to first filter the original file. But for some reason I can't save the variables or I get errors. But I guess there must be an easier way.

答案1

得分: 3

使用 GNU awkgrep

grep -A 1 -f <(awk '
BEGIN{PROCINFO["sorted_in"] = "@val_num_desc";}
{a[$2]++; b[$1]=$2}
END{for (i in a){j++; if (j==1){tgt=i; break}}
for (ii in b){if(tgt==b[ii]){print ">"ii}}' blast.txt) input.fasta | grep -v '^--' > filtered.fasta

输出

>sp1
AATCGAGTCGT
>sp3
AATCGAGTAGT
>sp4
AATCGAGTCGG

描述

  • 将数组排序顺序设置为降序,以数组值的数值为基准:
BEGIN{PROCINFO["sorted_in"] = "@val_num_desc";}
  • 使用第二列值作为索引,填充第一个数组以统计第二列值的出现次数。同时,使用第一列作为键,第二列作为值填充第二个数组。
{a[$2]++; b[$1]=$2}
  • 提取第一个数组的第一个索引作为目标基因名称,用于从第二个数组中筛选物种名称。
for (i in a){j++; if (j==1){tgt=i; break}}
  • 使用上述提取的基因名来获取物种名称,用于筛选 input.fasta 文件中的物种:
for (ii in b){if(tgt==b[ii]){print ">"ii}}
  • 使用 grep-A 1 选项和 -f 选项来提取所需的 fasta 记录。
grep -A 1 -f <(awk...) input.fasta`
  • 使用管道符 | 到另一个 grep 来去除匹配组之间的 --,并将输出重定向到一个文件。
| grep -v '^--' > filtered.fasta

  • 或者只使用 GNU awk
awk 'BEGIN{PROCINFO["sorted_in"] = "@val_num_desc";}
BEGINFILE{fnum++}
fnum==1{a[$2]++; b[$1]=$2}
ENDFILE{if(fnum==1){
    for (i in a){j++; if (j==1){tgt=i; break}}} 
    for (ii in b){if(tgt!=b[ii]){delete b[ii]}}
}
fnum==2{for (i in b){if($0==">"i){
    getline n; 
    printf "%s\n%s\n", $0, n}}}' blast.txt input.fasta

细节

  • 将数组排序顺序设置为降序,以数组值的数值为基准:
BEGIN{PROCINFO["sorted_in"] = "@val_num_desc";}
  • 对于每个处理的文件,增加变量 fnum 的值。
BEGINFILE{fnum++}
  • 对于第一个处理的文件,使用第二列值作为索引,填充第一个数组以统计第二列值的出现次数。同时,使用第一列作为键,第二列作为值填充第二个数组。
fnum==1{a[$2]++; b[$1]=$2}
  • 当完成处理第一个文件时,提取第一个数组的第一个索引作为目标基因名称,用于从第二个数组中筛选物种名称。然后,通过删除不匹配目标基因名的条目来筛选第二个数组。
ENDFILE{if(fnum==1){
    for (i in a){j++; if (j==1){tgt=i; break}}} 
    for (ii in b){if(tgt!=b[ii]){delete b[ii]}}
}
  • 如果处理第二个文件,则将当前行与筛选后的数组中的每个条目进行比较,如果有匹配,则打印当前行和接下来的一行,并将输出重定向到一个文件。
fnum==2{for (i in b){if($0==">"i){
    getline n; 
    printf "%s\n%s\n", $0, n}}}' blast.txt input.fasta > filtered.fasta
英文:

Using GNU awk with grep:

grep -A 1 -f &lt;(awk &#39;
BEGIN{PROCINFO[&quot;sorted_in&quot;] = &quot;@val_num_desc&quot;}
{a[$2]++; b[$1]=$2}
END{for (i in a){j++; if (j==1){tgt=i; break}}
for (ii in b){if(tgt==b[ii]){print &quot;&gt;&quot;ii}}}&#39; blast.txt) input.fasta | grep -v &#39;^--&#39; &gt; filtered.fasta

Output

&gt;sp1
AATCGAGTCGT
&gt;sp3
AATCGAGTAGT
&gt;sp4
AATCGAGTCGG

Description

  • Set sort order for arrays to descending, numeric value of array values:
BEGIN{PROCINFO[&quot;sorted_in&quot;] = &quot;@val_num_desc&quot;}
  • Populate array count of column 2 values using value as index. And populate second array using column 1 as key and column 2 as value.
{a[$2]++; b[$1]=$2}
  • Extract first array index as the target gene name for use in filtering species names from second array.
for (i in a){j++; if (j==1){tgt=i; break}}
  • Use extracted gene name above to get species names for use in filtering input.fasta file:
for (ii in b){if(tgt==b[ii]){print &quot;&gt;&quot;ii}}
  • Use grep the -A 1 switch and the -f switch to extract the desired fasta records.
grep -A 1 -f &lt;(awk...) input.fasta`
  • Pipe | to another grep to get rid of the -- between contiguous groups of matches and redirect the output to a file.
| grep -v &#39;^--&#39; &gt; filtered.fasta

-or- Just using GNU awk:

awk &#39;BEGIN{PROCINFO[&quot;sorted_in&quot;] = &quot;@val_num_desc&quot;}
BEGINFILE{fnum++}
fnum==1{a[$2]++; b[$1]=$2}
ENDFILE{if(fnum==1){
    for (i in a){j++; if (j==1){tgt=i; break}}} 
    for (ii in b){if(tgt!=b[ii]){delete b[ii]}}
}
fnum==2{for (i in b){if($0==&quot;&gt;&quot;i){
    getline n; 
    printf &quot;%s\n%s\n&quot;, $0, n}}
}&#39; blast.txt input.fasta

Details

  • Set sort order for arrays to descending, numeric value of array values:
BEGIN{PROCINFO[&quot;sorted_in&quot;] = &quot;@val_num_desc&quot;}
  • For each file processed, increment variable 'fnum'
BEGINFILE{fnum++}
  • For first file processed, populate array count of column 2 values using value as index. And populate second array using column 1 as key and column 2 as value.
fnum==1{a[$2]++; b[$1]=$2}
  • When finished processing first file, Extract first array index as the target gene name for use in filtering species names from second array. Then filter second array by deleting entries that do not match the target gene name.
ENDFILE{if(fnum==1){
    for (i in a){j++; if (j==1){tgt=i; break}}} 
    for (ii in b){if(tgt!=b[ii]){delete b[ii]}}
}
  • If processing the second file, then compare the current row to each entry in the filtered array and if there is a match print the current row and the following row and redirect the output to a file.
fnum==2{for (i in b){if($0==&quot;&gt;&quot;i){
    getline n; 
    printf &quot;%s\n%s\n&quot;, $0, n}}
}&#39; blast.txt input.fasta &gt; filtered.fasta

</details>



# 答案2
**得分**: 2

使用标准的UNIX工具和一个Perl单行命令来提取最常见的基因。使用`grep`和`cut`来从blast文件中提取物种信息。使用`grep`来提取FASTA头部,然后提取具有所需物种的头部,然后提取序列ID。最后,使用[`seqtk subseq`](https://github.com/lh3/seqtk) 来根据ID从FASTA文件中提取reads(使用 `conda` 安装它)。

```shell
cut -f2 blast.txt | sort | uniq -c | sort -k1,1n | tail -n1 | perl -lane 'print $F[1];' > most_frequent_gene.txt

grep -f most_frequent_gene.txt blast.txt | cut -f1 > species_w_most_frequent_gene.txt

grep '^>' input.fasta | grep -f species_w_most_frequent_gene.txt | grep -Po '^>\K\S+' > seqids_w_species_w_most_frequent_gene.txt

seqtk subseq input.fasta seqids_w_species_w_most_frequent_gene.txt > seqs_w_species_w_most_frequent_gene.fasta

在这个示例中使用的输入和输出文件如下:

head blast.txt input.fasta most_frequent_gene.txt seqids_w_species_w_most_frequent_gene.txt seqs_w_species_w_most_frequent_gene.fasta species_w_most_frequent_gene.txt
==> blast.txt <==
sp1     XXX
sp2     AAA
sp3     XXX
sp4     XXX
sp5     BBB

==> input.fasta <==
>seq1 sp1
ACTG
>seq2 sp2
ACTGA
>seq3 sp3
ACTGAC
>seq4 sp3
ACTGAC

==> most_frequent_gene.txt <==
XXX

==> seqids_w_species_w_most_frequent_gene.txt <==
seq1
seq3
seq4

==> seqs_w_species_w_most_frequent_gene.fasta <==
>seq1 sp1
ACTG
>seq3 sp3
ACTGAC
>seq4 sp3
ACTGAC

==> species_w_most_frequent_gene.txt <==
sp1
sp3
sp4

参考链接:

在这里,GNU的grep使用以下选项:
-P:使用Perl正则表达式。
-o:仅打印匹配项(每行一个匹配项),而不是整个行。
-f file:从file中获取模式,每行一个。

\K:使正则表达式引擎在\K之前匹配的所有内容保持不变,并在匹配时不包括在内。具体来说,在打印匹配项时忽略正则表达式的前面部分。

Perl单行命令使用这些命令行标志:
-e:告诉Perl在行内查找代码,而不是在文件中查找。
-n:逐行遍历输入,将其默认赋值给$_
-l:在执行行内代码之前剥离输入行分隔符(在*NIX中默认为"\n"),并在打印时追加它。
-a:在空格或-F选项中指定的正则表达式上将$_分割为数组@F

要安装seqtk和许多其他未在此处列出的生物信息工具,请使用conda,特别是miniconda,例如:

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... 在这里使用seqtk ...
conda deactivate
英文:

Use standard UNIX tools plus a perl one-liner to extract the most frequent gene. Use grep and cut to extract the species from the blast file. Use grep to extract the FASTA headers, then extract those with the desired species, then to extract just the sequence ids. Finally, to extract reads from FASTA file by IDs, use seqtk subseq (install it with conda).

cut -f2 blast.txt | sort | uniq -c | sort -k1,1n | tail -n1 | perl -lane &#39;print $F[1];&#39; &gt; most_frequent_gene.txt

grep -f most_frequent_gene.txt blast.txt | cut -f1 &gt; species_w_most_frequent_gene.txt

grep &#39;^&gt;&#39; input.fasta | grep -f species_w_most_frequent_gene.txt | grep -Po &#39;^&gt;\K\S+&#39; &gt; seqids_w_species_w_most_frequent_gene.txt

seqtk subseq input.fasta seqids_w_species_w_most_frequent_gene.txt &gt; seqs_w_species_w_most_frequent_gene.fasta

Inputs and outputs used in this example:

head blast.txt input.fasta most_frequent_gene.txt seqids_w_species_w_most_frequent_gene.txt seqs_w_species_w_most_frequent_gene.fasta species_w_most_frequent_gene.txt
==&gt; blast.txt &lt;==
sp1     XXX
sp2     AAA
sp3     XXX
sp4     XXX
sp5     BBB

==&gt; input.fasta &lt;==
&gt;seq1 sp1
ACTG
&gt;seq2 sp2
ACTGA
&gt;seq3 sp3
ACTGAC
&gt;seq4 sp3
ACTGAC

==&gt; most_frequent_gene.txt &lt;==
XXX

==&gt; seqids_w_species_w_most_frequent_gene.txt &lt;==
seq1
seq3
seq4

==&gt; seqs_w_species_w_most_frequent_gene.fasta &lt;==
&gt;seq1 sp1
ACTG
&gt;seq3 sp3
ACTGAC
&gt;seq4 sp3
ACTGAC

==&gt; species_w_most_frequent_gene.txt &lt;==
sp1
sp3
sp4

See also:

Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
-f file : Obtain patterns from file, one per line.

\K : Cause the regex engine to "keep" everything it had matched prior to the \K and not include it in the match. Specifically, ignore the preceding part of the regex when printing the match.

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator (&quot;\n&quot; on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or on the regex specified in -F option.

To install seqtk and many other bioinformatics tools not listed here, use conda, specifically miniconda, for example:

conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate

答案3

得分: 2

使用任何awk:

NR == FNR {
    gene2species[$2] = ($2 in gene2species ? gene2species[$2] FS : "") $1
    if ( ++cnt[$2] > maxCnt ) {
        maxCnt = cnt[$2]
        maxGene = $2
    }
    next
}
FNR == 1 {
    split(gene2species[maxGene],tmp)
    for ( i in tmp ) {
        maxSpecies[">"tmp[i]]
    }
}
/^>/ {
    isMaxSpecies = ($1 in maxSpecies ? 1 : 0)
}
isMaxSpecies

例如:

awk -f tst.awk blastfile fastafile
英文:

Using any awk:

$ cat tst.awk
NR == FNR {
    gene2species[$2] = ($2 in gene2species ? gene2species[$2] FS : &quot;&quot;) $1
    if ( ++cnt[$2] &gt; maxCnt ) {
        maxCnt = cnt[$2]
        maxGene = $2
    }
    next
}
FNR == 1 {
    split(gene2species[maxGene],tmp)
    for ( i in tmp ) {
        maxSpecies[&quot;&gt;&quot;tmp[i]]
    }
}
/^&gt;/ {
    isMaxSpecies = ($1 in maxSpecies ? 1 : 0)
}
isMaxSpecies

<p>

$ awk -f tst.awk blastfile fastafile
&gt;sp1
AATCGAGTCGT
&gt;sp3
AATCGAGTAGT
&gt;sp4
AATCGAGTCGG

The above assumes there will only be 1 species with the max count, as in the example you provided. If there can be multiple species with the same max count of genes then tweak it to:

$ cat tst.awk
NR == FNR {
    gene2species[$2] = ($2 in gene2species ? gene2species[$2] FS : &quot;&quot;) $1
    if ( ++cnt[$2] &gt; maxCnt ) {
        maxCnt = cnt[$2]
    }
    next
}
FNR == 1 {
    for ( gene in cnt ) {
        if ( cnt[gene] == maxCnt ) {
            split(gene2species[gene],tmp)
            for ( i in tmp ) {
                maxSpecies[&quot;&gt;&quot;tmp[i]]
            }
        }
    }
}
/^&gt;/ {
    isMaxSpecies = ($1 in maxSpecies ? 1 : 0)
}
isMaxSpecies

e.g.:

$ cat blastfile
sp1 XXX
sp2 AAA
sp3 XXX
sp4 AAA
sp5 BBB

<p>

$ awk -f tst.awk blastfile fastafile
&gt;sp1
AATCGAGTCGT
&gt;sp2
AATCGCGTCGT
&gt;sp3
AATCGAGTAGT
&gt;sp4
AATCGAGTCGG

答案4

得分: 0

如果你想使用除了awk之外的其他工具来完成,你可以执行以下操作:

#!/bin/bash

mostCommont=$(awk '{ print $2 }' ./input | sort | uniq -c | sort -nr | head -1 | awk '{print $2}')
grep "$mostCommont" ./input | awk '{print $1}'

或者作为一行命令:

grep "$(awk '{ print $2 }' ./input | sort | uniq -c | sort -nr | head -1 | awk '{print $2}')" ./input | awk '{print $1}'

只需将./input替换为你的输入文件的路径。

根据你的示例数据,这将输出以下内容:

sp1
sp3
sp4

如果你想保留基因名称,只需删除最后的... | awk '{print $1}'

英文:

If you want to do it using other tools than awk, you can do the following:

#!/bin/bash

mostCommont=$(awk &#39;{ print $2 }&#39; ./input | sort | uniq -c | sort -nr | head -1 | awk &#39;{print $2}&#39;)
grep &quot;$mostCommont&quot; ./input | awk &#39;{print $1}&#39;

Or as a one-liner:

grep &quot;$(awk &#39;{ print $2 }&#39; ./input | sort | uniq -c | sort -nr | head -1 | awk &#39;{print $2}&#39;)&quot; ./input | awk &#39;{print $1}&#39;

Just replace ./input with the path to your input file.

Based on your sample data, this will print the following:

sp1
sp3
sp4

If you want to keep the gene name, just remove the final ... | awk &#39;{print $1}

huangapple
  • 本文由 发表于 2023年3月31日 19:15:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75897905.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定