在Nextflow流程中迭代遍历文件。

huangapple go评论54阅读模式
英文:

iterating through a file in a Nextflow process

问题

我正在使用nextflow创建一个流水线,并且在其中一个过程中遇到了一些问题。

我有一个过程,它以2个普通文件(output.kraken和$sequences)以及一个字符串(例如“Aspergillus”)作为输入。

我还有一个文件'fungal_species.txt',其中包含多行内容,我希望迭代该文件,并在每一行上启动该过程。

我尝试了这样的方式:

process fungal_reads_extraction {

     publishDir("${params.extraction_output}", mode: 'copy') 
     
     input:
     path namesspecies
     
     output:
     path "*", emit: reads_extracted_out
     
     script:
     """
   while read -r species_name; do

//Extract lines from the Kraken file where the third word matches the species name
     awk -F'\t' -v "$species_name" 'BEGIN {OFS="\t"} $3 ~ "$species_name" {print}' output.kraken > "${species_name}_lines.txt"

//Extract accessions from species lines
     awk -F'\t' '{print $2}' "${species_name}_lines.txt" > "${species_name}_accessions.txt"

//Add "@" symbol to the beginning of each line in the accession file
     awk '{print "@" $0}' "${species_name}_accessions.txt" > "${species_name}_full_accessions.txt"

//Extract reads assigned to the species
     cat $sequences | awk 'NR==FNR {accessions[$1]=1; next} $1 in accessions {print; getline; print; getline; print; getline; print}' "${species_name}_full_accessions.txt" - > "${species_name}_reads.fastq"

//Cleanup intermediate files
     rm "${species_name}_lines.txt" "${species_name}_accessions.txt" "${species_name}_full_accessions.txt"

   done < fungal_species.txt


     """
}

在我看来,使用while循环,并将行命名为species_name非常合理。但是当我尝试运行流水线时,在该过程中遇到一个错误,说species_name未知!!!这似乎非常奇怪,有人能帮我吗?也许我忽略了非常重要的东西

ERROR ~ Error executing process > 'fungal_reads_extraction (1)'

Caused by:
  No such variable: species_name -- Check script 'pipeline.nf' at line: 193

提前感谢!祝你有个愉快的一天!

英文:

I am working with nextflow to create a pipeline, and I am facing some problems in one of the processes.

I have a process that takes as input 2 normal files (output.kraken, and $sequences) and a string ("Aspergillus" for example)

I have another file 'fungal_species.txt) that contain multiples lines, and I want to iterate this file and launch the process on every line of them.

I tried that:

process fungal_reads_extraction {

     publishDir(&quot;${params.extraction_output}&quot; , mode: &#39;copy&#39;) 
     
     input:
     path namesspecies
     
     output:
     path &quot;*&quot; , emit: reads_extracted_out
     
     script:
     &quot;&quot;&quot;
   while read -r species_name; do

//Extract lines from the Kraken file where the third word matches the species name
     awk -F&#39;\t&#39; -v &quot;$species_name&quot; &#39;BEGIN {OFS=&quot;\t&quot;} \$3 ~ &quot;$species_name&quot; {print}&#39; output.kraken &gt; &quot;${species_name}_lines.txt&quot;

//Extract accessions from species lines
     awk -F&#39;\t&#39; &#39;{print \$2}&#39; &quot;${species_name}_lines.txt&quot; &gt; &quot;${species_name}_accessions.txt&quot;

//Add &quot;@&quot; symbol to the beginning of each line in the accession file
     awk &#39;{print &quot;@&quot; \$0}&#39; &quot;${species_name}_accessions.txt&quot; &gt; &quot;${species_name}_full_accessions.txt&quot;

//Extract reads assigned to the species
     cat $sequences | awk &#39;NR==FNR {accessions[\$1]=1; next} \$1 in accessions {print; getline; print; getline; print; getline; print}&#39; &quot;${species_name}_full_accessions.txt&quot; - &gt; &quot;${species_name}_reads.fastq&quot;

//Cleanup intermediate files
     rm &quot;${species_name}_lines.txt&quot; &quot;${species_name}_accessions.txt&quot; &quot;${species_name}_full_accessions.txt&quot;

   done &lt; fungal_species.txt


     &quot;&quot;&quot;

}

It seemed to me very logic to use while, and mention the line as species_name.
But when I try to run the pipeline, I met an error in that process saying that the species_name is uknown !!! It seems very bizarre, can anyone help me please, maybe I am ignoring something very important

ERROR ~ Error executing process &gt; &#39;fungal_reads_extraction (1)&#39;

Caused by:
  No such variable: species_name -- Check script &#39;pipeline.nf&#39; at line: 193

Thank you in advance !
have a good day !

答案1

得分: 2

$species_name中的$不是一个Nextflow变量,而是一个SHELL变量。必须转义以告诉Nextflow它不是一个Nextflow变量。awk -F'\t' -v "\$species_name" 'BEGIN {...

此外,最好的方式是根据真菌物种拆分并并行处理每个物种。类似于:

species_ch = Channel.fromPath(params.path_to_fungal_species).splitText().map{it.trim()}

(...)

process fungal_reads_extraction {
     input:
     val(one_name)
     (...)
英文:

$ in $species_name is not a nextflow variable but a SHELL variable. It must be escaped to tell nextflow that it's not a nextflow variable. awk -F&#39;\t&#39; -v &quot;\$species_name&quot; &#39;BEGIN {..

Futhermore, best way would be to split your fungal_species and parallelize per species. Something like:

species_ch = Channel.fromPath(params.path_to_fungal_species).splitText().map{it.trim()}


(...)
process fungal_reads_extraction {
     input:
     val(one_name)
     (...)
     

huangapple
  • 本文由 发表于 2023年7月11日 00:24:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76655634.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定