在Nextflow流程中迭代遍历文件。

huangapple go评论86阅读模式
英文:

iterating through a file in a Nextflow process

问题

我正在使用nextflow创建一个流水线,并且在其中一个过程中遇到了一些问题。

我有一个过程,它以2个普通文件(output.kraken和$sequences)以及一个字符串(例如“Aspergillus”)作为输入。

我还有一个文件'fungal_species.txt',其中包含多行内容,我希望迭代该文件,并在每一行上启动该过程。

我尝试了这样的方式:

  1. process fungal_reads_extraction {
  2. publishDir("${params.extraction_output}", mode: 'copy')
  3. input:
  4. path namesspecies
  5. output:
  6. path "*", emit: reads_extracted_out
  7. script:
  8. """
  9. while read -r species_name; do
  10. //Extract lines from the Kraken file where the third word matches the species name
  11. awk -F'\t' -v "$species_name" 'BEGIN {OFS="\t"} $3 ~ "$species_name" {print}' output.kraken > "${species_name}_lines.txt"
  12. //Extract accessions from species lines
  13. awk -F'\t' '{print $2}' "${species_name}_lines.txt" > "${species_name}_accessions.txt"
  14. //Add "@" symbol to the beginning of each line in the accession file
  15. awk '{print "@" $0}' "${species_name}_accessions.txt" > "${species_name}_full_accessions.txt"
  16. //Extract reads assigned to the species
  17. cat $sequences | awk 'NR==FNR {accessions[$1]=1; next} $1 in accessions {print; getline; print; getline; print; getline; print}' "${species_name}_full_accessions.txt" - > "${species_name}_reads.fastq"
  18. //Cleanup intermediate files
  19. rm "${species_name}_lines.txt" "${species_name}_accessions.txt" "${species_name}_full_accessions.txt"
  20. done < fungal_species.txt
  21. """
  22. }

在我看来,使用while循环,并将行命名为species_name非常合理。但是当我尝试运行流水线时,在该过程中遇到一个错误,说species_name未知!!!这似乎非常奇怪,有人能帮我吗?也许我忽略了非常重要的东西

  1. ERROR ~ Error executing process > 'fungal_reads_extraction (1)'
  2. Caused by:
  3. No such variable: species_name -- Check script 'pipeline.nf' at line: 193

提前感谢!祝你有个愉快的一天!

英文:

I am working with nextflow to create a pipeline, and I am facing some problems in one of the processes.

I have a process that takes as input 2 normal files (output.kraken, and $sequences) and a string ("Aspergillus" for example)

I have another file 'fungal_species.txt) that contain multiples lines, and I want to iterate this file and launch the process on every line of them.

I tried that:

  1. process fungal_reads_extraction {
  2. publishDir(&quot;${params.extraction_output}&quot; , mode: &#39;copy&#39;)
  3. input:
  4. path namesspecies
  5. output:
  6. path &quot;*&quot; , emit: reads_extracted_out
  7. script:
  8. &quot;&quot;&quot;
  9. while read -r species_name; do
  10. //Extract lines from the Kraken file where the third word matches the species name
  11. awk -F&#39;\t&#39; -v &quot;$species_name&quot; &#39;BEGIN {OFS=&quot;\t&quot;} \$3 ~ &quot;$species_name&quot; {print}&#39; output.kraken &gt; &quot;${species_name}_lines.txt&quot;
  12. //Extract accessions from species lines
  13. awk -F&#39;\t&#39; &#39;{print \$2}&#39; &quot;${species_name}_lines.txt&quot; &gt; &quot;${species_name}_accessions.txt&quot;
  14. //Add &quot;@&quot; symbol to the beginning of each line in the accession file
  15. awk &#39;{print &quot;@&quot; \$0}&#39; &quot;${species_name}_accessions.txt&quot; &gt; &quot;${species_name}_full_accessions.txt&quot;
  16. //Extract reads assigned to the species
  17. cat $sequences | awk &#39;NR==FNR {accessions[\$1]=1; next} \$1 in accessions {print; getline; print; getline; print; getline; print}&#39; &quot;${species_name}_full_accessions.txt&quot; - &gt; &quot;${species_name}_reads.fastq&quot;
  18. //Cleanup intermediate files
  19. rm &quot;${species_name}_lines.txt&quot; &quot;${species_name}_accessions.txt&quot; &quot;${species_name}_full_accessions.txt&quot;
  20. done &lt; fungal_species.txt
  21. &quot;&quot;&quot;
  22. }

It seemed to me very logic to use while, and mention the line as species_name.
But when I try to run the pipeline, I met an error in that process saying that the species_name is uknown !!! It seems very bizarre, can anyone help me please, maybe I am ignoring something very important

  1. ERROR ~ Error executing process &gt; &#39;fungal_reads_extraction (1)&#39;
  2. Caused by:
  3. No such variable: species_name -- Check script &#39;pipeline.nf&#39; at line: 193

Thank you in advance !
have a good day !

答案1

得分: 2

$species_name中的$不是一个Nextflow变量,而是一个SHELL变量。必须转义以告诉Nextflow它不是一个Nextflow变量。awk -F'\t' -v "\$species_name" 'BEGIN {...

此外,最好的方式是根据真菌物种拆分并并行处理每个物种。类似于:

  1. species_ch = Channel.fromPath(params.path_to_fungal_species).splitText().map{it.trim()}
  2. (...)
  3. process fungal_reads_extraction {
  4. input:
  5. val(one_name)
  6. (...)
英文:

$ in $species_name is not a nextflow variable but a SHELL variable. It must be escaped to tell nextflow that it's not a nextflow variable. awk -F&#39;\t&#39; -v &quot;\$species_name&quot; &#39;BEGIN {..

Futhermore, best way would be to split your fungal_species and parallelize per species. Something like:

  1. species_ch = Channel.fromPath(params.path_to_fungal_species).splitText().map{it.trim()}
  2. (...)
  3. process fungal_reads_extraction {
  4. input:
  5. val(one_name)
  6. (...)

huangapple
  • 本文由 发表于 2023年7月11日 00:24:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76655634.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定