shell nextflow中的迭代过程

y3bcpkx1  于 2023-08-07  发布在  Shell
关注(0)|答案(1)|浏览(85)

我试图在nextflow中创建一个进程,它需要2个输入,krakenfile将直接使用,fungalname包含多行,每行包含一个物种的名称。
我想在fungalname文件中进行一次迭代,逐行读取,对于每一行/种类,我将查找krakenfile中第3列包含该名称的所有行。
例如,如果我的fungalname包含以下内容:

Aspergillus fumigatus
Candida albicans

字符串
krakenfile包含

xxxx  548   Aspergillus fumigatus 
zzzz  566   Candida albicans 
aaaa  598   Aspergillus fumigatus
kkk   888   Candida albicans


我的输出应该是2个文件,ASpergillus_fumigatus_lines.txtCandida_albicans_lines.txt,每个文件都包含2行(如上例)
问题是,我的输出文件总是空的,虽然我肯定的格式,我的输入文件的本地化,我认为这是一个过程的问题,任何人都可以请帮助我,这是我的代码:

params.fungaalnames="/home/aziz/pipeline/results/extraction/fungal_species.txt"
    params.krakeenfile="/home/aziz/pipeline/results/classification_before_filtration/output.kraken"

    

fungalnames = file(params.fungaalnames)
krakenfile = file(params.krakeenfile)

process fungal_reads_extraction {
     
     input:

     file fungalnames
     file krakenfile

     output:
     path "*" , emit: reads_extracted_out
     
     script:
     """
while IFS= read -r species_name; do
  awk -F'\t' '\$3 ~ "'\$species_name'" {print}' $krakenfile > "\${species_name}_lines.txt"
done < $fungalnames
     """

}

workflow {

fungalnames_ch=Channel.fromPath(params.fungaalnames)
krakenfile_ch=Channel.fromPath(params.krakeenfile)

fungal_reads_extraction(fungalnames_ch, krakenfile_ch) | view
}

uxh89sit

uxh89sit1#

只关注awk脚本;我将把它留给OP来根据需要进行(重新)格式化,以便包含在nextflow脚本文件中...
一种awk方法:

awk '
BEGIN        { FS="\t" }
FNR==NR      { sp = $0                            # 1st file: copy specie
               gsub(/[[:space:]]/,"_",sp)         # replace spaces with "_"
               specie[$0]                         # save specie name as index to specie[] array
               fname[$0] = sp "_lines.txt"        # create filename associated with this specie
               next
             }
$3 in specie {                                    # 2nd file: if 3rd column is index in specie[] array then ...
               print $0 > fname[$3]               # print current line to associated file
             # close(fname[$3])                   # uncomment if awk complains of too many open file descriptors
             }
' fungalname krakenfile

字符串

注意事项:

  • 这将替换OP当前的while/read/awk循环
  • 如果唯一种类的数量“太大”,那么某些版本的awk可能会抱怨打开的文件描述符太多;取消对close(fname[$3])命令的注解可以缓解这个问题,但会导致脚本运行速度变慢

这产生:

$ head *_lines.txt
==> Aspergillus_fumigatus_lines.txt <==
xxxx    548     Aspergillus fumigatus
aaaa    598     Aspergillus fumigatus

==> Candida_albicans_lines.txt <==
zzzz    566     Candida albicans
kkk     888     Candida albicans


如果物种的数量“太大”,需要过多的close()调用,我们可以预排序krakenfile;减少close()调用的数量可以提高性能:

awk '
BEGIN        { FS="\t" }
FNR==NR      { sp = $0
               gsub(/[[:space:]]/,"_",sp)
               specie[$0]
               fname[$0] = sp "_lines.txt"
               next
             }
$3 in specie { if (prev != $3 )                   # if this is a new specie then ...
                  close(fname[prev])              # close the previous file
               prev = $3                          # save the new specie
               print $0 > fname[$3]               # print current line to associated file
             }
' fungalname <(sort -t$'\t' -k3,3 krakenfile)


这产生:

$ head *_lines.txt
==> Aspergillus_fumigatus_lines.txt <==
xxxx    548     Aspergillus fumigatus
aaaa    598     Aspergillus fumigatus

==> Candida_albicans_lines.txt <==
zzzz    566     Candida albicans
kkk     888     Candida albicans

注意事项:

  • *_lines.txt文件中的行的顺序基于从krakenfile读取的行的顺序
  • 如果*_lines.txt内容需要按其他列排序,OP可以包含更多的-k#,#参数

相关问题