英文:
Filtering a tuple with a channel.collect in Nextflow
问题
I've created a tuple from the output of a channel.
ch_groups = INPUT_CHECK_GEX.out.group_samplesheet
.splitCsv( header:true, sep:',', strip:true )
.map { row ->
def keyID = row["keyid"]
def sampleID = row["sampleid"]
return [keyID, sampleID]
}
.groupTuple()
ch_groups.view()
This is the output
[group1-group2, [sample1, sample2, sample3, sample4]]
I have another output set up as a tuple as well: SEURAT_SINGLE.out.rds.view()
[sample3, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds]
[sample7, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/37/6df9873421a81170aa8156c303bb3c/sample7_seurat_object.rds]
[sample6, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/7a/ebe2243cd6dbc81c2374be9e80c24b/sample6_seurat_object.rds]
[sample1, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds]
[sample5, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/a0ce478d03da5fb4f67b34fcd194e4/sample5_seurat_object.rds]
[sample2, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds]
[sample4, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]
I need to get a list of all the RDS files associated with each of the first outputs. For example, for [group1-group2, [sample1, sample2, sample3, sample4]]
I need a list of :
/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds]
/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample3_seurat_object.rds]
/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]
英文:
I've created a tuple from the output of a channel.
ch_groups = INPUT_CHECK_GEX.out.group_samplesheet
.splitCsv( header:true, sep:',', strip:true )
.map { row ->
def keyID = row["keyid"]
def sampleID = row["sampleid"]
return [keyID, sampleID]
}
.groupTuple()
ch_groups.view()
This is the output
[group1-group2, [sample1, sample2, sample3, sample4]]
I have another output set up as a tuple as well: SEURAT_SINGLE.out.rds.view()
[sample3, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds]
[sample7, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/37/6df9873421a81170aa8156c303bb3c/sample7_seurat_object.rds]
[sample6, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/7a/ebe2243cd6dbc81c2374be9e80c24b/sample6_seurat_object.rds]
[sample1, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds]
[sample5, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/a0ce478d03da5fb4f67b34fcd194e4/sample5_seurat_object.rds]
[sample2, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds]
[sample4, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]
I need to get a list of all the RDS files associated with each of the first outputs. For example, for [group1-group2, [sample1, sample2, sample3, sample4]]
I need a list of :
/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds]
/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample3_seurat_object.rds]
/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]
EDITED with the advice from Steve
Using his approach I was able to get the desired result for one contrast. As soon as I added contrasts, the output still provided only the first result.
For example, adding additional contrasts to INPUT_CHECK_GEX.out.group_samplesheet
:
ch_groups = INPUT_CHECK_GEX.out.group_samplesheet
.splitCsv( header:true, sep:',', strip:true )
.map { row ->
def keyID = row["keyid"]
def sampleID = row["sampleid"]
return [keyID, sampleID]
}
.groupTuple()
ch_groups.view()
ch_groups.view()
[group1-group2, [sample1, sample2, sample3, sample4]]
[group1-group2-group3, [sample1, sample2, sample3, sample4, sample5, sample6]]
And then running his suggestion, still gives the output, ignoring the added contrast:
[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/a6/02a8bc99a1a0ea3549d774145facbe/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/bf/2f9f884fe8868ee91ce077d598bd5d/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/8e/99e42901219cd3eba0981987033145/sample1_seurat_object.rds]]
I attempted to fix this with this solution, but while it brings in the second contrast, it doesn't map duplicate samples (IE sample1 is in BOTH contrasts):
INPUT_CHECK_GEX.out.group_samplesheet
.splitCsv( header:true, sep:',', strip:true )
.map { row ->
def key = row["keyid"]
def sample = row["sampleid"]
tuple( key, sample )
}
.map { key, sample -> tuple( sample, key ) }
.join( SEURAT_SINGLE.out.rds )
.map { sample, key, rds_file -> tuple( key, rds_file ) }
.groupTuple()
.view()
Output:
[group1-group2-group3, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/4c/747cbe34e3464a22c376d09be2cdb1/sample6_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/51/9bb8aad780fd14e9ed7ad9b3f3b06f/sample5_seurat_object.rds]
[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/d8/b02c8c3ab57faefe4bb60e85b03743/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/27/eb43d9f44534819f289831869270a8/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/e2/2811ac1360970134456f34b7d55518/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds]]
Expected Output:
[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/d8/b02c8c3ab57faefe4bb60e85b03743/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/27/eb43d9f44534819f289831869270a8/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/e2/2811ac1360970134456f34b7d55518/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds]]
[group1-group2-group3, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/4c/747cbe34e3464a22c376d09be2cdb1/sample6_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/51/9bb8aad780fd14e9ed7ad9b3f3b06f/sample5_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/d8/b02c8c3ab57faefe4bb60e85b03743/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/27/eb43d9f44534819f289831869270a8/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/e2/2811ac1360970134456f34b7d55518/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds]
Solution
For anyone else who finds themselves with this question, this was the solution I came up with:
ch_groups = INPUT_CHECK_GEX.out.group_samplesheet
.splitCsv( header:true, sep:',', strip:true )
.map { row ->
def key = row["keyid"]
def sample = row["sampleid"]
return [sample, key]
}
.combine(SEURAT_SINGLE.out.rds, by: 0)
.map { sample, key, rds_file -> tuple( key, rds_file ) }
.groupTuple()
.view()
Gives the output:
[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/a6/02a8bc99a1a0ea3549d774145facbe/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/e7d26a4328f99d5984cdb1acd8e4b0/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/da/ca761f3d5b389f1333736ec5ae1dfe/sample4_seurat_object.rds]]
[group1-group2-group3, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/a6/02a8bc99a1a0ea3549d774145facbe/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/98/1063e9c6b025e59238d84db688ece5/sample5_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/c1924829b9e4298540c530aa37e919/sample6_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/e7d26a4328f99d5984cdb1acd8e4b0/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/da/ca761f3d5b389f1333736ec5ae1dfe/sample4_seurat_object.rds]]
答案1
得分: 1
以下是翻译好的内容:
假设您的群组样本表包含多个群组,每个群组具有不同数量的样本,您可以使用groupKey
对象将样本数量与每个群组关联起来。这种方法允许groupTuple
运算符尽快流式传输收集到的值。例如:
workflow {
INPUT_CHECK_GEX.out.group_samplesheet
.splitCsv(header:true, sep:',', strip:true)
.map { row ->
def keyID = row["keyid"]
def sampleID = row["sampleid"]
tuple(keyID, sampleID)
}
.groupTuple()
.map { group, samples ->
tuple(groupKey(group, samples.size()), samples)
}
.set { groups_ch }
groups_ch
.transpose()
.map { key, sample -> tuple(sample, key) }
.join(SEURAT_SINGLE.out.rds)
.map { sample, key, rds_file -> tuple(key, rds_file) }
.groupTuple()
.view()
}
预期结果:
[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]]
请注意,如果一个样本可以属于一个或多个群组,只需将join
替换为combine
运算符。只需确保使用允许您使用by
参数将共享公共匹配键的项目组合在一起的第二种形式,例如:
groups_ch
.transpose()
.map { key, sample -> tuple(sample, key) }
.combine(SEURAT_SINGLE.out.rds, by: 0)
.map { sample, key, rds_file -> tuple(key, rds_file) }
.groupTuple()
.view()
预期结果:
[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]]
[group1-group2-group3, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/a0ce478d03da5fb4f67b34fcd194e4/sample5_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/7a/ebe2243cd6dbc81c2374be9e80c24b/sample6_seurat_object.rds]]
希望这对您有所帮助。
英文:
Assuming your group samplesheet contains multiple groups each with a different number of samples, you could use a groupKey
object to associate the number of samples with each group. This approach lets the groupTuple
operator then stream the collected values as soon as possible. For example:
workflow {
INPUT_CHECK_GEX.out.group_samplesheet
.splitCsv( header:true, sep:',', strip:true )
.map { row ->
def keyID = row["keyid"]
def sampleID = row["sampleid"]
tuple( keyID, sampleID )
}
.groupTuple()
.map { group, samples ->
tuple( groupKey(group, samples.size()), samples )
}
.set { groups_ch }
groups_ch
.transpose()
.map { key, sample -> tuple( sample, key ) }
.join( SEURAT_SINGLE.out.rds )
.map { sample, key, rds_file -> tuple( key, rds_file ) }
.groupTuple()
.view()
}
Expected results:
[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]]
Note that if a sample can belong to one or more groups, simply replace the join
with the combine
operator. Just make sure to use the second form which allows you to combine items that share a common matching key using the by
parameter, for example:
groups_ch
.transpose()
.map { key, sample -> tuple( sample, key ) }
.combine( SEURAT_SINGLE.out.rds, by: 0 )
.map { sample, key, rds_file -> tuple( key, rds_file ) }
.groupTuple()
.view()
Expected results:
[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]]
[group1-group2-group3, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/a0ce478d03da5fb4f67b34fcd194e4/sample5_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/7a/ebe2243cd6dbc81c2374be9e80c24b/sample6_seurat_object.rds]]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论