使用Nextflow中的`channel.collect`来筛选元组。

huangapple go评论150阅读模式
英文:

Filtering a tuple with a channel.collect in Nextflow

问题

I've created a tuple from the output of a channel.

ch_groups = INPUT_CHECK_GEX.out.group_samplesheet
            .splitCsv( header:true, sep:',', strip:true )
            .map { row ->
                    def keyID = row["keyid"]
                    def sampleID = row["sampleid"]
                    return [keyID, sampleID]
                }
            .groupTuple()
        ch_groups.view()

This is the output

[group1-group2, [sample1, sample2, sample3, sample4]]

I have another output set up as a tuple as well: SEURAT_SINGLE.out.rds.view()

[sample3, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds]
[sample7, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/37/6df9873421a81170aa8156c303bb3c/sample7_seurat_object.rds]
[sample6, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/7a/ebe2243cd6dbc81c2374be9e80c24b/sample6_seurat_object.rds]
[sample1, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds]
[sample5, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/a0ce478d03da5fb4f67b34fcd194e4/sample5_seurat_object.rds]
[sample2, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds]
[sample4, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]

I need to get a list of all the RDS files associated with each of the first outputs. For example, for [group1-group2, [sample1, sample2, sample3, sample4]] I need a list of :

/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds]
/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample3_seurat_object.rds]
/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]
英文:

I've created a tuple from the output of a channel.

ch_groups = INPUT_CHECK_GEX.out.group_samplesheet
            .splitCsv( header:true, sep:',', strip:true )
            .map { row ->
                    def keyID = row["keyid"]
                    def sampleID = row["sampleid"]
                    return [keyID, sampleID]
                }
            .groupTuple()
        ch_groups.view()

This is the output

[group1-group2, [sample1, sample2, sample3, sample4]]

I have another output set up as a tuple as well: SEURAT_SINGLE.out.rds.view()

[sample3, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds]
[sample7, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/37/6df9873421a81170aa8156c303bb3c/sample7_seurat_object.rds]
[sample6, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/7a/ebe2243cd6dbc81c2374be9e80c24b/sample6_seurat_object.rds]
[sample1, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds]
[sample5, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/a0ce478d03da5fb4f67b34fcd194e4/sample5_seurat_object.rds]
[sample2, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds]
[sample4, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]

I need to get a list of all the RDS files associated with each of the first outputs. For example, for [group1-group2, [sample1, sample2, sample3, sample4]] I need a list of :

/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds]
/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample3_seurat_object.rds]
/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]

EDITED with the advice from Steve

Using his approach I was able to get the desired result for one contrast. As soon as I added contrasts, the output still provided only the first result.

For example, adding additional contrasts to INPUT_CHECK_GEX.out.group_samplesheet:

ch_groups = INPUT_CHECK_GEX.out.group_samplesheet
            .splitCsv( header:true, sep:',', strip:true )
            .map { row ->
                    def keyID = row["keyid"]
                    def sampleID = row["sampleid"]
                    return [keyID, sampleID]
                }
            .groupTuple()
        ch_groups.view()

ch_groups.view()

[group1-group2, [sample1, sample2, sample3, sample4]]
[group1-group2-group3, [sample1, sample2, sample3, sample4, sample5, sample6]]

And then running his suggestion, still gives the output, ignoring the added contrast:

[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/a6/02a8bc99a1a0ea3549d774145facbe/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/bf/2f9f884fe8868ee91ce077d598bd5d/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/8e/99e42901219cd3eba0981987033145/sample1_seurat_object.rds]]

I attempted to fix this with this solution, but while it brings in the second contrast, it doesn't map duplicate samples (IE sample1 is in BOTH contrasts):

INPUT_CHECK_GEX.out.group_samplesheet
            .splitCsv( header:true, sep:',', strip:true )
            .map { row ->
                def key = row["keyid"]
                def sample = row["sampleid"]

                tuple( key, sample )
            }
            .map { key, sample -> tuple( sample, key ) }
            .join( SEURAT_SINGLE.out.rds )
            .map { sample, key, rds_file -> tuple( key, rds_file ) }
            .groupTuple()
            .view()

Output:

[group1-group2-group3, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/4c/747cbe34e3464a22c376d09be2cdb1/sample6_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/51/9bb8aad780fd14e9ed7ad9b3f3b06f/sample5_seurat_object.rds]
[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/d8/b02c8c3ab57faefe4bb60e85b03743/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/27/eb43d9f44534819f289831869270a8/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/e2/2811ac1360970134456f34b7d55518/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds]]

Expected Output:

[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/d8/b02c8c3ab57faefe4bb60e85b03743/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/27/eb43d9f44534819f289831869270a8/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/e2/2811ac1360970134456f34b7d55518/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds]]

[group1-group2-group3, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/4c/747cbe34e3464a22c376d09be2cdb1/sample6_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/51/9bb8aad780fd14e9ed7ad9b3f3b06f/sample5_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/d8/b02c8c3ab57faefe4bb60e85b03743/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/27/eb43d9f44534819f289831869270a8/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/e2/2811ac1360970134456f34b7d55518/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds]

Solution

For anyone else who finds themselves with this question, this was the solution I came up with:

        ch_groups = INPUT_CHECK_GEX.out.group_samplesheet
            .splitCsv( header:true, sep:',', strip:true )
            .map { row ->
                    def key = row["keyid"]
                    def sample = row["sampleid"]
                    return [sample, key]
                }
            .combine(SEURAT_SINGLE.out.rds, by: 0)
            .map { sample, key, rds_file -> tuple( key, rds_file ) }
            .groupTuple()
            .view()

Gives the output:

[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/a6/02a8bc99a1a0ea3549d774145facbe/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/e7d26a4328f99d5984cdb1acd8e4b0/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/da/ca761f3d5b389f1333736ec5ae1dfe/sample4_seurat_object.rds]]

[group1-group2-group3, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/1f/a18fc5718d3a7869da2340149254e3/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/a6/02a8bc99a1a0ea3549d774145facbe/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/98/1063e9c6b025e59238d84db688ece5/sample5_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/c1924829b9e4298540c530aa37e919/sample6_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/e7d26a4328f99d5984cdb1acd8e4b0/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/da/ca761f3d5b389f1333736ec5ae1dfe/sample4_seurat_object.rds]]

答案1

得分: 1

以下是翻译好的内容:

假设您的群组样本表包含多个群组,每个群组具有不同数量的样本,您可以使用groupKey对象将样本数量与每个群组关联起来。这种方法允许groupTuple运算符尽快流式传输收集到的值。例如:

workflow {

    INPUT_CHECK_GEX.out.group_samplesheet
        .splitCsv(header:true, sep:',', strip:true)
        .map { row ->
            def keyID = row["keyid"]
            def sampleID = row["sampleid"]

            tuple(keyID, sampleID)
        }
        .groupTuple()
        .map { group, samples ->
            tuple(groupKey(group, samples.size()), samples)
        }
        .set { groups_ch }

    groups_ch
        .transpose()
        .map { key, sample -> tuple(sample, key) }
        .join(SEURAT_SINGLE.out.rds)
        .map { sample, key, rds_file -> tuple(key, rds_file) }
        .groupTuple()
        .view()
}

预期结果:

[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]]

请注意,如果一个样本可以属于一个或多个群组,只需将join替换为combine运算符。只需确保使用允许您使用by参数将共享公共匹配键的项目组合在一起的第二种形式,例如:

groups_ch
    .transpose()
    .map { key, sample -> tuple(sample, key) }
    .combine(SEURAT_SINGLE.out.rds, by: 0)
    .map { sample, key, rds_file -> tuple(key, rds_file) }
    .groupTuple()
    .view()

预期结果:

[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]]
[group1-group2-group3, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/a0ce478d03da5fb4f67b34fcd194e4/sample5_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/7a/ebe2243cd6dbc81c2374be9e80c24b/sample6_seurat_object.rds]]

希望这对您有所帮助。

英文:

Assuming your group samplesheet contains multiple groups each with a different number of samples, you could use a groupKey object to associate the number of samples with each group. This approach lets the groupTuple operator then stream the collected values as soon as possible. For example:

workflow {

    INPUT_CHECK_GEX.out.group_samplesheet
        .splitCsv( header:true, sep:',', strip:true )
        .map { row ->
            def keyID = row["keyid"]
            def sampleID = row["sampleid"]

            tuple( keyID, sampleID )
        }
        .groupTuple()
        .map { group, samples ->
            tuple( groupKey(group, samples.size()), samples )
        }
        .set { groups_ch }

    groups_ch
        .transpose()
        .map { key, sample -> tuple( sample, key ) }
        .join( SEURAT_SINGLE.out.rds )
        .map { sample, key, rds_file -> tuple( key, rds_file ) }
        .groupTuple()
        .view()
}

Expected results:

[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]]

Note that if a sample can belong to one or more groups, simply replace the join with the combine operator. Just make sure to use the second form which allows you to combine items that share a common matching key using the by parameter, for example:

    groups_ch
        .transpose()
        .map { key, sample -> tuple( sample, key ) }
        .combine( SEURAT_SINGLE.out.rds, by: 0 )
        .map { sample, key, rds_file -> tuple( key, rds_file ) }
        .groupTuple()
        .view()

Expected results:

[group1-group2, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds]]
[group1-group2-group3, [/gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/65/888f0fb28a20fe1c034e8da8666eee/sample1_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/ec/98b2b1e045db5b0664233052e28e37/sample2_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/b1/92baee56b862a2187f1459e1e66a4d/sample3_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/44/5c38598986b3a48e05a4bcb5c72c73/sample4_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/78/a0ce478d03da5fb4f67b34fcd194e4/sample5_seurat_object.rds, /gpfs/gsfs10/users/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/work/7a/ebe2243cd6dbc81c2374be9e80c24b/sample6_seurat_object.rds]]

huangapple
  • 本文由 发表于 2023年5月26日 13:34:30
  • 转载请务必保留本文链接:https://go.coder-hub.com/76337909.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定