创建两个不同数据框之间的映射 ID

huangapple go评论82阅读模式
英文:

Creating mapping id between two different data frame

问题

Here's the translated code portion without the translation of code itself:

I have data frame which is coming out of seurat analysis pipeline,

 - one set is for the cells which are annotated which results in one data frame that contains the cells, its cell type annotation and numerical cluster associated with each cell type cluster.
 - The second data frame is the genes which is how the cell types are annotated based on the expression of genes which are probable markers for the annotated cell types.

That was the background.

The cluster for the cell type

    > table(seurat@meta.data$louvain)

          CD4 T cells   CD14+ Monocytes           B cells       CD8 T cells          NK cells FCGR3A+ Monocytes 
                 1144               480               342               316               154               150 
      Dendritic cells    Megakaryocytes 
                   37                15 
    > table(seurat@meta.data$seurat_clusters)

       0    1    2    3    4    5    6    7 
    1140  467  351  247  219  167   32   15


The cluster for the gene 

    table(cl_markers$seurat_clusters)

      0   1   2   3   4   5   6   7 
    341 691 299 640 229 858 815 439 

The common factor here is the cluster number for both the gene and cell type.

Now I can't map directly each cell type to the gene due to the differences in the dimension.

    > dim(mke_cluster)
    [1] 2638    2
    > dim(mke_gene)
    [1] 4312    3

My small subset cluster dataframe

     head(mke_cluster)
                             louvain seurat_clusters
    AAACATACAACCAC-1     CD4 T cells               0
    AAACATTGAGCTAC-1         B cells               2
    AAACATTGATCAGC-1     CD4 T cells               0
    AAACCGTGCTTCCG-1 CD14+ Monocytes               5
    AAACCGTGTATGCG-1        NK cells               3
    AAACGCACTGGTAC-1     CD8 T cells               0

Gene subset

    head(mke_gene)
          seurat_clusters  gene avg_log2FC
    LDHB                0  LDHB  1.6653235
    RPS12               2 RPS12  0.8438077
    RPS25               2 RPS25  0.9089848
    CD3D                0  CD3D  1.5250903
    RPS27               1 RPS27  0.7858780
    RPS6                0  RPS6  0.7065248

My objective is 

 - Another column in the mke_gene data frame where the gene are labelled with the `*louvain*`.

I was not sure how to map I tried merging I'm not sure if this is the right approach or not since both the df are different dimension, which is by default biologically. 

    > merged_data <- merge(mke_cluster, mke_gene, by.x = "seurat_clusters", by.y = "seurat_clusters")

Any suggestion or help would be appreciated 

    dput(head(cl_markers, n = 10))
    structure(list(p_val = c(6.64840265863089e-242, 1.13459111505844e-224, 
    5.8379603130071e-204, 1.10871523912512e-193, 1.2932367269787e-184, 
    3.62844688134858e-184, 2.7796055265616e-178, 9.24017697438729e-173, 
    9.23182535159467e-168, 4.61251612799725e-164), avg_log2FC = c(1.66532350813551, 
    0.843807664890472, 0.90898477536885, 1.52509026397192, 0.785877979378442, 
    0.706524751609463, 0.687290625486455, 0.656853693459277, 0.759283189548348, 
    0.824286733662864), pct.1 = c(0.936, 1, 1, 0.879, 0.999, 1, 1, 
    1, 0.996, 0.997), pct.2 = c(0.477, 0.989, 0.967, 0.247, 0.989, 
    0.993, 0.993, 0.993, 0.978, 0.953), p_val_adj = c(9.1176194060464e-238, 
    1.55597825519115e-220, 8.00617877325794e-200, 1.52049207893619e-189, 
    1.77354484737859e-180, 4.97605205308144e-180, 3.81195101912658e-174, 
    1.26719787026747e-168, 1.26605252871769e-163, 6.32560461793543e-160
    ), cluster = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L), levels = c("0", "1", "2", "3", "4", "5", "6", "7"), class = "factor"), 
        gene = c("LDHB", "RPS12", "RPS25", "CD3D", "RPS27", "RPS6", 
        "RPS3", "RPS14", "TPT1", "RPL31")), row.names = c("LDHB", 
    "RPS12", "RPS25", "CD3D", "RPS27", "RPS6", "RPS3", "RPS14", "TPT1", 
    "RPL31"), class = "data.frame")

dput(head(seurat@meta.data, n = 10))
structure(list(orig.ident = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), levels = "SeuratProject", class = "factor"), 
nCount_RNA = c(2419, 4903, 3147, 2639, 980, 2163, 2175

<details>
<summary>英文:</summary>

I have data frame which is coming out of seurat analysis pipeline, 

 - one set is for the cells which are annotated which results in one data frame that contains the cells, its cell type annotation and numerical cluster associated with each cell type cluster.
 - The second data frame is the genes which is how the cell types are annotated based on the expression of genes which are probable markers for the annotated cell types.

That was the background.

The cluster for the cell type

    &gt; table(seurat@meta.data$louvain)
    
          CD4 T cells   CD14+ Monocytes           B cells       CD8 T cells          NK cells FCGR3A+ Monocytes 
                 1144               480               342               316               154               150 
      Dendritic cells    Megakaryocytes 
                   37                15 
    &gt; table(seurat@meta.data$seurat_clusters)
    
       0    1    2    3    4    5    6    7 
    1140  467  351  247  219  167   32   15


The cluster for the gene 
 

    table(cl_markers$seurat_clusters)
    
      0   1   2   3   4   5   6   7 
    341 691 299 640 229 858 815 439 

The common factor here is the cluster number for both the gene and cell type.

Now I can&#39;t map directly each cell type to the gene due to the differences in the dimension.

    &gt; dim(mke_cluster)
    [1] 2638    2
    &gt; dim(mke_gene)
    [1] 4312    3

My small subset cluster dataframe

     head(mke_cluster)
                             louvain seurat_clusters
    AAACATACAACCAC-1     CD4 T cells               0
    AAACATTGAGCTAC-1         B cells               2
    AAACATTGATCAGC-1     CD4 T cells               0
    AAACCGTGCTTCCG-1 CD14+ Monocytes               5
    AAACCGTGTATGCG-1        NK cells               3
    AAACGCACTGGTAC-1     CD8 T cells               0

Gene subset

    head(mke_gene)
          seurat_clusters  gene avg_log2FC
    LDHB                0  LDHB  1.6653235
    RPS12               2 RPS12  0.8438077
    RPS25               2 RPS25  0.9089848
    CD3D                0  CD3D  1.5250903
    RPS27               1 RPS27  0.7858780
    RPS6                0  RPS6  0.7065248

My objective is 

 - Another column in the mke_gene data frame where the gene are labelled with the `*louvain*`.

I was not sure how to map I tried merging I&#39;m not sure if this is the right approach or not since both the df are different dimension, which is by default biologically. 

    &gt; merged_data &lt;- merge(mke_cluster, mke_gene, by.x = &quot;seurat_clusters&quot;, by.y = &quot;seurat_clusters&quot;)

Any suggestion or help would be appreciated 

    dput(head(cl_markers, n = 10))
    structure(list(p_val = c(6.64840265863089e-242, 1.13459111505844e-224, 
    5.8379603130071e-204, 1.10871523912512e-193, 1.2932367269787e-184, 
    3.62844688134858e-184, 2.7796055265616e-178, 9.24017697438729e-173, 
    9.23182535159467e-168, 4.61251612799725e-164), avg_log2FC = c(1.66532350813551, 
    0.843807664890472, 0.90898477536885, 1.52509026397192, 0.785877979378442, 
    0.706524751609463, 0.687290625486455, 0.656853693459277, 0.759283189548348, 
    0.824286733662864), pct.1 = c(0.936, 1, 1, 0.879, 0.999, 1, 1, 
    1, 0.996, 0.997), pct.2 = c(0.477, 0.989, 0.967, 0.247, 0.989, 
    0.993, 0.993, 0.993, 0.978, 0.953), p_val_adj = c(9.1176194060464e-238, 
    1.55597825519115e-220, 8.00617877325794e-200, 1.52049207893619e-189, 
    1.77354484737859e-180, 4.97605205308144e-180, 3.81195101912658e-174, 
    1.26719787026747e-168, 1.26605252871769e-163, 6.32560461793543e-160
    ), cluster = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L), levels = c(&quot;0&quot;, &quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;), class = &quot;factor&quot;), 
        gene = c(&quot;LDHB&quot;, &quot;RPS12&quot;, &quot;RPS25&quot;, &quot;CD3D&quot;, &quot;RPS27&quot;, &quot;RPS6&quot;, 
        &quot;RPS3&quot;, &quot;RPS14&quot;, &quot;TPT1&quot;, &quot;RPL31&quot;)), row.names = c(&quot;LDHB&quot;, 
    &quot;RPS12&quot;, &quot;RPS25&quot;, &quot;CD3D&quot;, &quot;RPS27&quot;, &quot;RPS6&quot;, &quot;RPS3&quot;, &quot;RPS14&quot;, &quot;TPT1&quot;, 
    &quot;RPL31&quot;), class = &quot;data.frame&quot;)
#######################################################

     dput(head(seurat@meta.data, n = 10))
    structure(list(orig.ident = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), levels = &quot;SeuratProject&quot;, class = &quot;factor&quot;), 
        nCount_RNA = c(2419, 4903, 3147, 2639, 980, 2163, 2175, 2260, 
        1275, 1103), nFeature_RNA = c(779L, 1352L, 1129L, 960L, 521L, 
        781L, 782L, 790L, 532L, 550L), n_genes = c(781, 1352, 1131, 
        960, 522, 782, 783, 790, 533, 550), percent_mito = c(0.0301777590066195, 
        0.0379359573125839, 0.00889736227691174, 0.0174308456480503, 
        0.0122448978945613, 0.016643550246954, 0.0381609201431274, 
        0.0309734512120485, 0.0117647061124444, 0.0290117859840393
        ), n_counts = c(2419, 4903, 3147, 2639, 980, 2163, 2175, 
        2260, 1275, 1103), louvain = structure(c(1L, 3L, 1L, 2L, 
        5L, 4L, 4L, 4L, 1L, 6L), levels = c(&quot;CD4 T cells&quot;, &quot;CD14+ Monocytes&quot;, 
        &quot;B cells&quot;, &quot;CD8 T cells&quot;, &quot;NK cells&quot;, &quot;FCGR3A+ Monocytes&quot;, 
        &quot;Dendritic cells&quot;, &quot;Megakaryocytes&quot;), class = &quot;factor&quot;), 
        percent.mt = c(3.01777594047127, 3.79359575769937, 0.889736256752463, 
        1.74308450170519, 1.22448979591837, 1.66435506241331, 3.81609195402299, 
        3.09734513274336, 1.17647058823529, 2.9011786038078), RNA_snn_res.1 = structure(c(1L, 
        3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c(&quot;0&quot;, &quot;1&quot;, 
        &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;), class = &quot;factor&quot;), seurat_clusters = structure(c(1L, 
        3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c(&quot;0&quot;, &quot;1&quot;, 
        &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;), class = &quot;factor&quot;)), row.names = c(&quot;AAACATACAACCAC-1&quot;, 
    &quot;AAACATTGAGCTAC-1&quot;, &quot;AAACATTGATCAGC-1&quot;, &quot;AAACCGTGCTTCCG-1&quot;, &quot;AAACCGTGTATGCG-1&quot;, 
    &quot;AAACGCACTGGTAC-1&quot;, &quot;AAACGCTGACCAGT-1&quot;, &quot;AAACGCTGGTTCTT-1&quot;, &quot;AAACGCTGTAGCCA-1&quot;, 
    &quot;AAACGCTGTTTCTG-1&quot;), class = &quot;data.frame&quot;)

    
So here was my code to generate mke_cluster and mke_gene

    mke_cluster = seurat@meta.data %&gt;% select(louvain,seurat_clusters)
    mke_gene = cl_markers %&gt;% select(cluster,gene,avg_log2FC) 
    names(mke_gene)[1] = &quot;seurat_clusters&quot;




**UPDATE**

Seurat dataframe 


        dput(head(seurat, n = 10))
    structure(list(orig.ident = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), levels = &quot;SeuratProject&quot;, class = &quot;factor&quot;), 
        nCount_RNA = c(2419, 4903, 3147, 2639, 980, 2163, 2175, 2260, 
        1275, 1103), nFeature_RNA = c(779L, 1352L, 1129L, 960L, 521L, 
        781L, 782L, 790L, 532L, 550L), n_genes = c(781, 1352, 1131, 
        960, 522, 782, 783, 790, 533, 550), percent_mito = c(0.0301777590066195, 
        0.0379359573125839, 0.00889736227691174, 0.0174308456480503, 
        0.0122448978945613, 0.016643550246954, 0.0381609201431274, 
        0.0309734512120485, 0.0117647061124444, 0.0290117859840393
        ), n_counts = c(2419, 4903, 3147, 2639, 980, 2163, 2175, 
        2260, 1275, 1103), louvain = structure(c(1L, 3L, 1L, 2L, 
        5L, 4L, 4L, 4L, 1L, 6L), levels = c(&quot;CD4 T cells&quot;, &quot;CD14+ Monocytes&quot;, 
        &quot;B cells&quot;, &quot;CD8 T cells&quot;, &quot;NK cells&quot;, &quot;FCGR3A+ Monocytes&quot;, 
        &quot;Dendritic cells&quot;, &quot;Megakaryocytes&quot;), class = &quot;factor&quot;), 
        percent.mt = c(3.01777594047127, 3.79359575769937, 0.889736256752463, 
        1.74308450170519, 1.22448979591837, 1.66435506241331, 3.81609195402299, 
        3.09734513274336, 1.17647058823529, 2.9011786038078), RNA_snn_res.1 = structure(c(1L, 
        3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c(&quot;0&quot;, &quot;1&quot;, 
        &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;), class = &quot;factor&quot;), seurat_clusters = structure(c(1L, 
        3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c(&quot;0&quot;, &quot;1&quot;, 
        &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;), class = &quot;factor&quot;)), row.names = c(&quot;AAACATACAACCAC-1&quot;, 
    &quot;AAACATTGAGCTAC-1&quot;, &quot;AAACATTGATCAGC-1&quot;, &quot;AAACCGTGCTTCCG-1&quot;, &quot;AAACCGTGTATGCG-1&quot;, 
    &quot;AAACGCACTGGTAC-1&quot;, &quot;AAACGCTGACCAGT-1&quot;, &quot;AAACGCTGGTTCTT-1&quot;, &quot;AAACGCTGTAGCCA-1&quot;, 
    &quot;AAACGCTGTTTCTG-1&quot;), class = &quot;data.frame&quot;)




</details>


# 答案1
**得分**: 1

在您的`mke_gene`示例表格中是否存在错误,或者群集0是否真的具有两个不同的标签,即在第一行和第三行上标记为CD4,在第六行上标记为CD8?您是如何设置这些标签的?

如果您只想在基因表格中添加群集注释,并且这些注释是唯一的,您可以避免多次匹配问题,像这样:

```R
clusters <- unique(mke_cluster)
merged_data <- left_join(mke_gene, clusters, by = "seurat_clusters")
英文:

Do you have a mistake in your mke_gene example table or does cluster 0 really have two different labels, CD4 on the first and third rows and CD8 on the sixth row? How did you set the labels?

If you just want to add the cluster annotations in the genes table and the annotations are unique, you can avoid the multiple matches problem like this:

clusters &lt;- unique(mke_cluster)
merged_data &lt;- left_join(mke_gene, clusters, by = &quot;seurat_clusters&quot;)

huangapple
  • 本文由 发表于 2023年6月15日 04:16:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76477234.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定