创建两个不同数据框之间的映射 ID

huangapple go评论105阅读模式
英文:

Creating mapping id between two different data frame

问题

Here's the translated code portion without the translation of code itself:

  1. I have data frame which is coming out of seurat analysis pipeline,
  2. - one set is for the cells which are annotated which results in one data frame that contains the cells, its cell type annotation and numerical cluster associated with each cell type cluster.
  3. - The second data frame is the genes which is how the cell types are annotated based on the expression of genes which are probable markers for the annotated cell types.
  4. That was the background.
  5. The cluster for the cell type
  6. > table(seurat@meta.data$louvain)
  7. CD4 T cells CD14+ Monocytes B cells CD8 T cells NK cells FCGR3A+ Monocytes
  8. 1144 480 342 316 154 150
  9. Dendritic cells Megakaryocytes
  10. 37 15
  11. > table(seurat@meta.data$seurat_clusters)
  12. 0 1 2 3 4 5 6 7
  13. 1140 467 351 247 219 167 32 15
  14. The cluster for the gene
  15. table(cl_markers$seurat_clusters)
  16. 0 1 2 3 4 5 6 7
  17. 341 691 299 640 229 858 815 439
  18. The common factor here is the cluster number for both the gene and cell type.
  19. Now I can't map directly each cell type to the gene due to the differences in the dimension.
  20. > dim(mke_cluster)
  21. [1] 2638 2
  22. > dim(mke_gene)
  23. [1] 4312 3
  24. My small subset cluster dataframe
  25. head(mke_cluster)
  26. louvain seurat_clusters
  27. AAACATACAACCAC-1 CD4 T cells 0
  28. AAACATTGAGCTAC-1 B cells 2
  29. AAACATTGATCAGC-1 CD4 T cells 0
  30. AAACCGTGCTTCCG-1 CD14+ Monocytes 5
  31. AAACCGTGTATGCG-1 NK cells 3
  32. AAACGCACTGGTAC-1 CD8 T cells 0
  33. Gene subset
  34. head(mke_gene)
  35. seurat_clusters gene avg_log2FC
  36. LDHB 0 LDHB 1.6653235
  37. RPS12 2 RPS12 0.8438077
  38. RPS25 2 RPS25 0.9089848
  39. CD3D 0 CD3D 1.5250903
  40. RPS27 1 RPS27 0.7858780
  41. RPS6 0 RPS6 0.7065248
  42. My objective is
  43. - Another column in the mke_gene data frame where the gene are labelled with the `*louvain*`.
  44. I was not sure how to map I tried merging I'm not sure if this is the right approach or not since both the df are different dimension, which is by default biologically.
  45. > merged_data <- merge(mke_cluster, mke_gene, by.x = "seurat_clusters", by.y = "seurat_clusters")
  46. Any suggestion or help would be appreciated
  47. dput(head(cl_markers, n = 10))
  48. structure(list(p_val = c(6.64840265863089e-242, 1.13459111505844e-224,
  49. 5.8379603130071e-204, 1.10871523912512e-193, 1.2932367269787e-184,
  50. 3.62844688134858e-184, 2.7796055265616e-178, 9.24017697438729e-173,
  51. 9.23182535159467e-168, 4.61251612799725e-164), avg_log2FC = c(1.66532350813551,
  52. 0.843807664890472, 0.90898477536885, 1.52509026397192, 0.785877979378442,
  53. 0.706524751609463, 0.687290625486455, 0.656853693459277, 0.759283189548348,
  54. 0.824286733662864), pct.1 = c(0.936, 1, 1, 0.879, 0.999, 1, 1,
  55. 1, 0.996, 0.997), pct.2 = c(0.477, 0.989, 0.967, 0.247, 0.989,
  56. 0.993, 0.993, 0.993, 0.978, 0.953), p_val_adj = c(9.1176194060464e-238,
  57. 1.55597825519115e-220, 8.00617877325794e-200, 1.52049207893619e-189,
  58. 1.77354484737859e-180, 4.97605205308144e-180, 3.81195101912658e-174,
  59. 1.26719787026747e-168, 1.26605252871769e-163, 6.32560461793543e-160
  60. ), cluster = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
  61. 1L), levels = c("0", "1", "2", "3", "4", "5", "6", "7"), class = "factor"),
  62. gene = c("LDHB", "RPS12", "RPS25", "CD3D", "RPS27", "RPS6",
  63. "RPS3", "RPS14", "TPT1", "RPL31")), row.names = c("LDHB",
  64. "RPS12", "RPS25", "CD3D", "RPS27", "RPS6", "RPS3", "RPS14", "TPT1",
  65. "RPL31"), class = "data.frame")
  66. dput(head(seurat@meta.data, n = 10))
  67. structure(list(orig.ident = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
  68. 1L, 1L, 1L, 1L), levels = "SeuratProject", class = "factor"),
  69. nCount_RNA = c(2419, 4903, 3147, 2639, 980, 2163, 2175
  70. <details>
  71. <summary>英文:</summary>
  72. I have data frame which is coming out of seurat analysis pipeline,
  73. - one set is for the cells which are annotated which results in one data frame that contains the cells, its cell type annotation and numerical cluster associated with each cell type cluster.
  74. - The second data frame is the genes which is how the cell types are annotated based on the expression of genes which are probable markers for the annotated cell types.
  75. That was the background.
  76. The cluster for the cell type
  77. &gt; table(seurat@meta.data$louvain)
  78. CD4 T cells CD14+ Monocytes B cells CD8 T cells NK cells FCGR3A+ Monocytes
  79. 1144 480 342 316 154 150
  80. Dendritic cells Megakaryocytes
  81. 37 15
  82. &gt; table(seurat@meta.data$seurat_clusters)
  83. 0 1 2 3 4 5 6 7
  84. 1140 467 351 247 219 167 32 15
  85. The cluster for the gene
  86. table(cl_markers$seurat_clusters)
  87. 0 1 2 3 4 5 6 7
  88. 341 691 299 640 229 858 815 439
  89. The common factor here is the cluster number for both the gene and cell type.
  90. Now I can&#39;t map directly each cell type to the gene due to the differences in the dimension.
  91. &gt; dim(mke_cluster)
  92. [1] 2638 2
  93. &gt; dim(mke_gene)
  94. [1] 4312 3
  95. My small subset cluster dataframe
  96. head(mke_cluster)
  97. louvain seurat_clusters
  98. AAACATACAACCAC-1 CD4 T cells 0
  99. AAACATTGAGCTAC-1 B cells 2
  100. AAACATTGATCAGC-1 CD4 T cells 0
  101. AAACCGTGCTTCCG-1 CD14+ Monocytes 5
  102. AAACCGTGTATGCG-1 NK cells 3
  103. AAACGCACTGGTAC-1 CD8 T cells 0
  104. Gene subset
  105. head(mke_gene)
  106. seurat_clusters gene avg_log2FC
  107. LDHB 0 LDHB 1.6653235
  108. RPS12 2 RPS12 0.8438077
  109. RPS25 2 RPS25 0.9089848
  110. CD3D 0 CD3D 1.5250903
  111. RPS27 1 RPS27 0.7858780
  112. RPS6 0 RPS6 0.7065248
  113. My objective is
  114. - Another column in the mke_gene data frame where the gene are labelled with the `*louvain*`.
  115. I was not sure how to map I tried merging I&#39;m not sure if this is the right approach or not since both the df are different dimension, which is by default biologically.
  116. &gt; merged_data &lt;- merge(mke_cluster, mke_gene, by.x = &quot;seurat_clusters&quot;, by.y = &quot;seurat_clusters&quot;)
  117. Any suggestion or help would be appreciated
  118. dput(head(cl_markers, n = 10))
  119. structure(list(p_val = c(6.64840265863089e-242, 1.13459111505844e-224,
  120. 5.8379603130071e-204, 1.10871523912512e-193, 1.2932367269787e-184,
  121. 3.62844688134858e-184, 2.7796055265616e-178, 9.24017697438729e-173,
  122. 9.23182535159467e-168, 4.61251612799725e-164), avg_log2FC = c(1.66532350813551,
  123. 0.843807664890472, 0.90898477536885, 1.52509026397192, 0.785877979378442,
  124. 0.706524751609463, 0.687290625486455, 0.656853693459277, 0.759283189548348,
  125. 0.824286733662864), pct.1 = c(0.936, 1, 1, 0.879, 0.999, 1, 1,
  126. 1, 0.996, 0.997), pct.2 = c(0.477, 0.989, 0.967, 0.247, 0.989,
  127. 0.993, 0.993, 0.993, 0.978, 0.953), p_val_adj = c(9.1176194060464e-238,
  128. 1.55597825519115e-220, 8.00617877325794e-200, 1.52049207893619e-189,
  129. 1.77354484737859e-180, 4.97605205308144e-180, 3.81195101912658e-174,
  130. 1.26719787026747e-168, 1.26605252871769e-163, 6.32560461793543e-160
  131. ), cluster = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
  132. 1L), levels = c(&quot;0&quot;, &quot;1&quot;, &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;), class = &quot;factor&quot;),
  133. gene = c(&quot;LDHB&quot;, &quot;RPS12&quot;, &quot;RPS25&quot;, &quot;CD3D&quot;, &quot;RPS27&quot;, &quot;RPS6&quot;,
  134. &quot;RPS3&quot;, &quot;RPS14&quot;, &quot;TPT1&quot;, &quot;RPL31&quot;)), row.names = c(&quot;LDHB&quot;,
  135. &quot;RPS12&quot;, &quot;RPS25&quot;, &quot;CD3D&quot;, &quot;RPS27&quot;, &quot;RPS6&quot;, &quot;RPS3&quot;, &quot;RPS14&quot;, &quot;TPT1&quot;,
  136. &quot;RPL31&quot;), class = &quot;data.frame&quot;)
  137. #######################################################
  138. dput(head(seurat@meta.data, n = 10))
  139. structure(list(orig.ident = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
  140. 1L, 1L, 1L, 1L), levels = &quot;SeuratProject&quot;, class = &quot;factor&quot;),
  141. nCount_RNA = c(2419, 4903, 3147, 2639, 980, 2163, 2175, 2260,
  142. 1275, 1103), nFeature_RNA = c(779L, 1352L, 1129L, 960L, 521L,
  143. 781L, 782L, 790L, 532L, 550L), n_genes = c(781, 1352, 1131,
  144. 960, 522, 782, 783, 790, 533, 550), percent_mito = c(0.0301777590066195,
  145. 0.0379359573125839, 0.00889736227691174, 0.0174308456480503,
  146. 0.0122448978945613, 0.016643550246954, 0.0381609201431274,
  147. 0.0309734512120485, 0.0117647061124444, 0.0290117859840393
  148. ), n_counts = c(2419, 4903, 3147, 2639, 980, 2163, 2175,
  149. 2260, 1275, 1103), louvain = structure(c(1L, 3L, 1L, 2L,
  150. 5L, 4L, 4L, 4L, 1L, 6L), levels = c(&quot;CD4 T cells&quot;, &quot;CD14+ Monocytes&quot;,
  151. &quot;B cells&quot;, &quot;CD8 T cells&quot;, &quot;NK cells&quot;, &quot;FCGR3A+ Monocytes&quot;,
  152. &quot;Dendritic cells&quot;, &quot;Megakaryocytes&quot;), class = &quot;factor&quot;),
  153. percent.mt = c(3.01777594047127, 3.79359575769937, 0.889736256752463,
  154. 1.74308450170519, 1.22448979591837, 1.66435506241331, 3.81609195402299,
  155. 3.09734513274336, 1.17647058823529, 2.9011786038078), RNA_snn_res.1 = structure(c(1L,
  156. 3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c(&quot;0&quot;, &quot;1&quot;,
  157. &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;), class = &quot;factor&quot;), seurat_clusters = structure(c(1L,
  158. 3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c(&quot;0&quot;, &quot;1&quot;,
  159. &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;), class = &quot;factor&quot;)), row.names = c(&quot;AAACATACAACCAC-1&quot;,
  160. &quot;AAACATTGAGCTAC-1&quot;, &quot;AAACATTGATCAGC-1&quot;, &quot;AAACCGTGCTTCCG-1&quot;, &quot;AAACCGTGTATGCG-1&quot;,
  161. &quot;AAACGCACTGGTAC-1&quot;, &quot;AAACGCTGACCAGT-1&quot;, &quot;AAACGCTGGTTCTT-1&quot;, &quot;AAACGCTGTAGCCA-1&quot;,
  162. &quot;AAACGCTGTTTCTG-1&quot;), class = &quot;data.frame&quot;)
  163. So here was my code to generate mke_cluster and mke_gene
  164. mke_cluster = seurat@meta.data %&gt;% select(louvain,seurat_clusters)
  165. mke_gene = cl_markers %&gt;% select(cluster,gene,avg_log2FC)
  166. names(mke_gene)[1] = &quot;seurat_clusters&quot;
  167. **UPDATE**
  168. Seurat dataframe
  169. dput(head(seurat, n = 10))
  170. structure(list(orig.ident = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
  171. 1L, 1L, 1L, 1L), levels = &quot;SeuratProject&quot;, class = &quot;factor&quot;),
  172. nCount_RNA = c(2419, 4903, 3147, 2639, 980, 2163, 2175, 2260,
  173. 1275, 1103), nFeature_RNA = c(779L, 1352L, 1129L, 960L, 521L,
  174. 781L, 782L, 790L, 532L, 550L), n_genes = c(781, 1352, 1131,
  175. 960, 522, 782, 783, 790, 533, 550), percent_mito = c(0.0301777590066195,
  176. 0.0379359573125839, 0.00889736227691174, 0.0174308456480503,
  177. 0.0122448978945613, 0.016643550246954, 0.0381609201431274,
  178. 0.0309734512120485, 0.0117647061124444, 0.0290117859840393
  179. ), n_counts = c(2419, 4903, 3147, 2639, 980, 2163, 2175,
  180. 2260, 1275, 1103), louvain = structure(c(1L, 3L, 1L, 2L,
  181. 5L, 4L, 4L, 4L, 1L, 6L), levels = c(&quot;CD4 T cells&quot;, &quot;CD14+ Monocytes&quot;,
  182. &quot;B cells&quot;, &quot;CD8 T cells&quot;, &quot;NK cells&quot;, &quot;FCGR3A+ Monocytes&quot;,
  183. &quot;Dendritic cells&quot;, &quot;Megakaryocytes&quot;), class = &quot;factor&quot;),
  184. percent.mt = c(3.01777594047127, 3.79359575769937, 0.889736256752463,
  185. 1.74308450170519, 1.22448979591837, 1.66435506241331, 3.81609195402299,
  186. 3.09734513274336, 1.17647058823529, 2.9011786038078), RNA_snn_res.1 = structure(c(1L,
  187. 3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c(&quot;0&quot;, &quot;1&quot;,
  188. &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;), class = &quot;factor&quot;), seurat_clusters = structure(c(1L,
  189. 3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c(&quot;0&quot;, &quot;1&quot;,
  190. &quot;2&quot;, &quot;3&quot;, &quot;4&quot;, &quot;5&quot;, &quot;6&quot;, &quot;7&quot;), class = &quot;factor&quot;)), row.names = c(&quot;AAACATACAACCAC-1&quot;,
  191. &quot;AAACATTGAGCTAC-1&quot;, &quot;AAACATTGATCAGC-1&quot;, &quot;AAACCGTGCTTCCG-1&quot;, &quot;AAACCGTGTATGCG-1&quot;,
  192. &quot;AAACGCACTGGTAC-1&quot;, &quot;AAACGCTGACCAGT-1&quot;, &quot;AAACGCTGGTTCTT-1&quot;, &quot;AAACGCTGTAGCCA-1&quot;,
  193. &quot;AAACGCTGTTTCTG-1&quot;), class = &quot;data.frame&quot;)
  194. </details>
  195. # 答案1
  196. **得分**: 1
  197. 在您的`mke_gene`示例表格中是否存在错误,或者群集0是否真的具有两个不同的标签,即在第一行和第三行上标记为CD4,在第六行上标记为CD8?您是如何设置这些标签的?
  198. 如果您只想在基因表格中添加群集注释,并且这些注释是唯一的,您可以避免多次匹配问题,像这样:
  199. ```R
  200. clusters <- unique(mke_cluster)
  201. merged_data <- left_join(mke_gene, clusters, by = "seurat_clusters")
英文:

Do you have a mistake in your mke_gene example table or does cluster 0 really have two different labels, CD4 on the first and third rows and CD8 on the sixth row? How did you set the labels?

If you just want to add the cluster annotations in the genes table and the annotations are unique, you can avoid the multiple matches problem like this:

  1. clusters &lt;- unique(mke_cluster)
  2. merged_data &lt;- left_join(mke_gene, clusters, by = &quot;seurat_clusters&quot;)

huangapple
  • 本文由 发表于 2023年6月15日 04:16:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76477234.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定