英文:
Creating mapping id between two different data frame
问题
Here's the translated code portion without the translation of code itself:
I have data frame which is coming out of seurat analysis pipeline,
- one set is for the cells which are annotated which results in one data frame that contains the cells, its cell type annotation and numerical cluster associated with each cell type cluster.
- The second data frame is the genes which is how the cell types are annotated based on the expression of genes which are probable markers for the annotated cell types.
That was the background.
The cluster for the cell type
> table(seurat@meta.data$louvain)
CD4 T cells CD14+ Monocytes B cells CD8 T cells NK cells FCGR3A+ Monocytes
1144 480 342 316 154 150
Dendritic cells Megakaryocytes
37 15
> table(seurat@meta.data$seurat_clusters)
0 1 2 3 4 5 6 7
1140 467 351 247 219 167 32 15
The cluster for the gene
table(cl_markers$seurat_clusters)
0 1 2 3 4 5 6 7
341 691 299 640 229 858 815 439
The common factor here is the cluster number for both the gene and cell type.
Now I can't map directly each cell type to the gene due to the differences in the dimension.
> dim(mke_cluster)
[1] 2638 2
> dim(mke_gene)
[1] 4312 3
My small subset cluster dataframe
head(mke_cluster)
louvain seurat_clusters
AAACATACAACCAC-1 CD4 T cells 0
AAACATTGAGCTAC-1 B cells 2
AAACATTGATCAGC-1 CD4 T cells 0
AAACCGTGCTTCCG-1 CD14+ Monocytes 5
AAACCGTGTATGCG-1 NK cells 3
AAACGCACTGGTAC-1 CD8 T cells 0
Gene subset
head(mke_gene)
seurat_clusters gene avg_log2FC
LDHB 0 LDHB 1.6653235
RPS12 2 RPS12 0.8438077
RPS25 2 RPS25 0.9089848
CD3D 0 CD3D 1.5250903
RPS27 1 RPS27 0.7858780
RPS6 0 RPS6 0.7065248
My objective is
- Another column in the mke_gene data frame where the gene are labelled with the `*louvain*`.
I was not sure how to map I tried merging I'm not sure if this is the right approach or not since both the df are different dimension, which is by default biologically.
> merged_data <- merge(mke_cluster, mke_gene, by.x = "seurat_clusters", by.y = "seurat_clusters")
Any suggestion or help would be appreciated
dput(head(cl_markers, n = 10))
structure(list(p_val = c(6.64840265863089e-242, 1.13459111505844e-224,
5.8379603130071e-204, 1.10871523912512e-193, 1.2932367269787e-184,
3.62844688134858e-184, 2.7796055265616e-178, 9.24017697438729e-173,
9.23182535159467e-168, 4.61251612799725e-164), avg_log2FC = c(1.66532350813551,
0.843807664890472, 0.90898477536885, 1.52509026397192, 0.785877979378442,
0.706524751609463, 0.687290625486455, 0.656853693459277, 0.759283189548348,
0.824286733662864), pct.1 = c(0.936, 1, 1, 0.879, 0.999, 1, 1,
1, 0.996, 0.997), pct.2 = c(0.477, 0.989, 0.967, 0.247, 0.989,
0.993, 0.993, 0.993, 0.978, 0.953), p_val_adj = c(9.1176194060464e-238,
1.55597825519115e-220, 8.00617877325794e-200, 1.52049207893619e-189,
1.77354484737859e-180, 4.97605205308144e-180, 3.81195101912658e-174,
1.26719787026747e-168, 1.26605252871769e-163, 6.32560461793543e-160
), cluster = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), levels = c("0", "1", "2", "3", "4", "5", "6", "7"), class = "factor"),
gene = c("LDHB", "RPS12", "RPS25", "CD3D", "RPS27", "RPS6",
"RPS3", "RPS14", "TPT1", "RPL31")), row.names = c("LDHB",
"RPS12", "RPS25", "CD3D", "RPS27", "RPS6", "RPS3", "RPS14", "TPT1",
"RPL31"), class = "data.frame")
dput(head(seurat@meta.data, n = 10))
structure(list(orig.ident = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), levels = "SeuratProject", class = "factor"),
nCount_RNA = c(2419, 4903, 3147, 2639, 980, 2163, 2175
<details>
<summary>英文:</summary>
I have data frame which is coming out of seurat analysis pipeline,
- one set is for the cells which are annotated which results in one data frame that contains the cells, its cell type annotation and numerical cluster associated with each cell type cluster.
- The second data frame is the genes which is how the cell types are annotated based on the expression of genes which are probable markers for the annotated cell types.
That was the background.
The cluster for the cell type
> table(seurat@meta.data$louvain)
CD4 T cells CD14+ Monocytes B cells CD8 T cells NK cells FCGR3A+ Monocytes
1144 480 342 316 154 150
Dendritic cells Megakaryocytes
37 15
> table(seurat@meta.data$seurat_clusters)
0 1 2 3 4 5 6 7
1140 467 351 247 219 167 32 15
The cluster for the gene
table(cl_markers$seurat_clusters)
0 1 2 3 4 5 6 7
341 691 299 640 229 858 815 439
The common factor here is the cluster number for both the gene and cell type.
Now I can't map directly each cell type to the gene due to the differences in the dimension.
> dim(mke_cluster)
[1] 2638 2
> dim(mke_gene)
[1] 4312 3
My small subset cluster dataframe
head(mke_cluster)
louvain seurat_clusters
AAACATACAACCAC-1 CD4 T cells 0
AAACATTGAGCTAC-1 B cells 2
AAACATTGATCAGC-1 CD4 T cells 0
AAACCGTGCTTCCG-1 CD14+ Monocytes 5
AAACCGTGTATGCG-1 NK cells 3
AAACGCACTGGTAC-1 CD8 T cells 0
Gene subset
head(mke_gene)
seurat_clusters gene avg_log2FC
LDHB 0 LDHB 1.6653235
RPS12 2 RPS12 0.8438077
RPS25 2 RPS25 0.9089848
CD3D 0 CD3D 1.5250903
RPS27 1 RPS27 0.7858780
RPS6 0 RPS6 0.7065248
My objective is
- Another column in the mke_gene data frame where the gene are labelled with the `*louvain*`.
I was not sure how to map I tried merging I'm not sure if this is the right approach or not since both the df are different dimension, which is by default biologically.
> merged_data <- merge(mke_cluster, mke_gene, by.x = "seurat_clusters", by.y = "seurat_clusters")
Any suggestion or help would be appreciated
dput(head(cl_markers, n = 10))
structure(list(p_val = c(6.64840265863089e-242, 1.13459111505844e-224,
5.8379603130071e-204, 1.10871523912512e-193, 1.2932367269787e-184,
3.62844688134858e-184, 2.7796055265616e-178, 9.24017697438729e-173,
9.23182535159467e-168, 4.61251612799725e-164), avg_log2FC = c(1.66532350813551,
0.843807664890472, 0.90898477536885, 1.52509026397192, 0.785877979378442,
0.706524751609463, 0.687290625486455, 0.656853693459277, 0.759283189548348,
0.824286733662864), pct.1 = c(0.936, 1, 1, 0.879, 0.999, 1, 1,
1, 0.996, 0.997), pct.2 = c(0.477, 0.989, 0.967, 0.247, 0.989,
0.993, 0.993, 0.993, 0.978, 0.953), p_val_adj = c(9.1176194060464e-238,
1.55597825519115e-220, 8.00617877325794e-200, 1.52049207893619e-189,
1.77354484737859e-180, 4.97605205308144e-180, 3.81195101912658e-174,
1.26719787026747e-168, 1.26605252871769e-163, 6.32560461793543e-160
), cluster = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), levels = c("0", "1", "2", "3", "4", "5", "6", "7"), class = "factor"),
gene = c("LDHB", "RPS12", "RPS25", "CD3D", "RPS27", "RPS6",
"RPS3", "RPS14", "TPT1", "RPL31")), row.names = c("LDHB",
"RPS12", "RPS25", "CD3D", "RPS27", "RPS6", "RPS3", "RPS14", "TPT1",
"RPL31"), class = "data.frame")
#######################################################
dput(head(seurat@meta.data, n = 10))
structure(list(orig.ident = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), levels = "SeuratProject", class = "factor"),
nCount_RNA = c(2419, 4903, 3147, 2639, 980, 2163, 2175, 2260,
1275, 1103), nFeature_RNA = c(779L, 1352L, 1129L, 960L, 521L,
781L, 782L, 790L, 532L, 550L), n_genes = c(781, 1352, 1131,
960, 522, 782, 783, 790, 533, 550), percent_mito = c(0.0301777590066195,
0.0379359573125839, 0.00889736227691174, 0.0174308456480503,
0.0122448978945613, 0.016643550246954, 0.0381609201431274,
0.0309734512120485, 0.0117647061124444, 0.0290117859840393
), n_counts = c(2419, 4903, 3147, 2639, 980, 2163, 2175,
2260, 1275, 1103), louvain = structure(c(1L, 3L, 1L, 2L,
5L, 4L, 4L, 4L, 1L, 6L), levels = c("CD4 T cells", "CD14+ Monocytes",
"B cells", "CD8 T cells", "NK cells", "FCGR3A+ Monocytes",
"Dendritic cells", "Megakaryocytes"), class = "factor"),
percent.mt = c(3.01777594047127, 3.79359575769937, 0.889736256752463,
1.74308450170519, 1.22448979591837, 1.66435506241331, 3.81609195402299,
3.09734513274336, 1.17647058823529, 2.9011786038078), RNA_snn_res.1 = structure(c(1L,
3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c("0", "1",
"2", "3", "4", "5", "6", "7"), class = "factor"), seurat_clusters = structure(c(1L,
3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c("0", "1",
"2", "3", "4", "5", "6", "7"), class = "factor")), row.names = c("AAACATACAACCAC-1",
"AAACATTGAGCTAC-1", "AAACATTGATCAGC-1", "AAACCGTGCTTCCG-1", "AAACCGTGTATGCG-1",
"AAACGCACTGGTAC-1", "AAACGCTGACCAGT-1", "AAACGCTGGTTCTT-1", "AAACGCTGTAGCCA-1",
"AAACGCTGTTTCTG-1"), class = "data.frame")
So here was my code to generate mke_cluster and mke_gene
mke_cluster = seurat@meta.data %>% select(louvain,seurat_clusters)
mke_gene = cl_markers %>% select(cluster,gene,avg_log2FC)
names(mke_gene)[1] = "seurat_clusters"
**UPDATE**
Seurat dataframe
dput(head(seurat, n = 10))
structure(list(orig.ident = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), levels = "SeuratProject", class = "factor"),
nCount_RNA = c(2419, 4903, 3147, 2639, 980, 2163, 2175, 2260,
1275, 1103), nFeature_RNA = c(779L, 1352L, 1129L, 960L, 521L,
781L, 782L, 790L, 532L, 550L), n_genes = c(781, 1352, 1131,
960, 522, 782, 783, 790, 533, 550), percent_mito = c(0.0301777590066195,
0.0379359573125839, 0.00889736227691174, 0.0174308456480503,
0.0122448978945613, 0.016643550246954, 0.0381609201431274,
0.0309734512120485, 0.0117647061124444, 0.0290117859840393
), n_counts = c(2419, 4903, 3147, 2639, 980, 2163, 2175,
2260, 1275, 1103), louvain = structure(c(1L, 3L, 1L, 2L,
5L, 4L, 4L, 4L, 1L, 6L), levels = c("CD4 T cells", "CD14+ Monocytes",
"B cells", "CD8 T cells", "NK cells", "FCGR3A+ Monocytes",
"Dendritic cells", "Megakaryocytes"), class = "factor"),
percent.mt = c(3.01777594047127, 3.79359575769937, 0.889736256752463,
1.74308450170519, 1.22448979591837, 1.66435506241331, 3.81609195402299,
3.09734513274336, 1.17647058823529, 2.9011786038078), RNA_snn_res.1 = structure(c(1L,
3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c("0", "1",
"2", "3", "4", "5", "6", "7"), class = "factor"), seurat_clusters = structure(c(1L,
3L, 1L, 6L, 4L, 1L, 5L, 5L, 5L, 6L), levels = c("0", "1",
"2", "3", "4", "5", "6", "7"), class = "factor")), row.names = c("AAACATACAACCAC-1",
"AAACATTGAGCTAC-1", "AAACATTGATCAGC-1", "AAACCGTGCTTCCG-1", "AAACCGTGTATGCG-1",
"AAACGCACTGGTAC-1", "AAACGCTGACCAGT-1", "AAACGCTGGTTCTT-1", "AAACGCTGTAGCCA-1",
"AAACGCTGTTTCTG-1"), class = "data.frame")
</details>
# 答案1
**得分**: 1
在您的`mke_gene`示例表格中是否存在错误,或者群集0是否真的具有两个不同的标签,即在第一行和第三行上标记为CD4,在第六行上标记为CD8?您是如何设置这些标签的?
如果您只想在基因表格中添加群集注释,并且这些注释是唯一的,您可以避免多次匹配问题,像这样:
```R
clusters <- unique(mke_cluster)
merged_data <- left_join(mke_gene, clusters, by = "seurat_clusters")
英文:
Do you have a mistake in your mke_gene
example table or does cluster 0 really have two different labels, CD4 on the first and third rows and CD8 on the sixth row? How did you set the labels?
If you just want to add the cluster annotations in the genes table and the annotations are unique, you can avoid the multiple matches problem like this:
clusters <- unique(mke_cluster)
merged_data <- left_join(mke_gene, clusters, by = "seurat_clusters")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论