英文:
Display the distribution of two groups on the same plot, using two data frames
问题
I have a data frame of scores
, it has 2000 rows and 25 columns. The columns are features and rows are samples. This data frame will be the data I use to plot the distributions.
在我的数据帧scores
中,有2000行和25列。列是特征,行是样本。这个数据帧将用于绘制分布图。
In another data frame, metadata
, I have clinical information about each sample in the scores
data frame, like gender, age, type of disease, treatment, and most importantly outcome to treatment. This data frame will serve as labels, it gives the label for each sample.
在另一个数据帧metadata
中,我有有关scores
数据帧中每个样本的临床信息,例如性别、年龄、疾病类型、治疗情况,尤其是治疗结果。这个数据帧将用作标签,它为每个样本提供标签。
The two data frames have the exact same samples.
这两个数据帧具有完全相同的样本。
There are three columns that describe a different kind of response to each sample, and those columns are binary, yes or no.
有三列描述对每个样本的不同类型的响应,这些列是二进制的,是或否。
My target is to make a distribution plot for the samples that belong to the yes or no groups, in each of those 3 columns.
我的目标是在这3列中,为属于“是”或“否”组的样本制作分布图。
Here is an example. Say this is scores
:
这是一个例子。假设这是scores
:
Feature_1 Feature_2 Feature_3
Patient_1 0.56 0.11 0.03
Ptient_2 0.605 0.34 0.49
P_3 0.1 0.76 0.42
12312AX 0.9 0.382 0.12
P_10 0.89 0.30 0.119
12312BX 0.232 0.118 0.80
12312CX 0.679 0.31 0.789
And this is metadata
:
这是metadata
:
Gender Age Outcome1 Outcome2 Outcome3
Patient_1 M 54 1 0 0
Ptient_2 M 28 0 0 1
P_3 F 32 1 1 0
12312AX F 87 0 0 1
P_10 F 43 0 0 1
12312BX M 90 1 1 0
12312CX F 65 1 0 0
Now, for example, I want to plot Feature_1
for the samples that are Outcome1 = 1
vs. the samples that are with label Outcome1 = 0
, and put them on the same plot to see the difference. A plot that would look like this:
现在,例如,我想绘制Feature_1
,对于Outcome1 = 1
的样本与具有标签Outcome1 = 0
的样本,并将它们放在同一个图上以查看差异。一个看起来像这样的图:
It doesn't matter if it's not filled with color.
如果没有填充颜色也没关系。
This is some subset of the data. Starting with scores
:
这是数据的一部分。从scores
开始:
structure(list(`Feature_1` = c(0.58126387599574, 0.554773857342486,
0.73811669435931, 0.5993561705421, 0.549993884896126, 0.560952809292699,
0.514920708901865, 0.668611976328753, 0.579311040856707, 0.627079649056927,
0.549778821698995, 0.563433551362653, 0.566883741540508, 0.586839499814986,
0.527874599585146, 0.533974585406425, 0.583020804822263, 0.607821542253184,
0.570922624085177, 0.531065608748296), `Feature_2` = c(0.671868971517913,
0.657649690364772, 0.681277871841209, 0.633247301225077, 0.658829966989863,
0.649553434195565, 0.654719152272398, 0.678510931368968, 0.67606269281911,
0.657861486037168, 0.656157657102225, 0.654684442044789, 0.660668253143108,
0.680000904001928, 0.676215636114716, 0.68015840395165, 0.656533748483226,
0.654344382579621, 0.626207872177309, 0.640129803823085), `Feature10` = c(0.607691853076,
0.507746766229958, 0.642056075026442, 0.647793952813017, 0.571844979370279,
0.592183904204232, 0.473827520445559, 0.618900091543045, 0.60656936545554,
0.60603612041945, 0.510241627095173, 0.564418205496303, 0.561084611266194,
0.558495659089567, 0.503235910349171, 0.492768739941572, 0.551283907128425,
0.664425637003928, 0.541804175576185, 0.537845283573044)), row names = c("Pt1",
"Pt10", "Pt101", "Pt103", "Pt106", "Pt11", "Pt17", "Pt18
<details>
<summary>英文:</summary>
I have a data frame of `scores`, it has 2000 rows and 25 columns. The columns are features and rows are samples. This data frame will be the data I use to plot the distributions.
In another data frame,`metadata`, I have clinical information about each sample in the `scores` data frame, like gender, age, type of diease, treatment, and most importantly outcome to treatment. This data frame will serve as labels, it gives the label for each sample.
The two dataframe have the exact same samples.
There are three columns that describe a different kind of response to each sample, and those columns are binrary, yes or no.
My target is to make a distribution plot for the samples that belong to the yes or no groups, in each of those 3 columns.
Here is an example. Say this is `scores`:
Feature_1 Feature_2 Feature_3
Patient_1 0.56 0.11 0.03
Ptient_2 0.605 0.34 0.49
P_3 0.1 0.76 0.42
12312AX 0.9 0.382 0.12
P_10 0.89 0.30 0.119
12312BX 0.232 0.118 0.80
12312CX 0.679 0.31 0.789
And this is `metadata`:
Gender Age Outcome1 Outcome2 Outcome3
Patient_1 M 54 1 0 0
Ptient_2 M 28 0 0 1
P_3 F 32 1 1 0
12312AX F 87 0 0 1
P_10 F 43 0 0 1
12312BX M 90 1 1 0
12312CX F 65 1 0 0
Now, for example, I want to plot `Feature_1` for the sameples that are `Outcome1 = 1` vs the samples that are with label`Outcome1 = 0`, and put them on the same plot to see the difference. A plot that would look like this:
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/lpflL.png
It doesn't matter if it's not filled with color.
This is some subset of the data. Starting with `scores`:
structure(list(`Feature_1` = c(0.58126387599574, 0.554773857342486,
0.73811669435931, 0.5993561705421, 0.549993884896126, 0.560952809292699,
0.514920708901865, 0.668611976328753, 0.579311040856707, 0.627079649056927,
0.549778821698995, 0.563433551362653, 0.566883741540508, 0.586839499814986,
0.527874599585146, 0.533974585406425, 0.583020804822263, 0.607821542253184,
0.570922624085177, 0.531065608748296), `Feature_2` = c(0.671868971517913,
0.657649690364772, 0.681277871841209, 0.633247301225077, 0.658829966989863,
0.649553434195565, 0.654719152272398, 0.678510931368968, 0.67606269281911,
0.657861486037168, 0.656157657102225, 0.654684442044789, 0.660668253143108,
0.680000904001928, 0.676215636114716, 0.68015840395165, 0.656533748483226,
0.654344382579621, 0.626207872177309, 0.640129803823085), `Feature10` = c(0.607691853076,
0.507746766229958, 0.642056075026442, 0.647793952813017, 0.571844979370279,
0.592183904204232, 0.473827520445559, 0.618900091543045, 0.60656936545554,
0.60603612041945, 0.510241627095173, 0.564418205496303, 0.561084611266194,
0.558495659089567, 0.503235910349171, 0.492768739941572, 0.551283907128425,
0.664425637003928, 0.541804175576185, 0.537845283573044)), row.names = c("Pt1",
"Pt10", "Pt101", "Pt103", "Pt106", "Pt11", "Pt17", "Pt18", "Pt2",
"Pt24", "Pt26", "Pt27", "Pt28", "Pt29", "Pt3", "Pt30", "Pt31",
"Pt34", "Pt36", "Pt37"), class = "data.frame")
And the `metadata`:
structure(list(Response = c("No", "No", "Yes",
"No", "Yes", "No", "No", "Yes",
"No", "Yes", "No", "No", "Yes",
"No", "Yes", "Yes", "No", "Yes",
"No", "No"), Gender = c("F", "M",
"F", "M", "M", "F", "M",
"M", "F", "M", "M", "M",
"M", "F", "F", "M", "F",
"F", "M", "F"), Response2 = c(1, 0, 0,
1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0)), row.names = c("Pt1",
"Pt10", "Pt101", "Pt103", "Pt106", "Pt11", "Pt17", "Pt18", "Pt2",
"Pt24", "Pt26", "Pt27", "Pt28", "Pt29", "Pt3", "Pt30", "Pt31",
"Pt34", "Pt36", "Pt37"), class = "data.frame")
</details>
# 答案1
**得分**: 0
你可以使用ggplot2包来实现这个。首先通过行名称合并数据,然后你可以使用ggplot绘制图形。
```R
# 按行名称合并
df <- merge(score, metadata, by=0, all=TRUE)
# 绘图
library(ggplot2)
ggplot(data=df, aes(x=Feature_1, fill=Response)) + geom_density(alpha=.3)
如果你的分类数据是数值的(例如"0"或"1"而不是"Yes"或"No"),你可以将变量转换为因子:
ggplot(data=df, aes(x=Feature_1, fill=factor(Response2))) + geom_density(alpha=.3)
英文:
You can do this using ggplot2 package. So first merge the data by rownames, and then you can plot it with ggplot
# Merge by rownames
df <- merge(score, metadata, by=0, all=TRUE)
# Plot
library(ggplot2)
ggplot(data=df, aes(x=Feature_1, fill=Response)) + geom_density(alpha=.3)
If your categorical data is numerical ("0" or "1" instead of "Yes" "No", you can turn the variable into factor:
ggplot(data=df, aes(x=Feature_1, fill=factor(Response2))) + geom_density(alpha=.3)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论