计算基于子组的十分位数,并应用于整个数据集。

huangapple go评论64阅读模式
英文:

Calculate deciles based on subgroup and apply to entire dataset

问题

我有一个具有以下列的数据集:

subgroup: [group1, group2]
distribution: 连续变量

我想基于数据集的一个子组来计算十分位数:

df <- df %>%
  filter(subgroup == "group1") %>%
  mutate(decile = ntile(distribution, 10))

然后我想使用所得到的十分位数应用于整个数据集(不仅仅是group1)。

有没有办法可以做到这一点?

这是一个示例数据集:

df <- matrix(0, ncol=3, nrow=10000)
df[,1] <- 1:10000
df[,2] <- sample(c("group1","group2"), 10000, replace=TRUE)
df[,3] <- rnorm(10000)
df <- as.data.frame(df)
colnames(df) <- c("id", "subgroup", "value")

我选择子组 group1 并基于列 value 计算十分位数:

df %>% filter(subgroup == 'group1') %>%
 mutate(decile = ntile(value, 10))

然后我想使用从 group1 获取的十分位数,并根据这些十分位数对 subgroup=='group2' 进行分类。

期望的输出是 df 中的第四列,其中每个观察都有一个介于1和10之间的单个值(即每个观察的十分位分类)。

英文:

I have a dataset with columns:

subgroup: [group1, group2]
distribution: continuous variable

I want to calculate deciles based on a subgroup of the dataset:

df &lt;- df %&gt;%
  filter(subgroup == &quot;group1&quot;) %&gt;%
  mutate(decile = ntile(distribution, 10))

then I would like to use the obtained deciles and apply it to the entire dataset (i.e. not just group1).

is there a way to do this?

here's an example dataset

df &lt;- matrix(0,ncol=3,nrow=10000)
df[,1] &lt;- 1:10000
df[,2] &lt;- sample(c(&quot;group1&quot;,&quot;group2&quot;),10000,replace=T)
df[,3] &lt;- rnorm(10000)
df &lt;- as.data.frame(df)
colnames(df) &lt;- c(&quot;id&quot;, &quot;subgroup&quot;,&quot;value&quot;)

I select the subgroup group1 and calculate deciles based on the column value

df %&gt;% filter(subgroup == &#39;group1&#39;) %&gt;%
 mutate(decile = ntile(value, 10))

then I would like to use the obtained deciles, and classify subgroup==&#39;group2&#39; based on the deciles obtained from &#39;group1&#39;

the desired output would be a 4th column in df with a single value between 1 and 10 for each observation. (i.e. the decile classification for each observation)

答案1

得分: 0

以下是代码中需要翻译的部分:

We could use cut to divide the values into decile groups based on "group1".

library(dplyr)

df |&gt; 
  mutate(decile = cut(value, 
                      quantile(value[subgroup == "group1"], seq(0, 1, 0.1)),
                      labels = FALSE)
         )

Output:

     id subgroup        value decile
1     1   group1  0.674098613      8
2     2   group1 -2.881811886      1
3     3   group1 -0.377427063      4
4     4   group1  0.461585185      7
5     5   group1  0.460216469      7
6     6   group1 -1.374041767      1
7     7   group1 -0.945986918      2
8     8   group2  0.472525168      7
9     9   group2  0.418391193      7
10   10   group2  0.746413150      8
11   11   group2  0.175323464      6
12   12   group1  0.879160602      9
13   13   group1  0.469811384      7
14   14   group2  0.639019379      8
15   15   group1 -0.328276877      4
16   16   group1 -0.099512041      5
17   17   group1 -0.714642875      3
18   18   group1 -0.404702209      4
19   19   group1 -2.181077079      1
20   20   group2 -2.298182006      1

Data:

df$value &lt;- as.numeric(df$value)

请注意,我已经将HTML编码中的&quot;更改为正常的引号以便更好地理解代码和输出。

英文:

We could use cut to divide the values into decile groups based on "group1".

library(dplyr)

df |&gt; 
  mutate(decile = cut(value, 
                      quantile(value[subgroup == &quot;group1&quot;], seq(0, 1, 0.1)),
                      labels = FALSE)
         )

Output:

     id subgroup        value decile
1     1   group1  0.674098613      8
2     2   group1 -2.881811886      1
3     3   group1 -0.377427063      4
4     4   group1  0.461585185      7
5     5   group1  0.460216469      7
6     6   group1 -1.374041767      1
7     7   group1 -0.945986918      2
8     8   group2  0.472525168      7
9     9   group2  0.418391193      7
10   10   group2  0.746413150      8
11   11   group2  0.175323464      6
12   12   group1  0.879160602      9
13   13   group1  0.469811384      7
14   14   group2  0.639019379      8
15   15   group1 -0.328276877      4
16   16   group1 -0.099512041      5
17   17   group1 -0.714642875      3
18   18   group1 -0.404702209      4
19   19   group1 -2.181077079      1
20   20   group2 -2.298182006      1

Data:

df$value &lt;- as.numeric(df$value)

huangapple
  • 本文由 发表于 2023年5月29日 18:39:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76356628.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定