2023年5月29日 18:39:10go评论114阅读模式

英文:

Calculate deciles based on subgroup and apply to entire dataset

问题

我有一个具有以下列的数据集：

subgroup: [group1, group2]
distribution: 连续变量

我想基于数据集的一个子组来计算十分位数：

df <- df %>%
  filter(subgroup == "group1") %>%
  mutate(decile = ntile(distribution, 10))

然后我想使用所得到的十分位数应用于整个数据集（不仅仅是group1）。

有没有办法可以做到这一点？

这是一个示例数据集：

df <- matrix(0, ncol=3, nrow=10000)
df[,1] <- 1:10000
df[,2] <- sample(c("group1","group2"), 10000, replace=TRUE)
df[,3] <- rnorm(10000)
df <- as.data.frame(df)
colnames(df) <- c("id", "subgroup", "value")

我选择子组 group1 并基于列 value 计算十分位数：

df %>% filter(subgroup == 'group1') %>%
 mutate(decile = ntile(value, 10))

然后我想使用从 group1 获取的十分位数，并根据这些十分位数对 subgroup=='group2' 进行分类。

期望的输出是 df 中的第四列，其中每个观察都有一个介于1和10之间的单个值（即每个观察的十分位分类）。

英文:

I have a dataset with columns:

subgroup: [group1, group2]
distribution: continuous variable

I want to calculate deciles based on a subgroup of the dataset:

df &lt;- df %&gt;%
  filter(subgroup == &quot;group1&quot;) %&gt;%
  mutate(decile = ntile(distribution, 10))

then I would like to use the obtained deciles and apply it to the entire dataset (i.e. not just group1).

is there a way to do this?

here's an example dataset

df &lt;- matrix(0,ncol=3,nrow=10000)
df[,1] &lt;- 1:10000
df[,2] &lt;- sample(c(&quot;group1&quot;,&quot;group2&quot;),10000,replace=T)
df[,3] &lt;- rnorm(10000)
df &lt;- as.data.frame(df)
colnames(df) &lt;- c(&quot;id&quot;, &quot;subgroup&quot;,&quot;value&quot;)

I select the subgroup group1 and calculate deciles based on the column value

df %&gt;% filter(subgroup == &#39;group1&#39;) %&gt;%
 mutate(decile = ntile(value, 10))

then I would like to use the obtained deciles, and classify subgroup=='group2' based on the deciles obtained from 'group1'

the desired output would be a 4th column in df with a single value between 1 and 10 for each observation. (i.e. the decile classification for each observation)

答案1

得分: 0

以下是代码中需要翻译的部分：

We could use cut to divide the values into decile groups based on "group1".

library(dplyr)
df |&gt; 
  mutate(decile = cut(value, 
                      quantile(value[subgroup == "group1"], seq(0, 1, 0.1)),
                      labels = FALSE)
         )

Output:

     id subgroup        value decile
1     1   group1  0.674098613      8
2     2   group1 -2.881811886      1
3     3   group1 -0.377427063      4
4     4   group1  0.461585185      7
5     5   group1  0.460216469      7
6     6   group1 -1.374041767      1
7     7   group1 -0.945986918      2
8     8   group2  0.472525168      7
9     9   group2  0.418391193      7
10   10   group2  0.746413150      8
11   11   group2  0.175323464      6
12   12   group1  0.879160602      9
13   13   group1  0.469811384      7
14   14   group2  0.639019379      8
15   15   group1 -0.328276877      4
16   16   group1 -0.099512041      5
17   17   group1 -0.714642875      3
18   18   group1 -0.404702209      4
19   19   group1 -2.181077079      1
20   20   group2 -2.298182006      1

Data:

df$value &lt;- as.numeric(df$value)

请注意，我已经将HTML编码中的"更改为正常的引号以便更好地理解代码和输出。

英文:

We could use cut to divide the values into decile groups based on "group1".

library(dplyr)
df |&gt; 
  mutate(decile = cut(value, 
                      quantile(value[subgroup == &quot;group1&quot;], seq(0, 1, 0.1)),
                      labels = FALSE)
         )

Output:

     id subgroup        value decile
1     1   group1  0.674098613      8
2     2   group1 -2.881811886      1
3     3   group1 -0.377427063      4
4     4   group1  0.461585185      7
5     5   group1  0.460216469      7
6     6   group1 -1.374041767      1
7     7   group1 -0.945986918      2
8     8   group2  0.472525168      7
9     9   group2  0.418391193      7
10   10   group2  0.746413150      8
11   11   group2  0.175323464      6
12   12   group1  0.879160602      9
13   13   group1  0.469811384      7
14   14   group2  0.639019379      8
15   15   group1 -0.328276877      4
16   16   group1 -0.099512041      5
17   17   group1 -0.714642875      3
18   18   group1 -0.404702209      4
19   19   group1 -2.181077079      1
20   20   group2 -2.298182006      1

Data:

df$value &lt;- as.numeric(df$value)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

计算基于子组的十分位数，并应用于整个数据集。

问题

答案1

随机选择 R 数据表中的 50 列会导致只有 50 行的表格。如何修复这个问题？

20 panel graphs from ggplot have shifted values.

Tidy eval for `by` in `dplyr::_join`可翻译为：`dplyr::_join` 中的 `by` 的整洁评估

mutate()函数在列中用均值替换-1，但所有值都无条件替换。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。