在dplyr::group_by中,获取一个或多个分组变量中的观察数量。

huangapple go评论136阅读模式
英文:

Within dplyr::group_by, obtain the number of observations for ONE of multiple grouping variables

问题

以下是您要翻译的部分:

"It's very possible this has been asked before, however I am having a very difficult time articulating my problem.

Within my data, I have 3 variables, LOCATION, TOPIC, and RESPONSE. I would like to calculate the distribution for each combination of TOPIC and RESPONSE by LOCATION.

Create toy data and perform initial data prep

responses <- data.frame(LOCATION = c("LOC_A", "LOC_A", "LOC_A", "LOC_A", "LOC_A", 
                                     "LOC_A", "LOC_A", "LOC_A", 
                                     "LOC_B", "LOC_B", "LOC_B", "LOC_B", "LOC_B", 
                                     "LOC_C", "LOC_C", "LOC_C", "LOC_C", "LOC_C", "LOC_C", 
                                     "LOC_C", "LOC_C", "LOC_C", "LOC_C", "LOC_C", "LOC_C", 
                                     "LOC_C", "LOC_C", "LOC_C", "LOC_C", "LOC_C"),
                        TOPIC = c("Dogs", "Dogs", "Dogs", "Dogs", "Dogs", "Dogs", 
                                  "Lizards", "Lizards", "Lizards",
                                  "Lizards", "Lizards", "Lizards", "Lizards", "Lizards", 
                                   "Lizards", "Lizards", "Snakes", "Snakes", "Snakes", "Snakes", "Snakes", 
                                  "Snakes", "Dogs", "Snakes", "Dogs", "Snakes", "Dogs", 
                                  "Snakes", "Dogs", "Snakes"),
                        RESP = c("Agree", "Disagree", "Agree", "Disagree", "Agree", 
                                 "Disagree", "Agree", "Disagree", 
                                 "Agree", "Disagree", "Agree", "Disagree", "Neither", "Agree",
                                 "Neither", "Agree", "Neither", "Agree", "Neither", 
                                 "Agree", "Neither", "Agree", "Agree", "Neither", 
                                 "Agree", "Neither", "Agree", "Disagree", "Disagree",
                                 "Neither"))

获取每个组合级别的计数

distribution <- responses %>%
table() %>%
as.data.frame() %>%

使其更易读

dplyr::arrange(LOCATION, TOPIC, RESP)

以下是一个使用循环创建所需输出的示例解决方案:

# 丑陋的循环解决方案 :(
# 初始化输出容器
out &lt;- list()
# 遍历每个位置
for(loc in unique(distribution$LOCATION)){
  # 子集该位置的分布
  thisDist &lt;- dplyr::filter(distribution, LOCATION == loc)
  
  # 计算该位置的每个响应的百分比
  thisDist$percent &lt;- thisDist$Freq/sum(thisDist$Freq)
  
  # 存储带有百分比列的分布 df
  out[[loc]] &lt;- thisDist

}

# 将输出组合成单个 df
out &lt;- do.call(&quot;rbind&quot;, out)

我想要的是一个简洁的tidyverse解决方案。以下是描述我想象中的解决方案的伪代码:

# 想象中的tidyverse解决方案 :)
out &lt;- distribution %&gt;% 
  group_by(LOCATION, TOPIC, RESP) %&gt;% 
  summarise(#percent = Freq/(sum(&lt;all-Freq-values-for-this-group&#39;s-LOCATION-value&gt;))
            )

我在这里想要做的是获取当前组的LOCATION值的所有Freq值的总和。是否有一种在group_by/summarise内部实现这一点的好方法?

感谢您的阅读,希望这不会完全令人费解。

英文:

It's very possible this has been asked before, however I am having a very difficult time articulating my problem.

Within my data, I have 3 variables, LOCATION, TOPIC, and RESPONSE. I would like to calculate the distribution for each combination of TOPIC and RESPONSE by LOCATION.

Create toy data and perform initial data prep

responses &lt;- data.frame(LOCATION = c(&quot;LOC_A&quot;, &quot;LOC_A&quot;, &quot;LOC_A&quot;, &quot;LOC_A&quot;, &quot;LOC_A&quot;, 
&quot;LOC_A&quot;, &quot;LOC_A&quot;, &quot;LOC_A&quot;, 
&quot;LOC_B&quot;, &quot;LOC_B&quot;, &quot;LOC_B&quot;, &quot;LOC_B&quot;, &quot;LOC_B&quot;, 
&quot;LOC_C&quot;, &quot;LOC_C&quot;, &quot;LOC_C&quot;, &quot;LOC_C&quot;, &quot;LOC_C&quot;, &quot;LOC_C&quot;, 
&quot;LOC_C&quot;, &quot;LOC_C&quot;, &quot;LOC_C&quot;, &quot;LOC_C&quot;, &quot;LOC_C&quot;, &quot;LOC_C&quot;, 
&quot;LOC_C&quot;, &quot;LOC_C&quot;, &quot;LOC_C&quot;, &quot;LOC_C&quot;, &quot;LOC_C&quot;),
TOPIC = c(&quot;Dogs&quot;, &quot;Dogs&quot;, &quot;Dogs&quot;, &quot;Dogs&quot;, &quot;Dogs&quot;, &quot;Dogs&quot;, 
&quot;Lizards&quot;, &quot;Lizards&quot;, &quot;Lizards&quot;,
&quot;Lizards&quot;, &quot;Lizards&quot;, &quot;Lizards&quot;, &quot;Lizards&quot;, &quot;Lizards&quot;, 
&quot;Lizards&quot;, &quot;Lizards&quot;, &quot;Snakes&quot;, &quot;Snakes&quot;, &quot;Snakes&quot;, &quot;Snakes&quot;, &quot;Snakes&quot;, 
&quot;Snakes&quot;, &quot;Dogs&quot;, &quot;Snakes&quot;, &quot;Dogs&quot;, &quot;Snakes&quot;, &quot;Dogs&quot;, 
&quot;Snakes&quot;, &quot;Dogs&quot;, &quot;Snakes&quot;),
RESP = c(&quot;Agree&quot;, &quot;Disagree&quot;, &quot;Agree&quot;, &quot;Disagree&quot;, &quot;Agree&quot;, 
&quot;Disagree&quot;, &quot;Agree&quot;, &quot;Disagree&quot;, 
&quot;Agree&quot;, &quot;Disagree&quot;, &quot;Agree&quot;, &quot;Disagree&quot;, &quot;Neither&quot;, &quot;Agree&quot;,
&quot;Neither&quot;, &quot;Agree&quot;, &quot;Neither&quot;, &quot;Agree&quot;, &quot;Neither&quot;, 
&quot;Agree&quot;, &quot;Neither&quot;, &quot;Agree&quot;, &quot;Agree&quot;, &quot;Neither&quot;, 
&quot;Agree&quot;, &quot;Neither&quot;, &quot;Agree&quot;, &quot;Disagree&quot;, &quot;Disagree&quot;,
&quot;Neither&quot;))
# Obtain counts for each combination of levels
distribution &lt;- responses %&gt;% 
table() %&gt;% 
as.data.frame() %&gt;% 
# Make it more readable
dplyr::arrange(LOCATION, TOPIC, RESP) 

Here is an example solution which uses a loop to create my desired output:

# ugly loop solution :(
# Initialize output container
out &lt;- list()
# Iterate over each location
for(loc in unique(distribution$LOCATION)){
# Subset distribution for this location
thisDist &lt;- dplyr::filter(distribution, LOCATION == loc)
# Calculate percent of each response for this location
thisDist$percent &lt;- thisDist$Freq/sum(thisDist$Freq)
# Store distribution df with percent column
out[[loc]] &lt;- thisDist
}
# combine output into single df
out &lt;- do.call(&quot;rbind&quot;, out)

What I would like to have is a concise tidyverse solution. Here is some pseudo-code which describes my imaginary solution.

# Imaginary tidyverse solution :)
out &lt;- distribution %&gt;% 
group_by(LOCATION, TOPIC, RESP) %&gt;% 
summarise(#percent = Freq/(sum(&lt;all-Freq-values-for-this-group&#39;s-LOCATION-value&gt;))
)

What I'm looking to do here is obtain the sum of all Freq values for the LOCATION value of the current group. Is there a nice way to do this within a group_by/summarise?

Thanks for reading, I hope this isn't completely inscrutable.

答案1

得分: 1

这是您要翻译的内容:

"Is this what you're looking for?

如果您的dplyr版本早于1.1,则使用以下代码:

distribution %>%
  group_by(LOCATION) %>%
  mutate(percent = Freq/sum(Freq))
英文:

Is this what you're looking for?

distribution %&gt;%
  mutate(percent = Freq/sum(Freq), .by = LOCATION)
#    LOCATION   TOPIC     RESP Freq    percent
# 1     LOC_A    Dogs    Agree    3 0.37500000
# 2     LOC_A    Dogs Disagree    3 0.37500000
# 3     LOC_A    Dogs  Neither    0 0.00000000
# 4     LOC_A Lizards    Agree    1 0.12500000
# 5     LOC_A Lizards Disagree    1 0.12500000
# 6     LOC_A Lizards  Neither    0 0.00000000
# 7     LOC_A  Snakes    Agree    0 0.00000000
# 8     LOC_A  Snakes Disagree    0 0.00000000
# 9     LOC_A  Snakes  Neither    0 0.00000000
# 10    LOC_B    Dogs    Agree    0 0.00000000
# 11    LOC_B    Dogs Disagree    0 0.00000000
# 12    LOC_B    Dogs  Neither    0 0.00000000
# 13    LOC_B Lizards    Agree    2 0.40000000
# 14    LOC_B Lizards Disagree    2 0.40000000
# 15    LOC_B Lizards  Neither    1 0.20000000
# 16    LOC_B  Snakes    Agree    0 0.00000000
# 17    LOC_B  Snakes Disagree    0 0.00000000
# 18    LOC_B  Snakes  Neither    0 0.00000000
# 19    LOC_C    Dogs    Agree    3 0.17647059
# 20    LOC_C    Dogs Disagree    1 0.05882353
# 21    LOC_C    Dogs  Neither    0 0.00000000
# 22    LOC_C Lizards    Agree    2 0.11764706
# 23    LOC_C Lizards Disagree    0 0.00000000
# 24    LOC_C Lizards  Neither    1 0.05882353
# 25    LOC_C  Snakes    Agree    3 0.17647059
# 26    LOC_C  Snakes Disagree    1 0.05882353
# 27    LOC_C  Snakes  Neither    6 0.35294118

If you have dplyr older than 1.1, then use

distribution %&gt;%
  group_by(LOCATION) %&gt;%
  mutate(percent = Freq/sum(Freq))

答案2

得分: 1

The key is not to use summarise but mutate.

out <- distribution %>%
ungroup() %>%
group_by(LOCATION) %>%
mutate(percent = Freq/ sum(Freq))

英文:

The key is not to use summarise but mutate.

out &lt;- distribution %&gt;% 
ungroup() %&gt;% 
group_by(LOCATION) %&gt;% 
mutate(percent = Freq/ sum(Freq))

huangapple
  • 本文由 发表于 2023年8月11日 01:46:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76878179.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定