如何使用dplyr指示更高的层次级别?

huangapple go评论67阅读模式
英文:

How to indicate a higher hierarchy level using dplyr?

问题

我有一个数据框(df),其中每一行表示一个横贯(Transect)和站点(Site)内特定栖息地(Habitat)的起点(Start)和终点(End)的米数。需要注意的是,横贯的长度在站点内和站点之间都有所不同。例如:

df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
                 Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
                 Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
                 Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
                 End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))

df

在这个示例中,例如,栖息地X在站点A的横贯1中出现两次。此外,我们可以观察到站点A中横贯1和横贯2的总长度分别为10米和15米。在站点B中,横贯1和横贯2的总长度分别为20米和15米。

我想要计算每个站点和横贯中每个栖息地相对于所有栖息地的总米数的百分比。例如,在站点A和横贯1中,栖息地X占据了总长度为10米的横贯1的44%。在站点A和横贯2中,栖息地X占据了总长度为15米的横贯2的6米,相当于40%。

为了实现这个目标,我首先计算每个栖息地记录(行)的长度(Length):

df$Length <- df$End - df$Start

然后,我想要按站点和横贯分组,计算每个栖息地的米数相对于其他栖息地和横贯的总长度的百分比。我尝试了以下代码:

df2 <- as.data.frame(df %>% group_by(Site, Transect, Habitat) %>% summarise(Percentage = (sum(Length)/max(End))*100))

我想将代码中的 max(End) 更改为另一个表达式,该表达式表示横贯的总长度。目前,max(End) 表示特定栖息地最后出现的米数(End)。如何在代码中包含“横贯的最大值”,但是只在特定站点和横贯内,而不是特定栖息地?

如何做到这一点?我期望的输出如下:

   Site Transect Habitat Percentage
1     A        1       X       44.0
2     A        1       Y        6.0
3     A        1       Z       50.0
4     A        2       X       40.0
5     A        2       Y       23.3
6     A        2       Z       36.7
7     B        1       X       22.5
8     B        1       Y       62.5
9     B        1       Z       15.0
10    B        2       X       26.7
11    B        2       Y       26.7
12    B        2       Z       46.7

有谁知道如何做到这一点?

提前感谢!

英文:

I have a dataframe (df) in which each row represents the start (Start) and the end (End) of a specific habitat (Habitat) within a transect (Transect) and site (Site) in meters. It is important to note that the length of the transects varies within and among sites. As an example:

df &lt;- data.frame(Site = c(&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;),
                 Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
                 Habitat = c(&quot;X&quot;,&quot;Y&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;Z&quot;,&quot;Y&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;Y&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;Y&quot;,&quot;Z&quot;),
                 Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
                 End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))

df

  Site Transect Habitat Start  End
1     A        1       X   0.0  2.8  # Habitat `X` is between the meters 0 and 2.8
2     A        1       Y   2.8  3.4  # Habitat `Y` is between the meters 2.8 and 3.4
3     A        1       X   3.4  5.0  # Habitat `X` is between the meters 3.4 and 5.0
4     A        1       Z   5.0 10.0  # Habitat `Z` is between the meters 5 and 10.0
5     A        2       Z   0.0  1.5
6     A        2       Y   1.5  5.0
7     A        2       X   5.0  8.0
8     A        2       Z   8.0 12.0
9     A        2       X  12.0 15.0
10    B        1       X   0.0  2.0
11    B        1       Z   2.0  5.0
12    B        1       X   5.0  7.5
13    B        1       Y   7.5 20.0
14    B        2       Z   0.0  4.0
15    B        2       X   4.0  8.0
16    B        2       Y   8.0 12.0
17    B        2       Z  12.0 15.0

In this example, for instance, habitat X is twice in the transect 1 in site A. Also, we can observe that the total length of transects 1 and 2 in site A are 10 and 15 m, respectively. In site B, the total length of the transects 1 and 2 are 20 and 15 meters, respectively.

What I want is to calculate per Site and Transect the percentage that each Habitat represents with respect to all the habitats presented in terms of meters. For example, in transect 1 and site A habitat X represents 4.4 meters of a total length of 10 meters for transect 1. In site A and transect 2, habitat X has 6 meters from a total length of 15 meters for transect B.

To this aim, the first thing I do is to calculate the length (Length) in meters of each habitat record (=row)

df$Length &lt;- df$End - df$Start

Then, what I want is to calculate by site and transect the percentage that the meters of an habitat represents with respect the rest of habitats and the total length of the transect. I tried this:

df2 &lt;- as.data.frame(df %&gt;% group_by(Site, Transect, Habitat) %&gt;% summarise(Porcentage = (sum(Length)/max(End))*100))

I want to change max(End) to another expression that represents the total length OF THE TRANSECT. Right now max(End) represents the last meter (End) in which a specific habitat was present. How can I include in the code above "maximum value of End" but within of a specific Site and Transect, but not for a specific Habitat.

How can I do it? My desired output would be this:

   Site Transect Habitat Percentage
1     A        1       X       44.0
2     A        1       Y        6.0
3     A        1       Z       50.0
4     A        2       X       40.0
5     A        2       Y       23.3
6     A        2       Z       36.7
7     B        1       X       22.5
8     B        1       Y       62.5
9     B        1       Z       15.0
10    B        2       X       26.7
11    B        2       Y       26.7
12    B        2       Z       46.7

Does anyone know how to do it?

Thanks in advance!

答案1

得分: 1

使用dplyr,当你有不同层次的层级需要管理时,可能需要多个group_by()语句。在下面的代码中,我使用group_by(Site, Transect, Habitat)来计算每个SiteTransect中每个栖息地的总长度,然后使用group_by(Site, Transect)来计算百分比。

library(dplyr)
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
                 Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
                 Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
                 Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
                 End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))

df %>% 
  mutate(length = End-Start) %>% 
  group_by(Site, Transect, Habitat) %>% 
  summarise(tot_length = sum(length)) %>% 
  group_by(Site, Transect) %>% 
  mutate(percentage = 100*tot_length/sum(tot_length))
#> `summarise()` has grouped output by 'Site', 'Transect'. You can override using
#> the `.groups` argument.
#> # A tibble: 12 × 5
#> # Groups:   Site, Transect [4]
#>    Site  Transect Habitat tot_length percentage
#>    <chr>    <dbl> <chr>        <dbl>      <dbl>
#>  1 A            1 X              4.4       44  
#>  2 A            1 Y              0.6        6  
#>  3 A            1 Z              5         50  
#>  4 A            2 X              6         40  
#>  5 A            2 Y              3.5       23.3
#>  6 A            2 Z              5.5       36.7
#>  7 B            1 X              4.5       22.5
#>  8 B            1 Y             12.5       62.5
#>  9 B            1 Z              3         15  
#> 10 B            2 X              4         26.7
#> 11 B            2 Y              4         26.7
#> 12 B            2 Z              7         46.7

Created on 2023-02-16 by the reprex package (v2.0.1)

在你上面的代码中,当计算百分比时,你的数据仍然按Habitat分组,所以你计算的百分比是在Habitat内而不是在SiteTransect对之间的栖息地之间。

英文:

With dplyr, when you have different levels of hierarchy that need managing, you may need multiple group_by() statements. In the code below, I use group_by(Site, Transect, Habitat) to calculate the total length of each habitat in the Site and Transect and then group_by(Site, Transect) to calculate the percentage.

library(dplyr)
df &lt;- data.frame(Site = c(&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;),
                 Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
                 Habitat = c(&quot;X&quot;,&quot;Y&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;Z&quot;,&quot;Y&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;Y&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;Y&quot;,&quot;Z&quot;),
                 Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
                 End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))

df %&gt;% 
  mutate(length = End-Start) %&gt;% 
  group_by(Site, Transect, Habitat) %&gt;% 
  summarise(tot_length = sum(length)) %&gt;% 
  group_by(Site, Transect) %&gt;% 
  mutate(percentage = 100*tot_length/sum(tot_length))
#&gt; `summarise()` has grouped output by &#39;Site&#39;, &#39;Transect&#39;. You can override using
#&gt; the `.groups` argument.
#&gt; # A tibble: 12 &#215; 5
#&gt; # Groups:   Site, Transect [4]
#&gt;    Site  Transect Habitat tot_length percentage
#&gt;    &lt;chr&gt;    &lt;dbl&gt; &lt;chr&gt;        &lt;dbl&gt;      &lt;dbl&gt;
#&gt;  1 A            1 X              4.4       44  
#&gt;  2 A            1 Y              0.6        6  
#&gt;  3 A            1 Z              5         50  
#&gt;  4 A            2 X              6         40  
#&gt;  5 A            2 Y              3.5       23.3
#&gt;  6 A            2 Z              5.5       36.7
#&gt;  7 B            1 X              4.5       22.5
#&gt;  8 B            1 Y             12.5       62.5
#&gt;  9 B            1 Z              3         15  
#&gt; 10 B            2 X              4         26.7
#&gt; 11 B            2 Y              4         26.7
#&gt; 12 B            2 Z              7         46.7

<sup>Created on 2023-02-16 by the reprex package (v2.0.1)</sup>

In your code from above, when you are calculating the percentage, your data are still grouped by Habitat, so the percentage you are calculating is within the Habitat rather than across habitats within Site and Transect pairs.

huangapple
  • 本文由 发表于 2023年2月16日 19:24:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/75471570.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定