英文:
How to indicate a higher hierarchy level using dplyr?
问题
我有一个数据框(df
),其中每一行表示一个横贯(Transect
)和站点(Site
)内特定栖息地(Habitat
)的起点(Start
)和终点(End
)的米数。需要注意的是,横贯的长度在站点内和站点之间都有所不同。例如:
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df
在这个示例中,例如,栖息地X
在站点A
的横贯1
中出现两次。此外,我们可以观察到站点A
中横贯1
和横贯2
的总长度分别为10米和15米。在站点B
中,横贯1
和横贯2
的总长度分别为20米和15米。
我想要计算每个站点和横贯中每个栖息地相对于所有栖息地的总米数的百分比。例如,在站点A
和横贯1
中,栖息地X
占据了总长度为10米的横贯1
的44%。在站点A
和横贯2
中,栖息地X
占据了总长度为15米的横贯2
的6米,相当于40%。
为了实现这个目标,我首先计算每个栖息地记录(行)的长度(Length
):
df$Length <- df$End - df$Start
然后,我想要按站点和横贯分组,计算每个栖息地的米数相对于其他栖息地和横贯的总长度的百分比。我尝试了以下代码:
df2 <- as.data.frame(df %>% group_by(Site, Transect, Habitat) %>% summarise(Percentage = (sum(Length)/max(End))*100))
我想将代码中的 max(End)
更改为另一个表达式,该表达式表示横贯的总长度。目前,max(End)
表示特定栖息地最后出现的米数(End
)。如何在代码中包含“横贯的最大值
”,但是只在特定站点和横贯内,而不是特定栖息地?
如何做到这一点?我期望的输出如下:
Site Transect Habitat Percentage
1 A 1 X 44.0
2 A 1 Y 6.0
3 A 1 Z 50.0
4 A 2 X 40.0
5 A 2 Y 23.3
6 A 2 Z 36.7
7 B 1 X 22.5
8 B 1 Y 62.5
9 B 1 Z 15.0
10 B 2 X 26.7
11 B 2 Y 26.7
12 B 2 Z 46.7
有谁知道如何做到这一点?
提前感谢!
英文:
I have a dataframe (df
) in which each row represents the start (Start
) and the end (End
) of a specific habitat (Habitat
) within a transect (Transect
) and site (Site
) in meters. It is important to note that the length of the transects varies within and among sites. As an example:
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df
Site Transect Habitat Start End
1 A 1 X 0.0 2.8 # Habitat `X` is between the meters 0 and 2.8
2 A 1 Y 2.8 3.4 # Habitat `Y` is between the meters 2.8 and 3.4
3 A 1 X 3.4 5.0 # Habitat `X` is between the meters 3.4 and 5.0
4 A 1 Z 5.0 10.0 # Habitat `Z` is between the meters 5 and 10.0
5 A 2 Z 0.0 1.5
6 A 2 Y 1.5 5.0
7 A 2 X 5.0 8.0
8 A 2 Z 8.0 12.0
9 A 2 X 12.0 15.0
10 B 1 X 0.0 2.0
11 B 1 Z 2.0 5.0
12 B 1 X 5.0 7.5
13 B 1 Y 7.5 20.0
14 B 2 Z 0.0 4.0
15 B 2 X 4.0 8.0
16 B 2 Y 8.0 12.0
17 B 2 Z 12.0 15.0
In this example, for instance, habitat X
is twice in the transect 1
in site A
. Also, we can observe that the total length of transects 1
and 2
in site A
are 10 and 15 m, respectively. In site B
, the total length of the transects 1
and 2
are 20
and 15
meters, respectively.
What I want is to calculate per Site
and Transect
the percentage that each Habitat
represents with respect to all the habitats presented in terms of meters. For example, in transect 1
and site A
habitat X
represents 4.4 meters of a total length of 10 meters for transect 1
. In site A
and transect 2
, habitat X
has 6 meters from a total length of 15 meters for transect B
.
To this aim, the first thing I do is to calculate the length (Length
) in meters of each habitat record (=row)
df$Length <- df$End - df$Start
Then, what I want is to calculate by site and transect the percentage that the meters of an habitat represents with respect the rest of habitats and the total length of the transect. I tried this:
df2 <- as.data.frame(df %>% group_by(Site, Transect, Habitat) %>% summarise(Porcentage = (sum(Length)/max(End))*100))
I want to change max(End)
to another expression that represents the total length OF THE TRANSECT. Right now max(End)
represents the last meter (End
) in which a specific habitat was present. How can I include in the code above "maximum value of End
" but within of a specific Site
and Transect
, but not for a specific Habitat
.
How can I do it? My desired output would be this:
Site Transect Habitat Percentage
1 A 1 X 44.0
2 A 1 Y 6.0
3 A 1 Z 50.0
4 A 2 X 40.0
5 A 2 Y 23.3
6 A 2 Z 36.7
7 B 1 X 22.5
8 B 1 Y 62.5
9 B 1 Z 15.0
10 B 2 X 26.7
11 B 2 Y 26.7
12 B 2 Z 46.7
Does anyone know how to do it?
Thanks in advance!
答案1
得分: 1
使用dplyr
,当你有不同层次的层级需要管理时,可能需要多个group_by()
语句。在下面的代码中,我使用group_by(Site, Transect, Habitat)
来计算每个Site
和Transect
中每个栖息地的总长度,然后使用group_by(Site, Transect)
来计算百分比。
library(dplyr)
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df %>%
mutate(length = End-Start) %>%
group_by(Site, Transect, Habitat) %>%
summarise(tot_length = sum(length)) %>%
group_by(Site, Transect) %>%
mutate(percentage = 100*tot_length/sum(tot_length))
#> `summarise()` has grouped output by 'Site', 'Transect'. You can override using
#> the `.groups` argument.
#> # A tibble: 12 × 5
#> # Groups: Site, Transect [4]
#> Site Transect Habitat tot_length percentage
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 A 1 X 4.4 44
#> 2 A 1 Y 0.6 6
#> 3 A 1 Z 5 50
#> 4 A 2 X 6 40
#> 5 A 2 Y 3.5 23.3
#> 6 A 2 Z 5.5 36.7
#> 7 B 1 X 4.5 22.5
#> 8 B 1 Y 12.5 62.5
#> 9 B 1 Z 3 15
#> 10 B 2 X 4 26.7
#> 11 B 2 Y 4 26.7
#> 12 B 2 Z 7 46.7
Created on 2023-02-16 by the reprex package (v2.0.1)
在你上面的代码中,当计算百分比时,你的数据仍然按Habitat
分组,所以你计算的百分比是在Habitat
内而不是在Site
和Transect
对之间的栖息地之间。
英文:
With dplyr
, when you have different levels of hierarchy that need managing, you may need multiple group_by()
statements. In the code below, I use group_by(Site, Transect, Habitat)
to calculate the total length of each habitat in the Site
and Transect
and then group_by(Site, Transect)
to calculate the percentage.
library(dplyr)
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df %>%
mutate(length = End-Start) %>%
group_by(Site, Transect, Habitat) %>%
summarise(tot_length = sum(length)) %>%
group_by(Site, Transect) %>%
mutate(percentage = 100*tot_length/sum(tot_length))
#> `summarise()` has grouped output by 'Site', 'Transect'. You can override using
#> the `.groups` argument.
#> # A tibble: 12 × 5
#> # Groups: Site, Transect [4]
#> Site Transect Habitat tot_length percentage
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 A 1 X 4.4 44
#> 2 A 1 Y 0.6 6
#> 3 A 1 Z 5 50
#> 4 A 2 X 6 40
#> 5 A 2 Y 3.5 23.3
#> 6 A 2 Z 5.5 36.7
#> 7 B 1 X 4.5 22.5
#> 8 B 1 Y 12.5 62.5
#> 9 B 1 Z 3 15
#> 10 B 2 X 4 26.7
#> 11 B 2 Y 4 26.7
#> 12 B 2 Z 7 46.7
<sup>Created on 2023-02-16 by the reprex package (v2.0.1)</sup>
In your code from above, when you are calculating the percentage, your data are still grouped by Habitat
, so the percentage you are calculating is within the Habitat
rather than across habitats within Site
and Transect
pairs.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论