2023年2月16日 19:24:35go评论102阅读模式

英文:

How to indicate a higher hierarchy level using dplyr?

问题

我有一个数据框（df），其中每一行表示一个横贯（Transect）和站点（Site）内特定栖息地（Habitat）的起点（Start）和终点（End）的米数。需要注意的是，横贯的长度在站点内和站点之间都有所不同。例如：

df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
                 Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
                 Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
                 Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
                 End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df

在这个示例中，例如，栖息地X在站点A的横贯1中出现两次。此外，我们可以观察到站点A中横贯1和横贯2的总长度分别为10米和15米。在站点B中，横贯1和横贯2的总长度分别为20米和15米。

我想要计算每个站点和横贯中每个栖息地相对于所有栖息地的总米数的百分比。例如，在站点A和横贯1中，栖息地X占据了总长度为10米的横贯1的44%。在站点A和横贯2中，栖息地X占据了总长度为15米的横贯2的6米，相当于40%。

为了实现这个目标，我首先计算每个栖息地记录（行）的长度（Length）：

df$Length <- df$End - df$Start

然后，我想要按站点和横贯分组，计算每个栖息地的米数相对于其他栖息地和横贯的总长度的百分比。我尝试了以下代码：

df2 <- as.data.frame(df %>% group_by(Site, Transect, Habitat) %>% summarise(Percentage = (sum(Length)/max(End))*100))

我想将代码中的 max(End) 更改为另一个表达式，该表达式表示横贯的总长度。目前，max(End) 表示特定栖息地最后出现的米数（End）。如何在代码中包含“横贯的最大值”，但是只在特定站点和横贯内，而不是特定栖息地？

如何做到这一点？我期望的输出如下：

   Site Transect Habitat Percentage
1     A        1       X       44.0
2     A        1       Y        6.0
3     A        1       Z       50.0
4     A        2       X       40.0
5     A        2       Y       23.3
6     A        2       Z       36.7
7     B        1       X       22.5
8     B        1       Y       62.5
9     B        1       Z       15.0
10    B        2       X       26.7
11    B        2       Y       26.7
12    B        2       Z       46.7

有谁知道如何做到这一点？

提前感谢！

英文:

I have a dataframe (df) in which each row represents the start (Start) and the end (End) of a specific habitat (Habitat) within a transect (Transect) and site (Site) in meters. It is important to note that the length of the transects varies within and among sites. As an example:

df &lt;- data.frame(Site = c(&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;),
                 Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
                 Habitat = c(&quot;X&quot;,&quot;Y&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;Z&quot;,&quot;Y&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;Y&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;Y&quot;,&quot;Z&quot;),
                 Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
                 End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df
  Site Transect Habitat Start  End
1     A        1       X   0.0  2.8  # Habitat `X` is between the meters 0 and 2.8
2     A        1       Y   2.8  3.4  # Habitat `Y` is between the meters 2.8 and 3.4
3     A        1       X   3.4  5.0  # Habitat `X` is between the meters 3.4 and 5.0
4     A        1       Z   5.0 10.0  # Habitat `Z` is between the meters 5 and 10.0
5     A        2       Z   0.0  1.5
6     A        2       Y   1.5  5.0
7     A        2       X   5.0  8.0
8     A        2       Z   8.0 12.0
9     A        2       X  12.0 15.0
10    B        1       X   0.0  2.0
11    B        1       Z   2.0  5.0
12    B        1       X   5.0  7.5
13    B        1       Y   7.5 20.0
14    B        2       Z   0.0  4.0
15    B        2       X   4.0  8.0
16    B        2       Y   8.0 12.0
17    B        2       Z  12.0 15.0

In this example, for instance, habitat X is twice in the transect 1 in site A. Also, we can observe that the total length of transects 1 and 2 in site A are 10 and 15 m, respectively. In site B, the total length of the transects 1 and 2 are 20 and 15 meters, respectively.

What I want is to calculate per Site and Transect the percentage that each Habitat represents with respect to all the habitats presented in terms of meters. For example, in transect 1 and site A habitat X represents 4.4 meters of a total length of 10 meters for transect 1. In site A and transect 2, habitat X has 6 meters from a total length of 15 meters for transect B.

To this aim, the first thing I do is to calculate the length (Length) in meters of each habitat record (=row)

df$Length &lt;- df$End - df$Start

Then, what I want is to calculate by site and transect the percentage that the meters of an habitat represents with respect the rest of habitats and the total length of the transect. I tried this:

df2 &lt;- as.data.frame(df %&gt;% group_by(Site, Transect, Habitat) %&gt;% summarise(Porcentage = (sum(Length)/max(End))*100))

I want to change max(End) to another expression that represents the total length OF THE TRANSECT. Right now max(End) represents the last meter (End) in which a specific habitat was present. How can I include in the code above "maximum value of End" but within of a specific Site and Transect, but not for a specific Habitat.

How can I do it? My desired output would be this:

   Site Transect Habitat Percentage
1     A        1       X       44.0
2     A        1       Y        6.0
3     A        1       Z       50.0
4     A        2       X       40.0
5     A        2       Y       23.3
6     A        2       Z       36.7
7     B        1       X       22.5
8     B        1       Y       62.5
9     B        1       Z       15.0
10    B        2       X       26.7
11    B        2       Y       26.7
12    B        2       Z       46.7

Does anyone know how to do it?

Thanks in advance!

答案1

得分: 1

使用dplyr，当你有不同层次的层级需要管理时，可能需要多个group_by()语句。在下面的代码中，我使用group_by(Site, Transect, Habitat)来计算每个Site和Transect中每个栖息地的总长度，然后使用group_by(Site, Transect)来计算百分比。

library(dplyr)
df <- data.frame(Site = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
                 Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
                 Habitat = c("X","Y","X","Z","Z","Y","X","Z","X","X","Z","X","Y","Z","X","Y","Z"),
                 Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
                 End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df %>% 
  mutate(length = End-Start) %>% 
  group_by(Site, Transect, Habitat) %>% 
  summarise(tot_length = sum(length)) %>% 
  group_by(Site, Transect) %>% 
  mutate(percentage = 100*tot_length/sum(tot_length))
#> `summarise()` has grouped output by 'Site', 'Transect'. You can override using
#> the `.groups` argument.
#> # A tibble: 12 × 5
#> # Groups:   Site, Transect [4]
#>    Site  Transect Habitat tot_length percentage
#>    <chr>    <dbl> <chr>        <dbl>      <dbl>
#>  1 A            1 X              4.4       44  
#>  2 A            1 Y              0.6        6  
#>  3 A            1 Z              5         50  
#>  4 A            2 X              6         40  
#>  5 A            2 Y              3.5       23.3
#>  6 A            2 Z              5.5       36.7
#>  7 B            1 X              4.5       22.5
#>  8 B            1 Y             12.5       62.5
#>  9 B            1 Z              3         15  
#> 10 B            2 X              4         26.7
#> 11 B            2 Y              4         26.7
#> 12 B            2 Z              7         46.7

^{Created on 2023-02-16 by the reprex package (v2.0.1)}

在你上面的代码中，当计算百分比时，你的数据仍然按Habitat分组，所以你计算的百分比是在Habitat内而不是在Site和Transect对之间的栖息地之间。

英文:

With dplyr, when you have different levels of hierarchy that need managing, you may need multiple group_by() statements. In the code below, I use group_by(Site, Transect, Habitat) to calculate the total length of each habitat in the Site and Transect and then group_by(Site, Transect) to calculate the percentage.

library(dplyr)
df &lt;- data.frame(Site = c(&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;,&quot;B&quot;),
                 Transect = c(1,1,1,1,2,2,2,2,2,1,1,1,1,2,2,2,2),
                 Habitat = c(&quot;X&quot;,&quot;Y&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;Z&quot;,&quot;Y&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;X&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;Y&quot;,&quot;Z&quot;,&quot;X&quot;,&quot;Y&quot;,&quot;Z&quot;),
                 Start=c(0,2.8,3.4,5,0,1.5,5,8,12,0,2,5,7.5,0,4,8,12),
                 End=c(2.8,3.4,5,10,1.5,5,8,12,15,2,5,7.5,20,4,8,12,15))
df %&gt;% 
  mutate(length = End-Start) %&gt;% 
  group_by(Site, Transect, Habitat) %&gt;% 
  summarise(tot_length = sum(length)) %&gt;% 
  group_by(Site, Transect) %&gt;% 
  mutate(percentage = 100*tot_length/sum(tot_length))
#&gt; `summarise()` has grouped output by &#39;Site&#39;, &#39;Transect&#39;. You can override using
#&gt; the `.groups` argument.
#&gt; # A tibble: 12 &#215; 5
#&gt; # Groups:   Site, Transect [4]
#&gt;    Site  Transect Habitat tot_length percentage
#&gt;    &lt;chr&gt;    &lt;dbl&gt; &lt;chr&gt;        &lt;dbl&gt;      &lt;dbl&gt;
#&gt;  1 A            1 X              4.4       44  
#&gt;  2 A            1 Y              0.6        6  
#&gt;  3 A            1 Z              5         50  
#&gt;  4 A            2 X              6         40  
#&gt;  5 A            2 Y              3.5       23.3
#&gt;  6 A            2 Z              5.5       36.7
#&gt;  7 B            1 X              4.5       22.5
#&gt;  8 B            1 Y             12.5       62.5
#&gt;  9 B            1 Z              3         15  
#&gt; 10 B            2 X              4         26.7
#&gt; 11 B            2 Y              4         26.7
#&gt; 12 B            2 Z              7         46.7

<sup>Created on 2023-02-16 by the reprex package (v2.0.1)</sup>

In your code from above, when you are calculating the percentage, your data are still grouped by Habitat, so the percentage you are calculating is within the Habitat rather than across habitats within Site and Transect pairs.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何使用dplyr指示更高的层次级别？

问题

答案1

“Yahoo getQuote workaround for a novice” 可以翻译为 “Yahoo的getQuote初学者解决方案”。

Replace multiple columns in a dataframe with a new column that indicates if the original columns contained any non-missing data

How can I extract a string from between last dash and second to last dash out of a column that contains lists of strings?

同步Shiny中两个Handsontables的垂直滚动

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。