2023年6月1日 18:21:38go评论90阅读模式

英文:

Backconvert tree to data frame structure by keeping original rows

问题

让我们假设以下数据：

x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.", 
"Accounting", "Accounting", "Research", "Research", "IT", "IT", 
"IT"), child = c("Accounting", "Research", "IT", "New Software", 
"New Accounting Standards", "New Product Line", "New Labs", "Outsource", 
"Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e", 
"f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")

现在，你可以使用 dplyr 包来完成你想要的操作：

library(dplyr)
# 创建一个数据框来存储结果
result_df <- data.frame()
# 对每个行进行迭代
for (i in 1:nrow(x)) {
  row_data <- x[i, ]
  levels <- data.tree::GetAncestors(my_tree, node = row_data$child)
  levels_df <- data.frame(matrix(unlist(levels), ncol = length(levels), byrow = TRUE))
  colnames(levels_df) <- paste0("level_", 1:length(levels))
  
  # 合并原始数据和层级信息
  final_row <- c(row_data, levels_df)
  
  # 将结果添加到结果数据框中
  result_df <- rbind(result_df, final_row)
}
# 重命名结果数据框的列名
colnames(result_df)[(ncol(x) + 1):ncol(result_df)] <- colnames(levels_df)
# 打印结果数据框
print(result_df)

这段代码将为每个原始数据行计算其层级信息，并将其与原始数据合并，然后将结果存储在 result_df 数据框中。最后，它将列名重命名为你所期望的列名。这应该得到你想要的期望结果。

英文:

Let's assume the following data:

x &lt;- structure(list(parent = c(&quot;Acme Inc.&quot;, &quot;Acme Inc.&quot;, &quot;Acme Inc.&quot;, 
&quot;Accounting&quot;, &quot;Accounting&quot;, &quot;Research&quot;, &quot;Research&quot;, &quot;IT&quot;, &quot;IT&quot;, 
&quot;IT&quot;), child = c(&quot;Accounting&quot;, &quot;Research&quot;, &quot;IT&quot;, &quot;New Software&quot;, 
&quot;New Accounting Standards&quot;, &quot;New Product Line&quot;, &quot;New Labs&quot;, &quot;Outsource&quot;, 
&quot;Go agile&quot;, &quot;Switch to R&quot;), misc = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;, &quot;e&quot;, 
&quot;f&quot;, &quot;g&quot;, &quot;h&quot;, &quot;i&quot;, &quot;j&quot;)), row.names = c(NA, 10L), class = &quot;data.frame&quot;)
       parent                    child misc
1   Acme Inc.               Accounting    a
2   Acme Inc.                 Research    b
3   Acme Inc.                       IT    c
4  Accounting             New Software    d
5  Accounting New Accounting Standards    e
6    Research         New Product Line    f
7    Research                 New Labs    g
8          IT                Outsource    h
9          IT                 Go agile    i
10         IT              Switch to R    j

I can now convert that to a tree structure with the data.tree package.

my_tree &lt;- data.tree::FromDataFrameNetwork(x)

What I actually want to get is the level information, more or less in wide format, which I can theoretically get via

my_data &lt;- data.tree::ToDataFrameTypeCol(my_tree)

which gives:

    level_1    level_2                  level_3
1 Acme Inc. Accounting             New Software
2 Acme Inc. Accounting New Accounting Standards
3 Acme Inc.   Research         New Product Line
4 Acme Inc.   Research                 New Labs
5 Acme Inc.         IT                Outsource
6 Acme Inc.         IT                 Go agile
7 Acme Inc.         IT              Switch to R

However, as you can see, this output has fewer rows than the original data (7 instead of 10). This is because the function only gives me the final leaves, if I see that correctly. But what I want is for each row in the original data frame, enhance the data with the full level information for that particular child. E.g. we know that "Accounting" is level 2, so I want to put that information as new columns to the original data.

The expected outcome would be:

       parent                    child misc   level_1    level_2                  level_3
1   Acme Inc.               Accounting    a Acme Inc. Accounting                       NA
2   Acme Inc.                 Research    b Acme Inc.   Research                       NA
3   Acme Inc.                       IT    c Acme Inc.         IT                       NA
4  Accounting             New Software    d Acme Inc. Accounting             New Software
5  Accounting New Accounting Standards    e Acme Inc. Accounting New Accounting Standards
6    Research         New Product Line    f Acme Inc.   Research         New Product Line
7    Research                 New Labs    g Acme Inc.   Research                 New Labs
8          IT                Outsource    h Acme Inc.         IT                Outsource
9          IT                 Go agile    i Acme Inc.         IT                 Go agile
10         IT              Switch to R    j Acme Inc.         IT              Switch to R

I'm stuck here how to achieve this. Any idea?

答案1

得分: 1

这可能不是最优雅的解决方案，但似乎可以工作。您想要的是首先在键 parent == level_2 和 child == level_3 上将两个数据连接在一起。接下来，您可以使用键 parent == level_1 和 child == level_2 将结果与树状数据连接起来。这将连接上在第一个连接中未匹配的剩余观察结果。您可以合并不同的 level_1 和 level_2 变量以在连接之间整合信息。最后，使用 distinct() 将消除在连接期间产生的重复项。

library(dplyr)
library(data.tree)
  
x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.", 
                               "Accounting", "Accounting", "Research", "Research", "IT", "IT", 
                               "IT"), child = c("Accounting", "Research", "IT", "New Software", 
                                                "New Accounting Standards", "New Product Line", "New Labs", "Outsource", 
                                                "Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e", 
                                                                                     "f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")
my_tree <- data.tree::FromDataFrameNetwork(x)
my_data <- data.tree::ToDataFrameTypeCol(my_tree)
left_join(x, my_data, by = c("parent" = "level_2", "child" = "level_3"), keep = TRUE) %>%
  left_join(my_data %>% select(level_1, level_2), by = c("parent" = "level_1", "child" = "level_2"), keep = TRUE) %>%
  mutate(level_1 = coalesce(level_1.x, level_1.y), 
         level_2 = coalesce(level_2.x, level_2.y)) %>%
  select(parent:misc, level_1, level_2, level_3) %>%
  distinct()
#>        parent                    child misc   level_1    level_2
#> 1   Acme Inc.               Accounting    a Acme Inc. Accounting
#> 2   Acme Inc.                 Research    b Acme Inc.   Research
#> 3   Acme Inc.                       IT    c Acme Inc.         IT
#> 4  Accounting             New Software    d Acme Inc. Accounting
#> 5  Accounting New Accounting Standards    e Acme Inc. Accounting
#> 6    Research         New Product Line    f Acme Inc.   Research
#> 7    Research                 New Labs    g Acme Inc.   Research
#> 8          IT                Outsource    h Acme Inc.         IT
#> 9          IT                 Go agile    i Acme Inc.         IT
#> 10         IT              Switch to R    j Acme Inc.         IT
#>                     level_3
#> 1                      <NA>
#> 2                      <NA>
#> 3                      <NA>
#> 4              New Software
#> 5  New Accounting Standards
#> 6          New Product Line
#> 7                  New Labs
#> 8                 Outsource
#> 9                  Go agile
#> 10              Switch to R

^{创建于2023年06月01日，使用reprex v2.0.2}

英文:

This is likely not the most elegant solution, but it does seem to work. What you want is to first join the two data together on the keys parent == level_2 and child == level_3. Next, you can join the result with the tree data with the keys parent == level_1 and child == level_2. This will join in the remaining observations that weren't matched in the first join. You can coalesce the different level_1 and level_2 variables to consolidate the information across the joins. Finally, using distinct() will eliminate the duplicates that were produced during the joins.

library(dplyr)
library(data.tree)
  
x &lt;- structure(list(parent = c(&quot;Acme Inc.&quot;, &quot;Acme Inc.&quot;, &quot;Acme Inc.&quot;, 
                               &quot;Accounting&quot;, &quot;Accounting&quot;, &quot;Research&quot;, &quot;Research&quot;, &quot;IT&quot;, &quot;IT&quot;, 
                               &quot;IT&quot;), child = c(&quot;Accounting&quot;, &quot;Research&quot;, &quot;IT&quot;, &quot;New Software&quot;, 
                                                &quot;New Accounting Standards&quot;, &quot;New Product Line&quot;, &quot;New Labs&quot;, &quot;Outsource&quot;, 
                                                &quot;Go agile&quot;, &quot;Switch to R&quot;), misc = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;, &quot;e&quot;, 
                                                                                     &quot;f&quot;, &quot;g&quot;, &quot;h&quot;, &quot;i&quot;, &quot;j&quot;)), row.names = c(NA, 10L), class = &quot;data.frame&quot;)
my_tree &lt;- data.tree::FromDataFrameNetwork(x)
my_data &lt;- data.tree::ToDataFrameTypeCol(my_tree)
left_join(x, my_data, join_by(parent==level_2, child==level_3), keep=TRUE) %&gt;% 
  left_join(my_data %&gt;% select(level_1, level_2), join_by(parent==level_1, child==level_2), keep=TRUE) %&gt;%
  mutate(level_1 = coalesce(level_1.x, level_1.y), 
         level_2 = coalesce(level_2.x, level_2.y)) %&gt;% 
  select(parent:misc, level_1, level_2, level_3) %&gt;% 
  distinct()
#&gt;        parent                    child misc   level_1    level_2
#&gt; 1   Acme Inc.               Accounting    a Acme Inc. Accounting
#&gt; 2   Acme Inc.                 Research    b Acme Inc.   Research
#&gt; 3   Acme Inc.                       IT    c Acme Inc.         IT
#&gt; 4  Accounting             New Software    d Acme Inc. Accounting
#&gt; 5  Accounting New Accounting Standards    e Acme Inc. Accounting
#&gt; 6    Research         New Product Line    f Acme Inc.   Research
#&gt; 7    Research                 New Labs    g Acme Inc.   Research
#&gt; 8          IT                Outsource    h Acme Inc.         IT
#&gt; 9          IT                 Go agile    i Acme Inc.         IT
#&gt; 10         IT              Switch to R    j Acme Inc.         IT
#&gt;                     level_3
#&gt; 1                      &lt;NA&gt;
#&gt; 2                      &lt;NA&gt;
#&gt; 3                      &lt;NA&gt;
#&gt; 4              New Software
#&gt; 5  New Accounting Standards
#&gt; 6          New Product Line
#&gt; 7                  New Labs
#&gt; 8                 Outsource
#&gt; 9                  Go agile
#&gt; 10              Switch to R

<sup>Created on 2023-06-01 with reprex v2.0.2</sup>

答案2

得分: 0

我决定采用完全绕过树结构的解决方案（这也在我的实际用例中提高了性能约10000倍左右）。

我所做的基本上是创建一个循环，在每个循环中，我将父项与其子项进行匹配，然后向上移动一级，直到没有更多的父项为止。

然后，我对数据进行一些整理，使其具有我想要的结构和顺序。

使用我初始问题中的示例数据，我执行以下操作：

library(tidyverse)    
still_open <- nrow(x)
i <- 2
x2 <- x %>%
  mutate(level_1 = child)
while (still_open != 0)
{
  x2 <- x2 %>%
    mutate(paste0("level_{i}") := parent[match(.data[[paste0("level_", i - 1)]], child)], .after = .data[[paste0("level_", i - 1)]])
  still_open <- x2 %>%
    pull(paste0("level_", i)) %>%
    na.omit() %>%
    length()
  i <- i + 1
}
x2 <- x2 %>%
  pivot_longer(cols = starts_with("level_")) %>%
  filter(!is.na(value)) %>%
  mutate(value = rev(value), .by = child) %>%
  pivot_wider(names_from  = name,
              values_from = value)

结果如下：

# A tibble: 10 × 6
   parent     child                    misc  level_1   level_2    level_3                 
   <chr>      <chr>                    <chr> <chr>     <chr>      <chr>                   
 1 Acme Inc.  Accounting               a     Acme Inc. Accounting NA                      
 2 Acme Inc.  Research                 b     Acme Inc. Research   NA                      
 3 Acme Inc.  IT                       c     Acme Inc. IT         NA                      
 4 Accounting New Software             d     Acme Inc. Accounting New Software            
 5 Accounting New Accounting Standards e     Acme Inc. Accounting New Accounting Standards
 6 Research   New Product Line         f     Acme Inc. Research   New Product Line        
 7 Research   New Labs                 g     Acme Inc. Research   New Labs                
 8 IT         Outsource                h     Acme Inc. IT         Outsource               
 9 IT         Go agile                 i     Acme Inc. IT         Go agile                
10 IT         Switch to R              j     Acme Inc. IT         Switch to R

希望这有所帮助。

英文:

I decided for a solution completely circumventing a tree structure (which also boosts performance by a factor of 10000 or so in my real-life use case).

What I'm doing is essentially creating a loop where in each loop I match the parent to its child and then move one level up until there's no more parent left.

I then do some tidy cleanup of the data, so that it hast the structure and order I want.

using the example data from my initial question, I do:

library(tidyverse)    
still_open &lt;- nrow(x)
i &lt;- 2
x2 &lt;- x |&gt; 
mutate(level_1 = child)
while (still_open != 0)
{
x2 &lt;- x2 |&gt; 
mutate(&quot;level_{i}&quot; := parent[match(.data[[paste0(&quot;level_&quot;, i - 1)]], child)], .after = .data[[paste0(&quot;level_&quot;, i - 1)]])
still_open &lt;- x2 |&gt; 
pull(paste0(&quot;level_&quot;, i)) |&gt; 
na.omit() |&gt; 
length()
i &lt;- i + 1
}
x2 &lt;- x2 |&gt; 
pivot_longer(cols = starts_with(&quot;level_&quot;)) |&gt; 
filter(!is.na(value)) |&gt; 
mutate(value = rev(value), .by = child) |&gt; 
pivot_wider(names_from  = name,
values_from = value)

which gives:

# A tibble: 10 &#215; 6
parent     child                    misc  level_1   level_2    level_3                 
&lt;chr&gt;      &lt;chr&gt;                    &lt;chr&gt; &lt;chr&gt;     &lt;chr&gt;      &lt;chr&gt;                   
1 Acme Inc.  Accounting               a     Acme Inc. Accounting NA                      
2 Acme Inc.  Research                 b     Acme Inc. Research   NA                      
3 Acme Inc.  IT                       c     Acme Inc. IT         NA                      
4 Accounting New Software             d     Acme Inc. Accounting New Software            
5 Accounting New Accounting Standards e     Acme Inc. Accounting New Accounting Standards
6 Research   New Product Line         f     Acme Inc. Research   New Product Line        
7 Research   New Labs                 g     Acme Inc. Research   New Labs                
8 IT         Outsource                h     Acme Inc. IT         Outsource               
9 IT         Go agile                 i     Acme Inc. IT         Go agile                
10 IT         Switch to R              j     Acme Inc. IT         Switch to R

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将树形结构回转为数据框架结构，保持原始行。

问题

答案1

答案2

查看正在执行的 R 脚本的已删除源代码

在R中使用用户定义的函数执行多个变量操作时遇到的困难。

Why am I getting zeros when computing growth rates by country-year-sector in my data using tidyverse?

如何将 any(diff(vec) < 0) 以管道形式表示？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。