将树形结构回转为数据框架结构,保持原始行。

huangapple go评论57阅读模式
英文:

Backconvert tree to data frame structure by keeping original rows

问题

让我们假设以下数据:

x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.", 
"Accounting", "Accounting", "Research", "Research", "IT", "IT", 
"IT"), child = c("Accounting", "Research", "IT", "New Software", 
"New Accounting Standards", "New Product Line", "New Labs", "Outsource", 
"Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e", 
"f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")

现在,你可以使用 dplyr 包来完成你想要的操作:

library(dplyr)

# 创建一个数据框来存储结果
result_df <- data.frame()

# 对每个行进行迭代
for (i in 1:nrow(x)) {
  row_data <- x[i, ]
  levels <- data.tree::GetAncestors(my_tree, node = row_data$child)
  levels_df <- data.frame(matrix(unlist(levels), ncol = length(levels), byrow = TRUE))
  colnames(levels_df) <- paste0("level_", 1:length(levels))
  
  # 合并原始数据和层级信息
  final_row <- c(row_data, levels_df)
  
  # 将结果添加到结果数据框中
  result_df <- rbind(result_df, final_row)
}

# 重命名结果数据框的列名
colnames(result_df)[(ncol(x) + 1):ncol(result_df)] <- colnames(levels_df)

# 打印结果数据框
print(result_df)

这段代码将为每个原始数据行计算其层级信息,并将其与原始数据合并,然后将结果存储在 result_df 数据框中。最后,它将列名重命名为你所期望的列名。这应该得到你想要的期望结果。

英文:

Let's assume the following data:

x &lt;- structure(list(parent = c(&quot;Acme Inc.&quot;, &quot;Acme Inc.&quot;, &quot;Acme Inc.&quot;, 
&quot;Accounting&quot;, &quot;Accounting&quot;, &quot;Research&quot;, &quot;Research&quot;, &quot;IT&quot;, &quot;IT&quot;, 
&quot;IT&quot;), child = c(&quot;Accounting&quot;, &quot;Research&quot;, &quot;IT&quot;, &quot;New Software&quot;, 
&quot;New Accounting Standards&quot;, &quot;New Product Line&quot;, &quot;New Labs&quot;, &quot;Outsource&quot;, 
&quot;Go agile&quot;, &quot;Switch to R&quot;), misc = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;, &quot;e&quot;, 
&quot;f&quot;, &quot;g&quot;, &quot;h&quot;, &quot;i&quot;, &quot;j&quot;)), row.names = c(NA, 10L), class = &quot;data.frame&quot;)

       parent                    child misc
1   Acme Inc.               Accounting    a
2   Acme Inc.                 Research    b
3   Acme Inc.                       IT    c
4  Accounting             New Software    d
5  Accounting New Accounting Standards    e
6    Research         New Product Line    f
7    Research                 New Labs    g
8          IT                Outsource    h
9          IT                 Go agile    i
10         IT              Switch to R    j

I can now convert that to a tree structure with the data.tree package.

my_tree &lt;- data.tree::FromDataFrameNetwork(x)

What I actually want to get is the level information, more or less in wide format, which I can theoretically get via

my_data &lt;- data.tree::ToDataFrameTypeCol(my_tree)

which gives:

    level_1    level_2                  level_3
1 Acme Inc. Accounting             New Software
2 Acme Inc. Accounting New Accounting Standards
3 Acme Inc.   Research         New Product Line
4 Acme Inc.   Research                 New Labs
5 Acme Inc.         IT                Outsource
6 Acme Inc.         IT                 Go agile
7 Acme Inc.         IT              Switch to R

However, as you can see, this output has fewer rows than the original data (7 instead of 10). This is because the function only gives me the final leaves, if I see that correctly. But what I want is for each row in the original data frame, enhance the data with the full level information for that particular child. E.g. we know that "Accounting" is level 2, so I want to put that information as new columns to the original data.

The expected outcome would be:

       parent                    child misc   level_1    level_2                  level_3
1   Acme Inc.               Accounting    a Acme Inc. Accounting                       NA
2   Acme Inc.                 Research    b Acme Inc.   Research                       NA
3   Acme Inc.                       IT    c Acme Inc.         IT                       NA
4  Accounting             New Software    d Acme Inc. Accounting             New Software
5  Accounting New Accounting Standards    e Acme Inc. Accounting New Accounting Standards
6    Research         New Product Line    f Acme Inc.   Research         New Product Line
7    Research                 New Labs    g Acme Inc.   Research                 New Labs
8          IT                Outsource    h Acme Inc.         IT                Outsource
9          IT                 Go agile    i Acme Inc.         IT                 Go agile
10         IT              Switch to R    j Acme Inc.         IT              Switch to R

I'm stuck here how to achieve this. Any idea?

答案1

得分: 1

这可能不是最优雅的解决方案,但似乎可以工作。您想要的是首先在键 parent == level_2child == level_3 上将两个数据连接在一起。接下来,您可以使用键 parent == level_1child == level_2 将结果与树状数据连接起来。这将连接上在第一个连接中未匹配的剩余观察结果。您可以合并不同的 level_1level_2 变量以在连接之间整合信息。最后,使用 distinct() 将消除在连接期间产生的重复项。

library(dplyr)
library(data.tree)
  
x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.", 
                               "Accounting", "Accounting", "Research", "Research", "IT", "IT", 
                               "IT"), child = c("Accounting", "Research", "IT", "New Software", 
                                                "New Accounting Standards", "New Product Line", "New Labs", "Outsource", 
                                                "Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e", 
                                                                                     "f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")

my_tree <- data.tree::FromDataFrameNetwork(x)
my_data <- data.tree::ToDataFrameTypeCol(my_tree)


left_join(x, my_data, by = c("parent" = "level_2", "child" = "level_3"), keep = TRUE) %>%
  left_join(my_data %>% select(level_1, level_2), by = c("parent" = "level_1", "child" = "level_2"), keep = TRUE) %>%
  mutate(level_1 = coalesce(level_1.x, level_1.y), 
         level_2 = coalesce(level_2.x, level_2.y)) %>%
  select(parent:misc, level_1, level_2, level_3) %>%
  distinct()
#>        parent                    child misc   level_1    level_2
#> 1   Acme Inc.               Accounting    a Acme Inc. Accounting
#> 2   Acme Inc.                 Research    b Acme Inc.   Research
#> 3   Acme Inc.                       IT    c Acme Inc.         IT
#> 4  Accounting             New Software    d Acme Inc. Accounting
#> 5  Accounting New Accounting Standards    e Acme Inc. Accounting
#> 6    Research         New Product Line    f Acme Inc.   Research
#> 7    Research                 New Labs    g Acme Inc.   Research
#> 8          IT                Outsource    h Acme Inc.         IT
#> 9          IT                 Go agile    i Acme Inc.         IT
#> 10         IT              Switch to R    j Acme Inc.         IT
#>                     level_3
#> 1                      <NA>
#> 2                      <NA>
#> 3                      <NA>
#> 4              New Software
#> 5  New Accounting Standards
#> 6          New Product Line
#> 7                  New Labs
#> 8                 Outsource
#> 9                  Go agile
#> 10              Switch to R

创建于2023年06月01日,使用reprex v2.0.2

英文:

This is likely not the most elegant solution, but it does seem to work. What you want is to first join the two data together on the keys parent == level_2 and child == level_3. Next, you can join the result with the tree data with the keys parent == level_1 and child == level_2. This will join in the remaining observations that weren't matched in the first join. You can coalesce the different level_1 and level_2 variables to consolidate the information across the joins. Finally, using distinct() will eliminate the duplicates that were produced during the joins.

library(dplyr)
library(data.tree)
  
x &lt;- structure(list(parent = c(&quot;Acme Inc.&quot;, &quot;Acme Inc.&quot;, &quot;Acme Inc.&quot;, 
                               &quot;Accounting&quot;, &quot;Accounting&quot;, &quot;Research&quot;, &quot;Research&quot;, &quot;IT&quot;, &quot;IT&quot;, 
                               &quot;IT&quot;), child = c(&quot;Accounting&quot;, &quot;Research&quot;, &quot;IT&quot;, &quot;New Software&quot;, 
                                                &quot;New Accounting Standards&quot;, &quot;New Product Line&quot;, &quot;New Labs&quot;, &quot;Outsource&quot;, 
                                                &quot;Go agile&quot;, &quot;Switch to R&quot;), misc = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;, &quot;e&quot;, 
                                                                                     &quot;f&quot;, &quot;g&quot;, &quot;h&quot;, &quot;i&quot;, &quot;j&quot;)), row.names = c(NA, 10L), class = &quot;data.frame&quot;)

my_tree &lt;- data.tree::FromDataFrameNetwork(x)
my_data &lt;- data.tree::ToDataFrameTypeCol(my_tree)


left_join(x, my_data, join_by(parent==level_2, child==level_3), keep=TRUE) %&gt;% 
  left_join(my_data %&gt;% select(level_1, level_2), join_by(parent==level_1, child==level_2), keep=TRUE) %&gt;%
  mutate(level_1 = coalesce(level_1.x, level_1.y), 
         level_2 = coalesce(level_2.x, level_2.y)) %&gt;% 
  select(parent:misc, level_1, level_2, level_3) %&gt;% 
  distinct()
#&gt;        parent                    child misc   level_1    level_2
#&gt; 1   Acme Inc.               Accounting    a Acme Inc. Accounting
#&gt; 2   Acme Inc.                 Research    b Acme Inc.   Research
#&gt; 3   Acme Inc.                       IT    c Acme Inc.         IT
#&gt; 4  Accounting             New Software    d Acme Inc. Accounting
#&gt; 5  Accounting New Accounting Standards    e Acme Inc. Accounting
#&gt; 6    Research         New Product Line    f Acme Inc.   Research
#&gt; 7    Research                 New Labs    g Acme Inc.   Research
#&gt; 8          IT                Outsource    h Acme Inc.         IT
#&gt; 9          IT                 Go agile    i Acme Inc.         IT
#&gt; 10         IT              Switch to R    j Acme Inc.         IT
#&gt;                     level_3
#&gt; 1                      &lt;NA&gt;
#&gt; 2                      &lt;NA&gt;
#&gt; 3                      &lt;NA&gt;
#&gt; 4              New Software
#&gt; 5  New Accounting Standards
#&gt; 6          New Product Line
#&gt; 7                  New Labs
#&gt; 8                 Outsource
#&gt; 9                  Go agile
#&gt; 10              Switch to R

<sup>Created on 2023-06-01 with reprex v2.0.2</sup>

答案2

得分: 0

我决定采用完全绕过树结构的解决方案(这也在我的实际用例中提高了性能约10000倍左右)。

我所做的基本上是创建一个循环,在每个循环中,我将父项与其子项进行匹配,然后向上移动一级,直到没有更多的父项为止。

然后,我对数据进行一些整理,使其具有我想要的结构和顺序。

使用我初始问题中的示例数据,我执行以下操作:

library(tidyverse)    
still_open <- nrow(x)
i <- 2
x2 <- x %>%
  mutate(level_1 = child)

while (still_open != 0)
{
  x2 <- x2 %>%
    mutate(paste0("level_{i}") := parent[match(.data[[paste0("level_", i - 1)]], child)], .after = .data[[paste0("level_", i - 1)]])

  still_open <- x2 %>%
    pull(paste0("level_", i)) %>%
    na.omit() %>%
    length()

  i <- i + 1
}

x2 <- x2 %>%
  pivot_longer(cols = starts_with("level_")) %>%
  filter(!is.na(value)) %>%
  mutate(value = rev(value), .by = child) %>%
  pivot_wider(names_from  = name,
              values_from = value)

结果如下:

# A tibble: 10 × 6
   parent     child                    misc  level_1   level_2    level_3                 
   <chr>      <chr>                    <chr> <chr>     <chr>      <chr>                   
 1 Acme Inc.  Accounting               a     Acme Inc. Accounting NA                      
 2 Acme Inc.  Research                 b     Acme Inc. Research   NA                      
 3 Acme Inc.  IT                       c     Acme Inc. IT         NA                      
 4 Accounting New Software             d     Acme Inc. Accounting New Software            
 5 Accounting New Accounting Standards e     Acme Inc. Accounting New Accounting Standards
 6 Research   New Product Line         f     Acme Inc. Research   New Product Line        
 7 Research   New Labs                 g     Acme Inc. Research   New Labs                
 8 IT         Outsource                h     Acme Inc. IT         Outsource               
 9 IT         Go agile                 i     Acme Inc. IT         Go agile                
10 IT         Switch to R              j     Acme Inc. IT         Switch to R

希望这有所帮助。

英文:

I decided for a solution completely circumventing a tree structure (which also boosts performance by a factor of 10000 or so in my real-life use case).

What I'm doing is essentially creating a loop where in each loop I match the parent to its child and then move one level up until there's no more parent left.

I then do some tidy cleanup of the data, so that it hast the structure and order I want.

using the example data from my initial question, I do:

library(tidyverse)    
still_open &lt;- nrow(x)
i &lt;- 2
x2 &lt;- x |&gt; 
mutate(level_1 = child)
while (still_open != 0)
{
x2 &lt;- x2 |&gt; 
mutate(&quot;level_{i}&quot; := parent[match(.data[[paste0(&quot;level_&quot;, i - 1)]], child)], .after = .data[[paste0(&quot;level_&quot;, i - 1)]])
still_open &lt;- x2 |&gt; 
pull(paste0(&quot;level_&quot;, i)) |&gt; 
na.omit() |&gt; 
length()
i &lt;- i + 1
}
x2 &lt;- x2 |&gt; 
pivot_longer(cols = starts_with(&quot;level_&quot;)) |&gt; 
filter(!is.na(value)) |&gt; 
mutate(value = rev(value), .by = child) |&gt; 
pivot_wider(names_from  = name,
values_from = value)

which gives:

# A tibble: 10 &#215; 6
parent     child                    misc  level_1   level_2    level_3                 
&lt;chr&gt;      &lt;chr&gt;                    &lt;chr&gt; &lt;chr&gt;     &lt;chr&gt;      &lt;chr&gt;                   
1 Acme Inc.  Accounting               a     Acme Inc. Accounting NA                      
2 Acme Inc.  Research                 b     Acme Inc. Research   NA                      
3 Acme Inc.  IT                       c     Acme Inc. IT         NA                      
4 Accounting New Software             d     Acme Inc. Accounting New Software            
5 Accounting New Accounting Standards e     Acme Inc. Accounting New Accounting Standards
6 Research   New Product Line         f     Acme Inc. Research   New Product Line        
7 Research   New Labs                 g     Acme Inc. Research   New Labs                
8 IT         Outsource                h     Acme Inc. IT         Outsource               
9 IT         Go agile                 i     Acme Inc. IT         Go agile                
10 IT         Switch to R              j     Acme Inc. IT         Switch to R    

huangapple
  • 本文由 发表于 2023年6月1日 18:21:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76380920.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定