英文:
Backconvert tree to data frame structure by keeping original rows
问题
让我们假设以下数据:
x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.",
"Accounting", "Accounting", "Research", "Research", "IT", "IT",
"IT"), child = c("Accounting", "Research", "IT", "New Software",
"New Accounting Standards", "New Product Line", "New Labs", "Outsource",
"Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e",
"f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")
现在,你可以使用 dplyr
包来完成你想要的操作:
library(dplyr)
# 创建一个数据框来存储结果
result_df <- data.frame()
# 对每个行进行迭代
for (i in 1:nrow(x)) {
row_data <- x[i, ]
levels <- data.tree::GetAncestors(my_tree, node = row_data$child)
levels_df <- data.frame(matrix(unlist(levels), ncol = length(levels), byrow = TRUE))
colnames(levels_df) <- paste0("level_", 1:length(levels))
# 合并原始数据和层级信息
final_row <- c(row_data, levels_df)
# 将结果添加到结果数据框中
result_df <- rbind(result_df, final_row)
}
# 重命名结果数据框的列名
colnames(result_df)[(ncol(x) + 1):ncol(result_df)] <- colnames(levels_df)
# 打印结果数据框
print(result_df)
这段代码将为每个原始数据行计算其层级信息,并将其与原始数据合并,然后将结果存储在 result_df
数据框中。最后,它将列名重命名为你所期望的列名。这应该得到你想要的期望结果。
英文:
Let's assume the following data:
x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.",
"Accounting", "Accounting", "Research", "Research", "IT", "IT",
"IT"), child = c("Accounting", "Research", "IT", "New Software",
"New Accounting Standards", "New Product Line", "New Labs", "Outsource",
"Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e",
"f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")
parent child misc
1 Acme Inc. Accounting a
2 Acme Inc. Research b
3 Acme Inc. IT c
4 Accounting New Software d
5 Accounting New Accounting Standards e
6 Research New Product Line f
7 Research New Labs g
8 IT Outsource h
9 IT Go agile i
10 IT Switch to R j
I can now convert that to a tree structure with the data.tree
package.
my_tree <- data.tree::FromDataFrameNetwork(x)
What I actually want to get is the level information, more or less in wide format, which I can theoretically get via
my_data <- data.tree::ToDataFrameTypeCol(my_tree)
which gives:
level_1 level_2 level_3
1 Acme Inc. Accounting New Software
2 Acme Inc. Accounting New Accounting Standards
3 Acme Inc. Research New Product Line
4 Acme Inc. Research New Labs
5 Acme Inc. IT Outsource
6 Acme Inc. IT Go agile
7 Acme Inc. IT Switch to R
However, as you can see, this output has fewer rows than the original data (7 instead of 10). This is because the function only gives me the final leaves, if I see that correctly. But what I want is for each row in the original data frame, enhance the data with the full level information for that particular child. E.g. we know that "Accounting" is level 2, so I want to put that information as new columns to the original data.
The expected outcome would be:
parent child misc level_1 level_2 level_3
1 Acme Inc. Accounting a Acme Inc. Accounting NA
2 Acme Inc. Research b Acme Inc. Research NA
3 Acme Inc. IT c Acme Inc. IT NA
4 Accounting New Software d Acme Inc. Accounting New Software
5 Accounting New Accounting Standards e Acme Inc. Accounting New Accounting Standards
6 Research New Product Line f Acme Inc. Research New Product Line
7 Research New Labs g Acme Inc. Research New Labs
8 IT Outsource h Acme Inc. IT Outsource
9 IT Go agile i Acme Inc. IT Go agile
10 IT Switch to R j Acme Inc. IT Switch to R
I'm stuck here how to achieve this. Any idea?
答案1
得分: 1
这可能不是最优雅的解决方案,但似乎可以工作。您想要的是首先在键 parent == level_2
和 child == level_3
上将两个数据连接在一起。接下来,您可以使用键 parent == level_1
和 child == level_2
将结果与树状数据连接起来。这将连接上在第一个连接中未匹配的剩余观察结果。您可以合并不同的 level_1
和 level_2
变量以在连接之间整合信息。最后,使用 distinct()
将消除在连接期间产生的重复项。
library(dplyr)
library(data.tree)
x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.",
"Accounting", "Accounting", "Research", "Research", "IT", "IT",
"IT"), child = c("Accounting", "Research", "IT", "New Software",
"New Accounting Standards", "New Product Line", "New Labs", "Outsource",
"Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e",
"f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")
my_tree <- data.tree::FromDataFrameNetwork(x)
my_data <- data.tree::ToDataFrameTypeCol(my_tree)
left_join(x, my_data, by = c("parent" = "level_2", "child" = "level_3"), keep = TRUE) %>%
left_join(my_data %>% select(level_1, level_2), by = c("parent" = "level_1", "child" = "level_2"), keep = TRUE) %>%
mutate(level_1 = coalesce(level_1.x, level_1.y),
level_2 = coalesce(level_2.x, level_2.y)) %>%
select(parent:misc, level_1, level_2, level_3) %>%
distinct()
#> parent child misc level_1 level_2
#> 1 Acme Inc. Accounting a Acme Inc. Accounting
#> 2 Acme Inc. Research b Acme Inc. Research
#> 3 Acme Inc. IT c Acme Inc. IT
#> 4 Accounting New Software d Acme Inc. Accounting
#> 5 Accounting New Accounting Standards e Acme Inc. Accounting
#> 6 Research New Product Line f Acme Inc. Research
#> 7 Research New Labs g Acme Inc. Research
#> 8 IT Outsource h Acme Inc. IT
#> 9 IT Go agile i Acme Inc. IT
#> 10 IT Switch to R j Acme Inc. IT
#> level_3
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 New Software
#> 5 New Accounting Standards
#> 6 New Product Line
#> 7 New Labs
#> 8 Outsource
#> 9 Go agile
#> 10 Switch to R
创建于2023年06月01日,使用reprex v2.0.2
英文:
This is likely not the most elegant solution, but it does seem to work. What you want is to first join the two data together on the keys parent == level_2
and child == level_3
. Next, you can join the result with the tree data with the keys parent == level_1
and child == level_2
. This will join in the remaining observations that weren't matched in the first join. You can coalesce the different level_1
and level_2
variables to consolidate the information across the joins. Finally, using distinct()
will eliminate the duplicates that were produced during the joins.
library(dplyr)
library(data.tree)
x <- structure(list(parent = c("Acme Inc.", "Acme Inc.", "Acme Inc.",
"Accounting", "Accounting", "Research", "Research", "IT", "IT",
"IT"), child = c("Accounting", "Research", "IT", "New Software",
"New Accounting Standards", "New Product Line", "New Labs", "Outsource",
"Go agile", "Switch to R"), misc = c("a", "b", "c", "d", "e",
"f", "g", "h", "i", "j")), row.names = c(NA, 10L), class = "data.frame")
my_tree <- data.tree::FromDataFrameNetwork(x)
my_data <- data.tree::ToDataFrameTypeCol(my_tree)
left_join(x, my_data, join_by(parent==level_2, child==level_3), keep=TRUE) %>%
left_join(my_data %>% select(level_1, level_2), join_by(parent==level_1, child==level_2), keep=TRUE) %>%
mutate(level_1 = coalesce(level_1.x, level_1.y),
level_2 = coalesce(level_2.x, level_2.y)) %>%
select(parent:misc, level_1, level_2, level_3) %>%
distinct()
#> parent child misc level_1 level_2
#> 1 Acme Inc. Accounting a Acme Inc. Accounting
#> 2 Acme Inc. Research b Acme Inc. Research
#> 3 Acme Inc. IT c Acme Inc. IT
#> 4 Accounting New Software d Acme Inc. Accounting
#> 5 Accounting New Accounting Standards e Acme Inc. Accounting
#> 6 Research New Product Line f Acme Inc. Research
#> 7 Research New Labs g Acme Inc. Research
#> 8 IT Outsource h Acme Inc. IT
#> 9 IT Go agile i Acme Inc. IT
#> 10 IT Switch to R j Acme Inc. IT
#> level_3
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 New Software
#> 5 New Accounting Standards
#> 6 New Product Line
#> 7 New Labs
#> 8 Outsource
#> 9 Go agile
#> 10 Switch to R
<sup>Created on 2023-06-01 with reprex v2.0.2</sup>
答案2
得分: 0
我决定采用完全绕过树结构的解决方案(这也在我的实际用例中提高了性能约10000倍左右)。
我所做的基本上是创建一个循环,在每个循环中,我将父项与其子项进行匹配,然后向上移动一级,直到没有更多的父项为止。
然后,我对数据进行一些整理,使其具有我想要的结构和顺序。
使用我初始问题中的示例数据,我执行以下操作:
library(tidyverse)
still_open <- nrow(x)
i <- 2
x2 <- x %>%
mutate(level_1 = child)
while (still_open != 0)
{
x2 <- x2 %>%
mutate(paste0("level_{i}") := parent[match(.data[[paste0("level_", i - 1)]], child)], .after = .data[[paste0("level_", i - 1)]])
still_open <- x2 %>%
pull(paste0("level_", i)) %>%
na.omit() %>%
length()
i <- i + 1
}
x2 <- x2 %>%
pivot_longer(cols = starts_with("level_")) %>%
filter(!is.na(value)) %>%
mutate(value = rev(value), .by = child) %>%
pivot_wider(names_from = name,
values_from = value)
结果如下:
# A tibble: 10 × 6
parent child misc level_1 level_2 level_3
<chr> <chr> <chr> <chr> <chr> <chr>
1 Acme Inc. Accounting a Acme Inc. Accounting NA
2 Acme Inc. Research b Acme Inc. Research NA
3 Acme Inc. IT c Acme Inc. IT NA
4 Accounting New Software d Acme Inc. Accounting New Software
5 Accounting New Accounting Standards e Acme Inc. Accounting New Accounting Standards
6 Research New Product Line f Acme Inc. Research New Product Line
7 Research New Labs g Acme Inc. Research New Labs
8 IT Outsource h Acme Inc. IT Outsource
9 IT Go agile i Acme Inc. IT Go agile
10 IT Switch to R j Acme Inc. IT Switch to R
希望这有所帮助。
英文:
I decided for a solution completely circumventing a tree structure (which also boosts performance by a factor of 10000 or so in my real-life use case).
What I'm doing is essentially creating a loop where in each loop I match the parent to its child and then move one level up until there's no more parent left.
I then do some tidy cleanup of the data, so that it hast the structure and order I want.
using the example data from my initial question, I do:
library(tidyverse)
still_open <- nrow(x)
i <- 2
x2 <- x |>
mutate(level_1 = child)
while (still_open != 0)
{
x2 <- x2 |>
mutate("level_{i}" := parent[match(.data[[paste0("level_", i - 1)]], child)], .after = .data[[paste0("level_", i - 1)]])
still_open <- x2 |>
pull(paste0("level_", i)) |>
na.omit() |>
length()
i <- i + 1
}
x2 <- x2 |>
pivot_longer(cols = starts_with("level_")) |>
filter(!is.na(value)) |>
mutate(value = rev(value), .by = child) |>
pivot_wider(names_from = name,
values_from = value)
which gives:
# A tibble: 10 × 6
parent child misc level_1 level_2 level_3
<chr> <chr> <chr> <chr> <chr> <chr>
1 Acme Inc. Accounting a Acme Inc. Accounting NA
2 Acme Inc. Research b Acme Inc. Research NA
3 Acme Inc. IT c Acme Inc. IT NA
4 Accounting New Software d Acme Inc. Accounting New Software
5 Accounting New Accounting Standards e Acme Inc. Accounting New Accounting Standards
6 Research New Product Line f Acme Inc. Research New Product Line
7 Research New Labs g Acme Inc. Research New Labs
8 IT Outsource h Acme Inc. IT Outsource
9 IT Go agile i Acme Inc. IT Go agile
10 IT Switch to R j Acme Inc. IT Switch to R
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论