英文:
R: grouping and expanding the data.frame to include only possible pairs of names in a column
问题
我的目标是在R中将我的数据框扩展,以包括从R中的一列中获取可能的组合(但不是所有可能的组合)。类似于expand.grid
命令,但该函数会给出所有可能的组合,而不仅仅是当前存在的。
首先,我需要按照第一列中的每个因子进行分组,并保留包含在第二列中的信息。在第三列中,我有包含“动物”名称的字符字符串。我想要找到在此列中逐行发生的每个可能的配对(但不是所有可能的配对)。例如,如果在前两行中有“Dreadwing”和“Scorcher”,那将是一个配对:Dreadwing-Scorcher,不应包括Scorcher-Dreadwing。但是,如果第四行和第五行是T-Rex和T-Rex,则该配对应该出现一次:T-Rex-T-Rex,因为T-Rex出现在“动物”列的2个单独的行中。如果T-Rex在3个不同的行中出现,则该配对应该出现3次,依此类推。
最后,配对应扩展数据框,添加2列以存储这些配对。换句话说,Dreadwing和Scorcher的配对应分别位于它们自己的单独列中,但位于同一行中。
我已经手动制作了此图片,以显示来自我的数据框的输出应该是什么样的(注意:Area_1和Area_2仅分开以适应一个屏幕截图中的结果)。左侧:我已经用箭头显示了从第一行Dreadwing开始的期望组合。右侧:Area_1和Area_2的所有期望结果。
对于所需的结果,对于Area_1,Dreadwing-Dreadwing对不应出现,因为Dreadwing没有出现在Area_1的任何其他行中。但是,T-Rex出现在2个不同的行中,因此T-Rex-T-Rex的组合应该存在,以及T-Rex的每一行与Waterwing的每一行组合。因此,4个T-Rex-Waterwing组合。
可重现的数据
创建数据框
v <- c(rep("Area_1", 7), rep("Area_2", 7))
w <- c(rep("Forest", 7), rep("Cave", 7))
y <- c("Waterwing", "Scorcher", "Snapmaw", "T-Rex", "T-Rex", "Dreadwing",
"Waterwing", "Snake", "T-Rex", "T-Rex", "Dreadwing", "Snapmaw", "Scorcher",
"Waterwing")
stack_df <- data.frame(Area = v, Location = w, Animals = y)
stack_df <- stack_df[order(stack_df$Area, stack_df$Location, stack_df$Animals), ]
row.names(stack_df) <- 1:nrow(stack_df)
使用tidyr指南,我发现expand
命令与nesting
命令结合使用(仅保留出现在数据中的组合)不起作用。例如:
library(tidyr)
stack_df %>%
dplyr::group_by(Area) %>%
expand(nesting(Location, Animals, Animals))
将仅返回11/14行。
我尝试过多种使用expand
和crossing
命令的方法。但是,与expand.grid
命令一样,这些命令会给出所有可能的组合。
尽管如此,使用expand
命令是我接近目标的最接近方法。
stack_df %>%
dplyr::group_by(Area) %>%
expand(Location, Animals, Animals)
正如您所看到的,包括了所有可能性,这不是期望的结果。
有没有办法可以完成这个任务?
英文:
My goal is to expand
my data.frame in R
to include possible combinations (but not all possible combinations) from a column in R. Similar to the expand.grid
command, but that function gives you all possible combinations, not just what is present.
To start, I need to group by each factor in the 1st column, and keep the information included in column 2. In column 3, I have character strings of the names of 'Animals'. I want to find each possible pair that occur in this column, row by row (but not all possible pairs). For example, if I have 'Dreadwing' and 'Scorcher' in the first two rows - that would be one pair: Dreadwing-Scorcher and it should not include Scorcher-Dreadwing. However, if rows 4 and five are T-Rex and T-Rex, the pair should appear once: T-Rex-T-Rex because T-Rex appears in 2 separate rows of the column 'Animals.' If T-Rex were to appear in 3 separate rows, then the pair should appear 3 times, etc., etc.
Lastly, pairs should expand the data.frame by 2 columns to store the pairs. In other words, the Dreadwing and Scorcher pair should be each in their own separate column, but in the same row.
I have manually put together this picture to show what my output should be from the data.frame I have (note: Area_1 and Area_2 are separated only to fit the results in one screenshot). The left: I have put arrows showing the desired combinations just from the first row, Dreadwing. On the right: the desired result for all of Area_1 and Area_2.
For the desired result, for Area_1, the Dreadwing-Dreadwing pair should not occur bc Dreadwing does not appear in any other row for Area_1. However, T-Rex appears in 2 separate rows, so the combination of T-Rex-T-Rex should be there, as well as the combination of each row of T-Rex combining with each row of Waterwing. So, 4 T-Rex-Waterwing combinations.
Reproducible Data
Creating the data.frame
v <- c(rep("Area_1", 7), rep("Area_2", 7))
w <- c(rep("Forest", 7), rep("Cave", 7))
y <- c("Waterwing", "Scorcher", "Snapmaw", "T-Rex", "T-Rex", "Dreadwing",
"Waterwing", "Snake", "T-Rex", "T-Rex", "Dreadwing", "Snapmaw", "Scorcher",
"Waterwing")
stack_df <- data.frame(Area = v, Location = w, Animals = y)
stack_df <- stack_df[order(stack_df$Area, stack_df$Location, stack_df$Animals), ]
row.names(stack_df) <- 1:nrow(stack_df)
Using the tidyR guidebook, I have found that the command expand
in conjunction with the nesting
command (to keep only combinations that appear in the data) does not work. For example:
library(tidyr)
stack_df %>%
dplyr::group_by(Area) %>%
expand(nesting(Location, Animals, Animals))
will return only 11/14 rows.
I have tried multiple ways using the expand
and crossing
command. However, like the expand.grid
command, these commands give you all possible combinations.
Despite this, using the expand
command is the closest I have gotten to what I am aiming for.
stack_df %>%
dplyr::group_by(Area) %>%
expand(Location, Animals, Animals)
As you can see, all possibilities are included, which is not the desired result.
Any ideas on how I can get this done?
答案1
得分: 1
这似乎是要找到在Area/Location组内,第一个动物出现在第二个动物之前的所有动物对的组合。
我们可以通过添加行号索引并在行号上进行自连接,并在行号上应用不等式约束来完成这个任务(需要dplyr
版本>= 1.1.0)。
library(dplyr)
stack_df = stack_df %>%
mutate(group_i = row_number(), .by = c(Area, Location))
stack_df %>%
inner_join(
stack_df,
by = join_by(Area, Location, group_i < group_i),
suffix = c("..2", "..3")
) %>%
select(-starts_with("group_"))
# Area Location Animals..2 Animals..3
# 1 Area_1 Forest Dreadwing Scorcher
# 2 Area_1 Forest Dreadwing Snapmaw
# 3 Area_1 Forest Dreadwing T-Rex
# ...
# (以下为输出的部分结果)
请注意,此处的代码仅为翻译,实际运行可能需要根据您的数据和环境进行适当的调整。
英文:
It sounds/looks to me like you want to find all combinations (within the Area/Location group) of pairs of Animals where the 1st Animal in the pair occurs on a row before the 2nd Animal in the pair.
We can do this by adding a row number index and doing a self-join with an inequality constraint on the row numbers. (This requires dplyr
version >= 1.1.0)
library(dplyr)
stack_df = stack_df |>
mutate(group_i = row_number(), .by = c(Area, Location))
stack_df |>
inner_join(
stack_df,
by = join_by(Area, Location, group_i < group_i),
suffix = c("..2", "..3")
) |>
select(-starts_with("group_"))
# Area Location Animals..2 Animals..3
# 1 Area_1 Forest Dreadwing Scorcher
# 2 Area_1 Forest Dreadwing Snapmaw
# 3 Area_1 Forest Dreadwing T-Rex
# 4 Area_1 Forest Dreadwing T-Rex
# 5 Area_1 Forest Dreadwing Waterwing
# 6 Area_1 Forest Dreadwing Waterwing
# 7 Area_1 Forest Scorcher Snapmaw
# 8 Area_1 Forest Scorcher T-Rex
# 9 Area_1 Forest Scorcher T-Rex
# 10 Area_1 Forest Scorcher Waterwing
# 11 Area_1 Forest Scorcher Waterwing
# 12 Area_1 Forest Snapmaw T-Rex
# 13 Area_1 Forest Snapmaw T-Rex
# 14 Area_1 Forest Snapmaw Waterwing
# 15 Area_1 Forest Snapmaw Waterwing
# 16 Area_1 Forest T-Rex T-Rex
# 17 Area_1 Forest T-Rex Waterwing
# 18 Area_1 Forest T-Rex Waterwing
# 19 Area_1 Forest T-Rex Waterwing
# 20 Area_1 Forest T-Rex Waterwing
# 21 Area_1 Forest Waterwing Waterwing
# 22 Area_2 Cave Dreadwing Scorcher
# 23 Area_2 Cave Dreadwing Snake
# 24 Area_2 Cave Dreadwing Snapmaw
# 25 Area_2 Cave Dreadwing T-Rex
# 26 Area_2 Cave Dreadwing T-Rex
# 27 Area_2 Cave Dreadwing Waterwing
# 28 Area_2 Cave Scorcher Snake
# 29 Area_2 Cave Scorcher Snapmaw
# 30 Area_2 Cave Scorcher T-Rex
# 31 Area_2 Cave Scorcher T-Rex
# 32 Area_2 Cave Scorcher Waterwing
# 33 Area_2 Cave Snake Snapmaw
# 34 Area_2 Cave Snake T-Rex
# 35 Area_2 Cave Snake T-Rex
# 36 Area_2 Cave Snake Waterwing
# 37 Area_2 Cave Snapmaw T-Rex
# 38 Area_2 Cave Snapmaw T-Rex
# 39 Area_2 Cave Snapmaw Waterwing
# 40 Area_2 Cave T-Rex T-Rex
# 41 Area_2 Cave T-Rex Waterwing
# 42 Area_2 Cave T-Rex Waterwing
答案2
得分: 0
如果您的R版本大于4.3,有一个我用来进行多重测试的小包在GitHub上。
# devtools::install_github('oonyambu/SLR')
stack_df %>%
mutate(Area = paste(Area, Location), z = 1) %>%
SLR::multiple_tests(z ~ Animals | Area, ., \(x,y) list(NULL)) %>%
separate(Area, c('Area', 'Location'), sep = ' ') %>%
separate(Value, c('Animal1', 'Animal2'), sep = ':')
结果如下:
Area Location response Animal1 Animal2
1 Area_1 Forest z Dreadwing Scorcher
2 Area_1 Forest z Dreadwing Snapmaw
3 Area_1 Forest z Dreadwing T-Rex
4 Area_1 Forest z Dreadwing Waterwing
5 Area_1 Forest z Scorcher Snapmaw
6 Area_1 Forest z Scorcher T-Rex
7 Area_1 Forest z Scorcher Waterwing
8 Area_1 Forest z Snapmaw T-Rex
9 Area_1 Forest z Snapmaw Waterwing
10 Area_1 Forest z T-Rex Waterwing
11 Area_2 Cave z Dreadwing Scorcher
12 Area_2 Cave z Dreadwing Snake
13 Area_2 Cave z Dreadwing Snapmaw
14 Area_2 Cave z Dreadwing T-Rex
15 Area_2 Cave z Dreadwing Waterwing
16 Area_2 Cave z Scorcher Snake
17 Area_2 Cave z Scorcher Snapmaw
18 Area_2 Cave z Scorcher T-Rex
19 Area_2 Cave z Scorcher Waterwing
20 Area_2 Cave z Snake Snapmaw
21 Area_2 Cave z Snake T-Rex
22 Area_2 Cave z Snake Waterwing
23 Area_2 Cave z Snapmaw T-Rex
24 Area_2 Cave z Snapmaw Waterwing
25 Area_2 Cave z T-Rex Waterwing
英文:
if you have R > 4.3, there is a small package on github that I use to do multiple tests.
# devtools::install_github('oonyambu/SLR')
stack_df %>%
mutate(Area = paste(Area, Location), z = 1) %>%
SLR::multiple_tests(z~Animals|Area, ., \(x,y)list(NULL)) %>%
separate(Area, c('Area', 'Location'), sep = ' ') %>%
separate(Value, c('Animal1', 'Animal2'), sep = ':')
Area Location response Animal1 Animal2
1 Area_1 Forest z Dreadwing Scorcher
2 Area_1 Forest z Dreadwing Snapmaw
3 Area_1 Forest z Dreadwing T-Rex
4 Area_1 Forest z Dreadwing Waterwing
5 Area_1 Forest z Scorcher Snapmaw
6 Area_1 Forest z Scorcher T-Rex
7 Area_1 Forest z Scorcher Waterwing
8 Area_1 Forest z Snapmaw T-Rex
9 Area_1 Forest z Snapmaw Waterwing
10 Area_1 Forest z T-Rex Waterwing
11 Area_2 Cave z Dreadwing Scorcher
12 Area_2 Cave z Dreadwing Snake
13 Area_2 Cave z Dreadwing Snapmaw
14 Area_2 Cave z Dreadwing T-Rex
15 Area_2 Cave z Dreadwing Waterwing
16 Area_2 Cave z Scorcher Snake
17 Area_2 Cave z Scorcher Snapmaw
18 Area_2 Cave z Scorcher T-Rex
19 Area_2 Cave z Scorcher Waterwing
20 Area_2 Cave z Snake Snapmaw
21 Area_2 Cave z Snake T-Rex
22 Area_2 Cave z Snake Waterwing
23 Area_2 Cave z Snapmaw T-Rex
24 Area_2 Cave z Snapmaw Waterwing
25 Area_2 Cave z T-Rex Waterwing
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论