R:将数据框进行分组和扩展,仅包括列中可能的名称对。

huangapple go评论103阅读模式
英文:

R: grouping and expanding the data.frame to include only possible pairs of names in a column

问题

我的目标是在R中将我的数据框扩展,以包括从R中的一列中获取可能的组合(但不是所有可能的组合)。类似于expand.grid命令,但该函数会给出所有可能的组合,而不仅仅是当前存在的。

首先,我需要按照第一列中的每个因子进行分组,并保留包含在第二列中的信息。在第三列中,我有包含“动物”名称的字符字符串。我想要找到在此列中逐行发生的每个可能的配对(但不是所有可能的配对)。例如,如果在前两行中有“Dreadwing”和“Scorcher”,那将是一个配对:Dreadwing-Scorcher,不应包括Scorcher-Dreadwing。但是,如果第四行和第五行是T-Rex和T-Rex,则该配对应该出现一次:T-Rex-T-Rex,因为T-Rex出现在“动物”列的2个单独的行中。如果T-Rex在3个不同的行中出现,则该配对应该出现3次,依此类推。

最后,配对应扩展数据框,添加2列以存储这些配对。换句话说,Dreadwing和Scorcher的配对应分别位于它们自己的单独列中,但位于同一行中。

我已经手动制作了此图片,以显示来自我的数据框的输出应该是什么样的(注意:Area_1和Area_2仅分开以适应一个屏幕截图中的结果)。左侧:我已经用箭头显示了从第一行Dreadwing开始的期望组合。右侧:Area_1和Area_2的所有期望结果。

对于所需的结果,对于Area_1,Dreadwing-Dreadwing对不应出现,因为Dreadwing没有出现在Area_1的任何其他行中。但是,T-Rex出现在2个不同的行中,因此T-Rex-T-Rex的组合应该存在,以及T-Rex的每一行与Waterwing的每一行组合。因此,4个T-Rex-Waterwing组合。

可重现的数据

创建数据框

  1. v <- c(rep("Area_1", 7), rep("Area_2", 7))
  2. w <- c(rep("Forest", 7), rep("Cave", 7))
  3. y <- c("Waterwing", "Scorcher", "Snapmaw", "T-Rex", "T-Rex", "Dreadwing",
  4. "Waterwing", "Snake", "T-Rex", "T-Rex", "Dreadwing", "Snapmaw", "Scorcher",
  5. "Waterwing")
  6. stack_df <- data.frame(Area = v, Location = w, Animals = y)
  7. stack_df <- stack_df[order(stack_df$Area, stack_df$Location, stack_df$Animals), ]
  8. row.names(stack_df) <- 1:nrow(stack_df)

使用tidyr指南,我发现expand命令与nesting命令结合使用(仅保留出现在数据中的组合)不起作用。例如:

  1. library(tidyr)
  2. stack_df %>%
  3. dplyr::group_by(Area) %>%
  4. expand(nesting(Location, Animals, Animals))

将仅返回11/14行。

我尝试过多种使用expandcrossing命令的方法。但是,与expand.grid命令一样,这些命令会给出所有可能的组合。

尽管如此,使用expand命令是我接近目标的最接近方法。

  1. stack_df %>%
  2. dplyr::group_by(Area) %>%
  3. expand(Location, Animals, Animals)

正如您所看到的,包括了所有可能性,这不是期望的结果。

有没有办法可以完成这个任务?

英文:

My goal is to expand my data.frame in R to include possible combinations (but not all possible combinations) from a column in R. Similar to the expand.grid command, but that function gives you all possible combinations, not just what is present.

To start, I need to group by each factor in the 1st column, and keep the information included in column 2. In column 3, I have character strings of the names of 'Animals'. I want to find each possible pair that occur in this column, row by row (but not all possible pairs). For example, if I have 'Dreadwing' and 'Scorcher' in the first two rows - that would be one pair: Dreadwing-Scorcher and it should not include Scorcher-Dreadwing. However, if rows 4 and five are T-Rex and T-Rex, the pair should appear once: T-Rex-T-Rex because T-Rex appears in 2 separate rows of the column 'Animals.' If T-Rex were to appear in 3 separate rows, then the pair should appear 3 times, etc., etc.

Lastly, pairs should expand the data.frame by 2 columns to store the pairs. In other words, the Dreadwing and Scorcher pair should be each in their own separate column, but in the same row.

I have manually put together this picture to show what my output should be from the data.frame I have (note: Area_1 and Area_2 are separated only to fit the results in one screenshot). The left: I have put arrows showing the desired combinations just from the first row, Dreadwing. On the right: the desired result for all of Area_1 and Area_2.

R:将数据框进行分组和扩展,仅包括列中可能的名称对。

For the desired result, for Area_1, the Dreadwing-Dreadwing pair should not occur bc Dreadwing does not appear in any other row for Area_1. However, T-Rex appears in 2 separate rows, so the combination of T-Rex-T-Rex should be there, as well as the combination of each row of T-Rex combining with each row of Waterwing. So, 4 T-Rex-Waterwing combinations.

Reproducible Data

Creating the data.frame

  1. v &lt;- c(rep(&quot;Area_1&quot;, 7), rep(&quot;Area_2&quot;, 7))
  2. w &lt;- c(rep(&quot;Forest&quot;, 7), rep(&quot;Cave&quot;, 7))
  3. y &lt;- c(&quot;Waterwing&quot;, &quot;Scorcher&quot;, &quot;Snapmaw&quot;, &quot;T-Rex&quot;, &quot;T-Rex&quot;, &quot;Dreadwing&quot;,
  4. &quot;Waterwing&quot;, &quot;Snake&quot;, &quot;T-Rex&quot;, &quot;T-Rex&quot;, &quot;Dreadwing&quot;, &quot;Snapmaw&quot;, &quot;Scorcher&quot;,
  5. &quot;Waterwing&quot;)
  6. stack_df &lt;- data.frame(Area = v, Location = w, Animals = y)
  7. stack_df &lt;- stack_df[order(stack_df$Area, stack_df$Location, stack_df$Animals), ]
  8. row.names(stack_df) &lt;- 1:nrow(stack_df)

Using the tidyR guidebook, I have found that the command expand in conjunction with the nesting command (to keep only combinations that appear in the data) does not work. For example:

  1. library(tidyr)
  2. stack_df %&gt;%
  3. dplyr::group_by(Area) %&gt;%
  4. expand(nesting(Location, Animals, Animals))

will return only 11/14 rows.

I have tried multiple ways using the expand and crossing command. However, like the expand.grid command, these commands give you all possible combinations.

Despite this, using the expand command is the closest I have gotten to what I am aiming for.

  1. stack_df %&gt;%
  2. dplyr::group_by(Area) %&gt;%
  3. expand(Location, Animals, Animals)

As you can see, all possibilities are included, which is not the desired result.

Any ideas on how I can get this done?

答案1

得分: 1

这似乎是要找到在Area/Location组内,第一个动物出现在第二个动物之前的所有动物对的组合。

我们可以通过添加行号索引并在行号上进行自连接,并在行号上应用不等式约束来完成这个任务(需要dplyr版本>= 1.1.0)。

  1. library(dplyr)
  2. stack_df = stack_df %>%
  3. mutate(group_i = row_number(), .by = c(Area, Location))
  4. stack_df %>%
  5. inner_join(
  6. stack_df,
  7. by = join_by(Area, Location, group_i < group_i),
  8. suffix = c("..2", "..3")
  9. ) %>%
  10. select(-starts_with("group_"))
  11. # Area Location Animals..2 Animals..3
  12. # 1 Area_1 Forest Dreadwing Scorcher
  13. # 2 Area_1 Forest Dreadwing Snapmaw
  14. # 3 Area_1 Forest Dreadwing T-Rex
  15. # ...
  16. # (以下为输出的部分结果)

请注意,此处的代码仅为翻译,实际运行可能需要根据您的数据和环境进行适当的调整。

英文:

It sounds/looks to me like you want to find all combinations (within the Area/Location group) of pairs of Animals where the 1st Animal in the pair occurs on a row before the 2nd Animal in the pair.

We can do this by adding a row number index and doing a self-join with an inequality constraint on the row numbers. (This requires dplyr version >= 1.1.0)

  1. library(dplyr)
  2. stack_df = stack_df |&gt;
  3. mutate(group_i = row_number(), .by = c(Area, Location))
  4. stack_df |&gt;
  5. inner_join(
  6. stack_df,
  7. by = join_by(Area, Location, group_i &lt; group_i),
  8. suffix = c(&quot;..2&quot;, &quot;..3&quot;)
  9. ) |&gt;
  10. select(-starts_with(&quot;group_&quot;))
  11. # Area Location Animals..2 Animals..3
  12. # 1 Area_1 Forest Dreadwing Scorcher
  13. # 2 Area_1 Forest Dreadwing Snapmaw
  14. # 3 Area_1 Forest Dreadwing T-Rex
  15. # 4 Area_1 Forest Dreadwing T-Rex
  16. # 5 Area_1 Forest Dreadwing Waterwing
  17. # 6 Area_1 Forest Dreadwing Waterwing
  18. # 7 Area_1 Forest Scorcher Snapmaw
  19. # 8 Area_1 Forest Scorcher T-Rex
  20. # 9 Area_1 Forest Scorcher T-Rex
  21. # 10 Area_1 Forest Scorcher Waterwing
  22. # 11 Area_1 Forest Scorcher Waterwing
  23. # 12 Area_1 Forest Snapmaw T-Rex
  24. # 13 Area_1 Forest Snapmaw T-Rex
  25. # 14 Area_1 Forest Snapmaw Waterwing
  26. # 15 Area_1 Forest Snapmaw Waterwing
  27. # 16 Area_1 Forest T-Rex T-Rex
  28. # 17 Area_1 Forest T-Rex Waterwing
  29. # 18 Area_1 Forest T-Rex Waterwing
  30. # 19 Area_1 Forest T-Rex Waterwing
  31. # 20 Area_1 Forest T-Rex Waterwing
  32. # 21 Area_1 Forest Waterwing Waterwing
  33. # 22 Area_2 Cave Dreadwing Scorcher
  34. # 23 Area_2 Cave Dreadwing Snake
  35. # 24 Area_2 Cave Dreadwing Snapmaw
  36. # 25 Area_2 Cave Dreadwing T-Rex
  37. # 26 Area_2 Cave Dreadwing T-Rex
  38. # 27 Area_2 Cave Dreadwing Waterwing
  39. # 28 Area_2 Cave Scorcher Snake
  40. # 29 Area_2 Cave Scorcher Snapmaw
  41. # 30 Area_2 Cave Scorcher T-Rex
  42. # 31 Area_2 Cave Scorcher T-Rex
  43. # 32 Area_2 Cave Scorcher Waterwing
  44. # 33 Area_2 Cave Snake Snapmaw
  45. # 34 Area_2 Cave Snake T-Rex
  46. # 35 Area_2 Cave Snake T-Rex
  47. # 36 Area_2 Cave Snake Waterwing
  48. # 37 Area_2 Cave Snapmaw T-Rex
  49. # 38 Area_2 Cave Snapmaw T-Rex
  50. # 39 Area_2 Cave Snapmaw Waterwing
  51. # 40 Area_2 Cave T-Rex T-Rex
  52. # 41 Area_2 Cave T-Rex Waterwing
  53. # 42 Area_2 Cave T-Rex Waterwing

答案2

得分: 0

如果您的R版本大于4.3,有一个我用来进行多重测试的小包在GitHub上。

  1. # devtools::install_github('oonyambu/SLR')
  2. stack_df %>%
  3. mutate(Area = paste(Area, Location), z = 1) %>%
  4. SLR::multiple_tests(z ~ Animals | Area, ., \(x,y) list(NULL)) %>%
  5. separate(Area, c('Area', 'Location'), sep = ' ') %>%
  6. separate(Value, c('Animal1', 'Animal2'), sep = ':')

结果如下:

  1. Area Location response Animal1 Animal2
  2. 1 Area_1 Forest z Dreadwing Scorcher
  3. 2 Area_1 Forest z Dreadwing Snapmaw
  4. 3 Area_1 Forest z Dreadwing T-Rex
  5. 4 Area_1 Forest z Dreadwing Waterwing
  6. 5 Area_1 Forest z Scorcher Snapmaw
  7. 6 Area_1 Forest z Scorcher T-Rex
  8. 7 Area_1 Forest z Scorcher Waterwing
  9. 8 Area_1 Forest z Snapmaw T-Rex
  10. 9 Area_1 Forest z Snapmaw Waterwing
  11. 10 Area_1 Forest z T-Rex Waterwing
  12. 11 Area_2 Cave z Dreadwing Scorcher
  13. 12 Area_2 Cave z Dreadwing Snake
  14. 13 Area_2 Cave z Dreadwing Snapmaw
  15. 14 Area_2 Cave z Dreadwing T-Rex
  16. 15 Area_2 Cave z Dreadwing Waterwing
  17. 16 Area_2 Cave z Scorcher Snake
  18. 17 Area_2 Cave z Scorcher Snapmaw
  19. 18 Area_2 Cave z Scorcher T-Rex
  20. 19 Area_2 Cave z Scorcher Waterwing
  21. 20 Area_2 Cave z Snake Snapmaw
  22. 21 Area_2 Cave z Snake T-Rex
  23. 22 Area_2 Cave z Snake Waterwing
  24. 23 Area_2 Cave z Snapmaw T-Rex
  25. 24 Area_2 Cave z Snapmaw Waterwing
  26. 25 Area_2 Cave z T-Rex Waterwing
英文:

if you have R > 4.3, there is a small package on github that I use to do multiple tests.

  1. # devtools::install_github(&#39;oonyambu/SLR&#39;)
  2. stack_df %&gt;%
  3. mutate(Area = paste(Area, Location), z = 1) %&gt;%
  4. SLR::multiple_tests(z~Animals|Area, ., \(x,y)list(NULL)) %&gt;%
  5. separate(Area, c(&#39;Area&#39;, &#39;Location&#39;), sep = &#39; &#39;) %&gt;%
  6. separate(Value, c(&#39;Animal1&#39;, &#39;Animal2&#39;), sep = &#39;:&#39;)
  7. Area Location response Animal1 Animal2
  8. 1 Area_1 Forest z Dreadwing Scorcher
  9. 2 Area_1 Forest z Dreadwing Snapmaw
  10. 3 Area_1 Forest z Dreadwing T-Rex
  11. 4 Area_1 Forest z Dreadwing Waterwing
  12. 5 Area_1 Forest z Scorcher Snapmaw
  13. 6 Area_1 Forest z Scorcher T-Rex
  14. 7 Area_1 Forest z Scorcher Waterwing
  15. 8 Area_1 Forest z Snapmaw T-Rex
  16. 9 Area_1 Forest z Snapmaw Waterwing
  17. 10 Area_1 Forest z T-Rex Waterwing
  18. 11 Area_2 Cave z Dreadwing Scorcher
  19. 12 Area_2 Cave z Dreadwing Snake
  20. 13 Area_2 Cave z Dreadwing Snapmaw
  21. 14 Area_2 Cave z Dreadwing T-Rex
  22. 15 Area_2 Cave z Dreadwing Waterwing
  23. 16 Area_2 Cave z Scorcher Snake
  24. 17 Area_2 Cave z Scorcher Snapmaw
  25. 18 Area_2 Cave z Scorcher T-Rex
  26. 19 Area_2 Cave z Scorcher Waterwing
  27. 20 Area_2 Cave z Snake Snapmaw
  28. 21 Area_2 Cave z Snake T-Rex
  29. 22 Area_2 Cave z Snake Waterwing
  30. 23 Area_2 Cave z Snapmaw T-Rex
  31. 24 Area_2 Cave z Snapmaw Waterwing
  32. 25 Area_2 Cave z T-Rex Waterwing

huangapple
  • 本文由 发表于 2023年7月14日 09:08:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76684121.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定