2023年7月14日 09:08:40go评论103阅读模式

英文:

R: grouping and expanding the data.frame to include only possible pairs of names in a column

问题

我的目标是在R中将我的数据框扩展，以包括从R中的一列中获取可能的组合（但不是所有可能的组合）。类似于expand.grid命令，但该函数会给出所有可能的组合，而不仅仅是当前存在的。

首先，我需要按照第一列中的每个因子进行分组，并保留包含在第二列中的信息。在第三列中，我有包含“动物”名称的字符字符串。我想要找到在此列中逐行发生的每个可能的配对（但不是所有可能的配对）。例如，如果在前两行中有“Dreadwing”和“Scorcher”，那将是一个配对：Dreadwing-Scorcher，不应包括Scorcher-Dreadwing。但是，如果第四行和第五行是T-Rex和T-Rex，则该配对应该出现一次：T-Rex-T-Rex，因为T-Rex出现在“动物”列的2个单独的行中。如果T-Rex在3个不同的行中出现，则该配对应该出现3次，依此类推。

最后，配对应扩展数据框，添加2列以存储这些配对。换句话说，Dreadwing和Scorcher的配对应分别位于它们自己的单独列中，但位于同一行中。

我已经手动制作了此图片，以显示来自我的数据框的输出应该是什么样的（注意：Area_1和Area_2仅分开以适应一个屏幕截图中的结果）。左侧：我已经用箭头显示了从第一行Dreadwing开始的期望组合。右侧：Area_1和Area_2的所有期望结果。

对于所需的结果，对于Area_1，Dreadwing-Dreadwing对不应出现，因为Dreadwing没有出现在Area_1的任何其他行中。但是，T-Rex出现在2个不同的行中，因此T-Rex-T-Rex的组合应该存在，以及T-Rex的每一行与Waterwing的每一行组合。因此，4个T-Rex-Waterwing组合。

可重现的数据

创建数据框

v <- c(rep("Area_1", 7), rep("Area_2", 7))
w <- c(rep("Forest", 7), rep("Cave", 7))
y <- c("Waterwing", "Scorcher", "Snapmaw", "T-Rex", "T-Rex", "Dreadwing", 
       "Waterwing", "Snake", "T-Rex", "T-Rex", "Dreadwing", "Snapmaw", "Scorcher", 
       "Waterwing")
stack_df <- data.frame(Area = v, Location = w, Animals = y)
stack_df <- stack_df[order(stack_df$Area, stack_df$Location, stack_df$Animals), ]
row.names(stack_df) <- 1:nrow(stack_df)

使用tidyr指南，我发现expand命令与nesting命令结合使用（仅保留出现在数据中的组合）不起作用。例如：

library(tidyr)    
stack_df %>%
    dplyr::group_by(Area) %>%
    expand(nesting(Location, Animals, Animals))

将仅返回11/14行。

我尝试过多种使用expand和crossing命令的方法。但是，与expand.grid命令一样，这些命令会给出所有可能的组合。

尽管如此，使用expand命令是我接近目标的最接近方法。

stack_df %>%
dplyr::group_by(Area) %>%
expand(Location, Animals, Animals)

正如您所看到的，包括了所有可能性，这不是期望的结果。

有没有办法可以完成这个任务？

英文:

My goal is to expand my data.frame in R to include possible combinations (but not all possible combinations) from a column in R. Similar to the expand.grid command, but that function gives you all possible combinations, not just what is present.

To start, I need to group by each factor in the 1st column, and keep the information included in column 2. In column 3, I have character strings of the names of 'Animals'. I want to find each possible pair that occur in this column, row by row (but not all possible pairs). For example, if I have 'Dreadwing' and 'Scorcher' in the first two rows - that would be one pair: Dreadwing-Scorcher and it should not include Scorcher-Dreadwing. However, if rows 4 and five are T-Rex and T-Rex, the pair should appear once: T-Rex-T-Rex because T-Rex appears in 2 separate rows of the column 'Animals.' If T-Rex were to appear in 3 separate rows, then the pair should appear 3 times, etc., etc.

Lastly, pairs should expand the data.frame by 2 columns to store the pairs. In other words, the Dreadwing and Scorcher pair should be each in their own separate column, but in the same row.

I have manually put together this picture to show what my output should be from the data.frame I have (note: Area_1 and Area_2 are separated only to fit the results in one screenshot). The left: I have put arrows showing the desired combinations just from the first row, Dreadwing. On the right: the desired result for all of Area_1 and Area_2.

For the desired result, for Area_1, the Dreadwing-Dreadwing pair should not occur bc Dreadwing does not appear in any other row for Area_1. However, T-Rex appears in 2 separate rows, so the combination of T-Rex-T-Rex should be there, as well as the combination of each row of T-Rex combining with each row of Waterwing. So, 4 T-Rex-Waterwing combinations.

Reproducible Data

Creating the data.frame

v &lt;- c(rep(&quot;Area_1&quot;, 7), rep(&quot;Area_2&quot;, 7))
w &lt;- c(rep(&quot;Forest&quot;, 7), rep(&quot;Cave&quot;, 7))
y &lt;- c(&quot;Waterwing&quot;, &quot;Scorcher&quot;, &quot;Snapmaw&quot;, &quot;T-Rex&quot;, &quot;T-Rex&quot;, &quot;Dreadwing&quot;, 
&quot;Waterwing&quot;, &quot;Snake&quot;, &quot;T-Rex&quot;, &quot;T-Rex&quot;, &quot;Dreadwing&quot;, &quot;Snapmaw&quot;, &quot;Scorcher&quot;, 
&quot;Waterwing&quot;)
stack_df &lt;- data.frame(Area = v, Location = w, Animals = y)
stack_df &lt;- stack_df[order(stack_df$Area, stack_df$Location, stack_df$Animals), ]
row.names(stack_df) &lt;- 1:nrow(stack_df)

Using the tidyR guidebook, I have found that the command expand in conjunction with the nesting command (to keep only combinations that appear in the data) does not work. For example:

library(tidyr)    
stack_df %&gt;%
    dplyr::group_by(Area) %&gt;%
    expand(nesting(Location, Animals, Animals))

will return only 11/14 rows.

I have tried multiple ways using the expand and crossing command. However, like the expand.grid command, these commands give you all possible combinations.

Despite this, using the expand command is the closest I have gotten to what I am aiming for.

stack_df %&gt;%
dplyr::group_by(Area) %&gt;%
expand(Location, Animals, Animals)

As you can see, all possibilities are included, which is not the desired result.

Any ideas on how I can get this done?

答案1

得分: 1

这似乎是要找到在Area/Location组内，第一个动物出现在第二个动物之前的所有动物对的组合。

我们可以通过添加行号索引并在行号上进行自连接，并在行号上应用不等式约束来完成这个任务（需要dplyr版本>= 1.1.0）。

library(dplyr)
stack_df = stack_df %>%
  mutate(group_i = row_number(), .by = c(Area, Location))
stack_df %>%
  inner_join(
    stack_df,
    by = join_by(Area, Location, group_i < group_i),
    suffix = c("..2", "..3")  
  ) %>%
  select(-starts_with("group_"))
#      Area Location Animals..2 Animals..3
# 1  Area_1   Forest  Dreadwing   Scorcher
# 2  Area_1   Forest  Dreadwing    Snapmaw
# 3  Area_1   Forest  Dreadwing      T-Rex
# ...
# (以下为输出的部分结果)

请注意，此处的代码仅为翻译，实际运行可能需要根据您的数据和环境进行适当的调整。

英文:

It sounds/looks to me like you want to find all combinations (within the Area/Location group) of pairs of Animals where the 1st Animal in the pair occurs on a row before the 2nd Animal in the pair.

We can do this by adding a row number index and doing a self-join with an inequality constraint on the row numbers. (This requires dplyr version >= 1.1.0)

library(dplyr)
stack_df = stack_df |&gt;
  mutate(group_i = row_number(), .by = c(Area, Location))
stack_df |&gt;
  inner_join(
    stack_df,
    by = join_by(Area, Location, group_i &lt; group_i),
    suffix = c(&quot;..2&quot;, &quot;..3&quot;)  
  ) |&gt;
  select(-starts_with(&quot;group_&quot;))
#      Area Location Animals..2 Animals..3
# 1  Area_1   Forest  Dreadwing   Scorcher
# 2  Area_1   Forest  Dreadwing    Snapmaw
# 3  Area_1   Forest  Dreadwing      T-Rex
# 4  Area_1   Forest  Dreadwing      T-Rex
# 5  Area_1   Forest  Dreadwing  Waterwing
# 6  Area_1   Forest  Dreadwing  Waterwing
# 7  Area_1   Forest   Scorcher    Snapmaw
# 8  Area_1   Forest   Scorcher      T-Rex
# 9  Area_1   Forest   Scorcher      T-Rex
# 10 Area_1   Forest   Scorcher  Waterwing
# 11 Area_1   Forest   Scorcher  Waterwing
# 12 Area_1   Forest    Snapmaw      T-Rex
# 13 Area_1   Forest    Snapmaw      T-Rex
# 14 Area_1   Forest    Snapmaw  Waterwing
# 15 Area_1   Forest    Snapmaw  Waterwing
# 16 Area_1   Forest      T-Rex      T-Rex
# 17 Area_1   Forest      T-Rex  Waterwing
# 18 Area_1   Forest      T-Rex  Waterwing
# 19 Area_1   Forest      T-Rex  Waterwing
# 20 Area_1   Forest      T-Rex  Waterwing
# 21 Area_1   Forest  Waterwing  Waterwing
# 22 Area_2     Cave  Dreadwing   Scorcher
# 23 Area_2     Cave  Dreadwing      Snake
# 24 Area_2     Cave  Dreadwing    Snapmaw
# 25 Area_2     Cave  Dreadwing      T-Rex
# 26 Area_2     Cave  Dreadwing      T-Rex
# 27 Area_2     Cave  Dreadwing  Waterwing
# 28 Area_2     Cave   Scorcher      Snake
# 29 Area_2     Cave   Scorcher    Snapmaw
# 30 Area_2     Cave   Scorcher      T-Rex
# 31 Area_2     Cave   Scorcher      T-Rex
# 32 Area_2     Cave   Scorcher  Waterwing
# 33 Area_2     Cave      Snake    Snapmaw
# 34 Area_2     Cave      Snake      T-Rex
# 35 Area_2     Cave      Snake      T-Rex
# 36 Area_2     Cave      Snake  Waterwing
# 37 Area_2     Cave    Snapmaw      T-Rex
# 38 Area_2     Cave    Snapmaw      T-Rex
# 39 Area_2     Cave    Snapmaw  Waterwing
# 40 Area_2     Cave      T-Rex      T-Rex
# 41 Area_2     Cave      T-Rex  Waterwing
# 42 Area_2     Cave      T-Rex  Waterwing

答案2

得分: 0

如果您的R版本大于4.3，有一个我用来进行多重测试的小包在GitHub上。

# devtools::install_github('oonyambu/SLR')
stack_df %>%
  mutate(Area = paste(Area, Location), z = 1) %>%
  SLR::multiple_tests(z ~ Animals | Area, ., \(x,y) list(NULL)) %>%
  separate(Area, c('Area', 'Location'), sep = ' ') %>%
  separate(Value, c('Animal1', 'Animal2'), sep = ':')

结果如下：

     Area Location response   Animal1   Animal2
1  Area_1   Forest        z Dreadwing  Scorcher
2  Area_1   Forest        z Dreadwing   Snapmaw
3  Area_1   Forest        z Dreadwing     T-Rex
4  Area_1   Forest        z Dreadwing Waterwing
5  Area_1   Forest        z  Scorcher   Snapmaw
6  Area_1   Forest        z  Scorcher     T-Rex
7  Area_1   Forest        z  Scorcher Waterwing
8  Area_1   Forest        z   Snapmaw     T-Rex
9  Area_1   Forest        z   Snapmaw Waterwing
10 Area_1   Forest        z     T-Rex Waterwing
11 Area_2     Cave        z Dreadwing  Scorcher
12 Area_2     Cave        z Dreadwing     Snake
13 Area_2     Cave        z Dreadwing   Snapmaw
14 Area_2     Cave        z Dreadwing     T-Rex
15 Area_2     Cave        z Dreadwing Waterwing
16 Area_2     Cave        z  Scorcher     Snake
17 Area_2     Cave        z  Scorcher   Snapmaw
18 Area_2     Cave        z  Scorcher     T-Rex
19 Area_2     Cave        z  Scorcher Waterwing
20 Area_2     Cave        z     Snake   Snapmaw
21 Area_2     Cave        z     Snake     T-Rex
22 Area_2     Cave        z     Snake Waterwing
23 Area_2     Cave        z   Snapmaw     T-Rex
24 Area_2     Cave        z   Snapmaw Waterwing
25 Area_2     Cave        z     T-Rex Waterwing

英文:

if you have R > 4.3, there is a small package on github that I use to do multiple tests.

# devtools::install_github(&#39;oonyambu/SLR&#39;)
stack_df %&gt;%
 	mutate(Area = paste(Area, Location), z = 1) %&gt;%
 	SLR::multiple_tests(z~Animals|Area, ., \(x,y)list(NULL)) %&gt;%
 	separate(Area, c(&#39;Area&#39;, &#39;Location&#39;), sep = &#39; &#39;) %&gt;%
 	separate(Value, c(&#39;Animal1&#39;, &#39;Animal2&#39;), sep = &#39;:&#39;)
     Area Location response   Animal1   Animal2
1  Area_1   Forest        z Dreadwing  Scorcher
2  Area_1   Forest        z Dreadwing   Snapmaw
3  Area_1   Forest        z Dreadwing     T-Rex
4  Area_1   Forest        z Dreadwing Waterwing
5  Area_1   Forest        z  Scorcher   Snapmaw
6  Area_1   Forest        z  Scorcher     T-Rex
7  Area_1   Forest        z  Scorcher Waterwing
8  Area_1   Forest        z   Snapmaw     T-Rex
9  Area_1   Forest        z   Snapmaw Waterwing
10 Area_1   Forest        z     T-Rex Waterwing
11 Area_2     Cave        z Dreadwing  Scorcher
12 Area_2     Cave        z Dreadwing     Snake
13 Area_2     Cave        z Dreadwing   Snapmaw
14 Area_2     Cave        z Dreadwing     T-Rex
15 Area_2     Cave        z Dreadwing Waterwing
16 Area_2     Cave        z  Scorcher     Snake
17 Area_2     Cave        z  Scorcher   Snapmaw
18 Area_2     Cave        z  Scorcher     T-Rex
19 Area_2     Cave        z  Scorcher Waterwing
20 Area_2     Cave        z     Snake   Snapmaw
21 Area_2     Cave        z     Snake     T-Rex
22 Area_2     Cave        z     Snake Waterwing
23 Area_2     Cave        z   Snapmaw     T-Rex
24 Area_2     Cave        z   Snapmaw Waterwing
25 Area_2     Cave        z     T-Rex Waterwing

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R：将数据框进行分组和扩展，仅包括列中可能的名称对。

问题

答案1

答案2

如何在ggsurvplot中插入一张图片？

如果满足条件，替换特定列上的值。

XLSX Writer的num_format函数在Excel中不可视。

找到连胜次数最多的团队。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。