2023年3月3日 19:34:11go评论95阅读模式

英文:

Redefine factor levels and order within groups

问题

以下是您要翻译的内容：

"This is the simple example for illustration.
I want a summary of data presented in a predetermined order. I want to order col2 values depending on col1 values, and also include rows for factor levels within the col1 group that are not in the data (eg using group_by ( ..., .drop=FALSE). Some values in col2 appear in more than col1 group. There is no logic that can be applied to determine the order of col2. You may call it a two-level factor maybe?

For example, my input data could be:

df &lt;- read.table(
  header = TRUE,
  sep=&quot;,&quot;,
  text = &quot;
col1,col2
Tunnels,Dick
Tunnels,Tom
Tunnels,Tom
Beatles,George
Beatles,Paul
Beatles,Ringo
Beatles,Ringo
UK Artists,Gilbert
&quot;
)

and my required output would be

 col1       col2        n
 Beatles    John        0
 Beatles    Paul        1
 Beatles    George      1
 Beatles    Ringo       2
 UK Artists Gilbert     1
 UK Artists George      0
 Tunnels    Tom         2
 Tunnels    Dick        1
 Tunnels    Harry       0

The following, of course, does not work

col2_tunnels &lt;- c(&quot;Tom&quot;, &quot;Dick&quot;, &quot;Harry&quot;)
col2_beatles &lt;- c(&quot;John&quot;, &quot;Paul&quot;, &quot;George&quot;, &quot;Ringo&quot;)
col2_artists &lt;- c(&quot;Gilbert&quot;, &quot;George&quot;)
col2_order &lt;- unique(c(col2_tunnels, col2_beatles, col2_artists)) # cannot have duplicates
col1_order &lt;- c(&quot;Beatles&quot;, &quot;UK Artists&quot;, &quot;Tunnels&quot;)
df %&gt;% mutate(
    col1 = factor(col1, levels = col1_order),
    col2 = factor(col2, levels = col2_order)
  ) %&gt;% group_by(col1, col2, .drop = FALSE) %&gt;% summarise(n = n(), )

The only way forward I can see is to split the data by col1 levels and use a named list of vectors defining the factor order for each level of col1. While writing the question I found this worked

col2_fctlist &lt;- list(
  Tunnels = c(&quot;Tom&quot;, &quot;Dick&quot;, &quot;Harry&quot;),
  Beatles = c(&quot;John&quot;, &quot;Paul&quot;, &quot;George&quot;, &quot;Ringo&quot;),
  &#39;UK Artists&#39; = c(&quot;Gilbert&quot;, &quot;George&quot;)
)
x &lt;- lapply(col1_order, function(col1grp)
  df %&gt;% filter(col1==col1grp) %&gt;% 
    mutate(col2 = factor(col2, levels = col2_fctlist[[col1grp]])) %&gt;% 
    group_by(col1, col2, .drop = FALSE) %&gt;%
    summarise(n = n(), )
)
do.call(rbind, x)

虽然我已经找到了一个我认为适合我的解决方案，但我仍然发布在这里，以防有人能够提供更好的解决方案？"

英文:

This is the simple example for illustration.
I want a summary of data presented in a predetermined order. I want to order col2 values depending on col1 values, and also include rows for factor levels within the col1 group that are not in the data (eg using group_by ( ..., .drop=FALSE). Some values in col2 appear in more than col1 group. There is no logic that can be applied to determine the order of col2. You may call it a two-level factor maybe?

For example , my input data could be:

df &lt;- read.table(
  header = TRUE,
  sep=&quot;,&quot;,
  text = &quot;
col1,col2
Tunnels,Dick
Tunnels,Tom
Tunnels,Tom
Beatles,George
Beatles,Paul
Beatles,Ringo
Beatles,Ringo
UK Artists,Gilbert
&quot;
)

and my required output would be

 col1       col2        n
 Beatles    John        0
 Beatles    Paul        1
 Beatles    George      1
 Beatles    Ringo       2
 UK Artists Gilbert     1
 UK Artists George      0
 Tunnels    Tom         2
 Tunnels    Dick        1
 Tunnels    Harry       0

The following , of course, does not work

col2_tunnels &lt;- c(&quot;Tom&quot;, &quot;Dick&quot;, &quot;Harry&quot;)
col2_beatles &lt;- c(&quot;John&quot;, &quot;Paul&quot;, &quot;George&quot;, &quot;Ringo&quot;)
col2_artists &lt;- c(&quot;Gilbert&quot;, &quot;George&quot;)
col2_order &lt;- unique(c(col2_tunnels, col2_beatles, col2_artists)) # cannot have duplicates
col1_order &lt;- c(&quot;Beatles&quot;, &quot;UK Artists&quot;, &quot;Tunnels&quot;)
df %&gt;%
  mutate(
    col1 = factor(col1, levels = col1_order),
    col2 = factor(col2, levels = col2_order)
  ) %&gt;%
  group_by(col1, col2, .drop = FALSE) %&gt;%
  summarise(n = n(), )

The only way forward I can see is to split the data by col1 levels and use a named list of vectors defining the factor order for each level of col1. While writing the question I found this worked

col2_fctlist &lt;- list(
  Tunnels = c(&quot;Tom&quot;, &quot;Dick&quot;, &quot;Harry&quot;),
  Beatles = c(&quot;John&quot;, &quot;Paul&quot;, &quot;George&quot;, &quot;Ringo&quot;),
  &#39;UK Artists&#39; = c(&quot;Gilbert&quot;, &quot;George&quot;)
)
x &lt;- lapply(col1_order, function(col1grp)
  df %&gt;% filter(col1==col1grp) %&gt;% 
    mutate(col2 = factor(col2, levels = col2_fctlist[[col1grp]])) %&gt;% 
    group_by(col1, col2, .drop = FALSE) %&gt;%
    summarise(n = n(), )
)
do.call(rbind, x)

Although I have found a solution that I think works for me, I'm still posting in case anybody can offer a better solution?

答案1

得分: 2

不知道这是否比你的更好！使用 data.table，如果我首先按照如下方式设置col1和col2的所需顺序：

l1 <- list(Beatles=data.frame(col2=c("John", "Paul", "George", "Ringo")),
           `UK Artists`=data.frame(col2=c("Gilbert", "George")),
           `Tunnels`=data.frame(col2=c("Tom", "Dick", "Harry"))

然后，我可以使用 rbindlist 将其转换为一个 data.table，并使用 df 进行连接，以按指定顺序获取所需的输出：

dt1 <- rbindlist(l1, idcol = "col1")
df[,n:=1][ dt1 , on=c("col1","col2")][, sum(n,na.rm = TRUE) , .(col1, col2)]
         col1    col2 V1
1:    Beatles    John  0
2:    Beatles    Paul  1
3:    Beatles  George  1
4:    Beatles   Ringo  2
5: UK Artists Gilbert  1
6: UK Artists  George  0
7:    Tunnels     Tom  2
8:    Tunnels    Dick  1
9:    Tunnels   Harry  0

英文:

I don't know if this is better than yours! Using data.table, if I first set up the required order for col1 and col2 in a list like this:

l1 &lt;- list(Beatles=data.frame(col2=c(&quot;John&quot;, &quot;Paul&quot;, &quot;George&quot;, &quot;Ringo&quot;)),
           `UK Artists`=data.frame(col2=c(&quot;Gilbert&quot;, &quot;George&quot;)),
           `Tunnels`=data.frame(col2=c(&quot;Tom&quot;, &quot;Dick&quot;, &quot;Harry&quot;))

Then I can turn this into a data.table using rblindlist and use a join with df to get the output that you want in the specified order:


dt1 &lt;- rbindlist(l1, idcol = &quot;col1&quot;)
df[,n:=1][ dt1 , on=c(&quot;col1&quot;,&quot;col2&quot;)][, sum(n,na.rm = TRUE) , .(col1, col2)]
         col1    col2 V1
1:    Beatles    John  0
2:    Beatles    Paul  1
3:    Beatles  George  1
4:    Beatles   Ringo  2
5: UK Artists Gilbert  1
6: UK Artists  George  0
7:    Tunnels     Tom  2
8:    Tunnels    Dick  1
9:    Tunnels   Harry  0

答案2

得分: 1

With a join:

使用 `join` 函数：
```r
library(tidyverse)
enframe(col2_fctlist, name = "col1", value = "col2") %>% unnest(col2) %>% 
  left_join(df %>% count(col1, col2)) %>% 
  replace_na(list(n = 0))

    col1    col2 n

1 Tunnels Tom 2
2 Tunnels Dick 1
3 Tunnels Harry 0
4 Beatles John 0
5 Beatles Paul 1
6 Beatles George 1
7 Beatles Ringo 2
8 UK Artists Gilbert 1
9 UK Artists George 0


Or with `imap_dfr`:
```r
使用 `imap_dfr` 函数：
```r
imap_dfr(col2_fctlist, 
         ~ df %>% 
           filter(col1 == .y) %>% 
           mutate(col2 = factor(col2, levels = .x)) %>% 
           count(col2, .drop = FALSE), 
         .id = "col1")

英文:

With a join:

library(tidyverse)
enframe(col2_fctlist, name = &quot;col1&quot;, value = &quot;col2&quot;) %&gt;% unnest(col2) %&gt;% 
  left_join(df %&gt;% count(col1, col2)) %&gt;% 
  replace_na(list(n = 0))
        col1    col2 n
1    Tunnels     Tom 2
2    Tunnels    Dick 1
3    Tunnels   Harry 0
4    Beatles    John 0
5    Beatles    Paul 1
6    Beatles  George 1
7    Beatles   Ringo 2
8 UK Artists Gilbert 1
9 UK Artists  George 0

Or with imap_dfr:

imap_dfr(col2_fctlist, 
         ~ df %&gt;% 
           filter(col1 == .y) %&gt;% 
           mutate(col2 = factor(col2, levels = .x)) %&gt;% 
           count(col2, .drop = FALSE), 
         .id = &quot;col1&quot;)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

重新定义因子水平和组内顺序。

问题

答案1

答案2

为什么在R中的as.factor()函数如此缓慢，能否改进？

使用ggplot2的组合图例时，其中一个组不显示。

在另一个数据框中匹配行和列中的数值。

基于确切的行/列名称插入矩阵数值。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。