重新定义因子水平和组内顺序。

huangapple go评论70阅读模式
英文:

Redefine factor levels and order within groups

问题

以下是您要翻译的内容:

"This is the simple example for illustration.
I want a summary of data presented in a predetermined order. I want to order col2 values depending on col1 values, and also include rows for factor levels within the col1 group that are not in the data (eg using group_by ( ..., .drop=FALSE). Some values in col2 appear in more than col1 group. There is no logic that can be applied to determine the order of col2. You may call it a two-level factor maybe?

For example, my input data could be:

df <- read.table(
  header = TRUE,
  sep=",",
  text = "
col1,col2
Tunnels,Dick
Tunnels,Tom
Tunnels,Tom
Beatles,George
Beatles,Paul
Beatles,Ringo
Beatles,Ringo
UK Artists,Gilbert
"
)

and my required output would be

 col1       col2        n
 Beatles    John        0
 Beatles    Paul        1
 Beatles    George      1
 Beatles    Ringo       2
 UK Artists Gilbert     1
 UK Artists George      0
 Tunnels    Tom         2
 Tunnels    Dick        1
 Tunnels    Harry       0

The following, of course, does not work

col2_tunnels <- c("Tom", "Dick", "Harry")
col2_beatles <- c("John", "Paul", "George", "Ringo")
col2_artists <- c("Gilbert", "George")
col2_order <- unique(c(col2_tunnels, col2_beatles, col2_artists)) # cannot have duplicates
col1_order <- c("Beatles", "UK Artists", "Tunnels")

df %>% mutate(
    col1 = factor(col1, levels = col1_order),
    col2 = factor(col2, levels = col2_order)
  ) %>% group_by(col1, col2, .drop = FALSE) %>% summarise(n = n(), )

The only way forward I can see is to split the data by col1 levels and use a named list of vectors defining the factor order for each level of col1. While writing the question I found this worked

col2_fctlist <- list(
  Tunnels = c("Tom", "Dick", "Harry"),
  Beatles = c("John", "Paul", "George", "Ringo"),
  'UK Artists' = c("Gilbert", "George")
)

x <- lapply(col1_order, function(col1grp)
  df %>% filter(col1==col1grp) %>% 
    mutate(col2 = factor(col2, levels = col2_fctlist[[col1grp]])) %>% 
    group_by(col1, col2, .drop = FALSE) %>%
    summarise(n = n(), )
)

do.call(rbind, x)

虽然我已经找到了一个我认为适合我的解决方案,但我仍然发布在这里,以防有人能够提供更好的解决方案?"

英文:

This is the simple example for illustration.
I want a summary of data presented in a predetermined order. I want to order col2 values depending on col1 values, and also include rows for factor levels within the col1 group that are not in the data (eg using group_by ( ..., .drop=FALSE). Some values in col2 appear in more than col1 group. There is no logic that can be applied to determine the order of col2. You may call it a two-level factor maybe?

For example , my input data could be:

df <- read.table(
  header = TRUE,
  sep=",",
  text = "
col1,col2
Tunnels,Dick
Tunnels,Tom
Tunnels,Tom
Beatles,George
Beatles,Paul
Beatles,Ringo
Beatles,Ringo
UK Artists,Gilbert
"
)

and my required output would be

 col1       col2        n
 Beatles    John        0
 Beatles    Paul        1
 Beatles    George      1
 Beatles    Ringo       2
 UK Artists Gilbert     1
 UK Artists George      0
 Tunnels    Tom         2
 Tunnels    Dick        1
 Tunnels    Harry       0

The following , of course, does not work

col2_tunnels <- c("Tom", "Dick", "Harry")
col2_beatles <- c("John", "Paul", "George", "Ringo")
col2_artists <- c("Gilbert", "George")
col2_order <- unique(c(col2_tunnels, col2_beatles, col2_artists)) # cannot have duplicates
col1_order <- c("Beatles", "UK Artists", "Tunnels")

df %>%
  mutate(
    col1 = factor(col1, levels = col1_order),
    col2 = factor(col2, levels = col2_order)
  ) %>%
  group_by(col1, col2, .drop = FALSE) %>%
  summarise(n = n(), )

The only way forward I can see is to split the data by col1 levels and use a named list of vectors defining the factor order for each level of col1. While writing the question I found this worked

col2_fctlist <- list(
  Tunnels = c("Tom", "Dick", "Harry"),
  Beatles = c("John", "Paul", "George", "Ringo"),
  'UK Artists' = c("Gilbert", "George")
)

x <- lapply(col1_order, function(col1grp)
  df %>% filter(col1==col1grp) %>% 
    mutate(col2 = factor(col2, levels = col2_fctlist[[col1grp]])) %>% 
    group_by(col1, col2, .drop = FALSE) %>%
    summarise(n = n(), )
)

do.call(rbind, x)

Although I have found a solution that I think works for me, I'm still posting in case anybody can offer a better solution?

答案1

得分: 2

不知道这是否比你的更好!使用 data.table,如果我首先按照如下方式设置col1col2的所需顺序:

l1 <- list(Beatles=data.frame(col2=c("John", "Paul", "George", "Ringo")),
           `UK Artists`=data.frame(col2=c("Gilbert", "George")),
           `Tunnels`=data.frame(col2=c("Tom", "Dick", "Harry"))

然后,我可以使用 rbindlist 将其转换为一个 data.table,并使用 df 进行连接,以按指定顺序获取所需的输出:

dt1 <- rbindlist(l1, idcol = "col1")

df[,n:=1][ dt1 , on=c("col1","col2")][, sum(n,na.rm = TRUE) , .(col1, col2)]

         col1    col2 V1
1:    Beatles    John  0
2:    Beatles    Paul  1
3:    Beatles  George  1
4:    Beatles   Ringo  2
5: UK Artists Gilbert  1
6: UK Artists  George  0
7:    Tunnels     Tom  2
8:    Tunnels    Dick  1
9:    Tunnels   Harry  0
英文:

I don't know if this is better than yours! Using data.table, if I first set up the required order for col1 and col2 in a list like this:

l1 &lt;- list(Beatles=data.frame(col2=c(&quot;John&quot;, &quot;Paul&quot;, &quot;George&quot;, &quot;Ringo&quot;)),
           `UK Artists`=data.frame(col2=c(&quot;Gilbert&quot;, &quot;George&quot;)),
           `Tunnels`=data.frame(col2=c(&quot;Tom&quot;, &quot;Dick&quot;, &quot;Harry&quot;))

Then I can turn this into a data.table using rblindlist and use a join with df to get the output that you want in the specified order:


dt1 &lt;- rbindlist(l1, idcol = &quot;col1&quot;)

df[,n:=1][ dt1 , on=c(&quot;col1&quot;,&quot;col2&quot;)][, sum(n,na.rm = TRUE) , .(col1, col2)]

         col1    col2 V1
1:    Beatles    John  0
2:    Beatles    Paul  1
3:    Beatles  George  1
4:    Beatles   Ringo  2
5: UK Artists Gilbert  1
6: UK Artists  George  0
7:    Tunnels     Tom  2
8:    Tunnels    Dick  1
9:    Tunnels   Harry  0

答案2

得分: 1

With a join:

使用 `join` 函数:
```r
library(tidyverse)
enframe(col2_fctlist, name = "col1", value = "col2") %>% unnest(col2) %>% 
  left_join(df %>% count(col1, col2)) %>% 
  replace_na(list(n = 0))
    col1    col2 n

1 Tunnels Tom 2
2 Tunnels Dick 1
3 Tunnels Harry 0
4 Beatles John 0
5 Beatles Paul 1
6 Beatles George 1
7 Beatles Ringo 2
8 UK Artists Gilbert 1
9 UK Artists George 0


Or with `imap_dfr`:
```r
使用 `imap_dfr` 函数:
```r
imap_dfr(col2_fctlist, 
         ~ df %>% 
           filter(col1 == .y) %>% 
           mutate(col2 = factor(col2, levels = .x)) %>% 
           count(col2, .drop = FALSE), 
         .id = "col1")
英文:

With a join:

library(tidyverse)
enframe(col2_fctlist, name = &quot;col1&quot;, value = &quot;col2&quot;) %&gt;% unnest(col2) %&gt;% 
  left_join(df %&gt;% count(col1, col2)) %&gt;% 
  replace_na(list(n = 0))

        col1    col2 n
1    Tunnels     Tom 2
2    Tunnels    Dick 1
3    Tunnels   Harry 0
4    Beatles    John 0
5    Beatles    Paul 1
6    Beatles  George 1
7    Beatles   Ringo 2
8 UK Artists Gilbert 1
9 UK Artists  George 0

Or with imap_dfr:

imap_dfr(col2_fctlist, 
         ~ df %&gt;% 
           filter(col1 == .y) %&gt;% 
           mutate(col2 = factor(col2, levels = .x)) %&gt;% 
           count(col2, .drop = FALSE), 
         .id = &quot;col1&quot;)

huangapple
  • 本文由 发表于 2023年3月3日 19:34:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75626575.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定