用平均频率值替换“middle”频率。

huangapple go评论73阅读模式
英文:

Replace 'middle' frequencies with averaged frequency values

问题

以下是更简洁的 dplyr 代码,以获得所需的输出,而不使用 left_join

library(dplyr)

df %>%
  group_by(rowid) %>%
  mutate(avg_f = ifelse(position != min(position) & position != max(position),
                       mean(f, na.rm = TRUE), f)) %>%
  slice(c(1, 2, n())) %>%
  mutate(position = ifelse(position == min(position), 1,
                           ifelse(position == max(position), 2, 1.5))) %>%
  select(-avg_f)

这将生成与您提供的结果相同的输出,但不涉及 left_join

英文:

I have this type of data, with frequency data and position data grouped by rowid:

df
   rowid    word   f position
1      2       i 700        1
2      2      'm 600        2
3      2    fine   1        3
4      3     how 400        1
5      3      's 500        2
6      3     the 700        3
7      3 weather  20        4
8      4      it 390        1
9      4      's 500        2
10     4  really 177        3
11     4    very 200        4
12     4    cold  35        5
13     5       i 700        1
14     5    love 199        2
15     5     you 400        3

The task I'm facing seems simple: in those rowids where there are more than 3 positions, I need to replace the frequencies of all middle positions with their average. The following approach works but seems over-convoluted, so I'm almost certain there will be a more straightforward dplyrway to get the desired output:

df %>%
  group_by(rowid) %>%
  # filter for 'middle' positions:
  filter(position != first(position) & position != last(position)) %>%
  # summarise:
  summarize(across(position),
            # create average frequency:
            f_middle_position = mean(f, na.rm = TRUE),
            # concatenate words:
            word = str_c(word, collapse = " ")
            ) %>%
  filter(!duplicated(f_middle_position)) %>%
  # join with df:
  left_join(df, ., by = c("rowid", "position")) %>%
  # remove rows other than #1,#2, and last:
  group_by(rowid) %>%
  # create row count:
  mutate(rn = row_number()) %>%
  # filter first, second, and last row per group:
  filter(rn %in% c(1, 2, last(rn))) %>%
  # transfer frequencies for middle positions:
  mutate(f = ifelse(is.na(f_middle_position), f, f_middle_position)) %>%
  # make more changes:
  mutate(
    # change position labels:
    position = ifelse(position == first(position), 1,
                           ifelse(position == last(position), 2, 1.5)),
    # update word:
    word = ifelse(is.na(word.y), word.x, word.y)
         ) %>%
  # remove obsolete variables:
  select(-c(f_middle_position, word.y, word.x,rn))
 A tibble: 12 × 4
# Groups:   rowid [4]
   rowid     f position word          
   <dbl> <dbl>    <dbl> <chr>         
 1     2  700       1   i             
 2     2  600       1.5 'm            
 3     2    1       2   fine          
 4     3  400       1   how           
 5     3  600       1.5 's the        
 6     3   20       2   weather       
 7     4  390       1   it            
 8     4  292.      1.5 's really very
 9     4   35       2   cold          
10     5  700       1   i             
11     5  199       1.5 love          
12     5  400       2   you 

How can this result be obtained in a more concise way in dplyr and, preferably without the left_join, which causes problems with my actual data?

Data:

df <- data.frame(
  rowid = c(2,2,2,3,3,3,3,4,4,4,4,4,5,5,5),
  word = c("i","'m","fine",
           "how","'s","the","weather",
           "it","'s","really", "very","cold",
           "i","love","you"),
  f = c(700,600,1,
        400,500,700,20,
        390,500,177,200,35,
        700,199,400),
  position = c(1,2,3,
               1,2,3,4,
               1,2,3,4,5,
               1,2,3)
)

答案1

得分: 1

# 创建一个名为 `pos` 的组变量,将第一行标记为 `1`,中间行标记为 `1.5`,最后一行标记为 `2`。
# 然后按照 `rowid` 和 `pos` 进行分组,对 `f` 应用 `mean()` 函数,对 `word` 应用 `paste()` 函数。

library(dplyr)

df %>%
  group_by(rowid) %>%
  mutate(pos = case_when(position == 1 ~ 1, position == n() ~ 2, TRUE ~ 1.5)) %>%
  group_by(rowid, pos) %>%
  summarise(f = mean(f), word = paste(word, collapse = ' '), .groups = 'drop')
英文:

You can create a group variable pos that marks the first row with 1, the middle with 1.5, and the last with 2. Then group the data by rowid and pos and apply mean() and paste() on f and word respectively.

library(dplyr)

df %>%
  group_by(rowid) %>% 
  mutate(pos = case_when(position == 1 ~ 1, position == n() ~ 2, TRUE ~ 1.5)) %>%
  group_by(rowid, pos) %>%
  summarise(f = mean(f), word = paste(word, collapse = ' '), .groups = 'drop')

# # A tibble: 12 × 4
#    rowid   pos     f word          
#    <dbl> <dbl> <dbl> <chr>         
#  1     2   1    700  i             
#  2     2   1.5  600  'm            
#  3     2   2      1  fine          
#  4     3   1    400  how           
#  5     3   1.5  600  's the        
#  6     3   2     20  weather       
#  7     4   1    390  it            
#  8     4   1.5  292. 's really very
#  9     4   2     35  cold          
# 10     5   1    700  i             
# 11     5   1.5  199  love          
# 12     5   2    400  you

huangapple
  • 本文由 发表于 2023年1月9日 18:51:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75056207.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定