如何根据分类变量在R中从某些行的值中减去其他行的值

huangapple go评论54阅读模式
英文:

How to subtract the value of some rows from other rows in R according to categorical variables

问题

我正在尝试确定数据集中每个文本(filename)的每种时态和语态(TA)中动词(n)的数量。我将数据保存为tibble(如下所示),但是n列中的值还不准确,因为某些类别包含其他类别。例如,为了准确计算past_simple,我需要从同一文本中的past_perfect和present_progressive的数量(n)中减去。基本上,我想要做这样的事情:

  1. 从filename BIO.GO.01.1的past_simple的值中减去past_perfect和past_progressive的值(n)。

  2. 对每个单独的文件重复此过程。

我知道如何根据它们的位置从其他行中减去行,就像这样:

tib[3, 9] <- tib[3, 9] - tib[2, 9] - tib[1, 9]

但是行并不总是按可预测的顺序出现,因为并非每个文本(filename)中都包含所有的TA选项。我也不确定如何编写代码,以便在每次遇到新的filename时重新开始此过程。

我仍在学习如何在R中操作数据。任何建议将不胜感激!

英文:

I am trying to determine the number of verbs (n) in each tense and aspect (TA) for each text (filename) in my dataset. I have my data saved as a tibble (see below), but the values in the n column are not yet accurate because some categories subsume others. For instance, to get an accurate count of past_simple, I need to subtract the number (n) for past_perfect and present_progressive in that same text. Essentially, I am looking to do something like this:

  1. Subtract the value (n) of past_perfect and past_progressive from the value of past_simple for filename BIO.GO.01.1

  2. Repeat this process for each individual file

    tib &lt;- structure(
  list(
    TA = c(
      &quot;past_perfect&quot;,
      &quot;past_progressive&quot;,
      &quot;past_simple&quot;,
      &quot;past_simple&quot;,
      &quot;past_simple&quot;,
      &quot;past_simple&quot;,
      &quot;past_simple&quot;,
      &quot;past_simple&quot;,
      &quot;past_perfect&quot;,
      &quot;past_progressive&quot;
    ),
    tense = c(
      &quot;past&quot;,
      &quot;past&quot;,
      &quot;past&quot;,
      &quot;past&quot;,
      &quot;past&quot;,
      &quot;past&quot;,
      &quot;past&quot;,
      &quot;past&quot;,
      &quot;past&quot;,
      &quot;past&quot;
    ),
    aspect = c(
      &quot;perfect&quot;,
      &quot;progressive&quot;,
      &quot;simple&quot;,
      &quot;simple&quot;,
      &quot;simple&quot;,
      &quot;simple&quot;,
      &quot;simple&quot;,
      &quot;simple&quot;,
      &quot;perfect&quot;,
      &quot;progressive&quot;
    ),
    filename = c(
      &quot;BIO.G0.01.1&quot;,
      &quot;BIO.G0.01.1&quot;,
      &quot;BIO.G0.01.1&quot;,
      &quot;BIO.G0.02.1&quot;,
      &quot;BIO.G0.02.2&quot;,
      &quot;BIO.G0.02.4&quot;,
      &quot;BIO.G0.02.5&quot;,
      &quot;BIO.G0.02.6&quot;,
      &quot;BIO.G0.03.1&quot;,
      &quot;BIO.G0.03.1&quot;
    ),
    discipline = c(
      &quot;BIO&quot;,
      &quot;BIO&quot;,
      &quot;BIO&quot;,
      &quot;BIO&quot;,
      &quot;BIO&quot;,
      &quot;BIO&quot;,
      &quot;BIO&quot;,
      &quot;BIO&quot;,
      &quot;BIO&quot;,
      &quot;BIO&quot;
    ),
    nativeness = c(&quot;NS&quot;, &quot;NS&quot;, &quot;NS&quot;,
                   &quot;NS&quot;, &quot;NS&quot;, &quot;NS&quot;, &quot;NS&quot;, &quot;NS&quot;, &quot;NS&quot;, &quot;NS&quot;),
    year = c(&quot;G0&quot;,
             &quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;),
    gender = c(&quot;F&quot;,
               &quot;F&quot;, &quot;F&quot;, &quot;M&quot;, &quot;M&quot;, &quot;M&quot;, &quot;M&quot;, &quot;M&quot;, &quot;F&quot;, &quot;F&quot;),
    n = c(2L, 2L,
          57L, 39L, 3L, 4L, 49L, 103L, 1L, 1L)
  ),
  class = c(&quot;grouped_df&quot;,
            &quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;),
  row.names = c(NA,-10L),
  groups = structure(
    list(
      filename = c(
        &quot;BIO.G0.01.1&quot;,
        &quot;BIO.G0.02.1&quot;,
        &quot;BIO.G0.02.2&quot;,
        &quot;BIO.G0.02.4&quot;,
        &quot;BIO.G0.02.5&quot;,
        &quot;BIO.G0.02.6&quot;,
        &quot;BIO.G0.03.1&quot;
      ),
      discipline = c(&quot;BIO&quot;, &quot;BIO&quot;, &quot;BIO&quot;, &quot;BIO&quot;, &quot;BIO&quot;, &quot;BIO&quot;,
                     &quot;BIO&quot;),
      nativeness = c(&quot;NS&quot;, &quot;NS&quot;, &quot;NS&quot;, &quot;NS&quot;, &quot;NS&quot;, &quot;NS&quot;,
                     &quot;NS&quot;),
      year = c(&quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;, &quot;G0&quot;),
      gender = c(&quot;F&quot;, &quot;M&quot;, &quot;M&quot;, &quot;M&quot;, &quot;M&quot;, &quot;M&quot;, &quot;F&quot;),
      .rows = structure(
        list(1:3, 4L, 5L, 6L, 7L, 8L, 9:10),
        ptype = integer(0),
        class = c(&quot;vctrs_list_of&quot;,
                  &quot;vctrs_vctr&quot;, &quot;list&quot;)
      )
    ),
    class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;),
    row.names = c(NA,-7L),
    .drop = TRUE
  )
)

I know how to subtract rows from other rows based on their position, like this:

tib[3, 9] &lt;- tib[3, 9] - tib[2, 9] - tib[1, 9]

But the rows do not always appear in this predictable order because not all TA options are present in each text (filename). I'm also not sure how to write the code to restart this process again each time it comes across a new filename.

I am still learning how to manipulate data in R. Any suggestions would be very much appreciated!

答案1

得分: 0

根据所展示的逻辑,我们可能需要进行一次分组操作。如果在'TA'中有一些元素缺失,可能更好的做法是在进行连接之前将其重塑为宽格式。

library(dplyr)
library(tidyr)
tib %>%
   ungroup %>%
   select(-tense, -aspect) %>%
   pivot_wider(names_from = TA, values_from = n, values_fill = 0) %>%
   mutate(n1 = past_simple - past_progressive - past_perfect,
       TA = 'past_simple', .keep = 'unused') %>%
   left_join(tib %>% ungroup, .) %>%
   mutate(n = coalesce(n1, n), .keep = 'unused')

-output

# A tibble: 10 × 9
   TA               tense aspect      filename    discipline nativeness year  gender     n
   <chr>            <chr> <chr>       <chr>       <chr>      <chr>      <chr> <chr>  <int>
 1 past_perfect     past  perfect     BIO.G0.01.1 BIO        NS         G0    F          2
 2 past_progressive past  progressive BIO.G0.01.1 BIO        NS         G0    F          2
 3 past_simple      past  simple      BIO.G0.01.1 BIO        NS         G0    F         53
 4 past_simple      past  simple      BIO.G0.02.1 BIO        NS         G0    M         39
 5 past_simple      past  simple      BIO.G0.02.2 BIO        NS         G0    M          3
 6 past_simple      past  simple      BIO.G0.02.4 BIO        NS         G0    M          4
 7 past_simple      past  simple      BIO.G0.02.5 BIO        NS         G0    M         49
 8 past_simple      past  simple      BIO.G0.02.6 BIO        NS         G0    M        103
 9 past_perfect     past  perfect     BIO.G0.03.1 BIO        NS         G0    F          1
10 past_progressive past  progressive BIO.G0.03.1 BIO        NS         G0    F          1

请注意,上述代码和输出是保持原样的,没有进行中文翻译。

英文:

Based on the logic showed, we may need a group by operation. It may be better to reshape to wide before doing a join if there are some elements in 'TA' missing

library(dplyr)
library(tidyr)
tib %&gt;%
   ungroup %&gt;% 
   select(-tense, -aspect) %&gt;%
   pivot_wider(names_from = TA, values_from = n, values_fill = 0) %&gt;% 
   mutate(n1 = past_simple - past_progressive - past_perfect,  
       TA = &#39;past_simple&#39;, .keep = &#39;unused&#39;) %&gt;% 
   left_join(tib %&gt;% ungroup, .) %&gt;% 
   mutate(n = coalesce(n1, n), .keep = &#39;unused&#39;)

-output

# A tibble: 10 &#215; 9
   TA               tense aspect      filename    discipline nativeness year  gender     n
   &lt;chr&gt;            &lt;chr&gt; &lt;chr&gt;       &lt;chr&gt;       &lt;chr&gt;      &lt;chr&gt;      &lt;chr&gt; &lt;chr&gt;  &lt;int&gt;
 1 past_perfect     past  perfect     BIO.G0.01.1 BIO        NS         G0    F          2
 2 past_progressive past  progressive BIO.G0.01.1 BIO        NS         G0    F          2
 3 past_simple      past  simple      BIO.G0.01.1 BIO        NS         G0    F         53
 4 past_simple      past  simple      BIO.G0.02.1 BIO        NS         G0    M         39
 5 past_simple      past  simple      BIO.G0.02.2 BIO        NS         G0    M          3
 6 past_simple      past  simple      BIO.G0.02.4 BIO        NS         G0    M          4
 7 past_simple      past  simple      BIO.G0.02.5 BIO        NS         G0    M         49
 8 past_simple      past  simple      BIO.G0.02.6 BIO        NS         G0    M        103
 9 past_perfect     past  perfect     BIO.G0.03.1 BIO        NS         G0    F          1
10 past_progressive past  progressive BIO.G0.03.1 BIO        NS         G0    F          1

huangapple
  • 本文由 发表于 2023年2月10日 04:21:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75404030.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定