英文:
How to subtract the value of some rows from other rows in R according to categorical variables
问题
我正在尝试确定数据集中每个文本(filename)的每种时态和语态(TA)中动词(n)的数量。我将数据保存为tibble(如下所示),但是n列中的值还不准确,因为某些类别包含其他类别。例如,为了准确计算past_simple,我需要从同一文本中的past_perfect和present_progressive的数量(n)中减去。基本上,我想要做这样的事情:
-
从filename BIO.GO.01.1的past_simple的值中减去past_perfect和past_progressive的值(n)。
-
对每个单独的文件重复此过程。
我知道如何根据它们的位置从其他行中减去行,就像这样:
tib[3, 9] <- tib[3, 9] - tib[2, 9] - tib[1, 9]
但是行并不总是按可预测的顺序出现,因为并非每个文本(filename)中都包含所有的TA选项。我也不确定如何编写代码,以便在每次遇到新的filename时重新开始此过程。
我仍在学习如何在R中操作数据。任何建议将不胜感激!
英文:
I am trying to determine the number of verbs (n) in each tense and aspect (TA) for each text (filename) in my dataset. I have my data saved as a tibble (see below), but the values in the n column are not yet accurate because some categories subsume others. For instance, to get an accurate count of past_simple, I need to subtract the number (n) for past_perfect and present_progressive in that same text. Essentially, I am looking to do something like this:
-
Subtract the value (n) of past_perfect and past_progressive from the value of past_simple for filename BIO.GO.01.1
-
Repeat this process for each individual file
tib <- structure(
list(
TA = c(
"past_perfect",
"past_progressive",
"past_simple",
"past_simple",
"past_simple",
"past_simple",
"past_simple",
"past_simple",
"past_perfect",
"past_progressive"
),
tense = c(
"past",
"past",
"past",
"past",
"past",
"past",
"past",
"past",
"past",
"past"
),
aspect = c(
"perfect",
"progressive",
"simple",
"simple",
"simple",
"simple",
"simple",
"simple",
"perfect",
"progressive"
),
filename = c(
"BIO.G0.01.1",
"BIO.G0.01.1",
"BIO.G0.01.1",
"BIO.G0.02.1",
"BIO.G0.02.2",
"BIO.G0.02.4",
"BIO.G0.02.5",
"BIO.G0.02.6",
"BIO.G0.03.1",
"BIO.G0.03.1"
),
discipline = c(
"BIO",
"BIO",
"BIO",
"BIO",
"BIO",
"BIO",
"BIO",
"BIO",
"BIO",
"BIO"
),
nativeness = c("NS", "NS", "NS",
"NS", "NS", "NS", "NS", "NS", "NS", "NS"),
year = c("G0",
"G0", "G0", "G0", "G0", "G0", "G0", "G0", "G0", "G0"),
gender = c("F",
"F", "F", "M", "M", "M", "M", "M", "F", "F"),
n = c(2L, 2L,
57L, 39L, 3L, 4L, 49L, 103L, 1L, 1L)
),
class = c("grouped_df",
"tbl_df", "tbl", "data.frame"),
row.names = c(NA,-10L),
groups = structure(
list(
filename = c(
"BIO.G0.01.1",
"BIO.G0.02.1",
"BIO.G0.02.2",
"BIO.G0.02.4",
"BIO.G0.02.5",
"BIO.G0.02.6",
"BIO.G0.03.1"
),
discipline = c("BIO", "BIO", "BIO", "BIO", "BIO", "BIO",
"BIO"),
nativeness = c("NS", "NS", "NS", "NS", "NS", "NS",
"NS"),
year = c("G0", "G0", "G0", "G0", "G0", "G0", "G0"),
gender = c("F", "M", "M", "M", "M", "M", "F"),
.rows = structure(
list(1:3, 4L, 5L, 6L, 7L, 8L, 9:10),
ptype = integer(0),
class = c("vctrs_list_of",
"vctrs_vctr", "list")
)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA,-7L),
.drop = TRUE
)
)
I know how to subtract rows from other rows based on their position, like this:
tib[3, 9] <- tib[3, 9] - tib[2, 9] - tib[1, 9]
But the rows do not always appear in this predictable order because not all TA options are present in each text (filename). I'm also not sure how to write the code to restart this process again each time it comes across a new filename.
I am still learning how to manipulate data in R. Any suggestions would be very much appreciated!
答案1
得分: 0
根据所展示的逻辑,我们可能需要进行一次分组操作。如果在'TA'中有一些元素缺失,可能更好的做法是在进行连接之前将其重塑为宽格式。
library(dplyr)
library(tidyr)
tib %>%
ungroup %>%
select(-tense, -aspect) %>%
pivot_wider(names_from = TA, values_from = n, values_fill = 0) %>%
mutate(n1 = past_simple - past_progressive - past_perfect,
TA = 'past_simple', .keep = 'unused') %>%
left_join(tib %>% ungroup, .) %>%
mutate(n = coalesce(n1, n), .keep = 'unused')
-output
# A tibble: 10 × 9
TA tense aspect filename discipline nativeness year gender n
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 past_perfect past perfect BIO.G0.01.1 BIO NS G0 F 2
2 past_progressive past progressive BIO.G0.01.1 BIO NS G0 F 2
3 past_simple past simple BIO.G0.01.1 BIO NS G0 F 53
4 past_simple past simple BIO.G0.02.1 BIO NS G0 M 39
5 past_simple past simple BIO.G0.02.2 BIO NS G0 M 3
6 past_simple past simple BIO.G0.02.4 BIO NS G0 M 4
7 past_simple past simple BIO.G0.02.5 BIO NS G0 M 49
8 past_simple past simple BIO.G0.02.6 BIO NS G0 M 103
9 past_perfect past perfect BIO.G0.03.1 BIO NS G0 F 1
10 past_progressive past progressive BIO.G0.03.1 BIO NS G0 F 1
请注意,上述代码和输出是保持原样的,没有进行中文翻译。
英文:
Based on the logic showed, we may need a group by operation. It may be better to reshape to wide before doing a join if there are some elements in 'TA' missing
library(dplyr)
library(tidyr)
tib %>%
ungroup %>%
select(-tense, -aspect) %>%
pivot_wider(names_from = TA, values_from = n, values_fill = 0) %>%
mutate(n1 = past_simple - past_progressive - past_perfect,
TA = 'past_simple', .keep = 'unused') %>%
left_join(tib %>% ungroup, .) %>%
mutate(n = coalesce(n1, n), .keep = 'unused')
-output
# A tibble: 10 × 9
TA tense aspect filename discipline nativeness year gender n
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 past_perfect past perfect BIO.G0.01.1 BIO NS G0 F 2
2 past_progressive past progressive BIO.G0.01.1 BIO NS G0 F 2
3 past_simple past simple BIO.G0.01.1 BIO NS G0 F 53
4 past_simple past simple BIO.G0.02.1 BIO NS G0 M 39
5 past_simple past simple BIO.G0.02.2 BIO NS G0 M 3
6 past_simple past simple BIO.G0.02.4 BIO NS G0 M 4
7 past_simple past simple BIO.G0.02.5 BIO NS G0 M 49
8 past_simple past simple BIO.G0.02.6 BIO NS G0 M 103
9 past_perfect past perfect BIO.G0.03.1 BIO NS G0 F 1
10 past_progressive past progressive BIO.G0.03.1 BIO NS G0 F 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论