英文:
R - How to change column values based on a combination of values in other columns in that data frame?
问题
I can help you with the translation, but please note that the code you provided contains HTML entities (e.g., <
) that need to be replaced with their corresponding characters to make the code functional. I'll provide a translation of your text first, and then you can make the necessary adjustments to your code. Here's the translation:
我正在处理翻译数据,试图区分简单的打字错误和实际的文本修改;打字错误被定义为在另一个修改或错误单词之前或之后7个按键之内没有出现的变化,或者是打破的单词(单词内的较长暂停;这些我已经成功识别出来)。理想情况下,代码还应检查是否在同一个单词中发生了其他的修改或错误,无论这个单词有多长(即对于具有相同“Id”的每个“Count”)。如果不是这种情况,就应该将“ALT”视为“Typo”。
我尝试了一堆嵌套的for和if语句,但这些往往会出现问题,或者当这不是情况时,会声称参数“长度为0”;我不是一个真正擅长编码的人,不能让它们正常工作。在我的最后一次尝试中,我只考虑了“ALT”之前或之后7个按键内的任何问题,而没有考虑单词本身,尽管这并不理想。
示例数据集:
```R
T01 <- structure(list(Id = 1:100, Count = c(1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 10L), Issue = c("none",
"none", "none", "none", "none", "none", "BW", "none", "ALT",
"ALT", "ALT", "BW", "none", "none", "none", "none", "none", "none",
"none", "none", "none", "none", "none", "none", "ALT", "none",
"none", "none", "none", "none", "none", "none", "none", "none",
"none", "none", "none", "none", "BW", "ALT", "ALT", "ALT",
"ALT", "ALT", "ALT", "BW", "BW", "none", "BW", "ALT", "ALT",
"BW", "none", "none", "none", "none", "none", "none", "none",
"none", "none", "none", "none", "none", "none", "none", "none",
"none", "ALT", "none", "none", "none", "ALT", "BW", "none",
"BW", "ALT", "ALT", "ALT", "BW", "BW", "none", "none", "BW",
"none", "none", "none", "none", "none", "none", "none", "none",
"none", "none", "none", "none", "none")), row.names = c(NA,
100L), class = "data.frame")
最终编码尝试:
library(dplyr)
Changes <- which(T01$Issue == 'ALT') # 确定哪些问题是修改,可能是打字错误
T01$Typo <- FALSE # 在数据框中创建一个列来存储值
for (index in 1:length(T01$Id)) { # 即,对于每一行
for (x in 1:7) { # 检查此行之前或之后的7行
if (T01$Id[index] %in% Changes) { # 仅检查T01$Issue == 'ALT'的索引
if ((T01$Issue[index-x] != 'ALT') && (T01$Issue[index-x] != 'BW') &&
# 在这一行之前没有7个按键的修改或错误
(T01$Issue[index+x] != 'ALT') && (T01$Issue[index+x] != 'BW'))
# 在这一行之后没有7个按键的修改或错误
{
T01$Typo[index] <- TRUE
}
}
}
}
请注意,你需要在代码中做一些修正,以确保它运行正常,特别是将HTML实体(如<
)替换为相应的字符。
英文:
I'm working with translation data and am trying to distinguish between simple typos and actual text modifications; typos are identified as those alterations that do not occur within 7 keystrokes after or before another alteration or broken word (longer pause within a word; those I've managed to identify). Ideally, the code would also check if any other alterations or broken words occur in the same word, regardless of how many keystrokes that is removed from the present one.
The variable Issue takes the values 'BW' for broken word, 'ALT' for alterations, and 'none' if new text is produced smoothly. Each 'Id' represents a keystroke. 'Count' keeps track of the word count, i.e. all keystrokes contributing to the first word are labeled 1, the second word 2, etc.
I'd like to divvy up the 'ALT' group into 'ALT' and 'Typo' by determining for each 'ALT' if another issue ('ALT' or 'BW') pops up within 7 keystrokes before or after it, or within the same word, regardless of how long that word is (i.e. for each 'Id' with the same 'Count'). If this is not the case, the 'ALT' should be considered a 'Typo'.
I've tried a bunch of nested for and if statements, but these tend to get problematic or claim that arguments 'have length 0' when this is not the case; I'm not that skilled a coder to actually get them to work. In my final attempt, below, I'd settled for any issues within 7 keystrokes before or after the 'ALT', and not taken into account the word itself, although that is not ideal.
Example dataset:
T01 <- structure(list(Id = 1:100, Count = c(1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 10L), Issue = c("none",
"none", "none", "none", "none", "none", "BW", "none", "ALT",
"ALT", "ALT", "BW", "none", "none", "none", "none", "none", "none",
"none", "none", "none", "none", "none", "none", "none", "ALT",
"none", "none", "none", "none", "none", "none", "none", "none",
"none", "none", "none", "none", "none", "BW", "ALT", "ALT", "ALT",
"ALT", "ALT", "ALT", "BW", "BW", "none", "BW", "ALT", "ALT",
"BW", "none", "none", "none", "none", "none", "none", "none",
"none", "none", "none", "none", "none", "none", "none", "none",
"none", "ALT", "none", "none", "none", "ALT", "BW", "none", "BW",
"ALT", "ALT", "ALT", "BW", "BW", "none", "none", "BW", "none",
"none", "none", "none", "none", "none", "none", "none", "none",
"none", "none", "none", "none", "none", "none")), row.names = c(NA,
100L), class = "data.frame")
Final coding attempt:
library(dplyr)
Changes <- which(T01$Issue == 'Alt') # Identify which issues are alterations, and therefore possibly typos
T01$Typo <- F #Create a column in the dataframe to store values
for (index in 1:length(T01$Id)) { # i.e. for each row
for (x in 1:7) { # check up to 7 rows before or after this one
if (T01$Id[index] %in% Changes) { # Only check indices where T01$Issue == 'ALT'
if ((T01$Issue[index-x] != 'ALT') & (T01$Issue[index-x] != 'BW') &
# no alterations or broken words up to seven keystrokes before this one
(T01$Issue[index+x] != 'ALT') & (T01$Issue[index+x] != 'BW'))
# no alterations or broken words up to seven keystrokes after this one
{T01$Typo[index] <- T}}}}
Hoping someone here can help out!
答案1
得分: 1
我们可以使用group_by(Count)
和zoo::rollapply()
来实现这个。
rollapply
创建一个窗口并应用一个函数,partial = TRUE
用于考虑不完整的窗口,即在之前没有7个观测值。
library(dplyr)
T01 %>%
arrange(Id, Count) %>%
group_by(Count) %>%
mutate(teste = ifelse(Issue == "ALT",
ifelse(
zoo::rollapply(Issue,
width = list(c(-7:-1,1:7)),
\(x) any(x == "ALT" | x == "BW"),
partial = TRUE),
"Typo","ALT"),
Issue
)
)
#> # A tibble: 100 × 4
#> # Groups: Count [10]
#> Id Count Issue teste
#> <int> <int> <chr> <chr>
#> 1 1 1 none none
#> 2 2 1 none none
#> 3 3 1 none none
#> 4 4 1 none none
#> 5 5 1 none none
#> 6 6 2 none none
#> 7 7 2 BW BW
#> 8 8 2 none none
#> 9 9 2 ALT Typo
#> 10 10 2 ALT Typo
#> # … with 90 more rows
创建于2023-02-24,使用reprex v2.0.2。
英文:
We can do it with group_by(Count)
and zoo::rollapply()
.
rollapply
creates a window and apply a function, partial = TRUE
is to consider incomplete windows, ie there isn't 7 observations before
library(dplyr)
T01 %>%
arrange(Id, Count) %>%
group_by(Count) %>%
mutate(teste = ifelse(Issue == "ALT",
ifelse(
zoo::rollapply(Issue,
width = list(c(-7:-1,1:7)),
\(x) any(x == "ALT" | x == "BW"),
partial = TRUE),
"Typo","ALT"),
Issue
)
)
#> # A tibble: 100 × 4
#> # Groups: Count [10]
#> Id Count Issue teste
#> <int> <int> <chr> <chr>
#> 1 1 1 none none
#> 2 2 1 none none
#> 3 3 1 none none
#> 4 4 1 none none
#> 5 5 1 none none
#> 6 6 2 none none
#> 7 7 2 BW BW
#> 8 8 2 none none
#> 9 9 2 ALT Typo
#> 10 10 2 ALT Typo
#> # … with 90 more rows
<sup>Created on 2023-02-24 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论