英文:
Non-rowwise method to change multiple values in columns if present in other columns
问题
Here's the translated code portion you requested:
在下面的示例df中,对于每一行,我想确定b1:b3中是否有任何值在a1:a3中。如果找不到匹配项,那么将不匹配的b1:b3值更改为NA。请注意,df中a和b列的数量可能会有所不同,但命名约定保持不变:
```r
df <- structure(list(row = c(1L, 12L, 17L, 44L, 55L, 90L),
b1 = c(55L,17L, 12L, 12L, 1L, 12L),
b2 = c(NA, 44L, 44L, 17L, NA, 17L),
b3 = c(NA, 90L, 90L, 90L, NA, 44L),
a1 = c(55L, 17L, 12L, 12L, 1L, 12L),
a2 = c(NA, 44L, 44L, 17L, NA, 28L),
a3 = c(NA, 90L, NA, 90L, NA, 44L)),
row.names = c(NA, -6L),
class = "data.frame")
期望的输出:
row b1 b2 b3 a1 a2 a3
1 1 55 NA NA 55 NA NA
2 12 17 44 90 17 44 90
3 17 12 44 NA 12 44 NA
4 44 12 17 90 12 17 90
5 55 1 NA NA 1 NA NA
6 90 12 NA 44 12 28 44
我可以通过以下for循环实现期望的输出:
acols <- grep("^a", colnames(df))
bcols <- grep("^b", colnames(df))
for(i in 1:nrow(df)) {
for(j in bcols) {
if(df[i, j] %in% df[i, acols]) {
next
} else {
df[i, j] <- NA
}
}
}
但是,我正在寻找更高效的方法,因为实际数据要大得多。tidyverse
或基本R是否有替代方法?我已经搜索了across()
方法,但是无法将任何解决方案转化为我的需求。
<details>
<summary>英文:</summary>
In the example df below, for each row I want to determine if any values in b1:b3 are in a1:a3. If a match is not found, then change the unmatched b1:b3 value to NA. Note that the number of a and b columns in df may vary but the naming convention remains the same:
```r
df <- structure(list(row = c(1L, 12L, 17L, 44L, 55L, 90L),
b1 = c(55L,17L, 12L, 12L, 1L, 12L),
b2 = c(NA, 44L, 44L, 17L, NA, 17L),
b3 = c(NA, 90L, 90L, 90L, NA, 44L),
a1 = c(55L, 17L, 12L, 12L, 1L, 12L),
a2 = c(NA, 44L, 44L, 17L, NA, 28L),
a3 = c(NA, 90L, NA, 90L, NA, 44L)),
row.names = c(NA, -6L),
class = "data.frame")
Desired output:
row b1 b2 b3 a1 a2 a3
1 1 55 NA NA 55 NA NA
2 12 17 44 90 17 44 90
3 17 12 44 NA 12 44 NA
4 44 12 17 90 12 17 90
5 55 1 NA NA 1 NA NA
6 90 12 NA 44 12 28 44
I can achieve the desired output with the following for loop:
acols <- grep("^a", colnames(df))
bcols <- grep("^b", colnames(df))
for(i in 1:nrow(df)) {
for(j in bcols) {
if(df[i, j] %in% df[i, acols]) {
next
} else {
df[i, j] <- NA
}
}
}
but I'm looking for a more efficient method as the real data are much larger. Does the tidyverse
or base R have an alternative to this? I've searched for across()
approaches but I couldn't translate any solutions to my needs.
答案1
得分: 3
这里有一个基本解决方案,应该非常快:
acols <- grep("^a", colnames(df))
bcols <- grep("^b", colnames(df))
df[bcols] = lapply(
df[bcols],
\(bb) {
bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA_integer_
bb
})
df
# row b1 b2 b3 a1 a2 a3
# 1 1 55 NA NA 55 NA NA
# 2 12 17 44 90 17 44 90
# 3 17 12 44 NA 12 44 NA
# 4 44 12 17 90 12 17 90
# 5 55 1 NA NA 1 NA NA
# 6 90 12 NA 44 12 28 44
我将您的数据扩展到了500k行,对于该数据集,该方法运行速度大约快了50倍,使用了1/50的内存。我怀疑随着列数的增加,差异会更大。
single_loop = function(df){
df[bcols] = lapply(
df[bcols],
\(bb) {
bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA
bb
})
df
}
double_loop = function(df){
for (i in 1:nrow(df)) {
for (j in bcols) {
if (df[i, j] %in% df[i, acols]) {
next
} else {
df[i, j] <- NA
}
}
}
df
}
df_big = df[sample(1:nrow(df), size = 5e5, replace = TRUE)]
bench::mark(
op = double_loop(df_big),
gregor = single_loop(df_big)
)
# # A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
# 1 op 624.8ms 624.8ms 1.60 381MB 9.60 1 6 625ms <df [6 ×…
# 2 gregor 12.8ms 13.4ms 48.5 7.63MB 9.33 26 5 536ms <df [6 ×…
# # … with 3 more variables: memory <list>, time <list>, gc <list>
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled.
英文:
Here's a base solution that should be quite fast:
acols <- grep("^a", colnames(df))
bcols <- grep("^b", colnames(df))
df[bcols] = lapply(
df[bcols],
\(bb) {
bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA_integer_
bb
})
df
# row b1 b2 b3 a1 a2 a3
# 1 1 55 NA NA 55 NA NA
# 2 12 17 44 90 17 44 90
# 3 17 12 44 NA 12 44 NA
# 4 44 12 17 90 12 17 90
# 5 55 1 NA NA 1 NA NA
# 6 90 12 NA 44 12 28 44
I scaled your data up to 500k rows, and on that dataset this approach runs about 50x faster using 1/50 of the memory. I suspect the differences will be bigger as the number of columns increases.
single_loop = function(df){
df[bcols] = lapply(
df[bcols],
\(bb) {
bb[rowSums(bb == df[acols], na.rm = TRUE) == 0] = NA
bb
})
df
}
double_loop = function(df){
for (i in 1:nrow(df)) {
for (j in bcols) {
if (df[i, j] %in% df[i, acols]) {
next
} else {
df[i, j] <- NA
}
}
}
df
}
df_big = df[sample(1:nrow(df), size = 5e5, replace = TRUE)]
bench::mark(
op = double_loop(df_big),
gregor = single_loop(df_big)
)
# # A tibble: 2 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list>
# 1 op 624.8ms 624.8ms 1.60 381MB 9.60 1 6 625ms <df [6 ×…
# 2 gregor 12.8ms 13.4ms 48.5 7.63MB 9.33 26 5 536ms <df [6 ×…
# # … with 3 more variables: memory <list>, time <list>, gc <list>
# Warning message:
# Some expressions had a GC in every iteration; so filtering is disabled.
答案2
得分: 2
如果需要检查任何位置的值,您可以进行数据往返透视:
使用库(tidyr)
使用库(dplyr)
数据框(df) %>%
透视至长格式(-行) %>%
变异(grp = substr(name, 1L, 1L),
value = replace(value, !(value[grp == "b"] %in% value[grp == "a"])[1:n()], NA),
grp = NULL,
.by = 行) %>%
透视至宽格式()
# 结果数据框: 6 × 7
行 b1 b2 b3 a1 a2 a3
<int> <int> <int> <int> <int> <int> <int>
1 1 55 NA NA 55 NA NA
2 12 17 44 90 17 44 90
3 17 12 44 NA 12 44 NA
4 44 12 17 90 12 17 90
5 55 1 NA NA 1 NA NA
6 90 12 NA 44 12 28 44
如果可以按列对检查是否一致,可以使用以下方式:
数据框(df) %>%
变异(replace(pick(b1:b3), pick(a1:a3) != pick(b1:b3) | is.na(pick(a1:a3)), NA))
行 b1 b2 b3 a1 a2 a3
1 1 55 NA NA 55 NA NA
2 12 17 44 90 17 44 90
3 17 12 44 NA 12 44 NA
4 44 12 17 90 12 17 90
5 55 1 NA NA 1 NA NA
6 90 12 NA 44 12 28 44
英文:
If you need to check for values in any position you can round-trip pivot the data:
library(tidyr)
library(dplyr)
df %>%
pivot_longer(-row) %>%
mutate(grp = substr(name, 1L, 1L),
value = replace(value, !(value[grp == "b"] %in% value[grp == "a"])[1:n()], NA),
grp = NULL,
.by = row) %>%
pivot_wider()
# A tibble: 6 × 7
row b1 b2 b3 a1 a2 a3
<int> <int> <int> <int> <int> <int> <int>
1 1 55 NA NA 55 NA NA
2 12 17 44 90 17 44 90
3 17 12 44 NA 12 44 NA
4 44 12 17 90 12 17 90
5 55 1 NA NA 1 NA NA
6 90 12 NA 44 12 28 44
If it's ok to check by column pairs, you could do:
df %>%
mutate(replace(pick(b1:b3), pick(a1:a3) != pick(b1:b3) | is.na(pick(a1:a3)), NA))
row b1 b2 b3 a1 a2 a3
1 1 55 NA NA 55 NA NA
2 12 17 44 90 17 44 90
3 17 12 44 NA 12 44 NA
4 44 12 17 90 12 17 90
5 55 1 NA NA 1 NA NA
6 90 12 NA 44 12 28 44
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论