英文:
How can I use a value extracted from a dataframe to specify columns to subset in R?
问题
Sure, here's the translated code snippet:
我有一个数据框,我想在函数内部对其进行子集化,以便只保留两列都为1或NA的行。对于df:
df <- data.frame(a = c(1,1,0,NA,0,1),
b = c(0,1,0,1,0, NA),
c = c(0,0,0,0,0,0))
我想要的结果是:
a b c
2 1 1 0
4 NA 1 0
6 1 NA 0
我遇到的问题是,我有许多列的名称会变化。因此,这个方法效果很好:
subset(df, (is.na(a) | a == 1) & (is.na(b) | b == 1))
但是,当列名'a'和'b'在函数操作过程中变为'd'和'f'时,这个方法就失效了。通过列索引指定的方法更加健壮:
subset(df, (is.na(df[,1]) | df[,1] == 1) & (is.na(df[,2]) | df[,2] == 1))
但这样做很麻烦,而且如果先前的处理步骤出错,导致列'c'在'a'或'b'之前,那么我可能会选择错误的列进行子集化。
我还有另一个指定要进行子集化的列名的数据框:
cro_df <- data.frame(pop = c('c1', 'c2'),
p1 = c('a', 'd'),
p2 = c('b', 'f'))
我想能够从该数据框中提取列名,以在我的子集化函数中使用,例如:
col1 <- cro_df[cro_df[, 'pop'] == 'c1', 'p1']
subset(df, is.na(col1) | col1 == 1)
这将返回一个空数据框。我已经尝试将col1转换为符号和因子,但没有成功:
subset(df, as.symbol(col1) == 1)
subset(df, sym(col1) == 1)
subset(df, as.factor(col1) == 1)
它们都返回:
[1] a b c
<0 rows> (or 0-length row.names)
是否有一种方法可以使用第二个数据框cro_df来指定要进行子集化的列?
英文:
I have a dataframe that I want to subset inside a function so that only rows where both columns are either 1 or NA remain. For df:
df <- data.frame(a = c(1,1,0,NA,0,1),
b = c(0,1,0,1,0, NA),
c = c(0,0,0,0,0,0))
I want:
a b c
2 1 1 0
4 NA 1 0
6 1 NA 0
The problem I'm having is I have many columns with names that change. So this works well:
subset(df, (is.na(a) | a == 1) & (is.na(b) | b == 1))
but when column names 'a' and 'b' become 'd' and 'f' during the operation of the function it breaks. Specifying by column index works more robustly:
subset(df, (is.na(df[,1]) | df[,1] == 1) & (is.na(df[,2]) | df[,2] == 1))
But is cumbersome, and if a previous processing step messes up and column 'c' ends up before 'a' or 'b' I end up subsetting by the wrong columns.
I also have another dataframe that specifies what the column names to subset by will be:
cro_df <- data.frame(pop = c('c1', 'c2'),
p1 = c('a', 'd'),
p2 = c('b', 'f'))
pop p1 p2
1 c1 a d
2 c2 b f
I would like to be able to extract the column names from that dataframe to use in my subset function, e.g.:
col1 <- cro_df[cro_df[,'pop']=='c1', 'p1']
subset(df, is.na(col1) | col1 == 1)
This returns an empty dataframe. I have tried turning col1 into a symbol and a factor with no success:
subset(df, as.symbol(col1) == 1)
subset(df, sym(col1) == 1)
subset(df, as.factor(col1) == 1)
And they all return:
[1] a b c
<0 rows> (or 0-length row.names)
Is there a way I can specify my columns to subset using the second dataframe cro_df?
答案1
得分: 1
你可以使用dplyr
包中的filter
和if_all
。
按照你认为最适合的方式选择要筛选的列的名称。在我的案例中,我创建了一个变量cols
,其中包含'a'
和'b'
。
然后,我检查cols
中所有列名是否都满足条件,并使用filter
筛选满足if_all
语句为TRUE
的行:
library(dplyr) # packageVersion("dplyr") >= 1.1.0
cols <- c('a', 'b')
filter(df, if_all(all_of(cols), \(x) is.na(x) | x == 1))
#> a b c
#> 1 1 1 0
#> 2 NA 1 0
#> 3 1 NA 0
如果你将不同的列名分配给cols
,你可以重用相同的代码。
英文:
You can use filter
and if_all
from the dplyr
package.
Select in the manner you find best suited for your case the names of the columns you want to filter. In my case I just created a variable cols
that contains 'a'
and 'b'
.
Then I check all_of
the column names in cols
and filter
the rows if_all
statements are TRUE
:
library(dplyr) # packageVersion("dplyr") >= 1.1.0
cols <- c('a', 'b')
filter(df, if_all(all_of(cols), \(x) is.na(x) | x == 1))
#> a b c
#> 1 1 1 0
#> 2 NA 1 0
#> 3 1 NA 0
If you assign different column names to cols
you can reuse the same code.
答案2
得分: 1
以下是您请求的代码的中文翻译:
# 加载必要的包
library(dplyr)
library(purrr)
# 创建第一个数据框
df <- data.frame(a = c(1,1,0,NA,0,1),
b = c(0,1,0,1,0, NA),
c = c(0,0,0,0,0,0))
# 添加第二个具有不同列名的数据框
df2 <- data.frame(d = c(1,1,0,NA,0,1),
f = c(0,1,0,1,0, NA),
c = c(0,0,0,0,0,0))
# 使用dplyr::if_all()在dplyr::filter()中应用筛选条件
df |>
filter(if_all(c(a, b), \(x) is.na(x) | x == 1))
# 输出:
# a b c
# 1 1 1 0
# 2 NA 1 0
# 3 1 NA 0
# 创建自定义函数以适应不同的列名
custom_filter <-
function(data, v1, v2) {
filter(data,
if_all(c({{v1}}, {{v2}}), \(x) is.na(x) | x == 1))
}
# 示例如何使用自定义函数
custom_filter(df, a, b)
# 输出:
# a b c
# 1 1 1 0
# 2 NA 1 0
# 3 1 NA 0
custom_filter(df2, d, f)
# 输出:
# d f c
# 1 1 1 0
# 2 NA 1 0
# 3 1 NA 0
# 使用cro_df数据框和将所有数据框放入list()中,以便通过所有数据框并应用筛选条件的编程方式(purrr::map2())。
cro_df <- data.frame(pop = c('c1', 'c2'),
p1 = c('a', 'd'),
p2 = c('b', 'f'))
cro_l <-
cro_df |>
split(1:nrow(cro_df))
data_l <- list(df, df2)
map2(data_l,
cro_l,
\(x, y) custom_filter(
x, y$p1, y$p2
))
# 输出:
# [[1]]
# a b c
# 1 1 1 0
# 2 NA 1 0
# 3 1 NA 0
#
# [[2]]
# d f c
# 1 1 1 0
# 2 NA 1 0
# 3 1 NA 0
英文:
library(dplyr)
library(purrr)
df <- data.frame(a = c(1,1,0,NA,0,1),
b = c(0,1,0,1,0, NA),
c = c(0,0,0,0,0,0))
Let’s add a second data frame with different column names.
df2 <- data.frame(d = c(1,1,0,NA,0,1),
f = c(0,1,0,1,0, NA),
c = c(0,0,0,0,0,0))
We can use dplyr::if_all()
in dplyr::filter()
to apply the filter.
df |>
filter(if_all(c(a, b), \(x) is.na(x) | x == 1))
#> a b c
#> 1 1 1 0
#> 2 NA 1 0
#> 3 1 NA 0
Using that idea we now write a custom function to accomodate for changing
column names.
custom_filter <-
function(data, v1, v2) {
filter(data,
if_all(c({{v1}}, {{v2}}), \(x) is.na(x) | x == 1))
}
Here is how that can work.
custom_filter(df, a, b)
#> a b c
#> 1 1 1 0
#> 2 NA 1 0
#> 3 1 NA 0
custom_filter(df2, d, f)
#> d f c
#> 1 1 1 0
#> 2 NA 1 0
#> 3 1 NA 0
Using your cro_df
dataframe and by placing all dataframes in a list()
we can now programmatically (purrr::map2()
) go through all of the
dataframes and apply the filter.
cro_df <- data.frame(pop = c('c1', 'c2'),
p1 = c('a', 'd'),
p2 = c('b', 'f'))
cro_l <-
cro_df |>
split(1:nrow(cro_df))
data_l <- list(df, df2)
map2(data_l,
cro_l,
\(x, y) custom_filter(
x, y$p1, y$p2
))
#> [[1]]
#> a b c
#> 1 1 1 0
#> 2 NA 1 0
#> 3 1 NA 0
#>
#> [[2]]
#> d f c
#> 1 1 1 0
#> 2 NA 1 0
#> 3 1 NA 0
答案3
得分: 0
以下是翻译好的内容:
也许这是一个不错的开始?
with(cro_df[cro_df$pop == "c1",],
df[ (is.na(df[[p1]]) | df[[p1]] == 1) & (is.na(df[[p2]]) | df[[p2]] == 1), ]
)
# a b c
# 2 1 1 0
# 4 NA 1 0
# 6 1 NA 0
FYI,subset
用于交互式使用,其帮助页面指出:
这是一个方便的函数,用于交互式使用。
对于编程,最好使用标准的子集函数,如[,],尤其是参数 'subset' 的非标准评估可能会导致意外后果。
英文:
Perhaps this is a good start?
with(cro_df[cro_df$pop == "c1",],
df[ (is.na(df[[p1]]) | df[[p1]] == 1) & (is.na(df[[p2]]) | df[[p2]] == 1), ]
)
# a b c
# 2 1 1 0
# 4 NA 1 0
# 6 1 NA 0
FYI, subset
is intended for interactive use, its help page says
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like [, and in particular the non-standard evaluation
of argument ‘subset’ can have unanticipated consequences.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论