你可以使用从数据框中提取的值来指定在R中要进行子集操作的列。

huangapple go评论61阅读模式
英文:

How can I use a value extracted from a dataframe to specify columns to subset in R?

问题

Sure, here's the translated code snippet:

我有一个数据框,我想在函数内部对其进行子集化,以便只保留两列都为1或NA的行。对于df:

df <- data.frame(a = c(1,1,0,NA,0,1), 
                 b = c(0,1,0,1,0, NA),
                 c = c(0,0,0,0,0,0))

我想要的结果是:

   a  b  c
2  1  1  0
4 NA  1  0
6  1 NA  0

我遇到的问题是,我有许多列的名称会变化。因此,这个方法效果很好:

subset(df, (is.na(a) | a == 1) & (is.na(b) | b == 1))

但是,当列名'a''b'在函数操作过程中变为'd''f'时,这个方法就失效了。通过列索引指定的方法更加健壮:

subset(df, (is.na(df[,1]) | df[,1] == 1) & (is.na(df[,2]) | df[,2] == 1))

但这样做很麻烦,而且如果先前的处理步骤出错,导致列'c''a''b'之前,那么我可能会选择错误的列进行子集化。

我还有另一个指定要进行子集化的列名的数据框:

cro_df <- data.frame(pop = c('c1', 'c2'),
                     p1 = c('a', 'd'),
                     p2 = c('b', 'f'))

我想能够从该数据框中提取列名,以在我的子集化函数中使用,例如:

col1 <- cro_df[cro_df[, 'pop'] == 'c1', 'p1']
subset(df, is.na(col1) | col1 == 1)

这将返回一个空数据框。我已经尝试将col1转换为符号和因子,但没有成功:

subset(df, as.symbol(col1) == 1)
subset(df, sym(col1) == 1)
subset(df, as.factor(col1) == 1)

它们都返回:

[1] a b c
<0 rows> (or 0-length row.names)

是否有一种方法可以使用第二个数据框cro_df来指定要进行子集化的列?
英文:

I have a dataframe that I want to subset inside a function so that only rows where both columns are either 1 or NA remain. For df:

df &lt;- data.frame(a = c(1,1,0,NA,0,1), 
                 b = c(0,1,0,1,0, NA),
                 c = c(0,0,0,0,0,0))

I want:

   a  b  c
2  1  1  0
4 NA  1  0
6  1 NA  0

The problem I'm having is I have many columns with names that change. So this works well:

subset(df, (is.na(a) | a == 1) &amp; (is.na(b) | b == 1))

but when column names 'a' and 'b' become 'd' and 'f' during the operation of the function it breaks. Specifying by column index works more robustly:

subset(df, (is.na(df[,1]) | df[,1] == 1) &amp; (is.na(df[,2]) | df[,2] == 1))

But is cumbersome, and if a previous processing step messes up and column 'c' ends up before 'a' or 'b' I end up subsetting by the wrong columns.

I also have another dataframe that specifies what the column names to subset by will be:

cro_df &lt;- data.frame(pop = c(&#39;c1&#39;, &#39;c2&#39;),
                     p1 = c(&#39;a&#39;, &#39;d&#39;),
                     p2 = c(&#39;b&#39;, &#39;f&#39;))
  pop p1 p2
1  c1  a  d
2  c2  b  f

I would like to be able to extract the column names from that dataframe to use in my subset function, e.g.:

col1 &lt;- cro_df[cro_df[,&#39;pop&#39;]==&#39;c1&#39;, &#39;p1&#39;]
subset(df, is.na(col1) | col1 == 1)

This returns an empty dataframe. I have tried turning col1 into a symbol and a factor with no success:

subset(df, as.symbol(col1) == 1)
subset(df, sym(col1) == 1)
subset(df, as.factor(col1) == 1)

And they all return:

[1] a b c
&lt;0 rows&gt; (or 0-length row.names)

Is there a way I can specify my columns to subset using the second dataframe cro_df?

答案1

得分: 1

你可以使用dplyr包中的filterif_all

按照你认为最适合的方式选择要筛选的列的名称。在我的案例中,我创建了一个变量cols,其中包含'a''b'

然后,我检查cols中所有列名是否都满足条件,并使用filter筛选满足if_all语句为TRUE的行:

library(dplyr) # packageVersion("dplyr") >= 1.1.0

cols <- c('a', 'b')
filter(df, if_all(all_of(cols), \(x) is.na(x) | x == 1))
#>    a  b c
#> 1  1  1 0
#> 2 NA  1 0
#> 3  1 NA 0

如果你将不同的列名分配给cols,你可以重用相同的代码。

英文:

You can use filter and if_all from the dplyr package.

Select in the manner you find best suited for your case the names of the columns you want to filter. In my case I just created a variable cols that contains &#39;a&#39; and &#39;b&#39;.

Then I check all_of the column names in cols and filter the rows if_all statements are TRUE:

library(dplyr) # packageVersion(&quot;dplyr&quot;) &gt;= 1.1.0

cols &lt;- c(&#39;a&#39;, &#39;b&#39;)
filter(df, if_all(all_of(cols), \(x) is.na(x) | x == 1))
#&gt;    a  b c
#&gt; 1  1  1 0
#&gt; 2 NA  1 0
#&gt; 3  1 NA 0

If you assign different column names to cols you can reuse the same code.

答案2

得分: 1

以下是您请求的代码的中文翻译:

# 加载必要的包
library(dplyr)
library(purrr)

# 创建第一个数据框
df <- data.frame(a = c(1,1,0,NA,0,1), 
                 b = c(0,1,0,1,0, NA),
                 c = c(0,0,0,0,0,0))

# 添加第二个具有不同列名的数据框
df2 <- data.frame(d = c(1,1,0,NA,0,1), 
                  f = c(0,1,0,1,0, NA),
                  c = c(0,0,0,0,0,0))

# 使用dplyr::if_all()在dplyr::filter()中应用筛选条件
df |>
  filter(if_all(c(a, b), \(x) is.na(x) | x == 1))
# 输出:
#   a  b c
# 1  1  1 0
# 2 NA  1 0
# 3  1 NA 0

# 创建自定义函数以适应不同的列名
custom_filter <- 
  function(data, v1, v2) {
    filter(data,
           if_all(c({{v1}}, {{v2}}), \(x) is.na(x) | x == 1))
  }

# 示例如何使用自定义函数
custom_filter(df, a, b)
# 输出:
#   a  b c
# 1  1  1 0
# 2 NA  1 0
# 3  1 NA 0
custom_filter(df2, d, f)
# 输出:
#   d  f c
# 1  1  1 0
# 2 NA  1 0
# 3  1 NA 0

# 使用cro_df数据框和将所有数据框放入list()中,以便通过所有数据框并应用筛选条件的编程方式(purrr::map2())。
cro_df <- data.frame(pop = c('c1', 'c2'),
                     p1 = c('a', 'd'),
                     p2 = c('b', 'f'))

cro_l <- 
  cro_df |>
  split(1:nrow(cro_df))

data_l <- list(df, df2)

map2(data_l,
     cro_l,
     \(x, y) custom_filter(
       x, y$p1, y$p2
     ))
# 输出:
# [[1]]
#   a  b c
# 1  1  1 0
# 2 NA  1 0
# 3  1 NA 0
# 
# [[2]]
#   d  f c
# 1  1  1 0
# 2 NA  1 0
# 3  1 NA 0
英文:
library(dplyr)
library(purrr)
  
df &lt;- data.frame(a = c(1,1,0,NA,0,1), 
                 b = c(0,1,0,1,0, NA),
                 c = c(0,0,0,0,0,0))

Let’s add a second data frame with different column names.

df2 &lt;- data.frame(d = c(1,1,0,NA,0,1), 
                 f = c(0,1,0,1,0, NA),
                 c = c(0,0,0,0,0,0))

We can use dplyr::if_all() in dplyr::filter() to apply the filter.

df |&gt; 
  filter(if_all(c(a, b), \(x) is.na(x) | x == 1))
#&gt;    a  b c
#&gt; 1  1  1 0
#&gt; 2 NA  1 0
#&gt; 3  1 NA 0

Using that idea we now write a custom function to accomodate for changing
column names.

custom_filter &lt;- 
  function(data, v1, v2) {
    filter(data,
           if_all(c({{v1}}, {{v2}}), \(x) is.na(x) | x == 1))
  }

Here is how that can work.

custom_filter(df, a, b)
#&gt;    a  b c
#&gt; 1  1  1 0
#&gt; 2 NA  1 0
#&gt; 3  1 NA 0
custom_filter(df2, d, f)
#&gt;    d  f c
#&gt; 1  1  1 0
#&gt; 2 NA  1 0
#&gt; 3  1 NA 0

Using your cro_df dataframe and by placing all dataframes in a list()
we can now programmatically (purrr::map2()) go through all of the
dataframes and apply the filter.

cro_df &lt;- data.frame(pop = c(&#39;c1&#39;, &#39;c2&#39;),
                     p1 = c(&#39;a&#39;, &#39;d&#39;),
                     p2 = c(&#39;b&#39;, &#39;f&#39;))

cro_l &lt;- 
  cro_df |&gt; 
  split(1:nrow(cro_df))

data_l &lt;- list(df, df2)

map2(data_l,
     cro_l,
     \(x, y) custom_filter(
       x, y$p1, y$p2
     ))
#&gt; [[1]]
#&gt;    a  b c
#&gt; 1  1  1 0
#&gt; 2 NA  1 0
#&gt; 3  1 NA 0
#&gt; 
#&gt; [[2]]
#&gt;    d  f c
#&gt; 1  1  1 0
#&gt; 2 NA  1 0
#&gt; 3  1 NA 0

答案3

得分: 0

以下是翻译好的内容:

也许这是一个不错的开始?

with(cro_df[cro_df$pop == "c1",],
  df[ (is.na(df[[p1]]) | df[[p1]] == 1) & (is.na(df[[p2]]) | df[[p2]] == 1), ]
)
#    a  b c
# 2  1  1 0
# 4 NA  1 0
# 6  1 NA 0

FYI,subset 用于交互式使用,其帮助页面指出:


这是一个方便的函数,用于交互式使用。
对于编程,最好使用标准的子集函数,如[,],尤其是参数 'subset' 的非标准评估可能会导致意外后果。
英文:

Perhaps this is a good start?

with(cro_df[cro_df$pop == &quot;c1&quot;,],
  df[ (is.na(df[[p1]]) | df[[p1]] == 1) &amp; (is.na(df[[p2]]) | df[[p2]] == 1), ]
)
#    a  b c
# 2  1  1 0
# 4 NA  1 0
# 6  1 NA 0

FYI, subset is intended for interactive use, its help page says

Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like [, and in particular the non-standard evaluation
of argument ‘subset’ can have unanticipated consequences.

huangapple
  • 本文由 发表于 2023年5月17日 21:36:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/76272728.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定