如何检查大型数据框中一组列的存在?

huangapple go评论66阅读模式
英文:

How to check for existence of a set of columns in large dataframes?

问题

我正在从多年的调查数据中构建重复的横截面数据(面板数据)。每年的调查数据都以不一致的变量名称(列名)发布为单独的数据框架,有时甚至完全缺少变量。有450列,因此无法手动预览。因此,我希望搜索每个数据框架,以检查感兴趣的变量列表是否存在。

我首先尝试了dplyr::select(),但这不够,因为它按顺序遍历列名,并且仅在第一个无法在数据框架中找到的列名时产生警告,而实际上可能有多个:

df <- data.frame(c1 = 1:5,
                  c2 = letters[1:5],
                  c3 = letters[1:5],
                  c4 = letters[1:5],
                  c5 = letters[1:5],
                  c6 = letters[1:5])
> 
> select(df, c(c1, c2, c7, c10, c11))
Error in `select()`:
! Can't subset columns that don't exist.
✖ Column `c7` doesn't exist.
Run `rlang::last_trace()` to see where the error occurred.

对于这个类似的问题,我也发现了类似的答案(https://stackoverflow.com/questions/50647011/grep-with-multiple-column-names-in-data-frame?newreg=63d1d79b4c504a65976efcec952f1005),但我觉得它不够令人满意。

which(!is.na(match(colnames(df), c("c1", "c2", "c7", "c10", "c11"))))
[1] 1 2

匹配列名的位置被返回(而不是名称),但对于我们来说,最重要的是非匹配的列名,真正的问题有10个或更多感兴趣的变量,因此友好的输出对我们很重要。

有人知道更好的方法吗?也许有一个可以为每个提供的列名生成匹配(TRUE/FALSE)表格的函数吗?

英文:

I am constructing repeated cross-sectional (a panel) from multiple years of survey data. The survey data for each year is published as a separate dataframe with inconsistencies in variable names (column names) and sometimes variables are missing entirely. There are 450 columns so manual previewing is out of the question. As such, I want to search each dataframe to check for the existence of a list of variables of interest.

I first tried dplyr::select() but this is insufficient because it walks through the column names in order and only produces a warning for the first column name that can't be found in the dataframe when there may be in fact mutliple:

df &lt;- data.frame(c1 = 1:5,
                  c2 = letters[1:5],
                  c3 = letters[1:5],
                  c4 = letters[1:5],
                  c5 = letters[1:5],
                  c6 = letters[1:5])
&gt; 
&gt; select(df, c(c1, c2, c7, c10, c11))
Error in `select()`:
! Can&#39;t subset columns that don&#39;t exist.
✖ Column `c7` doesn&#39;t exist.
Run `rlang::last_trace()` to see where the error occurred.

The answer to this similar question (https://stackoverflow.com/questions/50647011/grep-with-multiple-column-names-in-data-frame?newreg=63d1d79b4c504a65976efcec952f1005) I also find unsatisfactory.

which(!is.na(match(colnames(df), c(&quot;c1&quot;, &quot;c2&quot;, &quot;c7&quot;, &quot;c10&quot;, &quot;c11&quot;))))
[1] 1 2

The position of the matching column names is returned (not the names) but it is the non-matching column names which are the most important for us to know, the real problem has 10 or more variables of interest so a human friendly output is important.

Does anyone know of a better way? Perhaps a function that would produce a table of matches (TRUE/FALSE) for each supplied column name?

答案1

得分: 3

你可以使用%in%!来获得一个逻辑向量,该向量表示数据集中不存在的列:

cols <- c("c1", "c2", "c7", "c10", "c11")
!cols %in% colnames(df)
#[1] FALSE FALSE  TRUE  TRUE  TRUE

或者使用setdiff来获取列名:

setdiff(cols, colnames(df))
#[1] "c7"  "c10" "c11"

并且要直接选择数据框中的列而不出现错误消息,可以使用any_of

library(dplyr)
select(df, any_of(cols))
#   c1 c2
# 1  1  a
# 2  2  b
# 3  3  c
# 4  4  d
# 5  5  e
英文:

You can use %in% with ! to get a logical vector of columns that are not present in the data set:

cols &lt;- c(&quot;c1&quot;, &quot;c2&quot;, &quot;c7&quot;, &quot;c10&quot;, &quot;c11&quot;)
!cols %in% colnames(df)
#[1] FALSE FALSE  TRUE  TRUE  TRUE

Or with setdiff to get the names:

setdiff(cols, colnames(df))
#[1] &quot;c7&quot;  &quot;c10&quot; &quot;c11&quot;

And to select directly the columns in the data.frame without error messages, use any_of:

library(dplyr)
select(df, any_of(cols))
#   c1 c2
# 1  1  a
# 2  2  b
# 3  3  c
# 4  4  d
# 5  5  e

答案2

得分: 1

你可以使用一个“外部”方法。同时使用一个感兴趣的变量名称向量以及一个带有命名数据框的命名列表。

v <- c('X1', 'X3', 'X7', 'X10', 'X11') |> setNames(nm=_)

f <- \(x, y) x %in% names(y)
t(outer(v, mget(ls(pattern='^df\\d$')), Vectorize(f)))
#       X1    X3    X7  X10   X11
# df1 TRUE  TRUE  TRUE TRUE FALSE
# df2 TRUE FALSE  TRUE TRUE FALSE
# df3 TRUE  TRUE FALSE TRUE FALSE

而不是使用 mget(ls(.)),你也可以硬编码列表,如下所示:

list(df1=df1, df2=df2, df3=df3)

说明: 我们想要知道语句 v %in% names(df) 是否为真,对于多个数据框的每一个都要这样做。需要的是一种双重循环方法,外部循环遍历数据框,内部循环遍历 v。但是我们可以使用 outer,它可以用于这类问题,对于两个向量 X 和 Y,它迭代 Y 的每个元素与 X 的每个元素并生成一个矩阵。我们只需对语句使用 Vectorize

编辑

你可能希望对数据框集合中的变量命名应用一致的逻辑。例如,使用三个字符来缩写含义(abbreviate('ethnicity', 3) 将有帮助),但更重要的是后面跟着两位数字(或者任何数字,只要全局一致),这些数字带有一致的含义。这将允许你使用 endsWith

ptt <- sprintf('%02d', c(1, 3, 7, 10, 11)) |> setNames(nm=_)

注意: sprintf('%02d', .) 方便地填充为两位数带有前导零,%02d,而 _ 是基本占位符,用于将其传递给 setNames,即输出名称与每个元素的内容相同(我们希望它对于矩阵显示良好):

ptt
#   01   03   07   10   11 
# "01" "03" "07" "10" "11" 
f <- \(x, y) any(endsWith(names(x), y))

适应了 outer 解决方案:

outer(mget(ls(pattern='^df[4-6]$')), ptt, Vectorize(f))
#       01    03    07   10    11
# df4 TRUE  TRUE  TRUE TRUE FALSE
# df5 TRUE FALSE  TRUE TRUE FALSE
# df6 TRUE  TRUE FALSE TRUE FALSE

再次强调,搜索模式 ^df[4-6]$ 寻找 df4、df5 和 df6;但是你可以使用一个命名列表 list(df4=df4, df5=df5, df6=df6) 来代替 mget(.),但 mget 更方便。


数据:

df1 <- df2 <- df3 <- data.frame(matrix(,2, 10))

df2$X3  <- NULL
df3$X2  <- df3$X7  <- NULL

df4 <- df5 <- df6 <- data.frame(matrix(,2, 10))

names(df4) <- paste0('eth', sprintf('%02d', seq_along(df4)))
names(df5) <- paste0('rgn', sprintf('%02d', seq_along(df5)))
names(df6) <- paste0('inc', sprintf('%02d', seq_along(df6)))

df5[3] <- NULL
df6[2] <- df6[7]  <- NULL
英文:

You could use an outer approach. Use both a names vector of variables of interest as well as a named list of your data frames.

v &lt;- c(&#39;X1&#39;, &#39;X3&#39;, &#39;X7&#39;, &#39;X10&#39;, &#39;X11&#39;) |&gt; setNames(nm=_)

f &lt;- \(x, y) x %in% names(y)
t(outer(v, mget(ls(pattern=&#39;^df\\d$&#39;)), Vectorize(f)))
#       X1    X3    X7  X10   X11
# df1 TRUE  TRUE  TRUE TRUE FALSE
# df2 TRUE FALSE  TRUE TRUE FALSE
# df3 TRUE  TRUE FALSE TRUE FALSE

Instead of mget(ls(.)) you can also hard-code the list, using

list(df1=df1, df2=df2, df3=df3)

Explanation: We want to know if the statement v %in% names(df) is true, and that for each of multiple data frames. What's needed is kind of a double for-loop approach, with the outer looping over the data frames and the inner over v. However we can use outer, which can be used for such problems, of two vectors X and Y it iterates each element of Y over each element of X and throws a matrix. We just have to Vectorize the statement.

Edit

You might want to apply a consistent logic to variable naming in your set of data frames. For instance, three characters to abbreviate the meaning (abbreviate(&#39;ethnicity&#39;, 3) will help), but more importantly followed by two digits (or any number, as long as globally consistent) that carry congruent meaning. This would allow you to use endsWith.

ptt &lt;- sprintf(&#39;%02d&#39;, c(1, 3, 7, 10, 11)) |&gt; setNames(nm=_)

Note: The sprintf(&#39;%02d&#39;, .) conveniently fills up to two digits with a leading zero, %02d, and the _ is the base placeholder what was piped into setNames, i.e. output names are identical to content of each element (we want it for the matrix to be well displayed):

ptt
#   01   03   07   10   11 
# &quot;01&quot; &quot;03&quot; &quot;07&quot; &quot;10&quot; &quot;11&quot; 

f &lt;- \(x, y) any(endsWith(names(x), y))

Adapted outer solution:

outer(mget(ls(pattern=&#39;^df[4-6]$&#39;)), ptt, Vectorize(f))
#       01    03    07   10    11
# df4 TRUE  TRUE  TRUE TRUE FALSE
# df5 TRUE FALSE  TRUE TRUE FALSE
# df6 TRUE  TRUE FALSE TRUE FALSE

Again, the search pattern &#39;^df[4-6]$&#39; looks for df4, df5, and df6; but instead of the mget(.) you can use a named list list(df4=df4, df5=df5, df6=df6), but mget is neat.


Data:

df1 &lt;- df2 &lt;- df3 &lt;- data.frame(matrix(,2, 10))

df2$X3  &lt;- NULL
df3$X2  &lt;- df3$X7  &lt;- NULL


df4 &lt;- df5 &lt;- df6 &lt;- data.frame(matrix(,2, 10))

names(df4) &lt;- paste0(&#39;eth&#39;, sprintf(&#39;%02d&#39;, seq_along(df4)))
names(df5) &lt;- paste0(&#39;rgn&#39;, sprintf(&#39;%02d&#39;, seq_along(df5)))
names(df6) &lt;- paste0(&#39;inc&#39;, sprintf(&#39;%02d&#39;, seq_along(df6)))


df5[3] &lt;- NULL
df6[2] &lt;- df6[7]  &lt;- NULL

答案3

得分: 0

以下是翻译好的部分:

"Also we can use the below code to create the vectors vars_in and vars_notin
Once we know the variables that exists then we can use that vector vars_in with all_of() in select()

cols &lt;- c(&quot;c10&quot;, &quot;c2&quot;, &quot;c7&quot;, &quot;c1&quot;, &quot;c11&quot;)

vars_in &lt;- cols[which(!is.na(match(cols,colnames(df))))]
[1] &quot;c2&quot; &quot;c1&quot;

vars_notin &lt;- cols[which(is.na(match(cols,colnames(df))))]
[1] &quot;c10&quot; &quot;c7&quot;  &quot;c11&quot;

select(df, all_of(vars_in))

# output

  c2 c1
1  a  1
2  b  2
3  c  3
4  d  4
5  e  5
```"

<details>
<summary>英文:</summary>

Also we can use the below code to create the vectors `vars_in` and `vars_notin`
Once we know the variables that exists then we can use that vector `vars_in` with all_of() in select()

```r
cols &lt;- c(&quot;c10&quot;, &quot;c2&quot;, &quot;c7&quot;, &quot;c1&quot;, &quot;c11&quot;)

vars_in &lt;- cols[which(!is.na(match(cols,colnames(df))))]
[1] &quot;c2&quot; &quot;c1&quot;

vars_notin &lt;- cols[which(is.na(match(cols,colnames(df))))]
[1] &quot;c10&quot; &quot;c7&quot;  &quot;c11&quot;

select(df, all_of(vars_in))

# output

  c2 c1
1  a  1
2  b  2
3  c  3
4  d  4
5  e  5


</details>



huangapple
  • 本文由 发表于 2023年6月27日 19:43:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76564508.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定