2023年6月27日 19:43:34go评论95阅读模式

英文:

How to check for existence of a set of columns in large dataframes?

问题

我正在从多年的调查数据中构建重复的横截面数据（面板数据）。每年的调查数据都以不一致的变量名称（列名）发布为单独的数据框架，有时甚至完全缺少变量。有450列，因此无法手动预览。因此，我希望搜索每个数据框架，以检查感兴趣的变量列表是否存在。

我首先尝试了dplyr::select()，但这不够，因为它按顺序遍历列名，并且仅在第一个无法在数据框架中找到的列名时产生警告，而实际上可能有多个：

df <- data.frame(c1 = 1:5,
                  c2 = letters[1:5],
                  c3 = letters[1:5],
                  c4 = letters[1:5],
                  c5 = letters[1:5],
                  c6 = letters[1:5])
> 
> select(df, c(c1, c2, c7, c10, c11))
Error in `select()`:
! Can't subset columns that don't exist.
✖ Column `c7` doesn't exist.
Run `rlang::last_trace()` to see where the error occurred.

对于这个类似的问题，我也发现了类似的答案（https://stackoverflow.com/questions/50647011/grep-with-multiple-column-names-in-data-frame?newreg=63d1d79b4c504a65976efcec952f1005），但我觉得它不够令人满意。

which(!is.na(match(colnames(df), c("c1", "c2", "c7", "c10", "c11"))))
[1] 1 2

匹配列名的位置被返回（而不是名称），但对于我们来说，最重要的是非匹配的列名，真正的问题有10个或更多感兴趣的变量，因此友好的输出对我们很重要。

有人知道更好的方法吗？也许有一个可以为每个提供的列名生成匹配（TRUE/FALSE）表格的函数吗？

英文:

I am constructing repeated cross-sectional (a panel) from multiple years of survey data. The survey data for each year is published as a separate dataframe with inconsistencies in variable names (column names) and sometimes variables are missing entirely. There are 450 columns so manual previewing is out of the question. As such, I want to search each dataframe to check for the existence of a list of variables of interest.

I first tried dplyr::select() but this is insufficient because it walks through the column names in order and only produces a warning for the first column name that can't be found in the dataframe when there may be in fact mutliple:

df &lt;- data.frame(c1 = 1:5,
                  c2 = letters[1:5],
                  c3 = letters[1:5],
                  c4 = letters[1:5],
                  c5 = letters[1:5],
                  c6 = letters[1:5])
&gt; 
&gt; select(df, c(c1, c2, c7, c10, c11))
Error in `select()`:
! Can&#39;t subset columns that don&#39;t exist.
✖ Column `c7` doesn&#39;t exist.
Run `rlang::last_trace()` to see where the error occurred.

The answer to this similar question (https://stackoverflow.com/questions/50647011/grep-with-multiple-column-names-in-data-frame?newreg=63d1d79b4c504a65976efcec952f1005) I also find unsatisfactory.

which(!is.na(match(colnames(df), c(&quot;c1&quot;, &quot;c2&quot;, &quot;c7&quot;, &quot;c10&quot;, &quot;c11&quot;))))
[1] 1 2

The position of the matching column names is returned (not the names) but it is the non-matching column names which are the most important for us to know, the real problem has 10 or more variables of interest so a human friendly output is important.

Does anyone know of a better way? Perhaps a function that would produce a table of matches (TRUE/FALSE) for each supplied column name?

答案1

得分: 3

你可以使用%in%与!来获得一个逻辑向量，该向量表示数据集中不存在的列：

cols <- c("c1", "c2", "c7", "c10", "c11")
!cols %in% colnames(df)
#[1] FALSE FALSE  TRUE  TRUE  TRUE

或者使用setdiff来获取列名：

setdiff(cols, colnames(df))
#[1] "c7"  "c10" "c11"

并且要直接选择数据框中的列而不出现错误消息，可以使用any_of：

library(dplyr)
select(df, any_of(cols))
#   c1 c2
# 1  1  a
# 2  2  b
# 3  3  c
# 4  4  d
# 5  5  e

英文:

You can use %in% with ! to get a logical vector of columns that are not present in the data set:

cols &lt;- c(&quot;c1&quot;, &quot;c2&quot;, &quot;c7&quot;, &quot;c10&quot;, &quot;c11&quot;)
!cols %in% colnames(df)
#[1] FALSE FALSE  TRUE  TRUE  TRUE

Or with setdiff to get the names:

setdiff(cols, colnames(df))
#[1] &quot;c7&quot;  &quot;c10&quot; &quot;c11&quot;

And to select directly the columns in the data.frame without error messages, use any_of:

library(dplyr)
select(df, any_of(cols))
#   c1 c2
# 1  1  a
# 2  2  b
# 3  3  c
# 4  4  d
# 5  5  e

答案2

得分: 1

你可以使用一个“外部”方法。同时使用一个感兴趣的变量名称向量以及一个带有命名数据框的命名列表。

v <- c('X1', 'X3', 'X7', 'X10', 'X11') |> setNames(nm=_)
f <- \(x, y) x %in% names(y)
t(outer(v, mget(ls(pattern='^df\\d$')), Vectorize(f)))
#       X1    X3    X7  X10   X11
# df1 TRUE  TRUE  TRUE TRUE FALSE
# df2 TRUE FALSE  TRUE TRUE FALSE
# df3 TRUE  TRUE FALSE TRUE FALSE

而不是使用 mget(ls(.))，你也可以硬编码列表，如下所示：

list(df1=df1, df2=df2, df3=df3)

说明: 我们想要知道语句 v %in% names(df) 是否为真，对于多个数据框的每一个都要这样做。需要的是一种双重循环方法，外部循环遍历数据框，内部循环遍历 v。但是我们可以使用 outer，它可以用于这类问题，对于两个向量 X 和 Y，它迭代 Y 的每个元素与 X 的每个元素并生成一个矩阵。我们只需对语句使用 Vectorize。

编辑

你可能希望对数据框集合中的变量命名应用一致的逻辑。例如，使用三个字符来缩写含义（abbreviate('ethnicity', 3) 将有帮助），但更重要的是后面跟着两位数字（或者任何数字，只要全局一致），这些数字带有一致的含义。这将允许你使用 endsWith。

ptt <- sprintf('%02d', c(1, 3, 7, 10, 11)) |> setNames(nm=_)

注意: sprintf('%02d', .) 方便地填充为两位数带有前导零，%02d，而 _ 是基本占位符，用于将其传递给 setNames，即输出名称与每个元素的内容相同（我们希望它对于矩阵显示良好）：

ptt
#   01   03   07   10   11 
# "01" "03" "07" "10" "11"

f <- \(x, y) any(endsWith(names(x), y))

适应了 outer 解决方案：

outer(mget(ls(pattern='^df[4-6]$')), ptt, Vectorize(f))
#       01    03    07   10    11
# df4 TRUE  TRUE  TRUE TRUE FALSE
# df5 TRUE FALSE  TRUE TRUE FALSE
# df6 TRUE  TRUE FALSE TRUE FALSE

再次强调，搜索模式 ^df[4-6]$ 寻找 df4、df5 和 df6；但是你可以使用一个命名列表 list(df4=df4, df5=df5, df6=df6) 来代替 mget(.)，但 mget 更方便。

数据:

df1 <- df2 <- df3 <- data.frame(matrix(,2, 10))
df2$X3  <- NULL
df3$X2  <- df3$X7  <- NULL
df4 <- df5 <- df6 <- data.frame(matrix(,2, 10))
names(df4) <- paste0('eth', sprintf('%02d', seq_along(df4)))
names(df5) <- paste0('rgn', sprintf('%02d', seq_along(df5)))
names(df6) <- paste0('inc', sprintf('%02d', seq_along(df6)))
df5[3] <- NULL
df6[2] <- df6[7]  <- NULL

英文:

You could use an outer approach. Use both a names vector of variables of interest as well as a named list of your data frames.

v &lt;- c(&#39;X1&#39;, &#39;X3&#39;, &#39;X7&#39;, &#39;X10&#39;, &#39;X11&#39;) |&gt; setNames(nm=_)
f &lt;- \(x, y) x %in% names(y)
t(outer(v, mget(ls(pattern=&#39;^df\\d$&#39;)), Vectorize(f)))
#       X1    X3    X7  X10   X11
# df1 TRUE  TRUE  TRUE TRUE FALSE
# df2 TRUE FALSE  TRUE TRUE FALSE
# df3 TRUE  TRUE FALSE TRUE FALSE

Instead of mget(ls(.)) you can also hard-code the list, using

list(df1=df1, df2=df2, df3=df3)

Explanation: We want to know if the statement v %in% names(df) is true, and that for each of multiple data frames. What's needed is kind of a double for-loop approach, with the outer looping over the data frames and the inner over v. However we can use outer, which can be used for such problems, of two vectors X and Y it iterates each element of Y over each element of X and throws a matrix. We just have to Vectorize the statement.

Edit

You might want to apply a consistent logic to variable naming in your set of data frames. For instance, three characters to abbreviate the meaning (abbreviate('ethnicity', 3) will help), but more importantly followed by two digits (or any number, as long as globally consistent) that carry congruent meaning. This would allow you to use endsWith.

ptt &lt;- sprintf(&#39;%02d&#39;, c(1, 3, 7, 10, 11)) |&gt; setNames(nm=_)

Note: The sprintf('%02d', .) conveniently fills up to two digits with a leading zero, %02d, and the _ is the base placeholder what was piped into setNames, i.e. output names are identical to content of each element (we want it for the matrix to be well displayed):

ptt
#   01   03   07   10   11 
# &quot;01&quot; &quot;03&quot; &quot;07&quot; &quot;10&quot; &quot;11&quot; 
f &lt;- \(x, y) any(endsWith(names(x), y))

Adapted outer solution:

outer(mget(ls(pattern=&#39;^df[4-6]$&#39;)), ptt, Vectorize(f))
#       01    03    07   10    11
# df4 TRUE  TRUE  TRUE TRUE FALSE
# df5 TRUE FALSE  TRUE TRUE FALSE
# df6 TRUE  TRUE FALSE TRUE FALSE

Again, the search pattern '^df[4-6]$' looks for df4, df5, and df6; but instead of the mget(.) you can use a named list list(df4=df4, df5=df5, df6=df6), but mget is neat.

Data:

df1 &lt;- df2 &lt;- df3 &lt;- data.frame(matrix(,2, 10))
df2$X3  &lt;- NULL
df3$X2  &lt;- df3$X7  &lt;- NULL
df4 &lt;- df5 &lt;- df6 &lt;- data.frame(matrix(,2, 10))
names(df4) &lt;- paste0(&#39;eth&#39;, sprintf(&#39;%02d&#39;, seq_along(df4)))
names(df5) &lt;- paste0(&#39;rgn&#39;, sprintf(&#39;%02d&#39;, seq_along(df5)))
names(df6) &lt;- paste0(&#39;inc&#39;, sprintf(&#39;%02d&#39;, seq_along(df6)))
df5[3] &lt;- NULL
df6[2] &lt;- df6[7]  &lt;- NULL

答案3

得分: 0

以下是翻译好的部分：

"Also we can use the below code to create the vectors vars_in and vars_notin
Once we know the variables that exists then we can use that vector vars_in with all_of() in select()

cols &lt;- c(&quot;c10&quot;, &quot;c2&quot;, &quot;c7&quot;, &quot;c1&quot;, &quot;c11&quot;)
vars_in &lt;- cols[which(!is.na(match(cols,colnames(df))))]
[1] &quot;c2&quot; &quot;c1&quot;
vars_notin &lt;- cols[which(is.na(match(cols,colnames(df))))]
[1] &quot;c10&quot; &quot;c7&quot;  &quot;c11&quot;
select(df, all_of(vars_in))
# output
  c2 c1
1  a  1
2  b  2
3  c  3
4  d  4
5  e  5
```"
<details>
<summary>英文:</summary>
Also we can use the below code to create the vectors `vars_in` and `vars_notin`
Once we know the variables that exists then we can use that vector `vars_in` with all_of() in select()
```r
cols &lt;- c(&quot;c10&quot;, &quot;c2&quot;, &quot;c7&quot;, &quot;c1&quot;, &quot;c11&quot;)
vars_in &lt;- cols[which(!is.na(match(cols,colnames(df))))]
[1] &quot;c2&quot; &quot;c1&quot;
vars_notin &lt;- cols[which(is.na(match(cols,colnames(df))))]
[1] &quot;c10&quot; &quot;c7&quot;  &quot;c11&quot;
select(df, all_of(vars_in))
# output
  c2 c1
1  a  1
2  b  2
3  c  3
4  d  4
5  e  5
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何检查大型数据框中一组列的存在？

问题

答案1

答案2

编辑

Edit

答案3

将数据框变量转换为因子时指定级别。

Pandas 无法构建索引。

将图像数据框转换为pandas数据框。

如何在多个数据框上使用多个条件进行求和

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。