检查多个列是否在R中相同(包含缺失数据)

huangapple go评论95阅读模式
英文:

Check if multiple columns are identical in r (with missing data)

问题

我正在尝试确定在R中多个列是否相同。每个变量包含字符串数据。示例可能如下所示。文件中有一些缺失数据。

df <- data.frame(id  = c(1:9),
                  var1  = c("a", "b", "c", "a", "a", "a", "c", "a", "b"),
                  var2  = c("a", "b", NA, "a", "a", "a", "c", "a", "b"),
                  var3  = c("a", "b", "c", "a", "a", "b", "c", "a", "b"),
                  var4  = c("a", "b", "c", "b", "a", "b", "c", "a", "b"),
                  var5  = c("a", NA, "c", "b", "a", NA, "c", "a", "b"),
                  var6  = c("a", NA, "c", "a", "c", NA, "c", "a", "b"),
                  var7  = c("a", NA, "c", "a", "c", "b", "c", "a", "b"),
                  var8  = c("a", "b", "c", "a", "c", "a", "c", "a", "b"),
                  var9  = c("a", "b", "c", "a", "c", "a", "c", "a", "b"),
                  var10  = c("a", "b", "c", "a", "c", NA, "c", "a", "b")
     )

我想要识别变量之间的任何差异,同时忽略缺失数据(例如,在id 4中,所有变量都是"a",除了var4和var5是"b")。
我尝试导出一个显示id和所有变量的数据框。

输出:

id var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
4   a    a    a    b    b    a    a    a    a    a
5   a    a    a    a    a    c    c    c    c    c
...

我可以为每种可能的情况创建一个标志变量,但这似乎不是一个理想的解决方案,因为我需要编写太多可能不会发生在数据帧中的情况的代码。

我尝试使用unique,但这只能识别不同的模式。我仍然需要为每个模式编写代码。由于数据是字符串,我不能使用var()。在小规模情况下,当只有两个变量时,我可以使用==identical()。我不确定如何处理大量变量的情况。

英文:

I am trying to determine if multiple columns are identical to each other in R. Each variable contains string data. An example might look something like this. There is some missing data in the file.

            df &lt;- data.frame (id  = c(1:9),
                              var1  = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;a&quot;, &quot;a&quot;, &quot;a&quot;, &quot;c&quot;, &quot;a&quot;, &quot;b&quot;),
                              var2  = c(&quot;a&quot;, &quot;b&quot;, NA, &quot;a&quot;, &quot;a&quot;, &quot;a&quot;, &quot;c&quot;, &quot;a&quot;, &quot;b&quot;),
                              var3  = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;a&quot;, &quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;a&quot;, &quot;b&quot;),
                              var4  = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;b&quot;, &quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;a&quot;, &quot;b&quot;),
                              var5  = c(&quot;a&quot;, NA, &quot;c&quot;, &quot;b&quot;, &quot;a&quot;, NA, &quot;c&quot;, &quot;a&quot;, &quot;b&quot;),
                              var6  = c(&quot;a&quot;, NA, &quot;c&quot;, &quot;a&quot;, &quot;c&quot;, NA, &quot;c&quot;, &quot;a&quot;, &quot;b&quot;),
                              var7  = c(&quot;a&quot;, NA, &quot;c&quot;, &quot;a&quot;, &quot;c&quot;, &quot;b&quot;, &quot;c&quot;, &quot;a&quot;, &quot;b&quot;),
                              var8  = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;a&quot;, &quot;c&quot;, &quot;a&quot;, &quot;c&quot;, &quot;a&quot;, &quot;b&quot;),
                              var9  = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;a&quot;, &quot;c&quot;, &quot;a&quot;, &quot;c&quot;, &quot;a&quot;, &quot;b&quot;),
                              var10  = c(&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;a&quot;, &quot;c&quot;, NA, &quot;c&quot;, &quot;a&quot;, &quot;b&quot;)
     )

I want to identify any changes differences across the variables while ignoring missing data (e.g., in id 4, all vars are "a" except for var4 and var5 which are b).
I am trying to export a dataframe that shows the id and all the variables.

output
id var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
4     a    a    a    b    b    a    a    a    a    a
5     a    a    a    a    a    c    c    c    c    c
...

I could create a flag variable for each individual circumstance that could happen.

df$flag [df$var1 == &quot;a&quot; &amp; df$var2 == &quot;a&quot; &amp; df$var2 == &quot;a&quot; &amp; df$var4 == &quot;a&quot; &amp; df$var5 == &quot;a&quot; &amp; df$var6 == &quot;a&quot; &amp; df$var7 == &quot;a&quot; &amp; df$var8 == &quot;a&quot; &amp; df$var9 == &quot;a&quot; &amp; df$var10 == &quot;b&quot;]&lt;- 1
df$flag [df$var1 == &quot;a&quot; &amp; df$var2 == &quot;a&quot; &amp; df$var2 == &quot;a&quot; &amp; df$var4 == &quot;a&quot; &amp; df$var5 == &quot;a&quot; &amp; df$var6 == &quot;a&quot; &amp; df$var7 == &quot;a&quot; &amp; df$var8 == &quot;a&quot; &amp; df$var9 == &quot;a&quot; &amp; df$var10 == &quot;c&quot;]&lt;- 1
df$flag [df$var1 == &quot;a&quot; &amp; df$var2 == &quot;a&quot; &amp; df$var2 == &quot;a&quot; &amp; df$var4 == &quot;a&quot; &amp; df$var5 == &quot;a&quot; &amp; df$var6 == &quot;a&quot; &amp; df$var7 == &quot;a&quot; &amp; df$var8 == &quot;a&quot; &amp; df$var9 == &quot;b&quot; &amp; df$var10 == &quot;b&quot;]&lt;- 1
df$flag [df$var1 == &quot;a&quot; &amp; df$var2 == &quot;a&quot; &amp; df$var2 == &quot;a&quot; &amp; df$var4 == &quot;a&quot; &amp; df$var5 == &quot;a&quot; &amp; df$var6 == &quot;a&quot; &amp; df$var7 == &quot;a&quot; &amp; df$var8 == &quot;a&quot; &amp; df$var9 == &quot;c&quot; &amp; df$var10 == &quot;c&quot;]&lt;- 1
    ...

This does not seem like an ideal solution as there are too many circumstances I would need to code for, many of which might not even occur in the data frame.

I tried using unique, but this just identifies the various patterns. I would still need to code for each of them. Since the data is string, I can't use var(). On a smaller scale, when there are just two variables, I could use == or identical(). I'm not sure how to approach this with a large number of variables.

答案1

得分: 1

library(dplyr)

df |&gt;
  filter(n_distinct(c_across(starts_with(&quot;var&quot;)), na.rm = T) &gt; 1, .by = id)

A way to do this by row in base R would be:

df[apply(df[-1], 1, \(x) nlevels(factor(x))) &gt; 1,]

Output

  id var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
1  4    a    a    a    b    b    a    a    a    a     a
2  5    a    a    a    a    a    c    c    c    c     c
3  6    a    a    b    b &lt;NA&gt; &lt;NA&gt;    b    a    a  &lt;NA&gt;
英文:
library(dplyr)

df |&gt;
  filter(n_distinct(c_across(starts_with(&quot;var&quot;)), na.rm = T) &gt; 1, .by = id)

A way to do this by row in base R would be:

df[apply(df[-1], 1, \(x) nlevels(factor(x))) &gt; 1,]

Output

  id var1 var2 var3 var4 var5 var6 var7 var8 var9 var10
1  4    a    a    a    b    b    a    a    a    a     a
2  5    a    a    a    a    a    c    c    c    c     c
3  6    a    a    b    b &lt;NA&gt; &lt;NA&gt;    b    a    a  &lt;NA&gt;

huangapple
  • 本文由 发表于 2023年8月4日 03:52:36
  • 转载请务必保留本文链接:https://go.coder-hub.com/76831249.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定