2023年2月14日 19:07:59go评论92阅读模式

英文:

R + dplyr: Partial Deduplication of Rows in a Tibble

问题

一个非常常见的问题是如何在R中删除数据框中的所有重复行，可以使用各种工具来完成（我喜欢dplyr+distinct）。

然而，如果您的数据集包含多个重复的行，但您不想删除所有重复行，只想删除某些变量的组合，该怎么办呢？

我不知道如何实现这一点，所以欢迎任何建议。

请查看帖子末尾的reprex。

感谢！

library(dplyr)
#&gt; 
#&gt; Attaching package: &#39;dplyr&#39;
#&gt; The following objects are masked from &#39;package:stats&#39;:
#&gt; 
#&gt;     filter, lag
#&gt; The following objects are masked from &#39;package:base&#39;:
#&gt; 
#&gt;     intersect, setdiff, setequal, union
df &lt;- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
             z=c(rep(c(&quot;h&quot;,&quot;j&quot;,&quot;k&quot;,&quot;t&quot;,&quot;u&quot;), 2), LETTERS[1:5])
             )
df
#&gt; # A tibble: 15 &#215; 3
#&gt;        x y     z    
#&gt;    &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt;  1     1 A     h    
#&gt;  2     2 B     j    
#&gt;  3     3 C     k    
#&gt;  4     4 D     t    
#&gt;  5     5 E     u    
#&gt;  6     1 A     h    
#&gt;  7     2 B     j    
#&gt;  8     3 C     k    
#&gt;  9     4 D     t    
#&gt; 10     5 E     u    
#&gt; 11     1 A     A    
#&gt; 12     2 B     B    
#&gt; 13     3 C     C    
#&gt; 14     4 D     D    
#&gt; 15     5 E     E
df_ded &lt;- df |&gt;
    distinct()
df_ded
#&gt; # A tibble: 10 &#215; 3
#&gt;        x y     z    
#&gt;    &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt;  1     1 A     h    
#&gt;  2     2 B     j    
#&gt;  3     3 C     k    
#&gt;  4     4 D     t    
#&gt;  5     5 E     u    
#&gt;  6     1 A     A    
#&gt;  7     2 B     B    
#&gt;  8     3 C     C    
#&gt;  9     4 D     D    
#&gt; 10     5 E     E
## 我只想要去重x==3和z==&quot;k&quot;的行。
df_ded_partial &lt;- df |&gt;
    distinct(x==3, z==&quot;k&quot;) ## 但这不是我想要的。
## 如何实现呢？
df_ded_partial
#&gt; # A tibble: 3 &#215; 2
#&gt;   `x == 3` `z == &quot;k&quot;`
#&gt;   &lt;lgl&gt;    &lt;lgl&gt;     
#&gt; 1 FALSE    FALSE     
#&gt; 2 TRUE     TRUE      
#&gt; 3 TRUE     FALSE

^{使用reprex v2.0.2于2023年02月14日创建}

英文:

A very common question is how to remove all the duplicated lines in a data frame in R, something which can be done with a variety of tools (I like dplyr+distinct).

However, what if your dataset contains several duplicated lines, but you do not want to remove all of them, but only those for some combination of the variables?

I do not know how to achieve that, so any suggestion is welcome.

Please have a look at the reprex at the end of the post.

Thanks!

library(dplyr)
#&gt; 
#&gt; Attaching package: &#39;dplyr&#39;
#&gt; The following objects are masked from &#39;package:stats&#39;:
#&gt; 
#&gt;     filter, lag
#&gt; The following objects are masked from &#39;package:base&#39;:
#&gt; 
#&gt;     intersect, setdiff, setequal, union
df &lt;- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
             z=c(rep(c(&quot;h&quot;,&quot;j&quot;,&quot;k&quot;,&quot;t&quot;,&quot;u&quot;), 2), LETTERS[1:5])
             )
df
#&gt; # A tibble: 15 &#215; 3
#&gt;        x y     z    
#&gt;    &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt;  1     1 A     h    
#&gt;  2     2 B     j    
#&gt;  3     3 C     k    
#&gt;  4     4 D     t    
#&gt;  5     5 E     u    
#&gt;  6     1 A     h    
#&gt;  7     2 B     j    
#&gt;  8     3 C     k    
#&gt;  9     4 D     t    
#&gt; 10     5 E     u    
#&gt; 11     1 A     A    
#&gt; 12     2 B     B    
#&gt; 13     3 C     C    
#&gt; 14     4 D     D    
#&gt; 15     5 E     E
df_ded &lt;- df |&gt;
    distinct()
df_ded
#&gt; # A tibble: 10 &#215; 3
#&gt;        x y     z    
#&gt;    &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt;  1     1 A     h    
#&gt;  2     2 B     j    
#&gt;  3     3 C     k    
#&gt;  4     4 D     t    
#&gt;  5     5 E     u    
#&gt;  6     1 A     A    
#&gt;  7     2 B     B    
#&gt;  8     3 C     C    
#&gt;  9     4 D     D    
#&gt; 10     5 E     E
## I want to deduplicate only the rows with x==3 and z==&quot;k&quot;
df_ded_partial &lt;- df |&gt;
    distinct(x==3, z==&quot;k&quot;) ## but this is not what I mean.
## How to achieve it?
df_ded_partial
#&gt; # A tibble: 3 &#215; 2
#&gt;   `x == 3` `z == &quot;k&quot;`
#&gt;   &lt;lgl&gt;    &lt;lgl&gt;     
#&gt; 1 FALSE    FALSE     
#&gt; 2 TRUE     TRUE      
#&gt; 3 TRUE     FALSE

<sup>Created on 2023-02-14 with reprex v2.0.2</sup>

答案1

得分: 5

我们可以使用 group_modify() 并使用 .y 参数来检查条件，该参数是当前分组的 tibble。所以我们可以说：如果条件满足，则返回 distinct(.x) 分组，否则返回整个分组 .x。

library(dplyr)
df |&gt;
  group_by(x, z) |&gt;
  group_modify(~ if(.y$x == 3 &amp;&amp; .y$z == &quot;k&quot;) distinct(.x) else .x)
#&gt; # A tibble: 14 x 3
#&gt; # Groups:   x, z [10]
#&gt;        x z     y    
#&gt;    &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt;  1     1 A     A    
#&gt;  2     1 h     A    
#&gt;  3     1 h     A    
#&gt;  4     2 B     B    
#&gt;  5     2 j     B    
#&gt;  6     2 j     B    
#&gt;  7     3 C     C    
#&gt;  8     3 k     C    
#&gt;  9     4 D     D    
#&gt; 10     4 t     D    
#&gt; 11     4 t     D    
#&gt; 12     5 E     E    
#&gt; 13     5 u     E    
#&gt; 14     5 u     E

来自 OP 的数据

df &lt;- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
             z=c(rep(c(&quot;h&quot;,&quot;j&quot;,&quot;k&quot;,&quot;t&quot;,&quot;u&quot;), 2), LETTERS[1:5])
)

^{创建于 2023-02-14，由 reprex package (v2.0.1) 创建}

英文:

We can use group_modify() and check for the condition using the .y argument which is a tibble of the current group. So we can say: if the condition is met return the distinct(.x) group otherwise return the whole group .x.

library(dplyr)
df |&gt;
  group_by(x, z) |&gt;
  group_modify(~ if(.y$x == 3 &amp;&amp; .y$z == &quot;k&quot;) distinct(.x) else .x)
#&gt; # A tibble: 14 x 3
#&gt; # Groups:   x, z [10]
#&gt;        x z     y    
#&gt;    &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt;  1     1 A     A    
#&gt;  2     1 h     A    
#&gt;  3     1 h     A    
#&gt;  4     2 B     B    
#&gt;  5     2 j     B    
#&gt;  6     2 j     B    
#&gt;  7     3 C     C    
#&gt;  8     3 k     C    
#&gt;  9     4 D     D    
#&gt; 10     4 t     D    
#&gt; 11     4 t     D    
#&gt; 12     5 E     E    
#&gt; 13     5 u     E    
#&gt; 14     5 u     E

Data from OP

df &lt;- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
             z=c(rep(c(&quot;h&quot;,&quot;j&quot;,&quot;k&quot;,&quot;t&quot;,&quot;u&quot;), 2), LETTERS[1:5])
)

<sup>Created on 2023-02-14 by the reprex package (v2.0.1)</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

R + dplyr: Tibble中行的部分去重

问题

答案1

如何在R数据框中查找列中特定值的百分比

数据框中的新列不保留 POSIXct 类。

Using parse_expr(), quo_name(), and enquo() to define a character object for plotting country-wise graphs in ggplot

仅绘制二项矩阵中的1s。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。