2023年2月10日 12:52:58go评论97阅读模式

英文:

Unique on a dataframe with two columns at a time

问题

我有一个包含11列和41,000行的数据框，我想通过逐一比较其中的两列来找到唯一的行。我希望这个问题很简单，但我无法通过unique或duplicated函数解决它。

我有以下示例数据框：

水果       PRS_001               PRS_002              PRS_015              PRS_016
苹果       0.5                  NA                   NA                   NA
橙子       0.2                  NA                   NA                   NA
苹果       1.3                  NA                   NA                   NA
苹果       0.5                  NA                   NA                   NA   
莓果       NA                   NA                   0.3                  NA
苹果       NA                   NA                   0.5                  NA
葡萄       NA                   0.2                  NA                   NA

我想要以下数据框：

水果       值             人员
苹果       1.3            PRS_001              
橙子       0.2            PRS_001      
莓果       0.3            PRS_015
苹果       0.5            PRS_015
葡萄       0.2            PRS_002

如果水果名称在水果和其他列之间重复出现，那么最大的数值将被保留。例如，在水果和PRS_001之间，苹果重复3次，但1.3是最高的数值，所以在最终表格中将保留1.3的数值。

英文:

I have a dataframe with 11 columns and 41k rows, and I would to find the unique rows by comparing two of the columns at a time. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.

I have the following example df.

Fruits    PRS_001               PRS_002              PRS_015              PRS_016
Apple       0.5                  NA                   NA                   NA
Orange      0.2                  NA                   NA                   NA
Apple       1.3                  NA                   NA                   NA
Apple       0.5                  NA                   NA                   NA   
Berry       NA                   NA                   0.3                  NA
Apple       NA                   NA                   0.5                  NA
Grape       NA                   0.2                  NA                   NA

I would like to have the following data frame:

Fruits     Value1         Person
Apple      1.3             PRS_001              
Orange     0.2             PRS_001      
Berry      0.3             PRS_015
Apple      0.5             PRS_015
Grape      0.2             PRS_002

If a fruit name is repeating between fruits and other column, then the highest number will come in place. for example, between fruits and PRS_001, apple is repeating 3 times but 1.3 is the highest value so in final table 1.3 value will come.

答案1

得分: 1

We could reshape to 'long' and then do a group by slice_max

library(dplyr)
library(tidyr)
pivot_longer(df1, cols = starts_with("PRS"), names_to = "Person",
    values_to = "value1", values_drop_na = TRUE) %>%
  group_by(Fruits, Person) %>%
  slice_max(value1, n = 1) %>%
  ungroup %>%
  relocate(Person, .after = last_col())

-output

# A tibble: 5 x 3
  Fruits value1 Person 
  <chr>   <dbl> <chr>  
1 Apple     1.3 PRS_001
2 Orange    0.2 PRS_001
3 Grape     0.2 PRS_002
4 Apple     0.5 PRS_015
5 Berry     0.3 PRS_015

Or with data.table

library(data.table)
melt(setDT(df1), measure.vars = patterns("^PRS"), 
  variable.name = "Person", na.rm = TRUE)[, .(value = max(value)), 
    .(Fruits, Person)]

-output

  Fruits  Person value
1:  Apple PRS_001   1.3
2: Orange PRS_001   0.2
3:  Grape PRS_002   0.2
4:  Berry PRS_015   0.3
5:  Apple PRS_015   0.5

Or use fmax from collapse and return the wide data

library(collapse)
fmax(df1[-1], g = df1$Fruits)
       PRS_001 PRS_002 PRS_015 PRS_016
Apple      1.3      NA     0.5      NA
Berry       NA      NA     0.3      NA
Grape       NA     0.2      NA      NA
Orange     0.2      NA      NA      NA

data

df1 <- structure(list(Fruits = c("Apple", "Orange", "Apple", "Apple", 
"Berry", "Apple", "Grape"), PRS_001 = c(0.5, 0.2, 1.3, 0.5, NA, 
NA, NA), PRS_002 = c(NA, NA, NA, NA, NA, NA, 0.2), PRS_015 = c(NA, 
NA, NA, NA, 0.3, 0.5, NA), PRS_016 = c(NA, NA, NA, NA, NA, NA, 
NA)), class = "data.frame", row names = c(NA, -7L))

英文:

We could reshape to 'long' and then do a group by slice_max

library(dplyr)
library(tidyr)
pivot_longer(df1, cols = starts_with(&quot;PRS&quot;), names_to = &quot;Person&quot;,
    values_to = &quot;value1&quot;, values_drop_na = TRUE) %&gt;%
  group_by(Fruits, Person) %&gt;%
  slice_max(value1, n = 1) %&gt;%
  ungroup %&gt;% 
  relocate(Person, .after = last_col())

-output

# A tibble: 5 &#215; 3
  Fruits value1 Person 
  &lt;chr&gt;   &lt;dbl&gt; &lt;chr&gt;  
1 Apple     1.3 PRS_001
2 Orange    0.2 PRS_001
3 Grape     0.2 PRS_002
4 Apple     0.5 PRS_015
5 Berry     0.3 PRS_015

Or with data.table

library(data.table)
melt(setDT(df1), measure.vars = patterns(&quot;^PRS&quot;), 
  variable.name = &quot;Person&quot;, na.rm = TRUE)[, .(value = max(value)), 
    .(Fruits, Person)]

-output

  Fruits  Person value
1:  Apple PRS_001   1.3
2: Orange PRS_001   0.2
3:  Grape PRS_002   0.2
4:  Berry PRS_015   0.3
5:  Apple PRS_015   0.5

Or use fmax from collapse and return the wide data

library(collapse)
fmax(df1[-1], g = df1$Fruits)
       PRS_001 PRS_002 PRS_015 PRS_016
Apple      1.3      NA     0.5      NA
Berry       NA      NA     0.3      NA
Grape       NA     0.2      NA      NA
Orange     0.2      NA      NA      NA

data

df1 &lt;- structure(list(Fruits = c(&quot;Apple&quot;, &quot;Orange&quot;, &quot;Apple&quot;, &quot;Apple&quot;, 
&quot;Berry&quot;, &quot;Apple&quot;, &quot;Grape&quot;), PRS_001 = c(0.5, 0.2, 1.3, 0.5, NA, 
NA, NA), PRS_002 = c(NA, NA, NA, NA, NA, NA, 0.2), PRS_015 = c(NA, 
NA, NA, NA, 0.3, 0.5, NA), PRS_016 = c(NA, NA, NA, NA, NA, NA, 
NA)), class = &quot;data.frame&quot;, row.names = c(NA, -7L))

答案2

得分: 1

在基本的R中，你可以执行以下操作：

aggregate(values~., cbind(df[1], stack(df,-1)), max)

在Tidyverse中，你可以使用以下代码：

library(tidyverse)
df %>%
  pivot_longer(-Fruits, names_to = 'Person', values_drop_na = TRUE) %>%
  summarise(Value1 = max(value), .by = c(Fruits, Person))

请注意，这两段代码执行的是相同的数据操作。

英文:

in base R you could do:

aggregate(values~., cbind(df[1], stack(df,-1)), max)
#&gt;   Fruits     ind values
#&gt; 1  Apple PRS_001    1.3
#&gt; 2 Orange PRS_001    0.2
#&gt; 3  Grape PRS_002    0.2
#&gt; 4  Apple PRS_015    0.5
#&gt; 5  Berry PRS_015    0.3

in tidyverse:

library(tidyverse)
df %&gt;%
  pivot_longer(-Fruits, names_to = &#39;Person&#39;, values_drop_na = TRUE) %&gt;%
  summarise(Value1 = max(value), .by = c(Fruits, Person))
#&gt; # A tibble: 5 &#215; 3
#&gt;   Fruits Person  Value1
#&gt;   &lt;chr&gt;  &lt;chr&gt;    &lt;dbl&gt;
#&gt; 1 Apple  PRS_001    1.3
#&gt; 2 Orange PRS_001    0.2
#&gt; 3 Berry  PRS_015    0.3
#&gt; 4 Apple  PRS_015    0.5
#&gt; 5 Grape  PRS_002    0.2

<sup>Created on 2023-02-09 with reprex v2.0.2</sup>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在一个数据框中，每次使用两列的唯一值。

问题

答案1

data

data

答案2

调整刻度标记

如何根据包含响应式图表的Rmarkdown生成HTML报告？

如何根据其他列的值填充Python数据框的列值？

基于一个字符串聚合列合并两个DataFrame

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论