在一个数据框中,每次使用两列的唯一值。

huangapple go评论67阅读模式
英文:

Unique on a dataframe with two columns at a time

问题

我有一个包含11列和41,000行的数据框,我想通过逐一比较其中的两列来找到唯一的行。我希望这个问题很简单,但我无法通过unique或duplicated函数解决它。

我有以下示例数据框:

水果       PRS_001               PRS_002              PRS_015              PRS_016
苹果       0.5                  NA                   NA                   NA
橙子       0.2                  NA                   NA                   NA
苹果       1.3                  NA                   NA                   NA
苹果       0.5                  NA                   NA                   NA   
莓果       NA                   NA                   0.3                  NA
苹果       NA                   NA                   0.5                  NA
葡萄       NA                   0.2                  NA                   NA  

我想要以下数据框:

水果       值             人员
苹果       1.3            PRS_001              
橙子       0.2            PRS_001      
莓果       0.3            PRS_015
苹果       0.5            PRS_015
葡萄       0.2            PRS_002

如果水果名称在水果和其他列之间重复出现,那么最大的数值将被保留。例如,在水果和PRS_001之间,苹果重复3次,但1.3是最高的数值,所以在最终表格中将保留1.3的数值。

英文:

I have a dataframe with 11 columns and 41k rows, and I would to find the unique rows by comparing two of the columns at a time. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.

I have the following example df.

Fruits    PRS_001               PRS_002              PRS_015              PRS_016
Apple       0.5                  NA                   NA                   NA
Orange      0.2                  NA                   NA                   NA
Apple       1.3                  NA                   NA                   NA
Apple       0.5                  NA                   NA                   NA   
Berry       NA                   NA                   0.3                  NA
Apple       NA                   NA                   0.5                  NA
Grape       NA                   0.2                  NA                   NA  

I would like to have the following data frame:

Fruits     Value1         Person
Apple      1.3             PRS_001              
Orange     0.2             PRS_001      
Berry      0.3             PRS_015
Apple      0.5             PRS_015
Grape      0.2             PRS_002

If a fruit name is repeating between fruits and other column, then the highest number will come in place. for example, between fruits and PRS_001, apple is repeating 3 times but 1.3 is the highest value so in final table 1.3 value will come.

答案1

得分: 1

We could reshape to 'long' and then do a group by slice_max

library(dplyr)
library(tidyr)
pivot_longer(df1, cols = starts_with("PRS"), names_to = "Person",
    values_to = "value1", values_drop_na = TRUE) %>%
  group_by(Fruits, Person) %>%
  slice_max(value1, n = 1) %>%
  ungroup %>%
  relocate(Person, .after = last_col())

-output

# A tibble: 5 x 3
  Fruits value1 Person 
  <chr>   <dbl> <chr>  
1 Apple     1.3 PRS_001
2 Orange    0.2 PRS_001
3 Grape     0.2 PRS_002
4 Apple     0.5 PRS_015
5 Berry     0.3 PRS_015

Or with data.table

library(data.table)
melt(setDT(df1), measure.vars = patterns("^PRS"), 
  variable.name = "Person", na.rm = TRUE)[, .(value = max(value)), 
    .(Fruits, Person)]

-output

  Fruits  Person value
1:  Apple PRS_001   1.3
2: Orange PRS_001   0.2
3:  Grape PRS_002   0.2
4:  Berry PRS_015   0.3
5:  Apple PRS_015   0.5

Or use fmax from collapse and return the wide data

library(collapse)
fmax(df1[-1], g = df1$Fruits)
       PRS_001 PRS_002 PRS_015 PRS_016
Apple      1.3      NA     0.5      NA
Berry       NA      NA     0.3      NA
Grape       NA     0.2      NA      NA
Orange     0.2      NA      NA      NA

data

df1 <- structure(list(Fruits = c("Apple", "Orange", "Apple", "Apple", 
"Berry", "Apple", "Grape"), PRS_001 = c(0.5, 0.2, 1.3, 0.5, NA, 
NA, NA), PRS_002 = c(NA, NA, NA, NA, NA, NA, 0.2), PRS_015 = c(NA, 
NA, NA, NA, 0.3, 0.5, NA), PRS_016 = c(NA, NA, NA, NA, NA, NA, 
NA)), class = "data.frame", row names = c(NA, -7L))
英文:

We could reshape to 'long' and then do a group by slice_max

library(dplyr)
library(tidyr)
pivot_longer(df1, cols = starts_with(&quot;PRS&quot;), names_to = &quot;Person&quot;,
    values_to = &quot;value1&quot;, values_drop_na = TRUE) %&gt;%
  group_by(Fruits, Person) %&gt;%
  slice_max(value1, n = 1) %&gt;%
  ungroup %&gt;% 
  relocate(Person, .after = last_col())

-output

# A tibble: 5 &#215; 3
  Fruits value1 Person 
  &lt;chr&gt;   &lt;dbl&gt; &lt;chr&gt;  
1 Apple     1.3 PRS_001
2 Orange    0.2 PRS_001
3 Grape     0.2 PRS_002
4 Apple     0.5 PRS_015
5 Berry     0.3 PRS_015

Or with data.table

library(data.table)
melt(setDT(df1), measure.vars = patterns(&quot;^PRS&quot;), 
  variable.name = &quot;Person&quot;, na.rm = TRUE)[, .(value = max(value)), 
    .(Fruits, Person)]

-output

  Fruits  Person value
1:  Apple PRS_001   1.3
2: Orange PRS_001   0.2
3:  Grape PRS_002   0.2
4:  Berry PRS_015   0.3
5:  Apple PRS_015   0.5

Or use fmax from collapse and return the wide data

library(collapse)
fmax(df1[-1], g = df1$Fruits)
       PRS_001 PRS_002 PRS_015 PRS_016
Apple      1.3      NA     0.5      NA
Berry       NA      NA     0.3      NA
Grape       NA     0.2      NA      NA
Orange     0.2      NA      NA      NA

data

df1 &lt;- structure(list(Fruits = c(&quot;Apple&quot;, &quot;Orange&quot;, &quot;Apple&quot;, &quot;Apple&quot;, 
&quot;Berry&quot;, &quot;Apple&quot;, &quot;Grape&quot;), PRS_001 = c(0.5, 0.2, 1.3, 0.5, NA, 
NA, NA), PRS_002 = c(NA, NA, NA, NA, NA, NA, 0.2), PRS_015 = c(NA, 
NA, NA, NA, 0.3, 0.5, NA), PRS_016 = c(NA, NA, NA, NA, NA, NA, 
NA)), class = &quot;data.frame&quot;, row.names = c(NA, -7L))

答案2

得分: 1

在基本的R中,你可以执行以下操作:

aggregate(values~., cbind(df[1], stack(df,-1)), max)

在Tidyverse中,你可以使用以下代码:

library(tidyverse)

df %>%
  pivot_longer(-Fruits, names_to = 'Person', values_drop_na = TRUE) %>%
  summarise(Value1 = max(value), .by = c(Fruits, Person))

请注意,这两段代码执行的是相同的数据操作。

英文:

in base R you could do:

aggregate(values~., cbind(df[1], stack(df,-1)), max)

#&gt;   Fruits     ind values
#&gt; 1  Apple PRS_001    1.3
#&gt; 2 Orange PRS_001    0.2
#&gt; 3  Grape PRS_002    0.2
#&gt; 4  Apple PRS_015    0.5
#&gt; 5  Berry PRS_015    0.3

in tidyverse:

library(tidyverse)

df %&gt;%
  pivot_longer(-Fruits, names_to = &#39;Person&#39;, values_drop_na = TRUE) %&gt;%
  summarise(Value1 = max(value), .by = c(Fruits, Person))
#&gt; # A tibble: 5 &#215; 3
#&gt;   Fruits Person  Value1
#&gt;   &lt;chr&gt;  &lt;chr&gt;    &lt;dbl&gt;
#&gt; 1 Apple  PRS_001    1.3
#&gt; 2 Orange PRS_001    0.2
#&gt; 3 Berry  PRS_015    0.3
#&gt; 4 Apple  PRS_015    0.5
#&gt; 5 Grape  PRS_002    0.2

<sup>Created on 2023-02-09 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年2月10日 12:52:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75407091.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定