英文:
Unique on a dataframe with two columns at a time
问题
我有一个包含11列和41,000行的数据框,我想通过逐一比较其中的两列来找到唯一的行。我希望这个问题很简单,但我无法通过unique或duplicated函数解决它。
我有以下示例数据框:
水果 PRS_001 PRS_002 PRS_015 PRS_016
苹果 0.5 NA NA NA
橙子 0.2 NA NA NA
苹果 1.3 NA NA NA
苹果 0.5 NA NA NA
莓果 NA NA 0.3 NA
苹果 NA NA 0.5 NA
葡萄 NA 0.2 NA NA
我想要以下数据框:
水果 值 人员
苹果 1.3 PRS_001
橙子 0.2 PRS_001
莓果 0.3 PRS_015
苹果 0.5 PRS_015
葡萄 0.2 PRS_002
如果水果名称在水果和其他列之间重复出现,那么最大的数值将被保留。例如,在水果和PRS_001之间,苹果重复3次,但1.3是最高的数值,所以在最终表格中将保留1.3的数值。
英文:
I have a dataframe with 11 columns and 41k rows, and I would to find the unique rows by comparing two of the columns at a time. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.
I have the following example df.
Fruits PRS_001 PRS_002 PRS_015 PRS_016
Apple 0.5 NA NA NA
Orange 0.2 NA NA NA
Apple 1.3 NA NA NA
Apple 0.5 NA NA NA
Berry NA NA 0.3 NA
Apple NA NA 0.5 NA
Grape NA 0.2 NA NA
I would like to have the following data frame:
Fruits Value1 Person
Apple 1.3 PRS_001
Orange 0.2 PRS_001
Berry 0.3 PRS_015
Apple 0.5 PRS_015
Grape 0.2 PRS_002
If a fruit name is repeating between fruits and other column, then the highest number will come in place. for example, between fruits and PRS_001, apple is repeating 3 times but 1.3 is the highest value so in final table 1.3 value will come.
答案1
得分: 1
We could reshape to 'long' and then do a group by slice_max
library(dplyr)
library(tidyr)
pivot_longer(df1, cols = starts_with("PRS"), names_to = "Person",
values_to = "value1", values_drop_na = TRUE) %>%
group_by(Fruits, Person) %>%
slice_max(value1, n = 1) %>%
ungroup %>%
relocate(Person, .after = last_col())
-output
# A tibble: 5 x 3
Fruits value1 Person
<chr> <dbl> <chr>
1 Apple 1.3 PRS_001
2 Orange 0.2 PRS_001
3 Grape 0.2 PRS_002
4 Apple 0.5 PRS_015
5 Berry 0.3 PRS_015
Or with data.table
library(data.table)
melt(setDT(df1), measure.vars = patterns("^PRS"),
variable.name = "Person", na.rm = TRUE)[, .(value = max(value)),
.(Fruits, Person)]
-output
Fruits Person value
1: Apple PRS_001 1.3
2: Orange PRS_001 0.2
3: Grape PRS_002 0.2
4: Berry PRS_015 0.3
5: Apple PRS_015 0.5
Or use fmax
from collapse
and return the wide data
library(collapse)
fmax(df1[-1], g = df1$Fruits)
PRS_001 PRS_002 PRS_015 PRS_016
Apple 1.3 NA 0.5 NA
Berry NA NA 0.3 NA
Grape NA 0.2 NA NA
Orange 0.2 NA NA NA
data
df1 <- structure(list(Fruits = c("Apple", "Orange", "Apple", "Apple",
"Berry", "Apple", "Grape"), PRS_001 = c(0.5, 0.2, 1.3, 0.5, NA,
NA, NA), PRS_002 = c(NA, NA, NA, NA, NA, NA, 0.2), PRS_015 = c(NA,
NA, NA, NA, 0.3, 0.5, NA), PRS_016 = c(NA, NA, NA, NA, NA, NA,
NA)), class = "data.frame", row names = c(NA, -7L))
英文:
We could reshape to 'long' and then do a group by slice_max
library(dplyr)
library(tidyr)
pivot_longer(df1, cols = starts_with("PRS"), names_to = "Person",
values_to = "value1", values_drop_na = TRUE) %>%
group_by(Fruits, Person) %>%
slice_max(value1, n = 1) %>%
ungroup %>%
relocate(Person, .after = last_col())
-output
# A tibble: 5 × 3
Fruits value1 Person
<chr> <dbl> <chr>
1 Apple 1.3 PRS_001
2 Orange 0.2 PRS_001
3 Grape 0.2 PRS_002
4 Apple 0.5 PRS_015
5 Berry 0.3 PRS_015
Or with data.table
library(data.table)
melt(setDT(df1), measure.vars = patterns("^PRS"),
variable.name = "Person", na.rm = TRUE)[, .(value = max(value)),
.(Fruits, Person)]
-output
Fruits Person value
1: Apple PRS_001 1.3
2: Orange PRS_001 0.2
3: Grape PRS_002 0.2
4: Berry PRS_015 0.3
5: Apple PRS_015 0.5
Or use fmax
from collapse
and return the wide data
library(collapse)
fmax(df1[-1], g = df1$Fruits)
PRS_001 PRS_002 PRS_015 PRS_016
Apple 1.3 NA 0.5 NA
Berry NA NA 0.3 NA
Grape NA 0.2 NA NA
Orange 0.2 NA NA NA
data
df1 <- structure(list(Fruits = c("Apple", "Orange", "Apple", "Apple",
"Berry", "Apple", "Grape"), PRS_001 = c(0.5, 0.2, 1.3, 0.5, NA,
NA, NA), PRS_002 = c(NA, NA, NA, NA, NA, NA, 0.2), PRS_015 = c(NA,
NA, NA, NA, 0.3, 0.5, NA), PRS_016 = c(NA, NA, NA, NA, NA, NA,
NA)), class = "data.frame", row.names = c(NA, -7L))
答案2
得分: 1
在基本的R中,你可以执行以下操作:
aggregate(values~., cbind(df[1], stack(df,-1)), max)
在Tidyverse中,你可以使用以下代码:
library(tidyverse)
df %>%
pivot_longer(-Fruits, names_to = 'Person', values_drop_na = TRUE) %>%
summarise(Value1 = max(value), .by = c(Fruits, Person))
请注意,这两段代码执行的是相同的数据操作。
英文:
in base R you could do:
aggregate(values~., cbind(df[1], stack(df,-1)), max)
#> Fruits ind values
#> 1 Apple PRS_001 1.3
#> 2 Orange PRS_001 0.2
#> 3 Grape PRS_002 0.2
#> 4 Apple PRS_015 0.5
#> 5 Berry PRS_015 0.3
in tidyverse:
library(tidyverse)
df %>%
pivot_longer(-Fruits, names_to = 'Person', values_drop_na = TRUE) %>%
summarise(Value1 = max(value), .by = c(Fruits, Person))
#> # A tibble: 5 × 3
#> Fruits Person Value1
#> <chr> <chr> <dbl>
#> 1 Apple PRS_001 1.3
#> 2 Orange PRS_001 0.2
#> 3 Berry PRS_015 0.3
#> 4 Apple PRS_015 0.5
#> 5 Grape PRS_002 0.2
<sup>Created on 2023-02-09 with reprex v2.0.2</sup>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论