在一个数据框中,每次使用两列的唯一值。

huangapple go评论97阅读模式
英文:

Unique on a dataframe with two columns at a time

问题

我有一个包含11列和41,000行的数据框,我想通过逐一比较其中的两列来找到唯一的行。我希望这个问题很简单,但我无法通过unique或duplicated函数解决它。

我有以下示例数据框:

  1. 水果 PRS_001 PRS_002 PRS_015 PRS_016
  2. 苹果 0.5 NA NA NA
  3. 橙子 0.2 NA NA NA
  4. 苹果 1.3 NA NA NA
  5. 苹果 0.5 NA NA NA
  6. 莓果 NA NA 0.3 NA
  7. 苹果 NA NA 0.5 NA
  8. 葡萄 NA 0.2 NA NA

我想要以下数据框:

  1. 水果 人员
  2. 苹果 1.3 PRS_001
  3. 橙子 0.2 PRS_001
  4. 莓果 0.3 PRS_015
  5. 苹果 0.5 PRS_015
  6. 葡萄 0.2 PRS_002

如果水果名称在水果和其他列之间重复出现,那么最大的数值将被保留。例如,在水果和PRS_001之间,苹果重复3次,但1.3是最高的数值,所以在最终表格中将保留1.3的数值。

英文:

I have a dataframe with 11 columns and 41k rows, and I would to find the unique rows by comparing two of the columns at a time. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.

I have the following example df.

  1. Fruits PRS_001 PRS_002 PRS_015 PRS_016
  2. Apple 0.5 NA NA NA
  3. Orange 0.2 NA NA NA
  4. Apple 1.3 NA NA NA
  5. Apple 0.5 NA NA NA
  6. Berry NA NA 0.3 NA
  7. Apple NA NA 0.5 NA
  8. Grape NA 0.2 NA NA

I would like to have the following data frame:

  1. Fruits Value1 Person
  2. Apple 1.3 PRS_001
  3. Orange 0.2 PRS_001
  4. Berry 0.3 PRS_015
  5. Apple 0.5 PRS_015
  6. Grape 0.2 PRS_002

If a fruit name is repeating between fruits and other column, then the highest number will come in place. for example, between fruits and PRS_001, apple is repeating 3 times but 1.3 is the highest value so in final table 1.3 value will come.

答案1

得分: 1

We could reshape to 'long' and then do a group by slice_max

  1. library(dplyr)
  2. library(tidyr)
  3. pivot_longer(df1, cols = starts_with("PRS"), names_to = "Person",
  4. values_to = "value1", values_drop_na = TRUE) %>%
  5. group_by(Fruits, Person) %>%
  6. slice_max(value1, n = 1) %>%
  7. ungroup %>%
  8. relocate(Person, .after = last_col())

-output

  1. # A tibble: 5 x 3
  2. Fruits value1 Person
  3. <chr> <dbl> <chr>
  4. 1 Apple 1.3 PRS_001
  5. 2 Orange 0.2 PRS_001
  6. 3 Grape 0.2 PRS_002
  7. 4 Apple 0.5 PRS_015
  8. 5 Berry 0.3 PRS_015

Or with data.table

  1. library(data.table)
  2. melt(setDT(df1), measure.vars = patterns("^PRS"),
  3. variable.name = "Person", na.rm = TRUE)[, .(value = max(value)),
  4. .(Fruits, Person)]

-output

  1. Fruits Person value
  2. 1: Apple PRS_001 1.3
  3. 2: Orange PRS_001 0.2
  4. 3: Grape PRS_002 0.2
  5. 4: Berry PRS_015 0.3
  6. 5: Apple PRS_015 0.5

Or use fmax from collapse and return the wide data

  1. library(collapse)
  2. fmax(df1[-1], g = df1$Fruits)
  3. PRS_001 PRS_002 PRS_015 PRS_016
  4. Apple 1.3 NA 0.5 NA
  5. Berry NA NA 0.3 NA
  6. Grape NA 0.2 NA NA
  7. Orange 0.2 NA NA NA

data

  1. df1 <- structure(list(Fruits = c("Apple", "Orange", "Apple", "Apple",
  2. "Berry", "Apple", "Grape"), PRS_001 = c(0.5, 0.2, 1.3, 0.5, NA,
  3. NA, NA), PRS_002 = c(NA, NA, NA, NA, NA, NA, 0.2), PRS_015 = c(NA,
  4. NA, NA, NA, 0.3, 0.5, NA), PRS_016 = c(NA, NA, NA, NA, NA, NA,
  5. NA)), class = "data.frame", row names = c(NA, -7L))
英文:

We could reshape to 'long' and then do a group by slice_max

  1. library(dplyr)
  2. library(tidyr)
  3. pivot_longer(df1, cols = starts_with(&quot;PRS&quot;), names_to = &quot;Person&quot;,
  4. values_to = &quot;value1&quot;, values_drop_na = TRUE) %&gt;%
  5. group_by(Fruits, Person) %&gt;%
  6. slice_max(value1, n = 1) %&gt;%
  7. ungroup %&gt;%
  8. relocate(Person, .after = last_col())

-output

  1. # A tibble: 5 &#215; 3
  2. Fruits value1 Person
  3. &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
  4. 1 Apple 1.3 PRS_001
  5. 2 Orange 0.2 PRS_001
  6. 3 Grape 0.2 PRS_002
  7. 4 Apple 0.5 PRS_015
  8. 5 Berry 0.3 PRS_015

Or with data.table

  1. library(data.table)
  2. melt(setDT(df1), measure.vars = patterns(&quot;^PRS&quot;),
  3. variable.name = &quot;Person&quot;, na.rm = TRUE)[, .(value = max(value)),
  4. .(Fruits, Person)]

-output

  1. Fruits Person value
  2. 1: Apple PRS_001 1.3
  3. 2: Orange PRS_001 0.2
  4. 3: Grape PRS_002 0.2
  5. 4: Berry PRS_015 0.3
  6. 5: Apple PRS_015 0.5

Or use fmax from collapse and return the wide data

  1. library(collapse)
  2. fmax(df1[-1], g = df1$Fruits)
  3. PRS_001 PRS_002 PRS_015 PRS_016
  4. Apple 1.3 NA 0.5 NA
  5. Berry NA NA 0.3 NA
  6. Grape NA 0.2 NA NA
  7. Orange 0.2 NA NA NA

data

  1. df1 &lt;- structure(list(Fruits = c(&quot;Apple&quot;, &quot;Orange&quot;, &quot;Apple&quot;, &quot;Apple&quot;,
  2. &quot;Berry&quot;, &quot;Apple&quot;, &quot;Grape&quot;), PRS_001 = c(0.5, 0.2, 1.3, 0.5, NA,
  3. NA, NA), PRS_002 = c(NA, NA, NA, NA, NA, NA, 0.2), PRS_015 = c(NA,
  4. NA, NA, NA, 0.3, 0.5, NA), PRS_016 = c(NA, NA, NA, NA, NA, NA,
  5. NA)), class = &quot;data.frame&quot;, row.names = c(NA, -7L))

答案2

得分: 1

在基本的R中,你可以执行以下操作:

  1. aggregate(values~., cbind(df[1], stack(df,-1)), max)

在Tidyverse中,你可以使用以下代码:

  1. library(tidyverse)
  2. df %>%
  3. pivot_longer(-Fruits, names_to = 'Person', values_drop_na = TRUE) %>%
  4. summarise(Value1 = max(value), .by = c(Fruits, Person))

请注意,这两段代码执行的是相同的数据操作。

英文:

in base R you could do:

  1. aggregate(values~., cbind(df[1], stack(df,-1)), max)
  2. #&gt; Fruits ind values
  3. #&gt; 1 Apple PRS_001 1.3
  4. #&gt; 2 Orange PRS_001 0.2
  5. #&gt; 3 Grape PRS_002 0.2
  6. #&gt; 4 Apple PRS_015 0.5
  7. #&gt; 5 Berry PRS_015 0.3

in tidyverse:

  1. library(tidyverse)
  2. df %&gt;%
  3. pivot_longer(-Fruits, names_to = &#39;Person&#39;, values_drop_na = TRUE) %&gt;%
  4. summarise(Value1 = max(value), .by = c(Fruits, Person))
  5. #&gt; # A tibble: 5 &#215; 3
  6. #&gt; Fruits Person Value1
  7. #&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
  8. #&gt; 1 Apple PRS_001 1.3
  9. #&gt; 2 Orange PRS_001 0.2
  10. #&gt; 3 Berry PRS_015 0.3
  11. #&gt; 4 Apple PRS_015 0.5
  12. #&gt; 5 Grape PRS_002 0.2

<sup>Created on 2023-02-09 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年2月10日 12:52:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/75407091.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定