如何在基于不同数据框的情况下,为每个人分配相同的ID编号?

huangapple go评论104阅读模式
英文:

R: How to give the same id number to each person in a data frame based on a different data frame?

问题

我有两个数据集,df 和 df2。这是两个非常庞大且混乱的数据框的过于简化版本。

在原始的 df 中,我通过将它们按腰带(belt)和体重(weight)进行分组为每个人创建了一个唯一的 id。我希望在 df2 中,与 df 中相同的人具有相同的 id 号码。它们需要具有相同的名称,并且应该按腰带和体重进行分组。请注意,df2 中有一些在 df 中没有的人。

简化的 df 外观如下:

  1. belt weight rank id name
  2. 1 purple open class 1 55 Tom Cruise
  3. 2 black rooster 2 79 Emma Watson
  4. 3 blue feather 3 63 John Doe
  5. 4 blue feather 4 63 John Doe
  6. 5 purple open class 5 55 Tom Cruise
  7. 6 brown heavy 6 3 James Bond
  8. 7 purple open class 7 55 Tom Cruise
  9. 8 purple heavy 8 61 Tom Cruise
  10. 9 black open class 9 70 Jane Doe
  11. 10 purple heavy 10 61 Tom Cruise

第二个数据框的外观如下。在 df2 中的人,但不在 df 中的人应该为他们的 id 收到 NA。请注意,id 必须按腰带和体重分配,因为有些人根据他们参加的体重分组有不同的积分。

  1. belt2 weight2 rank2 name points
  2. 1 purple open class 1 Tom Cruise 100
  3. 2 black rooster 2 Emma Watson 30
  4. 3 blue feather 3 John Doe 50
  5. 4 blue feather 4 John Doe 50
  6. 5 purple open class 5 Tom Cruise 100
  7. 6 brown heavy 6 James Bond 200
  8. 7 black rooster 7 Jon Snow 92
  9. 8 purple heavy 8 Tom Cruise 77
  10. 9 black open class 9 Jane Doe 88
  11. 10 purple heavy 10 Tom Cruise 77

这是我希望 df2 外观的方式:

  1. belt2 weight2 rank2 id name points
  2. 1 purple open class 1 55 Tom Cruise 100
  3. 2 black rooster 2 79 Emma Watson 30
  4. 3 blue feather 3 63 John Doe 50
  5. 4 blue feather 4 63 John Doe 50
  6. 5 purple open class 5 55 Tom Cruise 100
  7. 6 brown heavy 6 3 James Bond 200
  8. 7 black rooster 7 NA Jon Snow 92
  9. 8 purple heavy 8 61 Tom Cruise 77
  10. 9 black open class 9 70 Jane Doe 88
  11. 10 purple heavy 10 61 Tom Cruise 77

基本上,我希望 df2 中的 id 号码与 df 中的 id 号码匹配。如果没有匹配,用 NA 填充。

以下是 df 和 df2 的数据:

  1. # df
  2. belt <- c("purple", "black", "blue", "blue", "purple", "brown", "purple", "purple", "black", "purple")
  3. weight <- c("open class", "rooster", "feather", "feather", "open class", "heavy", "open class", "heavy", "open class", "heavy")
  4. rank <- 1:10
  5. id <- c(55, 79, 63, 63, 55, 3, 55, 61, 70, 61)
  6. names <- c("Tom Cruise", "Emma Watson", "John Doe", "John Doe", "Tom Cruise", "James Bond", "Tom Cruise", "Tom Cruise", "Jane Doe", "Tom Cruise")
  7. (df <- data.frame(belt, weight, rank, id, name = names))
  8. # df2
  9. belt2 <- c("purple", "black", "blue", "blue", "purple", "brown", "black", "purple", "black", "purple")
  10. weight2 <- c("open class", "rooster", "feather", "feather", "open class", "heavy", "rooster", "heavy", "open class", "heavy")
  11. rank2 <- 1:10
  12. names2 <- c("Tom Cruise", "Emma Watson", "John Doe", "John Doe", "Tom Cruise", "James Bond", "Jon Snow", "Tom Cruise", "Jane Doe", "Tom Cruise")
  13. points <- c(100, 30, 50, 50, 100, 200, 92, 77, 88, 77)
  14. (df2 <- data.frame(belt2, weight2, rank2, name = names2, points))
英文:

I have two data sets, df and df2. This is an oversimplified version of two very large and messy data frames.

In the original df, I created a unique id for each person by grouping them by belt and weight. I want the same id numbers each person has in df to be given to the same people in df2. They need to have the same name, and should be grouped by belt and weight. Note that there are people in df2 who are not in df.

The simplified df looks something like below

  1. belt weight rank id name
  2. 1 purple open class 1 55 Tom Cruise
  3. 2 black rooster 2 79 Emma Watson
  4. 3 blue feather 3 63 John Doe
  5. 4 blue feather 4 63 John Doe
  6. 5 purple open class 5 55 Tom Cruise
  7. 6 brown heavy 6 3 James Bond
  8. 7 purple open class 7 55 Tom Cruise
  9. 8 purple heavy 8 61 Tom Cruise
  10. 9 black open class 9 70 Jane Doe
  11. 10 purple heavy 10 61 Tom Cruise

The second data frame looks something like this. A person who is in df2, but not in df should receive an NA for their id. Note that the id's must be given by belt and weight because some people have different points depending on which weight division they competed in

  1. belt2 weight2 rank2 name points
  2. 1 purple open class 1 Tom Cruise 100
  3. 2 black rooster 2 Emma Watson 30
  4. 3 blue feather 3 John Doe 50
  5. 4 blue feather 4 John Doe 50
  6. 5 purple open class 5 Tom Cruise 100
  7. 6 brown heavy 6 James Bond 200
  8. 7 black rooster 7 Jon Snow 92
  9. 8 purple heavy 8 Tom Cruise 77
  10. 9 black open class 9 Jane Doe 88
  11. 10 purple heavy 10 Tom Cruise 77

This is what I would like df2 to look like:

  1. belt2 weight2 rank2 id name points
  2. 1 purple open class 1 55 Tom Cruise 100
  3. 2 black rooster 2 79 Emma Watson 30
  4. 3 blue feather 3 63 John Doe 50
  5. 4 blue feather 4 63 John Doe 50
  6. 5 purple open class 5 55 Tom Cruise 100
  7. 6 brown heavy 6 3 James Bond 200
  8. 7 black rooster 7 NA Jon Snow 92
  9. 8 purple heavy 8 61 Tom Cruise 77
  10. 9 black open class 9 70 Jane Doe 88
  11. 10 purple heavy 10 61 Tom Cruise 77

Basically, I want the id numbers in df2 to match the id numbers in df. If there is not a match, fill with NA.

  1. # df
  2. belt &lt;- c(&quot;purple&quot;, &quot;black&quot;, &quot;blue&quot;, &quot;blue&quot;, &quot;purple&quot;, &quot;brown&quot;, &quot;purple&quot;, &quot;purple&quot;, &quot;black&quot;, &quot;purple&quot;)
  3. weight &lt;- c(&quot;open class&quot;, &quot;rooster&quot;, &quot;feather&quot;, &quot;feather&quot;, &quot;open class&quot;, &quot;heavy&quot;, &quot;open class&quot;, &quot;heavy&quot;, &quot;open class&quot;, &quot;heavy&quot;)
  4. rank &lt;- 1:10
  5. id &lt;- c(55, 79, 63, 63, 55, 3, 55, 61, 70, 61)
  6. names &lt;- c(&quot;Tom Cruise&quot;, &quot;Emma Watson&quot;, &quot;John Doe&quot;, &quot;John Doe&quot;, &quot;Tom Cruise&quot;, &quot;James Bond&quot;, &quot;Tom Cruise&quot;, &quot;Tom Cruise&quot;, &quot;Jane Doe&quot;, &quot;Tom Cruise&quot;)
  7. (df &lt;- data.frame(belt, weight, rank, id, name = names))
  8. #df2
  9. belt2 &lt;- c(&quot;purple&quot;, &quot;black&quot;, &quot;blue&quot;, &quot;blue&quot;, &quot;purple&quot;, &quot;brown&quot;, &quot;black&quot;, &quot;purple&quot;, &quot;black&quot;, &quot;purple&quot;)
  10. weight2 &lt;- c(&quot;open class&quot;, &quot;rooster&quot;, &quot;feather&quot;, &quot;feather&quot;, &quot;open class&quot;, &quot;heavy&quot;, &quot;rooster&quot;, &quot;heavy&quot;, &quot;open class&quot;, &quot;heavy&quot;)
  11. rank2 &lt;- 1:10
  12. names2 &lt;- c(&quot;Tom Cruise&quot;, &quot;Emma Watson&quot;, &quot;John Doe&quot;, &quot;John Doe&quot;, &quot;Tom Cruise&quot;, &quot;James Bond&quot;, &quot;Jon Snow&quot;, &quot;Tom Cruise&quot;, &quot;Jane Doe&quot;, &quot;Tom Cruise&quot;)
  13. points &lt;- c(100, 30, 50, 50, 100, 200, 92, 77, 88, 77)
  14. (df2 &lt;- data.frame(belt2, weight2, rank2, name = names2, points))

答案1

得分: 3

以下是代码部分的翻译:

This can be solved with a right join and removing the duplicates after it. I will use base function merge.

  1. df3 &lt;- merge(
  2. df, df2,
  3. by.x = c(&quot;belt&quot;, &quot;weight&quot;, &quot;rank&quot;, &quot;name&quot;),
  4. by.y = c(&quot;belt2&quot;, &quot;weight2&quot;, &quot;rank2&quot;, &quot;name&quot;),
  5. all.y = TRUE
  6. )
  7. df3 &lt;- df3[!duplicated(df3),]
  8. df3[order(df3$rank),]
  9. #&gt; belt weight rank name id points
  10. #&gt; 9 purple open class 1 Tom Cruise 55 100
  11. #&gt; 2 black rooster 2 Emma Watson 79 30
  12. #&gt; 4 blue feather 3 John Doe 63 50
  13. #&gt; 5 blue feather 4 John Doe 63 50
  14. #&gt; 10 purple open class 5 Tom Cruise 55 100
  15. #&gt; 6 brown heavy 6 James Bond 3 200
  16. #&gt; 3 black rooster 7 Jon Snow NA 92
  17. #&gt; 7 purple heavy 8 Tom Cruise 61 77
  18. #&gt; 1 black open class 9 Jane Doe 70 88
  19. #&gt; 8 purple heavy 10 Tom Cruise 61 77

A dplyr right join is

  1. suppressPackageStartupMessages({
  2. library(dplyr)
  3. })
  4. df %&gt;%
  5. right_join(df2,
  6. by = c(&quot;belt&quot; = &quot;belt2&quot;, &quot;weight&quot; = &quot;weight2&quot;, &quot;rank&quot; = &quot;rank2&quot;),
  7. suffix = c(&quot;.x&quot;, &quot;&quot;)) %&gt;%
  8. select(-name.x) %&gt;%
  9. arrange(rank)
  10. #&gt; belt weight rank id name points
  11. #&gt; 1 purple open class 1 55 Tom Cruise 100
  12. #&gt; 2 black rooster 2 79 Emma Watson 30
  13. #&gt; 3 blue feather 3 63 John Doe 50
  14. #&gt; 4 blue feather 4 63 John Doe 50
  15. #&gt; 5 purple open class 5 55 Tom Cruise 100
  16. #&gt; 6 brown heavy 6 3 James Bond 200
  17. #&gt; 7 black rooster 7 NA Jon Snow 92
  18. #&gt; 8 purple heavy 8 61 Tom Cruise 77
  19. #&gt; 9 black open class 9 70 Jane Doe 88
  20. #&gt; 10 purple heavy 10 61 Tom Cruise 77

希望这些翻译对您有所帮助。

英文:

This can be solved with a right join and removing the duplicates after it. I will use base function merge.

  1. df3 &lt;- merge(
  2. df, df2,
  3. by.x = c(&quot;belt&quot;, &quot;weight&quot;, &quot;rank&quot;, &quot;name&quot;),
  4. by.y = c(&quot;belt2&quot;, &quot;weight2&quot;, &quot;rank2&quot;, &quot;name&quot;),
  5. all.y = TRUE
  6. )
  7. df3 &lt;- df3[!duplicated(df3),]
  8. df3[order(df3$rank),]
  9. #&gt; belt weight rank name id points
  10. #&gt; 9 purple open class 1 Tom Cruise 55 100
  11. #&gt; 2 black rooster 2 Emma Watson 79 30
  12. #&gt; 4 blue feather 3 John Doe 63 50
  13. #&gt; 5 blue feather 4 John Doe 63 50
  14. #&gt; 10 purple open class 5 Tom Cruise 55 100
  15. #&gt; 6 brown heavy 6 James Bond 3 200
  16. #&gt; 3 black rooster 7 Jon Snow NA 92
  17. #&gt; 7 purple heavy 8 Tom Cruise 61 77
  18. #&gt; 1 black open class 9 Jane Doe 70 88
  19. #&gt; 8 purple heavy 10 Tom Cruise 61 77

<sup>Created on 2023-02-08 with reprex v2.0.2</sup>


A dplyr right join is

  1. suppressPackageStartupMessages({
  2. library(dplyr)
  3. })
  4. df %&gt;%
  5. right_join(df2,
  6. by = c(&quot;belt&quot; = &quot;belt2&quot;, &quot;weight&quot; = &quot;weight2&quot;, &quot;rank&quot; = &quot;rank2&quot;),
  7. suffix = c(&quot;.x&quot;, &quot;&quot;)) %&gt;%
  8. select(-name.x) %&gt;%
  9. arrange(rank)
  10. #&gt; belt weight rank id name points
  11. #&gt; 1 purple open class 1 55 Tom Cruise 100
  12. #&gt; 2 black rooster 2 79 Emma Watson 30
  13. #&gt; 3 blue feather 3 63 John Doe 50
  14. #&gt; 4 blue feather 4 63 John Doe 50
  15. #&gt; 5 purple open class 5 55 Tom Cruise 100
  16. #&gt; 6 brown heavy 6 3 James Bond 200
  17. #&gt; 7 black rooster 7 NA Jon Snow 92
  18. #&gt; 8 purple heavy 8 61 Tom Cruise 77
  19. #&gt; 9 black open class 9 70 Jane Doe 88
  20. #&gt; 10 purple heavy 10 61 Tom Cruise 77

<sup>Created on 2023-02-08 with reprex v2.0.2</sup>

答案2

得分: 3

你可以通过在这两个数据框之间使用left join来完成此任务。

  1. df2 = df2 %>% left_join(df, by = c("belt2" = "belt", "weight2" = "weight", "name" = "name")) %>% select(belt2, weight2, rank2, name, points, id)

翻译后的结果如下:

  1. df2 = df2 %>% left_join(df, by = c("belt2" = "belt", "weight2" = "weight", "name" = "name")) %>% select(belt2, weight2, rank2, name, points, id)
英文:

You can achieve this task by using a left join between the two dataframes.

  1. df2 = df2 %&gt;% left_join(df, by = c(&quot;belt2&quot; = &quot;belt&quot;, &quot;weight2&quot; = &quot;weight&quot;, &quot;name&quot; = &quot;name&quot;)) %&gt;% select(belt2, weight2, rank2, name, points, id)
  2. belt2 weight2 rank2 name points id
  3. 1 purple open class 1 Tom Cruise 100 55
  4. 2 purple open class 1 Tom Cruise 100 55
  5. 3 purple open class 1 Tom Cruise 100 55
  6. 4 black rooster 2 Emma Watson 30 79
  7. 5 blue feather 3 John Doe 50 63
  8. 6 blue feather 3 John Doe 50 63
  9. 7 blue feather 4 John Doe 50 63
  10. 8 blue feather 4 John Doe 50 63
  11. 9 purple open class 5 Tom Cruise 100 55
  12. 10 purple open class 5 Tom Cruise 100 55
  13. 11 purple open class 5 Tom Cruise 100 55
  14. 12 brown heavy 6 James Bond 200 3
  15. 13 black rooster 7 Jon Snow 92 NA
  16. 14 purple heavy 8 Tom Cruise 77 61
  17. 15 purple heavy 8 Tom Cruise 77 61
  18. 16 black open class 9 Jane Doe 88 70
  19. 17 purple heavy 10 Tom Cruise 77 61
  20. 18 purple heavy 10 Tom Cruise 77 61

huangapple
  • 本文由 发表于 2023年2月8日 15:54:39
  • 转载请务必保留本文链接:https://go.coder-hub.com/75382752.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定