左连接并保留唯一值

huangapple go评论81阅读模式
英文:

left join and keep unique values

问题

我想将这两个数据框根据Country进行左连接,以便我可以获得每个国家的每个League。我不能在**left_join()中使用multiple="any"**参数,因为每个国家都有多个联赛。

期望的输出如下:

# A tibble: 6 × 2
  Country League 
  <chr>   <chr>  
1 France  Ligue 1
2 France  Ligue 2
3 Spain   Liga   
4 Spain   Liga 2 
5 England PL1    
6 England PL2

我可以通过以下方式获得这个结果,但会出现警告消息。而且我有一种印象,可以使用dplyr更友好的方式来完成这个任务。

dtl <- dtl %>% 
  left_join(select(t2, Country, League), by = c("Country")) %>% 
  dplyr::distinct(Country, League, .keep_all = TRUE)

谢谢你的帮助。

英文:

I have the following df :

t1&lt;-structure(list(Country = c(&quot;France&quot;, &quot;Spain&quot;, &quot;England&quot;)), class = c(&quot;tbl_df&quot;, 
&quot;tbl&quot;, &quot;data.frame&quot;), row.names = c(NA, -3L))

t2&lt;-structure(list(Country = c(&quot;France&quot;, &quot;France&quot;, &quot;France&quot;, &quot;France&quot;, 
&quot;France&quot;, &quot;France&quot;, &quot;Spain&quot;, &quot;Spain&quot;, &quot;Spain&quot;, &quot;Spain&quot;, &quot;Spain&quot;, 
&quot;Spain&quot;, &quot;Spain&quot;, &quot;England&quot;, &quot;England&quot;, &quot;England&quot;, &quot;England&quot;, 
&quot;England&quot;, &quot;England&quot;, &quot;England&quot;, &quot;England&quot;), League = c(&quot;Ligue 1&quot;, 
&quot;Ligue 1&quot;, &quot;Ligue 1&quot;, &quot;Ligue 2&quot;, &quot;Ligue 2&quot;, &quot;Ligue 2&quot;, &quot;Liga&quot;, 
&quot;Liga&quot;, &quot;Liga&quot;, &quot;Liga 2&quot;, &quot;Liga 2&quot;, &quot;Liga 2&quot;, &quot;Liga 2&quot;, &quot;PL1&quot;, 
&quot;PL1&quot;, &quot;PL1&quot;, &quot;PL1&quot;, &quot;PL2&quot;, &quot;PL2&quot;, &quot;PL2&quot;, &quot;PL2&quot;), ID = c(&quot;EUR&quot;, 
&quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, 
&quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, 
&quot;EUR&quot;, &quot;EUR&quot;)), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;), row.names = c(NA, 
-21L))

I want to left join the two based on Country so that I have every League per country. I can't use multiple="any" argument in left_join() since there are several leagues per country.
The desired output would be this:

# A tibble: 6 &#215; 2
  Country League 
  &lt;chr&gt;   &lt;chr&gt;  
1 France  Ligue 1
2 France  Ligue 2
3 Spain   Liga   
4 Spain   Liga 2 
5 England PL1    
6 England PL2

I'm able to get this result by doing the following but it throws a warning message. Also I'm under the impression that this could be done in a more R friendly way using dplyr.

dtl&lt;- dtl %&gt;% 
  left_join(select(t2,Country,League),by=c(&quot;Country&quot;)) %&gt;% 
  dplyr::distinct(Country,League,.keep_all = T)

Thank you for your help.

答案1

得分: 2

Darren的评论所示,问题在于t2中的重复行。移除这些重复行,您将获得所需的结果:

library(dplyr)

left_join(t1, distinct(t2), by = "Country") %>%
  select(-ID)

# 输出:
# 一个 tibble: 6 × 2
  Country League 
  <chr>   <chr>  
1 France  Ligue 1
2 France  Ligue 2
3 Spain   Liga   
4 Spain   Liga 2 
5 England PL1    
6 England PL2  
英文:

As Darren's comment illustrates, the issue is the duplicate rows in t2. Remove those, and you get the desired result:

library(dplyr)

left_join(t1, distinct(t2), by = &quot;Country&quot;) %&gt;% 
  select(-ID)

# Output:
# A tibble: 6 &#215; 2
  Country League 
  &lt;chr&gt;   &lt;chr&gt;  
1 France  Ligue 1
2 France  Ligue 2
3 Spain   Liga   
4 Spain   Liga 2 
5 England PL1    
6 England PL2  

答案2

得分: 1

你可以使用dplyr的group_by和summarize函数来组合每个国家的联赛以获得所需的输出。以下是用于实现结果的代码,没有警告消息:

library(dplyr)

t1 <- structure(list(Country = c("France", "Spain", "England")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L))
t2 <- structure(list(Country = c("France", "France", "France", "France", "France", "France", "Spain", "Spain", "Spain", "Spain", "Spain", "Spain", "Spain", "England", "England", "England", "England", "England", "England", "England", "England"), League = c("Ligue 1", "Ligue 1", "Ligue 1", "Ligue 2", "Ligue 2", "Ligue 2", "Liga", "Liga", "Liga", "Liga 2", "Liga 2", "Liga 2", "Liga 2", "PL1", "PL1", "PL1", "PL1", "PL2", "PL2", "PL2", "PL2"), ID = c("EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR", "EUR")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -21L))

result <- t1 %>%
  left_join(t2, by = "Country") %>%
  group_by(Country) %>%
  summarize(League = toString(unique(League)))

print(result)

输出:

# A tibble: 3 × 2
  Country League                      
  <chr>   <chr>                       
1 England PL1, PL2                    
2 France  Ligue 1, Ligue 2            
3 Spain   Liga, Liga 2
英文:

You can achieve the desired output using dplyr's group_by and summarize functions to combine the leagues for each country. Here's the code to achieve the result without the warning message:

library(dplyr)

t1 &lt;- structure(list(Country = c(&quot;France&quot;, &quot;Spain&quot;, &quot;England&quot;)), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;), row.names = c(NA, -3L))
t2 &lt;- structure(list(Country = c(&quot;France&quot;, &quot;France&quot;, &quot;France&quot;, &quot;France&quot;, &quot;France&quot;, &quot;France&quot;, &quot;Spain&quot;, &quot;Spain&quot;, &quot;Spain&quot;, &quot;Spain&quot;, &quot;Spain&quot;, &quot;Spain&quot;, &quot;Spain&quot;, &quot;England&quot;, &quot;England&quot;, &quot;England&quot;, &quot;England&quot;, &quot;England&quot;, &quot;England&quot;, &quot;England&quot;, &quot;England&quot;), League = c(&quot;Ligue 1&quot;, &quot;Ligue 1&quot;, &quot;Ligue 1&quot;, &quot;Ligue 2&quot;, &quot;Ligue 2&quot;, &quot;Ligue 2&quot;, &quot;Liga&quot;, &quot;Liga&quot;, &quot;Liga&quot;, &quot;Liga 2&quot;, &quot;Liga 2&quot;, &quot;Liga 2&quot;, &quot;Liga 2&quot;, &quot;PL1&quot;, &quot;PL1&quot;, &quot;PL1&quot;, &quot;PL1&quot;, &quot;PL2&quot;, &quot;PL2&quot;, &quot;PL2&quot;, &quot;PL2&quot;), ID = c(&quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;, &quot;EUR&quot;)), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;), row.names = c(NA, -21L))

result &lt;- t1 %&gt;% 
  left_join(t2, by = &quot;Country&quot;) %&gt;% 
  group_by(Country) %&gt;% 
  summarize(League = toString(unique(League)))

print(result)

Output:

# A tibble: 3 &#215; 2
  Country League                      
  &lt;chr&gt;   &lt;chr&gt;                       
1 England PL1, PL2                    
2 France  Ligue 1, Ligue 2            
3 Spain   Liga, Liga 2

huangapple
  • 本文由 发表于 2023年7月27日 19:27:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76779279.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定