合并共享列但观测单位不同的数据框

huangapple go评论74阅读模式
英文:

Joining data frames that share columns, but differ in their observation units

问题

很遗憾,我不知道如何生成一个假设数据集来说明我的问题。所以我只会描述我想做的事情,希望有人能理解。

我有两个数据集。

  • df1:记录每个国家每年哪个政党执政。例如:共和党在2017年至2021年执政美国。

  • df2:记录政党如何随时间改变其政治立场。例如:它记录了党派A在1970年是意识形态上的左派,1980年是中左派。该数据集并不包含所有年份的观察结果。例如,没有关于党派A在1971年的意识形态数据,直接跳到1980年。

我想做的是从df2中提取党派的意识形态到df1中。

例如,df1的样子是这样的:

国家   | 年份 | 执政政党ID
X    | 1990 | 340
X    | 1991 | 340
X    | 1992 | 340
X    | 1993 | 340

df2的样子是这样的:

国家   | 年份 | 政党ID | 意识形态
X    | 1970 | 340   | 中间
X    | 1985 | 340   | 中间
X    | 1992 | 340   | 中左
X    | 1999 | 340   | 中左

df1和df2使用相同的政党编码。所以你可能会想:为什么不只是执行 dplyr::left_join() 呢?

如果我这样做,我的数据将会是这样的:

国家   | 年份 | 执政政党ID | 意识形态
X    | 1990 | 340      | NA
X    | 1991 | 340      | NA
X    | 1992 | 340      | 中左
X    | 1993 | 340      | NA

它会匹配国家、年份和ID,但结果不尽如人意。

相反,我希望我的结果是这样的:

国家   | 年份 | 执政政党ID | 意识形态
X    | 1990 | 340      | 中间
X    | 1991 | 340      | 中左
X    | 1992 | 340      | 中左
X    | 1993 | 340      | 中左

df2没有一个X-1990的观察结果,但可以推断出340的意识形态自1985年以来一直是中间。

我该如何实现这个目标?

英文:

Unfortunetely, I don't know how to produce a hypothetical dataset to illustrate my question. So I'll just describe what I'm trying to do and hope someone understands.

I have two datasets.

  • df1: it records which political party is in power for each country in each year. For example: the Republican Party was in power from 2017 to 2021 in US.

  • df2: it records how Political Parties changes its profile through time. For example: it records that party A is ideologically leftist in 1970 and center-leftist in 1980. This df does not have observations for all years. For example, one does not have data about party A's ideology in 1971, it jumps from 1970 directly into 1980.

What I'm trying to do is extracting party's ideology from df2 to df1.

For example, df1 looks like this:

Country | Year | Government's Political Party ID
X       | 1990 | 340
X       | 1991 | 340
X       | 1992 | 340
X       | 1993 | 340

df2 looks like this:

Country | Year | Political Party ID | Ideology
X       | 1970 | 340                | center
X       | 1985 | 340                | center
X       | 1992 | 340                | center-left
X       | 1999 | 340                | center-left

df1 and df2 use the same codification of Political Parties. So you may be thinking: well, why not just perform dplyr::left_join() ?

If I do this, my data will look like this:

Country | Year | Government's Political Party ID | Ideology
X       | 1990 | 340                             | NA
X       | 1991 | 340                             | NA
X       | 1992 | 340                             | center-left
X       | 1993 | 340                             | NA

It would match country, year and ID but produce an unsatisfactory result.

Instead, I want my result to look like this:

Country | Year | Government's Political Party ID | Ideology
X       | 1990 | 340                             | center
X       | 1991 | 340                             | center-left
X       | 1992 | 340                             | center-left
X       | 1993 | 340                             | center-left

df2 does not have an X-1990 observation, but it can be inferred that 340's ideology has been center since 1985.

How do I perform this?

答案1

得分: 2

尝试将以下内容作为起点-我在left_joinby = 选项中使用了rolling join,以匹配数据1中最接近数据2年份的年份。

并且请尽量提供数据,以便其他人在将来能够重现你的问题。我复制粘贴了你的数据表,并手动转换为R格式

library(tidyverse)

df1 <- 
  tribble(
  ~Country, ~Year1, ~`Government_Political Party ID`,
  'x'      , 1990, 340,
  'x'      , 1991, 340,
  'x'      , 1992, 340,
  'x'      , 1993, 340)


df2 <- 
  tribble(
    ~Country, ~Year2, ~`Political Party ID`, ~Ideology,
    'x'      , 1970, 340               , 'center',
    'x'      , 1985, 340               , 'center',
    'x'      , 1992, 340               , 'center-left',
    'x'      , 1999, 340               , 'center-left'
  )

left_join(df1, df2,
          by = join_by(Country, 
                       'Government_Political Party ID' == 'Political Party ID',
                       closest(Year1 >= Year2)))
#> # A tibble: 4 × 5
#>   Country Year1 `Government_Political Party ID` Year2 Ideology   
#>   <chr>   <dbl>                           <dbl> <dbl> <chr>      
#> 1 x        1990                             340  1985 center     
#> 2 x        1991                             340  1985 center     
#> 3 x        1992                             340  1992 center-left
#> 4 x        1993                             340  1992 center-left

创建于2023-08-08,使用reprex v2.0.2

英文:

Try this as the starting point - I used rolling join within the by = option of left_join to match the closest year in data 1 that is higher than data 2's year.

And do try to provide data for people to reproduce your question in the future. I copy pasted your data table, and hand-made to R format.

library(tidyverse)

df1 &lt;- 
  tribble(
  ~Country, ~Year1, ~`Government_Political Party ID`,
&#39;x&#39;      , 1990, 340,
&#39;x&#39;      , 1991, 340,
&#39;x&#39;      , 1992, 340,
&#39;x&#39;      , 1993, 340)


df2 &lt;- 
  tribble(
    ~Country, ~Year2, ~`Political Party ID`, ~Ideology,
    &#39;x&#39;      , 1970, 340               , &#39;center&#39;,
    &#39;x&#39;      , 1985, 340               , &#39;center&#39;,
    &#39;x&#39;      , 1992, 340               , &#39;center-left&#39;,
    &#39;x&#39;      , 1999, 340               , &#39;center-left&#39;
  )

left_join(df1, df2,
          by = join_by(Country, 
                       &#39;Government_Political Party ID&#39; == &#39;Political Party ID&#39;,
                       closest(Year1 &gt;= Year2)))
#&gt; # A tibble: 4 &#215; 5
#&gt;   Country Year1 `Government_Political Party ID` Year2 Ideology   
#&gt;   &lt;chr&gt;   &lt;dbl&gt;                           &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;      
#&gt; 1 x        1990                             340  1985 center     
#&gt; 2 x        1991                             340  1985 center     
#&gt; 3 x        1992                             340  1992 center-left
#&gt; 4 x        1993                             340  1992 center-left

<sup>Created on 2023-08-08 with reprex v2.0.2</sup>

答案2

得分: 0

这是使用data.tableroll参数的一种可能解决方案。

这只考虑了意识形态实际发生变化的时间。在你期望的结果中,当年份为1991时,我认为它应该仍然是“center”,直到1992年出现了到“center-left”的变化。

下面的示例可能需要根据你的实际数据进行进一步调整。

library(data.table)
df1 = data.table(Country = rep("X", 4),
                 Year = 1990:1993,
                 PP_ID = rep(340, 4))

df2 <- data.table(Country = rep("X", 4),
                 Year = c(1970, 1985, 1992, 1999),
                 PP_ID = rep(340, 4),
                 Ideology = c("center","center", "center-left", "center-left"))

df2[df1, .(Country, Year, PP_ID = i.PP_ID, Ideology), on = .(Country,Year), roll=TRUE]

   Country Year PP_ID    Ideology
1:       X 1990   340      center
2:       X 1991   340      center
3:       X 1992   340 center-left
4:       X 1993   340 center-left
英文:

Here is one possible solution using data.table's roll argument.

This takes into consideration only the times where the ideology has actually changed. In your expected result where Year = 1991, I believe it should still be "center" until there was a change to "center-left" in 1992.

The example below may need more fine tuning depending on your actual data.

library(data.table)
df1 = data.table(Country = rep(&quot;X&quot;, 4),
                 Year = 1990:1993,
                 PP_ID = rep(340, 4))

df2 &lt;- data.table(Country = rep(&quot;X&quot;, 4),
                 Year = c(1970, 1985, 1992, 1999),
                 PP_ID = rep(340, 4),
                 Ideology = c(&quot;center&quot;,&quot;center&quot;, &quot;center-left&quot;, &quot;center-left&quot;))

df2[df1, .(Country, Year, PP_ID = i.PP_ID, Ideology), on = .(Country,Year), roll=TRUE]

   Country Year PP_ID    Ideology
1:       X 1990   340      center
2:       X 1991   340      center
3:       X 1992   340 center-left
4:       X 1993   340 center-left

答案3

得分: 0

一种方法是使用tidyr::complete()扩展df2中的年份范围,然后使用tidyr::fill()填充其他变量,然后进行连接。

library(dplyr)
library(tidyr)

df2 %>%
  complete(Year = seq(min(Year), max(Year), 1)) %>%
  fill(Country, `Political Party ID`, Ideology) %>%
  right_join(df1, 
             by = c("Year", 
                    "Country", 
                    "Political Party ID" = "Government's Political Party ID"))

结果(使用您提供的原始示例,其中1991年将是"center"):

# A tibble: 4 × 4
   Year Country `Political Party ID` Ideology   
  <dbl> <chr>                  <dbl> <chr>      
1  1990 X                        340 center     
2  1991 X                        340 center     
3  1992 X                        340 center-left
4  1993 X                        340 center-left

数据:

df1 <- structure(list(Country = c("X", "X", "X", "X"), 
                      Year = 1990:1993, 
                      `Government's Political Party ID` = c(340, 340, 340, 340)),
                 class = "data.frame", row.names = c(NA, -4L))

df2 <- structure(list(Country = c("X", "X", "X", "X"), 
                      Year = c(1970, 1985, 1992, 1999), 
                      `Political Party ID` = c(340, 340, 340, 340), 
                      Ideology = c("center", "center", "center-left", "center-left")), 
                 class = "data.frame", row.names = c(NA, -4L))
英文:

One way is to expand the range of years in df2 using tidyr::complete(), then use tidyr::fill() to fill the other variables, and then do the join.

library(dplyr)
library(tidyr)

df2 %&gt;% 
  complete(Year = seq(min(Year), max(Year), 1)) %&gt;% 
  fill(Country, `Political Party ID`, Ideology) %&gt;% 
  right_join(df1, 
             by = c(&quot;Year&quot;, 
                    &quot;Country&quot;, 
                    &quot;Political Party ID&quot; = &quot;Government&#39;s Political Party ID&quot;))

Result (using your original examples where 1991 would be "center"):

# A tibble: 4 &#215; 4
   Year Country `Political Party ID` Ideology   
  &lt;dbl&gt; &lt;chr&gt;                  &lt;dbl&gt; &lt;chr&gt;      
1  1990 X                        340 center     
2  1991 X                        340 center     
3  1992 X                        340 center-left
4  1993 X                        340 center-left

Data:

df1 &lt;- structure(list(Country = c(&quot;X&quot;, &quot;X&quot;, &quot;X&quot;, &quot;X&quot;), 
                      Year = 1990:1993, 
                      `Government&#39;s Political Party ID` = c(340, 340, 340, 340)),
                 class = &quot;data.frame&quot;, row.names = c(NA, -4L))

df2 &lt;- structure(list(Country = c(&quot;X&quot;, &quot;X&quot;, &quot;X&quot;, &quot;X&quot;), 
                      Year = c(1970, 1985, 1992, 1999), 
                      `Political Party ID` = c(340, 340, 340, 340), 
                      Ideology = c(&quot;center&quot;, &quot;center&quot;, &quot;center-left&quot;, &quot;center-left&quot;)), 
                 class = &quot;data.frame&quot;, row.names = c(NA, -4L))

huangapple
  • 本文由 发表于 2023年8月9日 06:27:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76863537.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定