英文:
Joining data frames that share columns, but differ in their observation units
问题
很遗憾,我不知道如何生成一个假设数据集来说明我的问题。所以我只会描述我想做的事情,希望有人能理解。
我有两个数据集。
-
df1:记录每个国家每年哪个政党执政。例如:共和党在2017年至2021年执政美国。
-
df2:记录政党如何随时间改变其政治立场。例如:它记录了党派A在1970年是意识形态上的左派,1980年是中左派。该数据集并不包含所有年份的观察结果。例如,没有关于党派A在1971年的意识形态数据,直接跳到1980年。
我想做的是从df2中提取党派的意识形态到df1中。
例如,df1的样子是这样的:
国家 | 年份 | 执政政党ID
X | 1990 | 340
X | 1991 | 340
X | 1992 | 340
X | 1993 | 340
df2的样子是这样的:
国家 | 年份 | 政党ID | 意识形态
X | 1970 | 340 | 中间
X | 1985 | 340 | 中间
X | 1992 | 340 | 中左
X | 1999 | 340 | 中左
df1和df2使用相同的政党编码。所以你可能会想:为什么不只是执行 dplyr::left_join()
呢?
如果我这样做,我的数据将会是这样的:
国家 | 年份 | 执政政党ID | 意识形态
X | 1990 | 340 | NA
X | 1991 | 340 | NA
X | 1992 | 340 | 中左
X | 1993 | 340 | NA
它会匹配国家、年份和ID,但结果不尽如人意。
相反,我希望我的结果是这样的:
国家 | 年份 | 执政政党ID | 意识形态
X | 1990 | 340 | 中间
X | 1991 | 340 | 中左
X | 1992 | 340 | 中左
X | 1993 | 340 | 中左
df2没有一个X-1990的观察结果,但可以推断出340的意识形态自1985年以来一直是中间。
我该如何实现这个目标?
英文:
Unfortunetely, I don't know how to produce a hypothetical dataset to illustrate my question. So I'll just describe what I'm trying to do and hope someone understands.
I have two datasets.
-
df1: it records which political party is in power for each country in each year. For example: the Republican Party was in power from 2017 to 2021 in US.
-
df2: it records how Political Parties changes its profile through time. For example: it records that party A is ideologically leftist in 1970 and center-leftist in 1980. This df does not have observations for all years. For example, one does not have data about party A's ideology in 1971, it jumps from 1970 directly into 1980.
What I'm trying to do is extracting party's ideology from df2 to df1.
For example, df1 looks like this:
Country | Year | Government's Political Party ID
X | 1990 | 340
X | 1991 | 340
X | 1992 | 340
X | 1993 | 340
df2 looks like this:
Country | Year | Political Party ID | Ideology
X | 1970 | 340 | center
X | 1985 | 340 | center
X | 1992 | 340 | center-left
X | 1999 | 340 | center-left
df1 and df2 use the same codification of Political Parties. So you may be thinking: well, why not just perform dplyr::left_join()
?
If I do this, my data will look like this:
Country | Year | Government's Political Party ID | Ideology
X | 1990 | 340 | NA
X | 1991 | 340 | NA
X | 1992 | 340 | center-left
X | 1993 | 340 | NA
It would match country, year and ID but produce an unsatisfactory result.
Instead, I want my result to look like this:
Country | Year | Government's Political Party ID | Ideology
X | 1990 | 340 | center
X | 1991 | 340 | center-left
X | 1992 | 340 | center-left
X | 1993 | 340 | center-left
df2 does not have an X-1990 observation, but it can be inferred that 340's ideology has been center since 1985.
How do I perform this?
答案1
得分: 2
尝试将以下内容作为起点-我在left_join
的by =
选项中使用了rolling join,以匹配数据1中最接近数据2年份的年份。
并且请尽量提供数据,以便其他人在将来能够重现你的问题。我复制粘贴了你的数据表,并手动转换为R格式。
library(tidyverse)
df1 <-
tribble(
~Country, ~Year1, ~`Government_Political Party ID`,
'x' , 1990, 340,
'x' , 1991, 340,
'x' , 1992, 340,
'x' , 1993, 340)
df2 <-
tribble(
~Country, ~Year2, ~`Political Party ID`, ~Ideology,
'x' , 1970, 340 , 'center',
'x' , 1985, 340 , 'center',
'x' , 1992, 340 , 'center-left',
'x' , 1999, 340 , 'center-left'
)
left_join(df1, df2,
by = join_by(Country,
'Government_Political Party ID' == 'Political Party ID',
closest(Year1 >= Year2)))
#> # A tibble: 4 × 5
#> Country Year1 `Government_Political Party ID` Year2 Ideology
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 x 1990 340 1985 center
#> 2 x 1991 340 1985 center
#> 3 x 1992 340 1992 center-left
#> 4 x 1993 340 1992 center-left
创建于2023-08-08,使用reprex v2.0.2
英文:
Try this as the starting point - I used rolling join within the by =
option of left_join
to match the closest year in data 1 that is higher than data 2's year.
And do try to provide data for people to reproduce your question in the future. I copy pasted your data table, and hand-made to R format.
library(tidyverse)
df1 <-
tribble(
~Country, ~Year1, ~`Government_Political Party ID`,
'x' , 1990, 340,
'x' , 1991, 340,
'x' , 1992, 340,
'x' , 1993, 340)
df2 <-
tribble(
~Country, ~Year2, ~`Political Party ID`, ~Ideology,
'x' , 1970, 340 , 'center',
'x' , 1985, 340 , 'center',
'x' , 1992, 340 , 'center-left',
'x' , 1999, 340 , 'center-left'
)
left_join(df1, df2,
by = join_by(Country,
'Government_Political Party ID' == 'Political Party ID',
closest(Year1 >= Year2)))
#> # A tibble: 4 × 5
#> Country Year1 `Government_Political Party ID` Year2 Ideology
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 x 1990 340 1985 center
#> 2 x 1991 340 1985 center
#> 3 x 1992 340 1992 center-left
#> 4 x 1993 340 1992 center-left
<sup>Created on 2023-08-08 with reprex v2.0.2</sup>
答案2
得分: 0
这是使用data.table
的roll
参数的一种可能解决方案。
这只考虑了意识形态实际发生变化的时间。在你期望的结果中,当年份为1991时,我认为它应该仍然是“center”,直到1992年出现了到“center-left”的变化。
下面的示例可能需要根据你的实际数据进行进一步调整。
library(data.table)
df1 = data.table(Country = rep("X", 4),
Year = 1990:1993,
PP_ID = rep(340, 4))
df2 <- data.table(Country = rep("X", 4),
Year = c(1970, 1985, 1992, 1999),
PP_ID = rep(340, 4),
Ideology = c("center","center", "center-left", "center-left"))
df2[df1, .(Country, Year, PP_ID = i.PP_ID, Ideology), on = .(Country,Year), roll=TRUE]
Country Year PP_ID Ideology
1: X 1990 340 center
2: X 1991 340 center
3: X 1992 340 center-left
4: X 1993 340 center-left
英文:
Here is one possible solution using data.table
's roll
argument.
This takes into consideration only the times where the ideology has actually changed. In your expected result where Year = 1991, I believe it should still be "center" until there was a change to "center-left" in 1992.
The example below may need more fine tuning depending on your actual data.
library(data.table)
df1 = data.table(Country = rep("X", 4),
Year = 1990:1993,
PP_ID = rep(340, 4))
df2 <- data.table(Country = rep("X", 4),
Year = c(1970, 1985, 1992, 1999),
PP_ID = rep(340, 4),
Ideology = c("center","center", "center-left", "center-left"))
df2[df1, .(Country, Year, PP_ID = i.PP_ID, Ideology), on = .(Country,Year), roll=TRUE]
Country Year PP_ID Ideology
1: X 1990 340 center
2: X 1991 340 center
3: X 1992 340 center-left
4: X 1993 340 center-left
答案3
得分: 0
一种方法是使用tidyr::complete()
扩展df2
中的年份范围,然后使用tidyr::fill()
填充其他变量,然后进行连接。
library(dplyr)
library(tidyr)
df2 %>%
complete(Year = seq(min(Year), max(Year), 1)) %>%
fill(Country, `Political Party ID`, Ideology) %>%
right_join(df1,
by = c("Year",
"Country",
"Political Party ID" = "Government's Political Party ID"))
结果(使用您提供的原始示例,其中1991年将是"center"):
# A tibble: 4 × 4
Year Country `Political Party ID` Ideology
<dbl> <chr> <dbl> <chr>
1 1990 X 340 center
2 1991 X 340 center
3 1992 X 340 center-left
4 1993 X 340 center-left
数据:
df1 <- structure(list(Country = c("X", "X", "X", "X"),
Year = 1990:1993,
`Government's Political Party ID` = c(340, 340, 340, 340)),
class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(Country = c("X", "X", "X", "X"),
Year = c(1970, 1985, 1992, 1999),
`Political Party ID` = c(340, 340, 340, 340),
Ideology = c("center", "center", "center-left", "center-left")),
class = "data.frame", row.names = c(NA, -4L))
英文:
One way is to expand the range of years in df2
using tidyr::complete()
, then use tidyr::fill()
to fill the other variables, and then do the join.
library(dplyr)
library(tidyr)
df2 %>%
complete(Year = seq(min(Year), max(Year), 1)) %>%
fill(Country, `Political Party ID`, Ideology) %>%
right_join(df1,
by = c("Year",
"Country",
"Political Party ID" = "Government's Political Party ID"))
Result (using your original examples where 1991 would be "center"):
# A tibble: 4 × 4
Year Country `Political Party ID` Ideology
<dbl> <chr> <dbl> <chr>
1 1990 X 340 center
2 1991 X 340 center
3 1992 X 340 center-left
4 1993 X 340 center-left
Data:
df1 <- structure(list(Country = c("X", "X", "X", "X"),
Year = 1990:1993,
`Government's Political Party ID` = c(340, 340, 340, 340)),
class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(Country = c("X", "X", "X", "X"),
Year = c(1970, 1985, 1992, 1999),
`Political Party ID` = c(340, 340, 340, 340),
Ideology = c("center", "center", "center-left", "center-left")),
class = "data.frame", row.names = c(NA, -4L))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论