如何处理合并两个数据集的问题?

huangapple go评论73阅读模式
英文:

How to deal with issue merging two datasets?

问题

我在R中使用两个数据集:svolikest。为了背景,我开发了一个新的概念度量(立法权力分享),并正在使用它来复制以前的研究:Svolik(2012)。这个练习的目标是看看在使用我的度量时结果是否不同。

以下是svolik数据:链接

以下是est数据:链接

我从Svolik用来生成他的结果的数据集开始。我成功复制了他的结果(图中的模型1、3和5)。然后,我将他的数据集与包含我的新度量的数据集合并,丢弃了没有完全匹配的观测值:

# 加载原始数据(用于生成原始结果的数据)
svolik <- read_dta("svolik.dta")

# 加载包含我的新度量的数据
est <- read.csv("Merging with Svolik.csv")

# 合并
final <- merge(svolik, est, by = c("ccode", "year"), all = FALSE)

接下来,我再次运行他的模型,但我用我的立法权力分享变量替换了他的立法变量(图中的模型2、4和6)。请注意,尽管数据涵盖相同的时间段,但原始模型和我的模型包含略有不同数量的观测值(2,903与2,934相对比)。

我不能弄清楚为什么会有这些额外的观测值。我猜测这与合并/重复数据或类似的问题有关。您认为这可能是问题吗?如果是这样,您知道如何找出这些观测值是什么吗?解决方案可能非常简单,我可能只是想太多了。任何建议都将不胜感激!请注意,我尝试使用不同的合并策略——在dplyr()中使用left_join——但那没有起作用。

如何处理合并两个数据集的问题?

请注意,我正在Stata中运行结果。以下是原始结果的Stata代码(即模型1、3和5):

* 生存分析

use "leaders, institutions, covariates, updated tvc.dta" 

* 自然死亡
gen c_natural=censoring
replace c_natural=0 if exit!="natural"
replace c_natural=. if exit==""
tab c_natural

stset t, id(leadid) failure(c_natural)

stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, replace  ctitle(natural, leg) tex nonotes bdec(3) e(all) ef

* 政变
gen c_coup= censoring
replace c_coup=0 if exit!="coup"
replace c_coup=. if exit==""

stset t, id(leadid) failure(c_coup)
* 移除一些重复的观测值
* drop if (t[_n-1]==t &  leadid[_n-1]== leadid) 
stset t, id(leadid) failure(c_coup)

stcox legislature  lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, ctitle(coups, leg) tex nonotes bdec(3) e(all) ef
 

* 暴动
gen c_revolt= censoring
replace c_revolt=0 if exit!="revolt"
replace c_revolt=. if exit==""
tab c_revolt

stset t, id(leadid) failure(c_revolt)
* 去掉共产主义者,因为它是完美的预测器

stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED  mil cw age
outreg2 using survival, ctitle(revolt, leg) tex nonotes bdec(3) e(all) ef

以下是新结果的Stata代码(即模型2、4和6):

* 生存分析

use "merged_test.dta" 

* 自然死亡
gen c_natural=censoring
replace c_natural=0 if exit!="natural"
replace c_natural=. if exit==""
tab c_natural

stset t, id(leadid) failure(c_natural)

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, replace  ctitle(natural, leg) tex nonotes bdec(3) e(all) ef

* 政变
gen c_coup= censoring
replace c_coup=0 if exit!="coup"
replace c_coup=. if exit==""

stset t, id(leadid) failure(c_coup)
* 移除一些重复的观测值
* drop if (t[_n-1]==t &  leadid[_n-1]== leadid) 
stset t, id(leadid) failure(c_coup)

stcox estimate  lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, ctitle(coups, leg) tex nonotes bdec(3) e(all) ef
 

* 暴动
gen c_revolt= censoring
replace c_revolt=0 if exit!="revolt"
replace c_revolt=. if exit==""
tab c_revolt

stset t, id(leadid) failure(c_revolt)
* 去掉共产主义者,因为它是完美的预测器

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED  mil cw age
outreg2 using survival, ctitle(revolt, leg) tex nonotes bdec(3) e(all) ef

<details>
<summary>英文:</summary>

I am working with two datasets in `R`: `svolik` and `est`. For context, I have developed a new measure of a concept (legislative power sharing), and I am using it to replicate a previous study: Svolik (2012). The goal of the exercise is to see if the results are different when using my measure.

Here is the `svolik` data: https://drive.google.com/file/d/1nCBhRXNcBrLEr6-R2pkyuQ9mCtJKkdmm/view?usp=sharing

Here is the `est` data: https://drive.google.com/file/d/1D-UmHSi9LIEsmY5VBvU8nxu8u1gix7Ay/view?usp=sharing

I started with the dataset that Svolik used to generate his results. I successfully reproduced his results (models 1, 3, and 5 in the figure). I then merged his dataset with the dataset containing my new measure, discarding any observations for which there was no exact match:

load original data (the data used to produce original results)

svolik <- read_dta("svolik.dta")

load data containing my new measure

est <- read.csv("Merging with Svolik.csv")

merge

final <- merge(svolik, est, by = c("ccode", "year"), all = FALSE)

Next, I run his models again, but I replace his Legislature variable with my Legislative Power Sharing variable (models 2, 4, and 6 in the figure). Notice that, despite the data covering the same period of time, the original models and my own contain slightly different numbers of observations (2,903 as opposed to 2,934).

I cannot for the life of me figure out why I am getting these extra observations. My guess is that it has something to do with merging/duplicates or something like that. Does this seem like the likely problem to you? If so, do you know of a way to find out what those observations are? The solution is probably quite simple, and I am probably just overthinking things. Any advice would be appreciated! Note that I tried using a different merging strategy --- `left_join` in `dplyr()`--- but that didn&#39;t work.

[![enter image description here][1]][1]


  [1]: https://i.stack.imgur.com/DGQEJ.jpg

Note that I am running the results in Stata. Here is the Stata code for the original results (i.e., models 1, 3, and 5):

  • SURVIVAL ANALYSIS

use "leaders, institutions, covariates, updated tvc.dta"

  • NATURAL DEATHS
    gen c_natural=censoring
    replace c_natural=0 if exit!="natural"
    replace c_natural=. if exit==""
    tab c_natural

stset t, id(leadid) failure(c_natural)

stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, replace ctitle(natural, leg) tex nonotes bdec(3) e(all) ef

  • COUPS
    gen c_coup= censoring
    replace c_coup=0 if exit!="coup"
    replace c_coup=. if exit==""

stset t, id(leadid) failure(c_coup)

  • REMOVE SOM DUPLICATE OBSERVATIONS
  • drop if (t[_n-1]==t & leadid[_n-1]== leadid)
    stset t, id(leadid) failure(c_coup)

stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, ctitle(coups, leg) tex nonotes bdec(3) e(all) ef

  • REVOLTS
    gen c_revolt= censoring
    replace c_revolt=0 if exit!="revolt"
    replace c_revolt=. if exit==""
    tab c_revolt

stset t, id(leadid) failure(c_revolt)

  • COMMUNIST LEFT OUT BECAUSE IT IS A PERFECT PREDICTOR

stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED mil cw age
outreg2 using survival, ctitle(revolt, leg) tex nonotes bdec(3) e(all) ef

Here is the Stata code for the new results (i.e., models 2, 4, and 6):

  • SURVIVAL ANALYSIS

use "merged_test.dta"

  • NATURAL DEATHS
    gen c_natural=censoring
    replace c_natural=0 if exit!="natural"
    replace c_natural=. if exit==""
    tab c_natural

stset t, id(leadid) failure(c_natural)

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, replace ctitle(natural, leg) tex nonotes bdec(3) e(all) ef

  • COUPS
    gen c_coup= censoring
    replace c_coup=0 if exit!="coup"
    replace c_coup=. if exit==""

stset t, id(leadid) failure(c_coup)

  • REMOVE SOM DUPLICATE OBSERVATIONS
  • drop if (t[_n-1]==t & leadid[_n-1]== leadid)
    stset t, id(leadid) failure(c_coup)

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, ctitle(coups, leg) tex nonotes bdec(3) e(all) ef

  • REVOLTS
    gen c_revolt= censoring
    replace c_revolt=0 if exit!="revolt"
    replace c_revolt=. if exit==""
    tab c_revolt

stset t, id(leadid) failure(c_revolt)

  • COMMUNIST LEFT OUT BECAUSE IT IS A PERFECT PREDICTOR

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED mil cw age
outreg2 using survival, ctitle(revolt, leg) tex nonotes bdec(3) e(all) ef


</details>


# 答案1
**得分**: 5

### 问题解释

问题不在于 `svolik` 有 2903 个观测值,`final` 有 2934 个观测值,因此 `final` 是由合并中的一些重复行引起的 `svolik` 的超集。您永远不会有2903行可以在两个回归中都使用。这是因为在 Svolik 回归中使用的并非所有完整观测值都在 `est` 数据集中有对应的行。首先让我们了解一下 `svolik` 中的 2903 个观测值是从哪里来的:

```r
svolik_reg_cols <- c("legislative", "lgdp_1", "growth_1", "exportersoffuelsmainlyoil_EL2008", "ethfrac_FIXED", "communist", "mil", "cw", "age")
svolik_is_complete <- complete.cases(svolik[, svolik_reg_cols])
sum(svolik_is_complete) # 2903

正如您可以看到的,这是回归中所有列的完整案例数。现在让我们使用您的连接方法对 final 进行相同的操作:

final <- merge(svolik, est, by = c("ccode", "year"), all = FALSE)
final_reg_cols <- svolik_reg_cols
final_reg_cols[final_reg_cols == "legislative"] <- "estimate"
final_is_complete <- complete.cases(final[, final_reg_cols])
sum(final_is_complete) # 2934

同样,2934 是没有任何协变量数据缺失的观测值数。

但是,让我们看一下您正在连接的数据集。在 svolik 中有 278 组 ccodeyear,而这些组在 est 中没有出现。

# svolik 中有但 est 中没有的 ccode 和 year 数量
dplyr::anti_join(
    svolik,
    est,
    by = c("ccode", "year")
) |>
    group_by(ccode, cabb, year) |>
    summarise(n = n()) |>
    arrange(desc(n)) |>
    print(n = 2)

# # A tibble: 278 × 3
# # Groups:   ccode [39]
#   ccode  year     n
#   <dbl> <dbl> <int>
# 1   990  1982     4
# 2   947  2001     3
# # … with 276 more rows

这意味着使用您目前的数据,不可能在所有观测值之间进行比较。

解决方案

您有三个选项:

  1. 获取更多数据。
  2. 对缺失值进行插补。
  3. 限制回归到共同的观测值。

您将知道 1. 或 2. 是否可行。但是,由于您的分析目的似乎是将您的新度量与 Svolik 进行比较,所以 3. 似乎是一个合理的方法,特别是因为您不会删除太多行。

首先找到共同的行(共有 2830 个),并保存到 dta 中:

all_complete <- complete.cases(final[, c("estimate", svolik_reg_cols)])
sum(all_complete) # 2830
final_complete <- final[all_complete, ]
write_dta(final_complete, "./tmp/svolik_est_merged.dta")

Stata 代码

现在可以在 Stata 中运行回归。首先加载并准备数据,与以前一样:

use svolik_est_merged.dta, clear

* 自然死亡
cap drop c_natural c_coup c_revolt _d _t _t0
gen c_natural=censoring
replace c_natural=0 if exit!="natural"
replace c_natural=. if exit==""
tab c_natural

stset t, id(leadid) failure(c_natural)

然后运行 Svolik 回归。您可以看到有 2830 个观测值:

stcox legislative lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
Cox regression with Breslow method for ties

No. of subjects =   383                                 Number of obs =  2,830
No. of failures =    40
Time at risk    = 3,098
                                                        LR chi2(9)    =  28.46
Log likelihood = -157.48569                             Prob > chi2   = 0.0008

-------------------------------------------------------------------------------------
                 _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
--------------------+----------------------------------------------------------------
        legislative |   1.006541   .0088251     0.74   0.457     .9893923    1.023988
             lgdp_1 |   1.437144   .3138694     1.66   0.097     .9366983    2.204962
           growth_1 |   1.010814   .0283629     0.38   0.701      .956725    1.067962
exportersoffue~2008 |   2.487166   1.205382     1.88   0.060     .9620061    6.430308
      ethfrac_FIXED |   1.011694     .00645     1.82   0.068     .9991306    1.024415
          communist |     2.0526   1.610128     0.92   0.359     .4411573    9.550262
                mil |    1.06844   .3944057     0.18   0.858     .5182463    2.202744
                 cw |    4.15784   2.325053     2.55   0.011     1.389562    12.44106
                age |   1.057077   .0172812     3.40   0.001     1.023744    1.091496
-------------------------------------------------------------------------------------

然后运行您的回归:

stcox estimate lgdp_1 growth

<details>
<summary>英文:</summary>

### Explanation of the problem

Your problem is not that `svolik` has 2903 observations, `final` has 2934 and therefore `final` is a superset of `svolik` caused by some duplicate rows in the merge. You will *never* have 2903 rows that you can use in both regressions. This is because not all complete observations used in the Svolik regression have a corresponding row in the `est` dataset. Firstly let&#39;s understand where the 2903 observations in `svolik` come from:

```r
svolik_reg_cols &lt;- c(&quot;legislative&quot;, &quot;lgdp_1&quot;, &quot;growth_1&quot;, &quot;exportersoffuelsmainlyoil_EL2008&quot;, &quot;ethfrac_FIXED&quot;, &quot;communist&quot;, &quot;mil&quot;, &quot;cw&quot;, &quot;age&quot;)
svolik_is_complete &lt;- complete.cases(svolik[, svolik_reg_cols])
sum(svolik_is_complete) # 2903

As you can see, it is the number of complete cases for all the columns in the regression. Now let's do the same with final, using your join method:

final &lt;- merge(svolik, est, by = c(&quot;ccode&quot;, &quot;year&quot;), all = FALSE)
final_reg_cols &lt;- svolik_reg_cols
final_reg_cols[final_reg_cols == &quot;legislative&quot;] &lt;- &quot;estimate&quot;
final_is_complete &lt;- complete.cases(final[, final_reg_cols])
sum(final_is_complete) # 2934

Again, 2934 is the number of observations with no missing data for any of the covariates.

However, let's look at the datasets you are joining. There are 278 sets of ccode and year in svolik that do not appear in est.

# How many ccode and year are in svolik but not est
dplyr::anti_join(
    svolik,
    est,
    by = c(&quot;ccode&quot;, &quot;year&quot;)
) |&gt;
    group_by(ccode, cabb, year) |&gt;
    summarise(n = n()) |&gt;
    arrange(desc(n)) |&gt;
    print(n = 2)

# # A tibble: 278 &#215; 3
# # Groups:   ccode [39]
#   ccode  year     n
#   &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
# 1   990  1982     4
# 2   947  2001     3
# # … with 276 more rows

This means that with the data you have, it is impossible to compare the results across all observations.

Solutions

You have three options:

  1. Get more data.
  2. Impute missing values.
  3. Restrict the regression to common observations.

You will know whether 1. or 2. are possible. However as the purpose of your analysis seems to be to compare your new metric to Svolik, 3. seems a reasonable approach, particularly as you do not end up dropping many rows.

First find the common rows (there are 2830) and save to dta:

all_complete &lt;- complete.cases(final[, c(&quot;estimate&quot;, svolik_reg_cols)])
sum(all_complete) # 2830
final_complete &lt;- final[all_complete, ]
write_dta(final_complete, &quot;./tmp/svolik_est_merged.dta&quot;)

Stata code

You can now run the regression in Stata. Firstly load and prepare the data as previously:

use svolik_est_merged.dta, clear

* NATURAL DEATHS
cap drop c_natural c_coup c_revolt _d _t _t0
gen c_natural=censoring
replace c_natural=0 if exit!=&quot;natural&quot;
replace c_natural=. if exit==&quot;&quot;
tab c_natural

stset t, id(leadid) failure(c_natural)

Now run the Svolik regression. You can see there are 2830 observations:

stcox legislative lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
Cox regression with Breslow method for ties

No. of subjects =   383                                 Number of obs =  2,830
No. of failures =    40
Time at risk    = 3,098
                                                        LR chi2(9)    =  28.46
Log likelihood = -157.48569                             Prob &gt; chi2   = 0.0008

-------------------------------------------------------------------------------------
                 _t | Haz. ratio   Std. err.      z    P&gt;|z|     [95% conf. interval]
--------------------+----------------------------------------------------------------
        legislative |   1.006541   .0088251     0.74   0.457     .9893923    1.023988
             lgdp_1 |   1.437144   .3138694     1.66   0.097     .9366983    2.204962
           growth_1 |   1.010814   .0283629     0.38   0.701      .956725    1.067962
exportersoffue~2008 |   2.487166   1.205382     1.88   0.060     .9620061    6.430308
      ethfrac_FIXED |   1.011694     .00645     1.82   0.068     .9991306    1.024415
          communist |     2.0526   1.610128     0.92   0.359     .4411573    9.550262
                mil |    1.06844   .3944057     0.18   0.858     .5182463    2.202744
                 cw |    4.15784   2.325053     2.55   0.011     1.389562    12.44106
                age |   1.057077   .0172812     3.40   0.001     1.023744    1.091496
-------------------------------------------------------------------------------------

Then run your regression:

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age

Output:

Cox regression with Breslow method for ties

No. of subjects =   383                                 Number of obs =  2,830
No. of failures =    40
Time at risk    = 3,098
                                                        LR chi2(9)    =  28.00
Log likelihood = -157.71273                             Prob &gt; chi2   = 0.0010

-------------------------------------------------------------------------------------
                 _t | Haz. ratio   Std. err.      z    P&gt;|z|     [95% conf. interval]
--------------------+----------------------------------------------------------------
           estimate |   .9742007   .1278445    -0.20   0.842     .7532603    1.259946
             lgdp_1 |   1.506868   .3265272     1.89   0.058     .9854309    2.304222
           growth_1 |   1.007996    .028074     0.29   0.775      .954447     1.06455
exportersoffue~2008 |   2.147553   1.257702     1.31   0.192     .6814636    6.767761
      ethfrac_FIXED |   1.011719   .0070275     1.68   0.093     .9980384    1.025587
          communist |   2.064115   1.619767     0.92   0.356     .4433766    9.609369
                mil |   1.018648   .3747256     0.05   0.960     .4953321    2.094845
                 cw |   3.961413   2.202203     2.48   0.013     1.332464    11.77727
                age |   1.054575   .0174756     3.21   0.001     1.020873    1.089389
-------------------------------------------------------------------------------------

Again 2830 observations. The results seem pretty similar to me: the same two covariates (cw and age) have small p-values and all coefficients are close to Svolik. If you're trying to develop a metric which tells you something new, perhaps not what you want to hear. However, if you are trying to find out whether your metric is robust by comparing to an established one, perhaps that's better news.

One caveat of course is to think about whether the rows that you cannot join are missing completely at random. If there is a systemic issue causing data to be missing related to your dependent variable (e.g. authoritarian countries are more likely to be prevent data being published for certain years), it may not be reasonable to infer that similar results here can be extrapolated to missing data. One way to look at this would be to compare your Svolik results with the 2830 rows to the results with the 2903 rows and see if they are meaningfully different.

Edit: how to see which rows are not included in both

To answer your question in the comments. You wanted to see which c_natural failure was included in svolik but not final:

# Create dataset of complete svolik observations
svolik_complete &lt;- svolik[svolik_is_complete, ]

# Subset data to failure of interest
svolik_failures &lt;- svolik_complete |&gt;
    filter(c_natural == 1)
final_failures &lt;- final_complete |&gt;
    filter(c_natural == 1)

# Find failures in svolik_complete but not in final_complete
anti_join(svolik_failures, final_failures, by = c(&quot;ccode&quot;, &quot;year&quot;)) |&gt;
    select(ccode, cabb, year)
# # A tibble: 1 &#215; 3
#   ccode cabb   year
#   &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
# 1   640 TUR    1972 

You can do the same for c_coup or c_revolt.

huangapple
  • 本文由 发表于 2023年7月17日 21:24:56
  • 转载请务必保留本文链接:https://go.coder-hub.com/76704935.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定