2023年7月17日 21:24:56go评论98阅读模式

英文:

How to deal with issue merging two datasets?

问题

我在R中使用两个数据集：svolik和est。为了背景，我开发了一个新的概念度量（立法权力分享），并正在使用它来复制以前的研究：Svolik（2012）。这个练习的目标是看看在使用我的度量时结果是否不同。

以下是svolik数据：链接

以下是est数据：链接

我从Svolik用来生成他的结果的数据集开始。我成功复制了他的结果（图中的模型1、3和5）。然后，我将他的数据集与包含我的新度量的数据集合并，丢弃了没有完全匹配的观测值：

# 加载原始数据（用于生成原始结果的数据）
svolik <- read_dta("svolik.dta")
# 加载包含我的新度量的数据
est <- read.csv("Merging with Svolik.csv")
# 合并
final <- merge(svolik, est, by = c("ccode", "year"), all = FALSE)

接下来，我再次运行他的模型，但我用我的立法权力分享变量替换了他的立法变量（图中的模型2、4和6）。请注意，尽管数据涵盖相同的时间段，但原始模型和我的模型包含略有不同数量的观测值（2,903与2,934相对比）。

我不能弄清楚为什么会有这些额外的观测值。我猜测这与合并/重复数据或类似的问题有关。您认为这可能是问题吗？如果是这样，您知道如何找出这些观测值是什么吗？解决方案可能非常简单，我可能只是想太多了。任何建议都将不胜感激！请注意，我尝试使用不同的合并策略——在dplyr()中使用left_join——但那没有起作用。

请注意，我正在Stata中运行结果。以下是原始结果的Stata代码（即模型1、3和5）：

* 生存分析
use "leaders, institutions, covariates, updated tvc.dta" 
* 自然死亡
gen c_natural=censoring
replace c_natural=0 if exit!="natural"
replace c_natural=. if exit==""
tab c_natural
stset t, id(leadid) failure(c_natural)
stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, replace  ctitle(natural, leg) tex nonotes bdec(3) e(all) ef
* 政变
gen c_coup= censoring
replace c_coup=0 if exit!="coup"
replace c_coup=. if exit==""
stset t, id(leadid) failure(c_coup)
* 移除一些重复的观测值
* drop if (t[_n-1]==t &  leadid[_n-1]== leadid) 
stset t, id(leadid) failure(c_coup)
stcox legislature  lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, ctitle(coups, leg) tex nonotes bdec(3) e(all) ef
 
* 暴动
gen c_revolt= censoring
replace c_revolt=0 if exit!="revolt"
replace c_revolt=. if exit==""
tab c_revolt
stset t, id(leadid) failure(c_revolt)
* 去掉共产主义者，因为它是完美的预测器
stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED  mil cw age
outreg2 using survival, ctitle(revolt, leg) tex nonotes bdec(3) e(all) ef

以下是新结果的Stata代码（即模型2、4和6）：

* 生存分析
use "merged_test.dta" 
* 自然死亡
gen c_natural=censoring
replace c_natural=0 if exit!="natural"
replace c_natural=. if exit==""
tab c_natural
stset t, id(leadid) failure(c_natural)
stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, replace  ctitle(natural, leg) tex nonotes bdec(3) e(all) ef
* 政变
gen c_coup= censoring
replace c_coup=0 if exit!="coup"
replace c_coup=. if exit==""
stset t, id(leadid) failure(c_coup)
* 移除一些重复的观测值
* drop if (t[_n-1]==t &  leadid[_n-1]== leadid) 
stset t, id(leadid) failure(c_coup)
stcox estimate  lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, ctitle(coups, leg) tex nonotes bdec(3) e(all) ef
 
* 暴动
gen c_revolt= censoring
replace c_revolt=0 if exit!="revolt"
replace c_revolt=. if exit==""
tab c_revolt
stset t, id(leadid) failure(c_revolt)
* 去掉共产主义者，因为它是完美的预测器
stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED  mil cw age
outreg2 using survival, ctitle(revolt, leg) tex nonotes bdec(3) e(all) ef
<details>
<summary>英文:</summary>
I am working with two datasets in `R`: `svolik` and `est`. For context, I have developed a new measure of a concept (legislative power sharing), and I am using it to replicate a previous study: Svolik (2012). The goal of the exercise is to see if the results are different when using my measure.
Here is the `svolik` data: https://drive.google.com/file/d/1nCBhRXNcBrLEr6-R2pkyuQ9mCtJKkdmm/view?usp=sharing
Here is the `est` data: https://drive.google.com/file/d/1D-UmHSi9LIEsmY5VBvU8nxu8u1gix7Ay/view?usp=sharing
I started with the dataset that Svolik used to generate his results. I successfully reproduced his results (models 1, 3, and 5 in the figure). I then merged his dataset with the dataset containing my new measure, discarding any observations for which there was no exact match:

load original data (the data used to produce original results)

svolik <- read_dta("svolik.dta")

load data containing my new measure

est <- read.csv("Merging with Svolik.csv")

merge

final <- merge(svolik, est, by = c("ccode", "year"), all = FALSE)

Next, I run his models again, but I replace his Legislature variable with my Legislative Power Sharing variable (models 2, 4, and 6 in the figure). Notice that, despite the data covering the same period of time, the original models and my own contain slightly different numbers of observations (2,903 as opposed to 2,934).
I cannot for the life of me figure out why I am getting these extra observations. My guess is that it has something to do with merging/duplicates or something like that. Does this seem like the likely problem to you? If so, do you know of a way to find out what those observations are? The solution is probably quite simple, and I am probably just overthinking things. Any advice would be appreciated! Note that I tried using a different merging strategy --- `left_join` in `dplyr()`--- but that didn&#39;t work.
[![enter image description here][1]][1]
  [1]: https://i.stack.imgur.com/DGQEJ.jpg
Note that I am running the results in Stata. Here is the Stata code for the original results (i.e., models 1, 3, and 5):

SURVIVAL ANALYSIS

use "leaders, institutions, covariates, updated tvc.dta"

NATURAL DEATHS
gen c_natural=censoring
replace c_natural=0 if exit!="natural"
replace c_natural=. if exit==""
tab c_natural

stset t, id(leadid) failure(c_natural)

stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, replace ctitle(natural, leg) tex nonotes bdec(3) e(all) ef

COUPS
gen c_coup= censoring
replace c_coup=0 if exit!="coup"
replace c_coup=. if exit==""

stset t, id(leadid) failure(c_coup)

REMOVE SOM DUPLICATE OBSERVATIONS
drop if (t[_n-1]==t & leadid[_n-1]== leadid)
stset t, id(leadid) failure(c_coup)

stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, ctitle(coups, leg) tex nonotes bdec(3) e(all) ef

REVOLTS
gen c_revolt= censoring
replace c_revolt=0 if exit!="revolt"
replace c_revolt=. if exit==""
tab c_revolt

stset t, id(leadid) failure(c_revolt)

COMMUNIST LEFT OUT BECAUSE IT IS A PERFECT PREDICTOR

stcox legislature lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED mil cw age
outreg2 using survival, ctitle(revolt, leg) tex nonotes bdec(3) e(all) ef

Here is the Stata code for the new results (i.e., models 2, 4, and 6):

SURVIVAL ANALYSIS

use "merged_test.dta"

NATURAL DEATHS
gen c_natural=censoring
replace c_natural=0 if exit!="natural"
replace c_natural=. if exit==""
tab c_natural

stset t, id(leadid) failure(c_natural)

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, replace ctitle(natural, leg) tex nonotes bdec(3) e(all) ef

COUPS
gen c_coup= censoring
replace c_coup=0 if exit!="coup"
replace c_coup=. if exit==""

stset t, id(leadid) failure(c_coup)

REMOVE SOM DUPLICATE OBSERVATIONS
drop if (t[_n-1]==t & leadid[_n-1]== leadid)
stset t, id(leadid) failure(c_coup)

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age
outreg2 using survival, ctitle(coups, leg) tex nonotes bdec(3) e(all) ef

REVOLTS
gen c_revolt= censoring
replace c_revolt=0 if exit!="revolt"
replace c_revolt=. if exit==""
tab c_revolt

stset t, id(leadid) failure(c_revolt)

COMMUNIST LEFT OUT BECAUSE IT IS A PERFECT PREDICTOR

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED mil cw age
outreg2 using survival, ctitle(revolt, leg) tex nonotes bdec(3) e(all) ef


</details>
# 答案1
**得分**: 5
### 问题解释
问题不在于 `svolik` 有 2903 个观测值，`final` 有 2934 个观测值，因此 `final` 是由合并中的一些重复行引起的 `svolik` 的超集。您永远不会有2903行可以在两个回归中都使用。这是因为在 Svolik 回归中使用的并非所有完整观测值都在 `est` 数据集中有对应的行。首先让我们了解一下 `svolik` 中的 2903 个观测值是从哪里来的：
```r
svolik_reg_cols <- c("legislative", "lgdp_1", "growth_1", "exportersoffuelsmainlyoil_EL2008", "ethfrac_FIXED", "communist", "mil", "cw", "age")
svolik_is_complete <- complete.cases(svolik[, svolik_reg_cols])
sum(svolik_is_complete) # 2903

正如您可以看到的，这是回归中所有列的完整案例数。现在让我们使用您的连接方法对 final 进行相同的操作：

final <- merge(svolik, est, by = c("ccode", "year"), all = FALSE)
final_reg_cols <- svolik_reg_cols
final_reg_cols[final_reg_cols == "legislative"] <- "estimate"
final_is_complete <- complete.cases(final[, final_reg_cols])
sum(final_is_complete) # 2934

同样，2934 是没有任何协变量数据缺失的观测值数。

但是，让我们看一下您正在连接的数据集。在 svolik 中有 278 组 ccode 和 year，而这些组在 est 中没有出现。

# svolik 中有但 est 中没有的 ccode 和 year 数量
dplyr::anti_join(
    svolik,
    est,
    by = c("ccode", "year")
) |>
    group_by(ccode, cabb, year) |>
    summarise(n = n()) |>
    arrange(desc(n)) |>
    print(n = 2)
# # A tibble: 278 × 3
# # Groups:   ccode [39]
#   ccode  year     n
#   <dbl> <dbl> <int>
# 1   990  1982     4
# 2   947  2001     3
# # … with 276 more rows

这意味着使用您目前的数据，不可能在所有观测值之间进行比较。

解决方案

您有三个选项：

获取更多数据。
对缺失值进行插补。
限制回归到共同的观测值。

您将知道 1. 或 2. 是否可行。但是，由于您的分析目的似乎是将您的新度量与 Svolik 进行比较，所以 3. 似乎是一个合理的方法，特别是因为您不会删除太多行。

首先找到共同的行（共有 2830 个），并保存到 dta 中：

all_complete <- complete.cases(final[, c("estimate", svolik_reg_cols)])
sum(all_complete) # 2830
final_complete <- final[all_complete, ]
write_dta(final_complete, "./tmp/svolik_est_merged.dta")

Stata 代码

现在可以在 Stata 中运行回归。首先加载并准备数据，与以前一样：

use svolik_est_merged.dta, clear
* 自然死亡
cap drop c_natural c_coup c_revolt _d _t _t0
gen c_natural=censoring
replace c_natural=0 if exit!="natural"
replace c_natural=. if exit==""
tab c_natural
stset t, id(leadid) failure(c_natural)

然后运行 Svolik 回归。您可以看到有 2830 个观测值：

stcox legislative lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age

Cox regression with Breslow method for ties
No. of subjects =   383                                 Number of obs =  2,830
No. of failures =    40
Time at risk    = 3,098
                                                        LR chi2(9)    =  28.46
Log likelihood = -157.48569                             Prob > chi2   = 0.0008
-------------------------------------------------------------------------------------
                 _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
--------------------+----------------------------------------------------------------
        legislative |   1.006541   .0088251     0.74   0.457     .9893923    1.023988
             lgdp_1 |   1.437144   .3138694     1.66   0.097     .9366983    2.204962
           growth_1 |   1.010814   .0283629     0.38   0.701      .956725    1.067962
exportersoffue~2008 |   2.487166   1.205382     1.88   0.060     .9620061    6.430308
      ethfrac_FIXED |   1.011694     .00645     1.82   0.068     .9991306    1.024415
          communist |     2.0526   1.610128     0.92   0.359     .4411573    9.550262
                mil |    1.06844   .3944057     0.18   0.858     .5182463    2.202744
                 cw |    4.15784   2.325053     2.55   0.011     1.389562    12.44106
                age |   1.057077   .0172812     3.40   0.001     1.023744    1.091496
-------------------------------------------------------------------------------------

然后运行您的回归：

stcox estimate lgdp_1 growth
<details>
<summary>英文:</summary>
### Explanation of the problem
Your problem is not that `svolik` has 2903 observations, `final` has 2934 and therefore `final` is a superset of `svolik` caused by some duplicate rows in the merge. You will *never* have 2903 rows that you can use in both regressions. This is because not all complete observations used in the Svolik regression have a corresponding row in the `est` dataset. Firstly let&#39;s understand where the 2903 observations in `svolik` come from:
```r
svolik_reg_cols &lt;- c(&quot;legislative&quot;, &quot;lgdp_1&quot;, &quot;growth_1&quot;, &quot;exportersoffuelsmainlyoil_EL2008&quot;, &quot;ethfrac_FIXED&quot;, &quot;communist&quot;, &quot;mil&quot;, &quot;cw&quot;, &quot;age&quot;)
svolik_is_complete &lt;- complete.cases(svolik[, svolik_reg_cols])
sum(svolik_is_complete) # 2903

As you can see, it is the number of complete cases for all the columns in the regression. Now let's do the same with final, using your join method:

final &lt;- merge(svolik, est, by = c(&quot;ccode&quot;, &quot;year&quot;), all = FALSE)
final_reg_cols &lt;- svolik_reg_cols
final_reg_cols[final_reg_cols == &quot;legislative&quot;] &lt;- &quot;estimate&quot;
final_is_complete &lt;- complete.cases(final[, final_reg_cols])
sum(final_is_complete) # 2934

Again, 2934 is the number of observations with no missing data for any of the covariates.

However, let's look at the datasets you are joining. There are 278 sets of ccode and year in svolik that do not appear in est.

# How many ccode and year are in svolik but not est
dplyr::anti_join(
    svolik,
    est,
    by = c(&quot;ccode&quot;, &quot;year&quot;)
) |&gt;
    group_by(ccode, cabb, year) |&gt;
    summarise(n = n()) |&gt;
    arrange(desc(n)) |&gt;
    print(n = 2)
# # A tibble: 278 &#215; 3
# # Groups:   ccode [39]
#   ccode  year     n
#   &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
# 1   990  1982     4
# 2   947  2001     3
# # … with 276 more rows

This means that with the data you have, it is impossible to compare the results across all observations.

Solutions

You have three options:

Get more data.
Impute missing values.
Restrict the regression to common observations.

You will know whether 1. or 2. are possible. However as the purpose of your analysis seems to be to compare your new metric to Svolik, 3. seems a reasonable approach, particularly as you do not end up dropping many rows.

First find the common rows (there are 2830) and save to dta:

all_complete &lt;- complete.cases(final[, c(&quot;estimate&quot;, svolik_reg_cols)])
sum(all_complete) # 2830
final_complete &lt;- final[all_complete, ]
write_dta(final_complete, &quot;./tmp/svolik_est_merged.dta&quot;)

Stata code

You can now run the regression in Stata. Firstly load and prepare the data as previously:

use svolik_est_merged.dta, clear
* NATURAL DEATHS
cap drop c_natural c_coup c_revolt _d _t _t0
gen c_natural=censoring
replace c_natural=0 if exit!=&quot;natural&quot;
replace c_natural=. if exit==&quot;&quot;
tab c_natural
stset t, id(leadid) failure(c_natural)

Now run the Svolik regression. You can see there are 2830 observations:

stcox legislative lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age

Cox regression with Breslow method for ties
No. of subjects =   383                                 Number of obs =  2,830
No. of failures =    40
Time at risk    = 3,098
                                                        LR chi2(9)    =  28.46
Log likelihood = -157.48569                             Prob &gt; chi2   = 0.0008
-------------------------------------------------------------------------------------
                 _t | Haz. ratio   Std. err.      z    P&gt;|z|     [95% conf. interval]
--------------------+----------------------------------------------------------------
        legislative |   1.006541   .0088251     0.74   0.457     .9893923    1.023988
             lgdp_1 |   1.437144   .3138694     1.66   0.097     .9366983    2.204962
           growth_1 |   1.010814   .0283629     0.38   0.701      .956725    1.067962
exportersoffue~2008 |   2.487166   1.205382     1.88   0.060     .9620061    6.430308
      ethfrac_FIXED |   1.011694     .00645     1.82   0.068     .9991306    1.024415
          communist |     2.0526   1.610128     0.92   0.359     .4411573    9.550262
                mil |    1.06844   .3944057     0.18   0.858     .5182463    2.202744
                 cw |    4.15784   2.325053     2.55   0.011     1.389562    12.44106
                age |   1.057077   .0172812     3.40   0.001     1.023744    1.091496
-------------------------------------------------------------------------------------

Then run your regression:

stcox estimate lgdp_1 growth_1 exportersoffuelsmainlyoil_EL2008 ethfrac_FIXED communist mil cw age

Output:

Cox regression with Breslow method for ties
No. of subjects =   383                                 Number of obs =  2,830
No. of failures =    40
Time at risk    = 3,098
                                                        LR chi2(9)    =  28.00
Log likelihood = -157.71273                             Prob &gt; chi2   = 0.0010
-------------------------------------------------------------------------------------
                 _t | Haz. ratio   Std. err.      z    P&gt;|z|     [95% conf. interval]
--------------------+----------------------------------------------------------------
           estimate |   .9742007   .1278445    -0.20   0.842     .7532603    1.259946
             lgdp_1 |   1.506868   .3265272     1.89   0.058     .9854309    2.304222
           growth_1 |   1.007996    .028074     0.29   0.775      .954447     1.06455
exportersoffue~2008 |   2.147553   1.257702     1.31   0.192     .6814636    6.767761
      ethfrac_FIXED |   1.011719   .0070275     1.68   0.093     .9980384    1.025587
          communist |   2.064115   1.619767     0.92   0.356     .4433766    9.609369
                mil |   1.018648   .3747256     0.05   0.960     .4953321    2.094845
                 cw |   3.961413   2.202203     2.48   0.013     1.332464    11.77727
                age |   1.054575   .0174756     3.21   0.001     1.020873    1.089389
-------------------------------------------------------------------------------------

Again 2830 observations. The results seem pretty similar to me: the same two covariates (cw and age) have small p-values and all coefficients are close to Svolik. If you're trying to develop a metric which tells you something new, perhaps not what you want to hear. However, if you are trying to find out whether your metric is robust by comparing to an established one, perhaps that's better news.

One caveat of course is to think about whether the rows that you cannot join are missing completely at random. If there is a systemic issue causing data to be missing related to your dependent variable (e.g. authoritarian countries are more likely to be prevent data being published for certain years), it may not be reasonable to infer that similar results here can be extrapolated to missing data. One way to look at this would be to compare your Svolik results with the 2830 rows to the results with the 2903 rows and see if they are meaningfully different.

Edit: how to see which rows are not included in both

To answer your question in the comments. You wanted to see which c_natural failure was included in svolik but not final:

# Create dataset of complete svolik observations
svolik_complete &lt;- svolik[svolik_is_complete, ]
# Subset data to failure of interest
svolik_failures &lt;- svolik_complete |&gt;
    filter(c_natural == 1)
final_failures &lt;- final_complete |&gt;
    filter(c_natural == 1)
# Find failures in svolik_complete but not in final_complete
anti_join(svolik_failures, final_failures, by = c(&quot;ccode&quot;, &quot;year&quot;)) |&gt;
    select(ccode, cabb, year)
# # A tibble: 1 &#215; 3
#   ccode cabb   year
#   &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
# 1   640 TUR    1972

You can do the same for c_coup or c_revolt.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何处理合并两个数据集的问题？

问题

load original data (the data used to produce original results)

load data containing my new measure

merge

解决方案

Stata 代码

Solutions

Stata code

Edit: how to see which rows are not included in both

创建一个特定的选择矩阵

在R中的data.table中，如何引用列名称中包含`<=`而不让R将其视为操作？

解析函数输入名称作为输出名称。

在一个大型的 R 数据框中，通过 tidyverse 按日期分组线性插值数值。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。