创建一个新变量仅当满足条件时。

huangapple go评论72阅读模式
英文:

How to create a new variable only if a condition if satisfied?

问题

我想创建一个新变量,从两个不同数据集中减去两个变量。但我需要R仅计算相同值的差异... 在我的情况下,"相同的事物" 是第三个变量 COD_PROV,它指示城市。

我需要R计算数据集1的 Real Wage 变量和数据集2的 Real Wage 变量之间的差异,但仅当这些实际工资的 COD_PROV 相同时。

这是数据集1的示例:

COD_PROV     Real wage 
1            1962.18
6            1742.85
5            1541.81
96           1612.2
4            1574
3            1823.53
103          1584.49
2            1666.21
7            1747.81
10           2066.42
8            1498.01
11           1871.34
9            1770.41
15           2240.03
16           1729.17
17           1773.38
13           1832.57

数据集2具有相同的框架,但缺少一些 COD_PROV 值:

COD_PROV     Real wage 
1            4962.18
6            1542.85
5            3541.81
4            1564
3            1223.53
2            1446.21
7            1557.81
10           2226.42
8            1458.01
11           1843.34
16           1439.17
17           1883.38
13           1992.57

我尝试了这个:

new <- mutate(dataset1, `Wage Difference` = dataset1$`Real wage` - dataset2$`Real wage`)

但R当然会回复:

Error in mutate():
ℹ In argument: Wage difference = ... - dataset1$Real wage.
Caused by error:
! Wage difference must be size 105 or 1, not 106.
Run rlang::last_error() to see where the error occurred.

我认为原因是数据集2的观测数量比数据集1少(特别是缺少一些 COD_PROV 值)... 我如何只对相同的 COD_PROV 值应用差异?

英文:

I want to create a new variable subtracting two variables from two different datasets. But I need that R make the difference only for the values that are referred to the same thing... The "same thing" in my case is the a third variable COD_PROV which indicates the city.

I need that R make the difference between the Real Wage variable of dataset 1 and the Real Wage variable of dataset 2, but only if the COD_PROV of these real wages is the same.

This is an example of the dataset 1

COD_PROV     Real wage 
1            1962,18
6            1742,85
5            1541,81
96           1612,2
4            1574
3            1823,53
103          1584,49
2            1666,21
7            1747,81
10           2066,42
8            1498,01
11           1871,34
9            1770,41
15           2240,03
16           1729,17
17           1773,38
13           1832,57

Datset 2 has the same framework, but some values of COD_PROV are missing

COD_PROV     Real wage 
1            4962,18
6            1542,85
5            3541,81
4            1564
3            1223,53
2            1446,21
7            1557,81
10           2226,42
8            1458,01
11           1843,34
16           1439,17
17           1883,38
13           1992,57

I've tried this

new &lt;- mutate( dataset1, `Wage Difference ` = dataset1$`Real wage` - dataset2$`Real wage` )

but R of course replies

Error in mutate():
ℹ In argument: Wage difference = ... - dataset1$Real wage.
Caused by error:
! Wage difference must be size 105 or 1, not 106.
Run rlang::last_error() to see where the error occurred.

I suppose that the reason is that dataset 2 has less observations than dataset1 ( in particular some values of COD_PROV are missing)... How can I apply the difference only for the same values of COD_PROV ?

答案1

得分: 1

我认为你应该将它们合并在一起,然后计算差异。

library(dplyr)
dataset1 %>%
  left_join(dataset2, by = "COD_PROV", suffix = c("", ".y")) %>%
  mutate(diff = `Real wage` - `Real wage.y`)
#    COD_PROV Real wage Real wage.y  diff
# 1         1   1962.18     4962.18 -3000
# 2         6   1742.85     1542.85   200
# 3         5   1541.81     3541.81 -2000
# 4        96   1612.20          NA    NA
# 5         4   1574.00     1564.00    10
# 6         3   1823.53     1223.53   600
# 7       103   1584.49          NA    NA
# 8         2   1666.21     1446.21   220
# 9         7   1747.81     1557.81   190
# 10       10   2066.42     2226.42  -160
# 11        8   1498.01     1458.01    40
# 12       11   1871.34     1843.34    28
# 13        9   1770.41          NA    NA
# 14       15   2240.03          NA    NA
# 15       16   1729.17     1439.17   290
# 16       17   1773.38     1883.38  -110
# 17       13   1832.57     1992.57  -160

有关"join/merge"的更多信息,请参见https://stackoverflow.com/q/1299871/3358272,https://stackoverflow.com/q/5706437/3358272。

数据:

dataset1 <- structure(list(COD_PROV = c(1L, 6L, 5L, 96L, 4L, 3L, 103L, 2L, 7L, 10L, 8L, 11L, 9L, 15L, 16L, 17L, 13L), "Real wage" = c(1962.18, 1742.85, 1541.81, 1612.2, 1574, 1823.53, 1584.49, 1666.21, 1747.81, 2066.42, 1498.01, 1871.34, 1770.41, 2240.03, 1729.17, 1773.38, 1832.57)), row.names = c(NA, -17L), class = "data.frame")
dataset2 <- structure(list(COD_PROV = c(1L, 6L, 5L, 4L, 3L, 2L, 7L, 10L, 8L, 11L, 16L, 17L, 13L), "Real wage" = c(4962.18, 1542.85, 3541.81, 1564, 1223.53, 1446.21, 1557.81, 2226.42, 1458.01, 1843.34, 1439.17, 1883.38, 1992.57)), row.names = c(NA, -13L), class = "data.frame")
英文:

I think you should join/merge these together and then calculate the difference.

library(dplyr)
dataset1 %&gt;%
  left_join(dataset2, by = &quot;COD_PROV&quot;, suffix = c(&quot;&quot;, &quot;.y&quot;)) %&gt;%
  mutate(diff = `Real wage` - `Real wage.y`)
#    COD_PROV Real wage Real wage.y  diff
# 1         1   1962.18     4962.18 -3000
# 2         6   1742.85     1542.85   200
# 3         5   1541.81     3541.81 -2000
# 4        96   1612.20          NA    NA
# 5         4   1574.00     1564.00    10
# 6         3   1823.53     1223.53   600
# 7       103   1584.49          NA    NA
# 8         2   1666.21     1446.21   220
# 9         7   1747.81     1557.81   190
# 10       10   2066.42     2226.42  -160
# 11        8   1498.01     1458.01    40
# 12       11   1871.34     1843.34    28
# 13        9   1770.41          NA    NA
# 14       15   2240.03          NA    NA
# 15       16   1729.17     1439.17   290
# 16       17   1773.38     1883.38  -110
# 17       13   1832.57     1992.57  -160

For more information on what join/merge means, see https://stackoverflow.com/q/1299871/3358272, https://stackoverflow.com/q/5706437/3358272.


Data

dataset1 &lt;- structure(list(COD_PROV = c(1L, 6L, 5L, 96L, 4L, 3L, 103L, 2L, 7L, 10L, 8L, 11L, 9L, 15L, 16L, 17L, 13L), &quot;Real wage&quot; = c(1962.18, 1742.85, 1541.81, 1612.2, 1574, 1823.53, 1584.49, 1666.21, 1747.81, 2066.42, 1498.01, 1871.34, 1770.41, 2240.03, 1729.17, 1773.38, 1832.57)), row.names = c(NA, -17L), class = &quot;data.frame&quot;)
dataset2 &lt;- structure(list(COD_PROV = c(1L, 6L, 5L, 4L, 3L, 2L, 7L, 10L, 8L, 11L, 16L, 17L, 13L), &quot;Real wage&quot; = c(4962.18, 1542.85, 3541.81, 1564, 1223.53, 1446.21, 1557.81, 2226.42, 1458.01, 1843.34, 1439.17, 1883.38, 1992.57)), row.names = c(NA, -13L), class = &quot;data.frame&quot;)

答案2

得分: 0

我们可以使用 {powerjoin}:

powerjoin::power_left_join(dataset1, dataset2, by = "COD_PROV", conflict = `-`)
#>    COD_PROV  实际工资
#> 1         1   -3000
#> 2         6    200
#> 3         5   -2000
#> 4        96   <NA>
#> 5         4     10
#> 6         3    600
#> 7       103   <NA>
#> 8         2    220
#> 9         7    190
#> 10       10   -160
#> 11        8     40
#> 12       11     28
#> 13        9   <NA>
#> 14       15   <NA>
#> 15       16    290
#> 16       17   -110
#> 17       13   -160

创建于2023年3月17日,使用 reprex v2.0.2

英文:

We can use {powerjoin}:

powerjoin::power_left_join(dataset1, dataset2, by = &quot;COD_PROV&quot;, conflict = `-`)
#&gt;    COD_PROV Real wage
#&gt; 1         1     -3000
#&gt; 2         6       200
#&gt; 3         5     -2000
#&gt; 4        96        NA
#&gt; 5         4        10
#&gt; 6         3       600
#&gt; 7       103        NA
#&gt; 8         2       220
#&gt; 9         7       190
#&gt; 10       10      -160
#&gt; 11        8        40
#&gt; 12       11        28
#&gt; 13        9        NA
#&gt; 14       15        NA
#&gt; 15       16       290
#&gt; 16       17      -110
#&gt; 17       13      -160

<sup>Created on 2023-03-17 with reprex v2.0.2</sup>

huangapple
  • 本文由 发表于 2023年2月27日 01:27:23
  • 转载请务必保留本文链接:https://go.coder-hub.com/75573785.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定