英文:
How do I use dcast and keep data from multiple years?
问题
我有一个包含世界银行发展数据的.csv表,时间范围为2016年至2020年。一开始,我只使用了2020年的数据,但后来不得不添加其他年份的数据,因为某些变量的数据不可用。
.csv表包含国家代码、系列名称和每年的数值。例如:
Country.Code | Series.Name |
---|---|
Arg | Gini index |
Arg | Population |
我使用了dcast函数,如下所示:
ControlM <- dcast(Control, Country.Code ~ Series.Name)
这个方法有效,我得到了一个只包含一个行的表,其中有国家代码以及数值列,但只保留了2020年的数值。
然后,我尝试指定要保留的列,添加如下内容:
ControlM <- dcast(
Control,
Country.Code ~ Series.Name,
value.var = c("2016", "2017", "2018", "2019", "2020")
)
但是,我遇到了以下错误:
Error in if (!(value.var %in% names(data))) { :
the condition has length > 1
我尝试了在此处找到的不同解决方案,例如:https://stackoverflow.com/questions/35371004/error-using-dcast-with-multiple-value-var,并将数据帧转换为数据表并将value.var
设置为NULL
。但错误消息保持不变。
如果有人能为我提供建议,我会非常高兴。
编辑:感谢您的快速回复!以下是dput(head(Contro))
的输出:
structure(list(Country.Name = c("Argentina", "Argentina", "Argentina",
"Argentina", "Argentina", "Armenia"), Country.Code = c("ARG",
"ARG", "ARG", "ARG", "ARG", "ARM"), Series.Name = c("Gini index",
"Trade (% of GDP)", "Population density (people per sq. km of land area)",
"Population, total", "Educational attainment, at least completed post-secondary, population 25+, total (%) (cumulative)",
"Gini index"), Series.Code = c("SI.POV.GINI", "NE.TRD.GNFS.ZS",
"EN.POP.DNST", "SP.POP.TOTL", "SE.SEC.CUAT.PO.ZS", "SI.POV.GINI"
), `2016` = c("42", "26.0938878488799", "15.9281350828921", "43590368",
"..", "32.5"), `2017` = c("41.1", "25.2896011376779", "16.0941907925267",
"44044811", "..", "33.6"), `2018` = c("41.3", "30.7625359549926",
"16.2585100979651", "44494502", "..", "34.4"), `2019` = c("42.9",
"32.6306150458499", "16.4208266190179", "44938712", "..", "29.9"
), `2020` = c("42.3", "30.2197998857878", "16.580892611147",
"45376763", "..", "25.2")), row.names = c(NA, -6L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x00000193c0bd5930>)
问题是,我希望将系列名称作为列名。例如:
Country.Code | Gini Index |
---|---|
ARG | 50 |
ARM | 50 |
与上面的示例不同。问题在于,默认情况下,dcast函数只保留了2020年的数据,但我需要2016-2020年的数据。希望这有助于澄清我的问题。至于我使用了哪些包,我两周前安装了数据表和reshape2,不确定我在这里使用了哪个包。很抱歉,我还在努力学习R:D
英文:
I have a .csv table with world bank development data from the years 2016-2020. In the beginning I only used data from 2020, however I had to add years as for some of the variables the data was not available.
The .csv table contained countrycodes and series names and the values for each year. So for example
Country.Code | Series.Name |
---|---|
Arg | Gini index |
Arg | Population |
I used the dcast function like this:
ControlM <- dcast(Control, Country.Code ~ Series.Name)
This worked and I had a table with only one row with the Country code and then columns with the values, but it only kept the values for 2020.
Then I tried to specify, which columns it should keep by adding:
ControlM <- dcast(
Control,
Country.Code ~ Series.Name,
value.var = c("2016", "2017", "2018", "2019", "2020")
)
However, I get the following Error:
Error in if (!(value.var %in% names(data))) { :
the condition has length > 1
I tried different solutions I found here. For example: https://stackoverflow.com/questions/35371004/error-using-dcast-with-multiple-value-var and converted the data frame to a data table first and setting value.var = NULL
. The error message stays the same.
I would be really glad, if someone had a suggestion for me.
Edit: Thank you for your quick response! Here is the ouput for dput(head(Contro)):
structure(list(Country.Name = c("Argentina", "Argentina", "Argentina",
"Argentina", "Argentina", "Armenia"), Country.Code = c("ARG",
"ARG", "ARG", "ARG", "ARG", "ARM"), Series.Name = c("Gini index",
"Trade (% of GDP)", "Population density (people per sq. km of land area)",
"Population, total", "Educational attainment, at least completed post-secondary, population 25+, total (%) (cumulative)",
"Gini index"), Series.Code = c("SI.POV.GINI", "NE.TRD.GNFS.ZS",
"EN.POP.DNST", "SP.POP.TOTL", "SE.SEC.CUAT.PO.ZS", "SI.POV.GINI"
), `2016` = c("42", "26.0938878488799", "15.9281350828921", "43590368",
"..", "32.5"), `2017` = c("41.1", "25.2896011376779", "16.0941907925267",
"44044811", "..", "33.6"), `2018` = c("41.3", "30.7625359549926",
"16.2585100979651", "44494502", "..", "34.4"), `2019` = c("42.9",
"32.6306150458499", "16.4208266190179", "44938712", "..", "29.9"
), `2020` = c("42.3", "30.2197998857878", "16.580892611147",
"45376763", "..", "25.2")), row.names = c(NA, -6L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x00000193c0bd5930>)
The problem is that I want the series names as column names. For example:
Country.Code | Gini Index |
---|---|
ARG | 50 |
ARM | 50 |
as opposed to the example above. The problem is that the dcast function by default only keeps the data for 2020, but I need the data for 2016-2020.
I hope that helps to clarify my problem. Oh and as to which packages I used I installed both weeks ago (data.table and reshape2) and am not sure, which one I used here. I am sorry, still really struggling with R
答案1
得分: 0
你的代码基本正确——主要问题是函数被另一个包屏蔽了。如果我用reshape2::dcast()
运行你的代码,就会出现你发布的错误。然而,用data.table::dcast()
运行就没有问题。
如果可能的话,最好不要使用reshape2
(即不调用library(reshape2)
),如果你在使用data.table
的话。前者是tidyverse
包,而后者不是。有时它们能很好地协同工作,有时却不行——就像你发现的那样。
在你的情况下,在转换为宽格式之前,我会先清理系列名称。"Educational attainment, at least completed post-secondary, population 25+, total (%)(cumulative)"
作为列名会有些尴尬。
Control[, series_clean := gsub(
"(.*?)\\s.+",
"\\L\\1",
Series.Name,
perl = TRUE)
][, series_clean := sub("population,", "pop", series_clean)
][, series_clean := sub("population", "pop_density", series_clean)]
Control[, unique(series_clean)]
# [1] "gini" "trade" "pop_density" "pop" "educational"
如果你只想将基尼系数转为宽格式,可以这样做:
ControlM <- data.table::dcast(
Control[series_clean == "gini"],
Country.Code ~ series_clean,
value.var = c("2016", "2017", "2018", "2019", "2020")
)
如果你想对整个数据框执行此操作,只需从dcast()
调用中移除子集([series_clean=="gini"]
)即可。
你的数据是character
格式而不是numeric
格式。你也会想要改变这一点:
cols_to_make_numeric <- grep(
"^\\d{4}",
names(ControlM),
value = TRUE
)
ControlM[, (cols_to_make_numeric) := lapply(.SD, as.numeric),
.SDcols = cols_to_make_numeric
]
# Key: <Country.Code>
# Country.Code 2016_gini 2017_gini 2018_gini 2019_gini 2020_gini
# <char> <num> <num> <num> <num> <num>
# 1: ARG 42.0 41.1 41.3 42.9 42.3
# 2: ARM 32.5 33.6 34.4 29.9 25.2
英文:
Your code was basically correct - the main problem was the function being masked by another package. If I run your code with reshape2::dcast()
I get the error you posted. However, with data.table::dcast()
it runs.
If you can avoid it, it is advisable not to use reshape2
(i.e. not to call library(reshape2)
) if you are using data.table
. The former is a tidyverse
package and the latter is not. Sometimes they play well together, sometimes they do not - as you have found out.
In your case, the other thing I would do before casting to wide is clean the series names. "Educational attainment, at least completed post-secondary, population 25+, total (%) (cumulative)"
is going to be awkward as a column name.
Control[, series_clean := gsub(
"(.*?)\\s.+",
"\\L\\1",
Series.Name,
perl = TRUE)
][, series_clean := sub("population,", "pop", series_clean)
][, series_clean := sub("population", "pop_density", series_clean)]
Control[, unique(series_clean)]
# [1] "gini" "trade" "pop_density" "pop" "educational"
If you just want to make the Gini Coefficient wide, you can do:
ControlM <- data.table::dcast(
Control[series_clean == "gini"],
Country.Code ~ series_clean,
value.var = c("2016", "2017", "2018", "2019", "2020")
)
If you want to do it for the whole data frame you can just remove the subset ([series_clean=="gini"]
) from the dcast()
call.
Your data is in character
rather than numeric
format. You will want to change that as well:
cols_to_make_numeric <- grep(
"^\\d{4}",
names(ControlM),
value = TRUE
)
ControlM[, (cols_to_make_numeric) :=
lapply(.SD, as.numeric),
.SDcols = cols_to_make_numeric
]
# Key: <Country.Code>
# Country.Code 2016_gini 2017_gini 2018_gini 2019_gini 2020_gini
# <char> <num> <num> <num> <num> <num>
# 1: ARG 42.0 41.1 41.3 42.9 42.3
# 2: ARM 32.5 33.6 34.4 29.9 25.2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论