英文:
Best way to compute multiple data.frame columns from patterns
问题
我定期以如下宽格式的data.frame
形式接收数据:
head(data)
date NSW_1 AUS_1 NSW_2 AUS_2
1 2000-01-01 -0.38358623 -1.78824221 0.995984590 1.0744594
2 2001-01-01 -1.95910318 2.03124252 -1.695764903 0.2605978
3 2002-01-01 -0.84170506 -0.70314433 -0.533372143 -0.3142720
4 2003-01-01 1.90354747 0.15816476 -1.372269451 -0.7496301
5 2004-01-01 0.62249393 0.50623480 -2.207919779 -0.8621983
6 2005-01-01 1.99092044 -0.81999511 1.822122519 2.0480403
我需要根据这些数据计算多个新列,例如:
NSW_3 = NSW_1 + NSW_2
AUS_3 = NSW_1 + AUS_2
NET_1 = NSW_1 + AUS_1 - AUS_2
...
目前,生成这些列的代码使用基本的R如下所示:
data$NSW_3 = data$NSW_1 + data$NSW_2
data$AUS_3 = data$NSW_1 + data$AUS_2
data$NET_1 = data$NSW_1 + data$AUS_1 - data$AUS_2
...
对于少数列,使用dplyr
、data.table
或基本R中的任何可用解决方案都是可以的。然而,在实际情况中,我的初始数据有数千列和数百行,并且我需要在管道后续任务中创建数百个新列。
幸运的是,我已经在脚本中以文本形式存储了每个列的"patterns",例如"NSW_3 = NSW_1 + NSW_2"
(而且无论如何,很容易让ChatGPT将代码转换成这种格式)。一个问题是上述重复的data$
是否确实是最佳选项。
Q: 根据这些模式列表,计算多个新列的最佳方法是什么?
请注意,由于管道后续任务的依赖性,我在这个工作流程中有一些限制,更改起来成本高昂,因此我正在寻找一种解决此问题的清晰/最佳实践方法。理想情况下,解决方案应具备易于随时间更新/更改模式的属性。
在此过程中的一个限制是模式没有模式 - 列的构建相当随意,因为它来自(在我的情况下)不同的会计标准。
示例数据
structure(list(date = structure(c(10957, 11323, 11688, 12053,
12418, 12784, 13149, 13514, 13879, 14245, 14610, 14975, 15340,
15706, 16071, 16436, 16801, 17167, 17532, 17897), class = "Date"),
NSW_1 = c(1.24119982707737, 0.138858419103515, 1.71063158823657,
-0.430640974722993, -1.04422958092706, 0.537579524580529,
-0.669585987150229, 0.638805611438309, -1.72398983449257,
-1.74243008034091, 0.689804173282994, 0.330963177173467,
0.871067708948055, -2.01624558221344, 1.21257910351036, 1.20049469882194,
1.03206832593544, 0.786410256177216, 2.11007351377927, -1.45380984681329
), AUS_1 = c(-0.58310384813065, 0.409723982550305, -0.806981635414238,
0.0855504408545073, 0.746243168639741, -0.653673061331084,
0.657105983301959, 0.549909235009709, -0.806729358432671,
-0.997379717276235, 0.97589063842626, -0.169423180700716,
0.72219177943747, -0.844418606912503, 1.27729368500115, -1.34311054918022,
0.765340668860696, 0.464202569980373, 0.267993278040529,
0.667522687135242), NSW_2 = c(0.398467283863779, -0.638071030930068,
-0.267712904023511, 0.359879564885712, -1.31286609394071,
-0.883969609668706, 2.07709479458465, -2.09922563220098,
-1.23850596532828, 0.990433089664036, 1.08866186319808, 0.839852254188259,
0.0568586386833875, 0.32387805118027, -0.904668667590953,
-0.65218384837012, -0.262454637960949, -0.934662840905703,
0.821161212634175, -1.62425917345994), AUS_2 = c(-1.03040366948029,
-1.261929311973, 0.39218462589648, -1.13143826165372, 0.544144484008162,
<details>
<summary>英文:</summary>
I regularly receive data as a ``data.frame`` in wide format like the following:
head(data)
date NSW_1 AUS_1 NSW_2 AUS_2
1 2000-01-01 -0.38358623 -1.78824221 0.995984590 1.0744594
2 2001-01-01 -1.95910318 2.03124252 -1.695764903 0.2605978
3 2002-01-01 -0.84170506 -0.70314433 -0.533372143 -0.3142720
4 2003-01-01 1.90354747 0.15816476 -1.372269451 -0.7496301
5 2004-01-01 0.62249393 0.50623480 -2.207919779 -0.8621983
6 2005-01-01 1.99092044 -0.81999511 1.822122519 2.0480403
I need to compute several new columns from this data, for example:
NSW_3 = NSW_1 + NSW_2
AUS_3 = NSW_1 + AUS_2
NET_1 = NSW_1 + AUS_1 - AUS_2
...
Currently, the code to generate this uses Base R like this:
data$NSW_3 = df$NSW_1 + df$NSW_2
data$AUS_3 = data$NSW_1 + data$AUS_2
data$NET_1 = data$NSW_1 + data$AUS_1 - data$AUS_2
...
Doing this for a few columns is fine, using any available solution from `dplyr`, `data.table` or Base R. However, in reality my initial data is several thousand columns and hundreds of rows, and I need to create hundreds of new columns for tasks further down the pipeline.
Luckily, I have stored in a script the "patterns" for each column as text, for example `"NSW_3 = NSW_1 + NSW_2"` (and in any case, it is easy enough to get ChatGPT to convert the code into this). One question is whether the above style repeating `df$` is indeed the best option.
**Q: What is the best way to compute multiple new columns based off this list of patterns?**
Note that I am somewhat locked into this workflow due to dependencies further down the pipeline which are costly to change, so I am looking for a clean/best practice solution to this problem as given. The solution should ideally have the property that it is easy to update/change the patterns over time.
One constraint in this procedure is that there is no *pattern* to the patterns - the column construction is pretty arbitrary because it comes from (in my situation) varying accounting standards.
----------
Example data
structure(list(date = structure(c(10957, 11323, 11688, 12053,
12418, 12784, 13149, 13514, 13879, 14245, 14610, 14975, 15340,
15706, 16071, 16436, 16801, 17167, 17532, 17897), class = "Date"),
NSW_1 = c(1.24119982707737, 0.138858419103515, 1.71063158823657,
-0.430640974722993, -1.04422958092706, 0.537579524580529,
-0.669585987150229, 0.638805611438309, -1.72398983449257,
-1.74243008034091, 0.689804173282994, 0.330963177173467,
0.871067708948055, -2.01624558221344, 1.21257910351036, 1.20049469882194,
1.03206832593544, 0.786410256177216, 2.11007351377927, -1.45380984681329
), AUS_1 = c(-0.58310384813065, 0.409723982550305, -0.806981635414238,
0.0855504408545073, 0.746243168639741, -0.653673061331084,
0.657105983301959, 0.549909235009709, -0.806729358432671,
-0.997379717276235, 0.97589063842626, -0.169423180700716,
0.72219177943747, -0.844418606912503, 1.27729368500115, -1.34311054918022,
0.765340668860696, 0.464202569980373, 0.267993278040529,
0.667522687135242), NSW_2 = c(0.398467283863779, -0.638071030930068,
-0.267712904023511, 0.359879564885712, -1.31286609394071,
-0.883969609668706, 2.07709479458465, -2.09922563220098,
-1.23850596532828, 0.990433089664036, 1.08866186319808, 0.839852254188259,
0.0568586386833875, 0.32387805118027, -0.904668667590953,
-0.65218384837012, -0.262454637960949, -0.934662840905703,
0.821161212634175, -1.62425917345994), AUS_2 = c(-1.03040366948029,
-1.261929311973, 0.39218462589648, -1.13143826165372, 0.544144484008162,
1.17660893473457, 0.0252285692390373, 0.515133169692034,
-0.654109760017422, 0.503641990827566, -1.2721192223284,
-0.0767711536986575, -1.34531937567941, -0.266317560246187,
1.08756299501947, 0.700567795415207, -0.4427595146404, -0.78851996588786,
-0.856775709774438, -0.746419014623393)), class = "data.frame", row.names =
c(NA, -20L))
</details>
# 答案1
**得分**: 4
以下是翻译好的部分:
假设命令存储在字符向量 `cmds` 中,如下所示。如果它们存储在文本文件中,我们可以这样创建 `cmds`。
```R
cmds <- readLines("cmds.txt")
然后我们有一些替代方法。
1) Reduce 使用 Reduce
函数,其中包含一个运行 str2lang/eval
的函数。请注意,Reduce
会按顺序运行命令,因此可能会有一个命令使用先前命令中创建的变量。
不使用任何包。
cmds <- c(
"NSW_3 = NSW_1 + NSW_2",
"AUS_3 = NSW_1 + AUS_2",
"NET_1 = NSW_1 + AUS_1 - AUS_2"
)
Eval <- function(data, x) within(data, eval(str2lang(x)))
data1 <- Reduce(Eval, cmds, data)
2) for 这也可以工作:
data2 <- data
for (cmd in cmds) data2 <- Eval(data2, cmd)
3) single within 另一种方法是生成单个 within
语句的文本,然后解析和执行它。如果命令可能依赖于先前的命令,则省略 rev
,但如果这样做,则列的顺序与 (1) 和 (2) 不同。
data3 <- paste("within(data, {", paste(rev(cmds), collapse = "\n"), "})") %>%
str2lang() %>%
eval()
4) source 如果命令存储在文件 cmds.txt
中,那么我们可以简单地将其 source
到一个包含数据列的环境中。这可能会导致生成的列与前面的替代方案中的顺序不同。
source("cmds.txt", local = e <- list2env(data))
data4 <- modifyList(data, as.list(e))
英文:
Suppose the commands are held in a character vector cmds
as shown in (1) below. If they are in a text file we could create cmds
like this.
cmds <- readLines("cmds.txt")
Then we have some alternatives.
1) Reduce Use Reduce
with a function that runs str2lang/eval. Note that Reduce
will run the commands in order so that it would be possible for a command to use variables created in prior commands.
No packages are used.
cmds <- c("NSW_3 = NSW_1 + NSW_2",
"AUS_3 = NSW_1 + AUS_2",
"NET_1 = NSW_1 + AUS_1 - AUS_2")
Eval <- function(data, x) within(data, eval(str2lang(x)))
data1 <- Reduce(Eval, cmds, data)
2) for This would also work:
data2 <- data
for(cmd in cmds) data2 <- Eval(data2, cmd)
3) single within Another approach is to generate the text of a single within
statement and parse and execute it. Omit the rev
if commands can depend on prior commands but if this is done then the columns won't be in the same order as (1) and (2).
data3 <- paste("within(data, {", paste(rev(cmds), collapse = "\n"), "})") |>
str2lang() |>
eval()
4) source If the commands are in a file cmds.txt
then we can simply source
them into an environment holding the columns of data. This may cause the resulting columns to be in a different order than the prior alternatives.
source("cmds.txt", local = e <- list2env(data))
data4 <- modifyList(data, as.list(e))
答案2
得分: 3
如果您拥有表达式的命名字符串,您可以相对直接地使用 eval(parse(..))
来执行它们。
exprs <- c(NSW_3 = "NSW_1 + NSW_2", AUS_3 = "NSW_1 + AUS_2", NET_1 = "NSW_1 + AUS_1 - AUS_2")
quux[names(exprs)] <- lapply(exprs, function(expr) eval(parse(text = expr), envir = quux))
quux
# date NSW_1 AUS_1 NSW_2 AUS_2 NSW_3 AUS_3 NET_1
# 1 2000-01-01 1.2411998 -0.58310385 0.39846728 -1.03040367 1.63966711 0.21079616 1.68849965
# 2 2001-01-01 0.1388584 0.40972398 -0.63807103 -1.26192931 -0.49921261 -1.12307089 1.81051171
# 3 2002-01-01 1.7106316 -0.80698164 -0.26771290 0.39218463 1.44291868 2.10281621 0.51146533
# 4 2003-01-01 -0.4306410 0.08555044 0.35987956 -1.13143826 -0.07076141 -1.56207924 0.78634773
# 5 2004-01-01 -1.0442296 0.74624317 -1.31286609 0.54414448 -2.35709567 -0.50008510 -0.84213090
# 6 2005-01-01 0.5375795 -0.65367306 -0.88396961 1.17660893 -0.34639009 1.71418846 -1.29270247
# 7 2006-01-01 -0.6695860 0.65710598 2.07709479 0.02522857 1.40750881 -0.64435742 -0.03770857
# 8 2007-01-01 0.6388056 0.54990924 -2.09922563 0.51513317 -1.46042002 1.15393878 0.67358168
# 9 2008-01-01 -1.7239898 -0.80672936 -1.23850597 -0.65410976 -2.96249580 -2.37809959 -1.87660943
# 10 2009-01-01 -1.7424301 -0.99737972 0.99043309 0.50364199 -0.75199699 -1.23878809 -3.24345179
# 11 2010-01-01 0.6898042 0.97589064 1.08866186 -1.27211922 1.77846604 -0.58231505 2.93781403
# 12 2011-01-01 0.3309632 -0.16942318 0.83985225 -0.07677115 1.17081543 0.25419202 0.23831115
# 13 2012-01-01 0.8710677 0.72219178 0.05685864 -1.34531938 0.92792635 -0.47425167 2.93857886
# 14 2013-01-01 -2.0162456 -0.84441861 0.32387805 -0.26631756 -1.69236753 -2.28256314 -2.59434663
# 15 2014-01-01 1.2125791 1.27729369 -0.90466867 1.08756300 0.30791044 2.30014210 1.40230979
# 16 2015-01-01 1.2004947 -1.34311055 -0.65218385 0.70056780 0.54831085 1.90106249 -0.84318365
# 17 2016-01-01 1.0320683 0.76534067 -0.26245464 -0.44275951 0.76961369 0.58930881 2.24016851
# 18 2017-01-01 0.7864103 0.46420257 -0.93466284 -0.78851997 -0.14825258 -0.00210971 2.03913279
# 19 2018-01-01 2.1100735 0.26799328 0.82116121 -0.85677571 2.93123473 1.25329780 3.23484250
# 20 2019-01-01 -1.4538098 0.66752269 -1.62425917 -0.74641901 -3.07806902 -2.20022886 -0.03986815
数据
quux <- structure(list(date = structure(c(10957, 11323, 11688, 12053, 12418, 12784, 13149, 13514, 13879, 14245, 14610, 14975, 15340, 15706, 16071, 16436, 16801, 17167, 17532, 17897), class = "Date"), NSW_1 = c(1.24119982707737, 0.138858419103515, 1.71063158823657, -0.430640974722993, -1.04422958092706, 0.537579524580529, -0.669585987150229, 0.638805611438309, -1.72398983449257, -1.74243008034091, 0.689804173282994, 0.330963177173467, 0.871067708948055, -2.01624558221344, 1.
<details>
<summary>英文:</summary>
If you have _named_ strings of your expressions, you can `eval(parse(..))` them fairly directly.
```r
exprs <- c(NSW_3 = "NSW_1 + NSW_2", AUS_3 = "NSW_1 + AUS_2", NET_1 = "NSW_1 + AUS_1 - AUS_2")
quux[names(exprs)] <- lapply(exprs, function(expr) eval(parse(text = expr), envir = quux))
quux
# date NSW_1 AUS_1 NSW_2 AUS_2 NSW_3 AUS_3 NET_1
# 1 2000-01-01 1.2411998 -0.58310385 0.39846728 -1.03040367 1.63966711 0.21079616 1.68849965
# 2 2001-01-01 0.1388584 0.40972398 -0.63807103 -1.26192931 -0.49921261 -1.12307089 1.81051171
# 3 2002-01-01 1.7106316 -0.80698164 -0.26771290 0.39218463 1.44291868 2.10281621 0.51146533
# 4 2003-01-01 -0.4306410 0.08555044 0.35987956 -1.13143826 -0.07076141 -1.56207924 0.78634773
# 5 2004-01-01 -1.0442296 0.74624317 -1.31286609 0.54414448 -2.35709567 -0.50008510 -0.84213090
# 6 2005-01-01 0.5375795 -0.65367306 -0.88396961 1.17660893 -0.34639009 1.71418846 -1.29270247
# 7 2006-01-01 -0.6695860 0.65710598 2.07709479 0.02522857 1.40750881 -0.64435742 -0.03770857
# 8 2007-01-01 0.6388056 0.54990924 -2.09922563 0.51513317 -1.46042002 1.15393878 0.67358168
# 9 2008-01-01 -1.7239898 -0.80672936 -1.23850597 -0.65410976 -2.96249580 -2.37809959 -1.87660943
# 10 2009-01-01 -1.7424301 -0.99737972 0.99043309 0.50364199 -0.75199699 -1.23878809 -3.24345179
# 11 2010-01-01 0.6898042 0.97589064 1.08866186 -1.27211922 1.77846604 -0.58231505 2.93781403
# 12 2011-01-01 0.3309632 -0.16942318 0.83985225 -0.07677115 1.17081543 0.25419202 0.23831115
# 13 2012-01-01 0.8710677 0.72219178 0.05685864 -1.34531938 0.92792635 -0.47425167 2.93857886
# 14 2013-01-01 -2.0162456 -0.84441861 0.32387805 -0.26631756 -1.69236753 -2.28256314 -2.59434663
# 15 2014-01-01 1.2125791 1.27729369 -0.90466867 1.08756300 0.30791044 2.30014210 1.40230979
# 16 2015-01-01 1.2004947 -1.34311055 -0.65218385 0.70056780 0.54831085 1.90106249 -0.84318365
# 17 2016-01-01 1.0320683 0.76534067 -0.26245464 -0.44275951 0.76961369 0.58930881 2.24016851
# 18 2017-01-01 0.7864103 0.46420257 -0.93466284 -0.78851997 -0.14825258 -0.00210971 2.03913279
# 19 2018-01-01 2.1100735 0.26799328 0.82116121 -0.85677571 2.93123473 1.25329780 3.23484250
# 20 2019-01-01 -1.4538098 0.66752269 -1.62425917 -0.74641901 -3.07806902 -2.20022886 -0.03986815
Data
quux <- structure(list(date = structure(c(10957, 11323, 11688, 12053, 12418, 12784, 13149, 13514, 13879, 14245, 14610, 14975, 15340, 15706, 16071, 16436, 16801, 17167, 17532, 17897), class = "Date"), NSW_1 = c(1.24119982707737, 0.138858419103515, 1.71063158823657, -0.430640974722993, -1.04422958092706, 0.537579524580529, -0.669585987150229, 0.638805611438309, -1.72398983449257, -1.74243008034091, 0.689804173282994, 0.330963177173467, 0.871067708948055, -2.01624558221344, 1.21257910351036, 1.20049469882194, 1.03206832593544, 0.786410256177216, 2.11007351377927, -1.45380984681329), AUS_1 = c(-0.58310384813065, 0.409723982550305, -0.806981635414238, 0.0855504408545073, 0.746243168639741, -0.653673061331084, 0.657105983301959, 0.549909235009709, -0.806729358432671, -0.997379717276235, 0.97589063842626, -0.169423180700716, 0.72219177943747, -0.844418606912503, 1.27729368500115, -1.34311054918022, 0.765340668860696, 0.464202569980373, 0.267993278040529, 0.667522687135242), NSW_2 = c(0.398467283863779, -0.638071030930068, -0.267712904023511, 0.359879564885712, -1.31286609394071, -0.883969609668706, 2.07709479458465, -2.09922563220098, -1.23850596532828, 0.990433089664036, 1.08866186319808, 0.839852254188259, 0.0568586386833875, 0.32387805118027, -0.904668667590953, -0.65218384837012, -0.262454637960949, -0.934662840905703, 0.821161212634175, -1.62425917345994), AUS_2 = c(-1.03040366948029, -1.261929311973, 0.39218462589648, -1.13143826165372, 0.544144484008162, 1.17660893473457, 0.0252285692390373, 0.515133169692034, -0.654109760017422, 0.503641990827566, -1.2721192223284, -0.0767711536986575, -1.34531937567941, -0.266317560246187, 1.08756299501947, 0.700567795415207, -0.4427595146404, -0.78851996588786, -0.856775709774438, -0.746419014623393)), class = "data.frame", row.names = c(NA, -20L))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论