2023年6月26日 09:42:05go评论78阅读模式

英文:

Best way to compute multiple data.frame columns from patterns

问题

我定期以如下宽格式的data.frame形式接收数据：

head(data)
         date       NSW_1       AUS_1        NSW_2      AUS_2
1  2000-01-01 -0.38358623 -1.78824221  0.995984590  1.0744594
2  2001-01-01 -1.95910318  2.03124252 -1.695764903  0.2605978
3  2002-01-01 -0.84170506 -0.70314433 -0.533372143 -0.3142720
4  2003-01-01  1.90354747  0.15816476 -1.372269451 -0.7496301
5  2004-01-01  0.62249393  0.50623480 -2.207919779 -0.8621983
6  2005-01-01  1.99092044 -0.81999511  1.822122519  2.0480403

我需要根据这些数据计算多个新列，例如：

NSW_3 = NSW_1 + NSW_2
AUS_3 = NSW_1 + AUS_2
NET_1 = NSW_1 + AUS_1 - AUS_2
...

目前，生成这些列的代码使用基本的R如下所示：

data$NSW_3 = data$NSW_1 + data$NSW_2
data$AUS_3 = data$NSW_1 + data$AUS_2
data$NET_1 = data$NSW_1 + data$AUS_1 - data$AUS_2
...

对于少数列，使用dplyr、data.table或基本R中的任何可用解决方案都是可以的。然而，在实际情况中，我的初始数据有数千列和数百行，并且我需要在管道后续任务中创建数百个新列。

幸运的是，我已经在脚本中以文本形式存储了每个列的"patterns"，例如"NSW_3 = NSW_1 + NSW_2"（而且无论如何，很容易让ChatGPT将代码转换成这种格式）。一个问题是上述重复的data$是否确实是最佳选项。

Q: 根据这些模式列表，计算多个新列的最佳方法是什么？

请注意，由于管道后续任务的依赖性，我在这个工作流程中有一些限制，更改起来成本高昂，因此我正在寻找一种解决此问题的清晰/最佳实践方法。理想情况下，解决方案应具备易于随时间更新/更改模式的属性。

在此过程中的一个限制是模式没有模式 - 列的构建相当随意，因为它来自（在我的情况下）不同的会计标准。

示例数据

structure(list(date = structure(c(10957, 11323, 11688, 12053, 
12418, 12784, 13149, 13514, 13879, 14245, 14610, 14975, 15340, 
15706, 16071, 16436, 16801, 17167, 17532, 17897), class = "Date"), 
    NSW_1 = c(1.24119982707737, 0.138858419103515, 1.71063158823657, 
    -0.430640974722993, -1.04422958092706, 0.537579524580529, 
    -0.669585987150229, 0.638805611438309, -1.72398983449257, 
    -1.74243008034091, 0.689804173282994, 0.330963177173467, 
    0.871067708948055, -2.01624558221344, 1.21257910351036, 1.20049469882194, 
    1.03206832593544, 0.786410256177216, 2.11007351377927, -1.45380984681329
    ), AUS_1 = c(-0.58310384813065, 0.409723982550305, -0.806981635414238, 
    0.0855504408545073, 0.746243168639741, -0.653673061331084, 
    0.657105983301959, 0.549909235009709, -0.806729358432671, 
    -0.997379717276235, 0.97589063842626, -0.169423180700716, 
    0.72219177943747, -0.844418606912503, 1.27729368500115, -1.34311054918022, 
    0.765340668860696, 0.464202569980373, 0.267993278040529, 
    0.667522687135242), NSW_2 = c(0.398467283863779, -0.638071030930068, 
    -0.267712904023511, 0.359879564885712, -1.31286609394071, 
    -0.883969609668706, 2.07709479458465, -2.09922563220098, 
    -1.23850596532828, 0.990433089664036, 1.08866186319808, 0.839852254188259, 
    0.0568586386833875, 0.32387805118027, -0.904668667590953, 
    -0.65218384837012, -0.262454637960949, -0.934662840905703, 
    0.821161212634175, -1.62425917345994), AUS_2 = c(-1.03040366948029, 
    -1.261929311973, 0.39218462589648, -1.13143826165372, 0.544144484008162, 
   

<details>
<summary>英文:</summary>

I regularly receive data as a ``data.frame`` in wide format like the following:

head(data)
date NSW_1 AUS_1 NSW_2 AUS_2
1 2000-01-01 -0.38358623 -1.78824221 0.995984590 1.0744594
2 2001-01-01 -1.95910318 2.03124252 -1.695764903 0.2605978
3 2002-01-01 -0.84170506 -0.70314433 -0.533372143 -0.3142720
4 2003-01-01 1.90354747 0.15816476 -1.372269451 -0.7496301
5 2004-01-01 0.62249393 0.50623480 -2.207919779 -0.8621983
6 2005-01-01 1.99092044 -0.81999511 1.822122519 2.0480403

I need to compute several new columns from this data, for example:

NSW_3 = NSW_1 + NSW_2
AUS_3 = NSW_1 + AUS_2
NET_1 = NSW_1 + AUS_1 - AUS_2
...

Currently, the code to generate this uses Base R like this:

data$NSW_3 = df$NSW_1 + df$NSW_2
data$AUS_3 = data$NSW_1 + data$AUS_2
data$NET_1 = data$NSW_1 + data$AUS_1 - data$AUS_2
...

Doing this for a few columns is fine, using any available solution from `dplyr`, `data.table` or Base R. However, in reality my initial data is several thousand columns and hundreds of rows, and I need to create hundreds of new columns for tasks further down the pipeline.

Luckily, I have stored in a script the &quot;patterns&quot; for each column as text, for example `&quot;NSW_3 = NSW_1 + NSW_2&quot;` (and in any case, it is easy enough to get ChatGPT to convert the code into this). One question is whether the above style repeating `df$` is indeed the best option.

**Q: What is the best way to compute multiple new columns based off this list of patterns?**

Note that I am somewhat locked into this workflow due to dependencies further down the pipeline which are costly to change, so I am looking for a clean/best practice solution to this problem as given. The solution should ideally have the property that it is easy to update/change the patterns over time.

One constraint in this procedure is that there is no *pattern* to the patterns - the column construction is pretty arbitrary because it comes from  (in my situation) varying accounting standards.

----------

Example data

structure(list(date = structure(c(10957, 11323, 11688, 12053,
12418, 12784, 13149, 13514, 13879, 14245, 14610, 14975, 15340,
15706, 16071, 16436, 16801, 17167, 17532, 17897), class = "Date"),
NSW_1 = c(1.24119982707737, 0.138858419103515, 1.71063158823657,
-0.430640974722993, -1.04422958092706, 0.537579524580529,
-0.669585987150229, 0.638805611438309, -1.72398983449257,
-1.74243008034091, 0.689804173282994, 0.330963177173467,
0.871067708948055, -2.01624558221344, 1.21257910351036, 1.20049469882194,
1.03206832593544, 0.786410256177216, 2.11007351377927, -1.45380984681329
), AUS_1 = c(-0.58310384813065, 0.409723982550305, -0.806981635414238,
0.0855504408545073, 0.746243168639741, -0.653673061331084,
0.657105983301959, 0.549909235009709, -0.806729358432671,
-0.997379717276235, 0.97589063842626, -0.169423180700716,
0.72219177943747, -0.844418606912503, 1.27729368500115, -1.34311054918022,
0.765340668860696, 0.464202569980373, 0.267993278040529,
0.667522687135242), NSW_2 = c(0.398467283863779, -0.638071030930068,
-0.267712904023511, 0.359879564885712, -1.31286609394071,
-0.883969609668706, 2.07709479458465, -2.09922563220098,
-1.23850596532828, 0.990433089664036, 1.08866186319808, 0.839852254188259,
0.0568586386833875, 0.32387805118027, -0.904668667590953,
-0.65218384837012, -0.262454637960949, -0.934662840905703,
0.821161212634175, -1.62425917345994), AUS_2 = c(-1.03040366948029,
-1.261929311973, 0.39218462589648, -1.13143826165372, 0.544144484008162,
1.17660893473457, 0.0252285692390373, 0.515133169692034,
-0.654109760017422, 0.503641990827566, -1.2721192223284,
-0.0767711536986575, -1.34531937567941, -0.266317560246187,
1.08756299501947, 0.700567795415207, -0.4427595146404, -0.78851996588786,
-0.856775709774438, -0.746419014623393)), class = "data.frame", row.names =
c(NA, -20L))




</details>


# 答案1
**得分**: 4

以下是翻译好的部分：

假设命令存储在字符向量 `cmds` 中，如下所示。如果它们存储在文本文件中，我们可以这样创建 `cmds`。

```R
cmds <- readLines("cmds.txt")

然后我们有一些替代方法。

1) Reduce 使用 Reduce 函数，其中包含一个运行 str2lang/eval 的函数。请注意，Reduce 会按顺序运行命令，因此可能会有一个命令使用先前命令中创建的变量。

不使用任何包。

cmds <- c(
  "NSW_3 = NSW_1 + NSW_2",
  "AUS_3 = NSW_1 + AUS_2",
  "NET_1 = NSW_1 + AUS_1 - AUS_2"
)

Eval <- function(data, x) within(data, eval(str2lang(x)))
data1 <- Reduce(Eval, cmds, data)

2) for 这也可以工作：

data2 <- data
for (cmd in cmds) data2 <- Eval(data2, cmd)

3) single within 另一种方法是生成单个 within 语句的文本，然后解析和执行它。如果命令可能依赖于先前的命令，则省略 rev，但如果这样做，则列的顺序与 (1) 和 (2) 不同。

data3 <- paste("within(data, {", paste(rev(cmds), collapse = "\n"), "})") %>%
  str2lang() %>%
  eval()

4) source 如果命令存储在文件 cmds.txt 中，那么我们可以简单地将其 source 到一个包含数据列的环境中。这可能会导致生成的列与前面的替代方案中的顺序不同。

source("cmds.txt", local = e <- list2env(data))
data4 <- modifyList(data, as.list(e))

英文:

Suppose the commands are held in a character vector cmds as shown in (1) below. If they are in a text file we could create cmds like this.

cmds &lt;- readLines(&quot;cmds.txt&quot;)

Then we have some alternatives.

1) Reduce Use Reduce with a function that runs str2lang/eval. Note that Reduce will run the commands in order so that it would be possible for a command to use variables created in prior commands.

No packages are used.

cmds &lt;- c(&quot;NSW_3 = NSW_1 + NSW_2&quot;,
          &quot;AUS_3 = NSW_1 + AUS_2&quot;,
          &quot;NET_1 = NSW_1 + AUS_1 - AUS_2&quot;)

Eval &lt;- function(data, x) within(data, eval(str2lang(x)))
data1 &lt;- Reduce(Eval, cmds, data)

2) for This would also work:

data2 &lt;- data
for(cmd in cmds) data2 &lt;- Eval(data2, cmd)

3) single within Another approach is to generate the text of a single within statement and parse and execute it. Omit the rev if commands can depend on prior commands but if this is done then the columns won't be in the same order as (1) and (2).

data3 &lt;- paste(&quot;within(data, {&quot;, paste(rev(cmds), collapse = &quot;\n&quot;), &quot;})&quot;) |&gt;
  str2lang() |&gt;
  eval()

4) source If the commands are in a file cmds.txt then we can simply source them into an environment holding the columns of data. This may cause the resulting columns to be in a different order than the prior alternatives.

source(&quot;cmds.txt&quot;, local = e &lt;- list2env(data))
data4 &lt;- modifyList(data, as.list(e))

答案2

得分: 3

如果您拥有表达式的命名字符串，您可以相对直接地使用 eval(parse(..)) 来执行它们。

exprs <- c(NSW_3 = "NSW_1 + NSW_2", AUS_3 = "NSW_1 + AUS_2", NET_1 = "NSW_1 + AUS_1 - AUS_2")
quux[names(exprs)] <- lapply(exprs, function(expr) eval(parse(text = expr), envir = quux))
quux
#          date      NSW_1       AUS_1       NSW_2       AUS_2       NSW_3       AUS_3       NET_1
# 1  2000-01-01  1.2411998 -0.58310385  0.39846728 -1.03040367  1.63966711  0.21079616  1.68849965
# 2  2001-01-01  0.1388584  0.40972398 -0.63807103 -1.26192931 -0.49921261 -1.12307089  1.81051171
# 3  2002-01-01  1.7106316 -0.80698164 -0.26771290  0.39218463  1.44291868  2.10281621  0.51146533
# 4  2003-01-01 -0.4306410  0.08555044  0.35987956 -1.13143826 -0.07076141 -1.56207924  0.78634773
# 5  2004-01-01 -1.0442296  0.74624317 -1.31286609  0.54414448 -2.35709567 -0.50008510 -0.84213090
# 6  2005-01-01  0.5375795 -0.65367306 -0.88396961  1.17660893 -0.34639009  1.71418846 -1.29270247
# 7  2006-01-01 -0.6695860  0.65710598  2.07709479  0.02522857  1.40750881 -0.64435742 -0.03770857
# 8  2007-01-01  0.6388056  0.54990924 -2.09922563  0.51513317 -1.46042002  1.15393878  0.67358168
# 9  2008-01-01 -1.7239898 -0.80672936 -1.23850597 -0.65410976 -2.96249580 -2.37809959 -1.87660943
# 10 2009-01-01 -1.7424301 -0.99737972  0.99043309  0.50364199 -0.75199699 -1.23878809 -3.24345179
# 11 2010-01-01  0.6898042  0.97589064  1.08866186 -1.27211922  1.77846604 -0.58231505  2.93781403
# 12 2011-01-01  0.3309632 -0.16942318  0.83985225 -0.07677115  1.17081543  0.25419202  0.23831115
# 13 2012-01-01  0.8710677  0.72219178  0.05685864 -1.34531938  0.92792635 -0.47425167  2.93857886
# 14 2013-01-01 -2.0162456 -0.84441861  0.32387805 -0.26631756 -1.69236753 -2.28256314 -2.59434663
# 15 2014-01-01  1.2125791  1.27729369 -0.90466867  1.08756300  0.30791044  2.30014210  1.40230979
# 16 2015-01-01  1.2004947 -1.34311055 -0.65218385  0.70056780  0.54831085  1.90106249 -0.84318365
# 17 2016-01-01  1.0320683  0.76534067 -0.26245464 -0.44275951  0.76961369  0.58930881  2.24016851
# 18 2017-01-01  0.7864103  0.46420257 -0.93466284 -0.78851997 -0.14825258 -0.00210971  2.03913279
# 19 2018-01-01  2.1100735  0.26799328  0.82116121 -0.85677571  2.93123473  1.25329780  3.23484250
# 20 2019-01-01 -1.4538098  0.66752269 -1.62425917 -0.74641901 -3.07806902 -2.20022886 -0.03986815

数据

quux <- structure(list(date = structure(c(10957, 11323, 11688, 12053, 12418, 12784, 13149, 13514, 13879, 14245, 14610, 14975, 15340, 15706, 16071, 16436, 16801, 17167, 17532, 17897), class = "Date"), NSW_1 = c(1.24119982707737, 0.138858419103515, 1.71063158823657, -0.430640974722993, -1.04422958092706, 0.537579524580529, -0.669585987150229, 0.638805611438309, -1.72398983449257, -1.74243008034091, 0.689804173282994, 0.330963177173467, 0.871067708948055, -2.01624558221344, 1.

<details>
<summary>英文:</summary>

If you have _named_ strings of your expressions, you can `eval(parse(..))` them fairly directly.

```r
exprs &lt;- c(NSW_3 = &quot;NSW_1 + NSW_2&quot;, AUS_3 = &quot;NSW_1 + AUS_2&quot;, NET_1 = &quot;NSW_1 + AUS_1 - AUS_2&quot;)
quux[names(exprs)] &lt;- lapply(exprs, function(expr) eval(parse(text = expr), envir = quux))
quux
#          date      NSW_1       AUS_1       NSW_2       AUS_2       NSW_3       AUS_3       NET_1
# 1  2000-01-01  1.2411998 -0.58310385  0.39846728 -1.03040367  1.63966711  0.21079616  1.68849965
# 2  2001-01-01  0.1388584  0.40972398 -0.63807103 -1.26192931 -0.49921261 -1.12307089  1.81051171
# 3  2002-01-01  1.7106316 -0.80698164 -0.26771290  0.39218463  1.44291868  2.10281621  0.51146533
# 4  2003-01-01 -0.4306410  0.08555044  0.35987956 -1.13143826 -0.07076141 -1.56207924  0.78634773
# 5  2004-01-01 -1.0442296  0.74624317 -1.31286609  0.54414448 -2.35709567 -0.50008510 -0.84213090
# 6  2005-01-01  0.5375795 -0.65367306 -0.88396961  1.17660893 -0.34639009  1.71418846 -1.29270247
# 7  2006-01-01 -0.6695860  0.65710598  2.07709479  0.02522857  1.40750881 -0.64435742 -0.03770857
# 8  2007-01-01  0.6388056  0.54990924 -2.09922563  0.51513317 -1.46042002  1.15393878  0.67358168
# 9  2008-01-01 -1.7239898 -0.80672936 -1.23850597 -0.65410976 -2.96249580 -2.37809959 -1.87660943
# 10 2009-01-01 -1.7424301 -0.99737972  0.99043309  0.50364199 -0.75199699 -1.23878809 -3.24345179
# 11 2010-01-01  0.6898042  0.97589064  1.08866186 -1.27211922  1.77846604 -0.58231505  2.93781403
# 12 2011-01-01  0.3309632 -0.16942318  0.83985225 -0.07677115  1.17081543  0.25419202  0.23831115
# 13 2012-01-01  0.8710677  0.72219178  0.05685864 -1.34531938  0.92792635 -0.47425167  2.93857886
# 14 2013-01-01 -2.0162456 -0.84441861  0.32387805 -0.26631756 -1.69236753 -2.28256314 -2.59434663
# 15 2014-01-01  1.2125791  1.27729369 -0.90466867  1.08756300  0.30791044  2.30014210  1.40230979
# 16 2015-01-01  1.2004947 -1.34311055 -0.65218385  0.70056780  0.54831085  1.90106249 -0.84318365
# 17 2016-01-01  1.0320683  0.76534067 -0.26245464 -0.44275951  0.76961369  0.58930881  2.24016851
# 18 2017-01-01  0.7864103  0.46420257 -0.93466284 -0.78851997 -0.14825258 -0.00210971  2.03913279
# 19 2018-01-01  2.1100735  0.26799328  0.82116121 -0.85677571  2.93123473  1.25329780  3.23484250
# 20 2019-01-01 -1.4538098  0.66752269 -1.62425917 -0.74641901 -3.07806902 -2.20022886 -0.03986815

Data

quux &lt;- structure(list(date = structure(c(10957, 11323, 11688, 12053, 12418, 12784, 13149, 13514, 13879, 14245, 14610, 14975, 15340, 15706, 16071, 16436, 16801, 17167, 17532, 17897), class = &quot;Date&quot;), NSW_1 = c(1.24119982707737, 0.138858419103515, 1.71063158823657, -0.430640974722993, -1.04422958092706, 0.537579524580529, -0.669585987150229, 0.638805611438309, -1.72398983449257, -1.74243008034091, 0.689804173282994, 0.330963177173467, 0.871067708948055, -2.01624558221344, 1.21257910351036, 1.20049469882194,  1.03206832593544, 0.786410256177216, 2.11007351377927, -1.45380984681329), AUS_1 = c(-0.58310384813065, 0.409723982550305, -0.806981635414238, 0.0855504408545073, 0.746243168639741, -0.653673061331084, 0.657105983301959, 0.549909235009709, -0.806729358432671, -0.997379717276235, 0.97589063842626, -0.169423180700716, 0.72219177943747, -0.844418606912503, 1.27729368500115, -1.34311054918022, 0.765340668860696, 0.464202569980373, 0.267993278040529, 0.667522687135242), NSW_2 = c(0.398467283863779, -0.638071030930068,  -0.267712904023511, 0.359879564885712, -1.31286609394071, -0.883969609668706, 2.07709479458465, -2.09922563220098, -1.23850596532828, 0.990433089664036, 1.08866186319808, 0.839852254188259, 0.0568586386833875, 0.32387805118027, -0.904668667590953, -0.65218384837012, -0.262454637960949, -0.934662840905703, 0.821161212634175, -1.62425917345994), AUS_2 = c(-1.03040366948029, -1.261929311973, 0.39218462589648, -1.13143826165372, 0.544144484008162, 1.17660893473457, 0.0252285692390373, 0.515133169692034,  -0.654109760017422, 0.503641990827566, -1.2721192223284, -0.0767711536986575, -1.34531937567941, -0.266317560246187, 1.08756299501947, 0.700567795415207, -0.4427595146404, -0.78851996588786, -0.856775709774438, -0.746419014623393)), class = &quot;data.frame&quot;, row.names = c(NA, -20L))

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

计算多个数据框列的最佳方法是使用模式

问题

答案2

正确的运算符使用

正则表达式搜索跨越两个表格

Pandas选择列按顺序排在前面，其余列保持不变。

使用ggplot2更改堆叠柱状图的颜色，根据每个堆叠的数值标签。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论