2023年6月26日 11:19:11go评论81阅读模式

英文:

Is there a faster way to perform a non-equi join and find the max of the joined values in R?

问题

我试图加速一些R代码。由于数据量很大（数千万行），处理需要一些时间。基本上，我有一个名为parameters的小数据表，其中包含税率和阈值，以及一个名为taxation_data的大数据表，其中包含有关收入的个人级别数据。我想计算每个人的“总税收”，这需要从parameters表中查找相关的税率和阈值。

我的第一次尝试（未显示）是执行一个非等连接，并过滤连接值的最大值。那非常慢，我找到了一种使用cut函数来提高速度的方法（请参见下面的示例）。尽管如此，我仍然认为应该有更快的方法来做到这一点。特别是，我发现cut步骤非常快，但合并步骤很慢。有什么建议吗？

这是我能想出的最好方法：

library(tidyverse)
library(data.table)
parameters <- data.table("Component" = c("A","A","B","B","C","C"),
                         "Year" = c(2020, 2021, 2020, 2021,
                                    2020, 2021),
                         "Threshold_lower" = c(0,0,18000,18000,40000,50000),
                         "Threshold_upper" = c(18000,18000,40000,50000,Inf,Inf),
                         "Rate" = c(0,0,0.2,0.2,0.4,0.45),
                         "Tax paid (up to MTR)" = c(0,0,0,0,4400,6400)) 
taxation_data <- data.table("Year" = c(2020,2020,2021,2021),
                            "Income" = c(20000, 15000,80000,45000))
# 根据参数确定每个个人在taxation_data中应用哪个“组件”（阈值）
lapply(unique(parameters$Year), function(x) {
  # 税率适用于阈值的上部分“Threshold_upper”
  thresholds <- parameters[Year == x, .(Component, Threshold_upper)] 
  thresholds <- setNames(c(thresholds$Threshold_upper), c(as.character(thresholds$Component)))
  taxation_data[Year == x, Component := cut(Income, breaks = thresholds, 
                                            labels = names(thresholds)[2:length(thresholds)], 
                                            include.lowest = TRUE)]
}) %>%
  invisible()
# 合并来自parameters的其他变量
taxation_data <- merge(taxation_data, 
                       parameters[, .(Component, Year, Threshold_lower, Rate, `Tax paid (up to MTR)`)],
                       by.x = c("Year", "Component"), 
                       by.y=c("Year", "Component"), 
                       all.x=TRUE)
# 计算“总税收”
setnafill(taxation_data, fill = 0, cols = c("Rate", "Tax paid (up to MTR)", "Threshold_lower"))
taxation_data[, `Gross tax` := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`]

英文:

I'm trying to speed up some R code. Due to the large volume of data (tens of millions of rows), it takes some time to process. Essentially, I have a small data.table called parameters with tax rates and thresholds and a large data.table called taxation_data with individual level data on incomes. I want to calculate each person's gross tax, which requires looking up the relevant tax rates and thresholds from the parameters table.

My first attempt (not shown) was to perform a non-equi join and to filter on the max of the joined values. That was very slow and I found a way to improve the speed using the cut function (see example below). I still think there must be a faster way to do this though. In particular, I find it interesting that the cut step is very fast, but the merge step is slow. Any ideas?

This is the best I have been able to come up with:

library(tidyverse)
library(data.table)
parameters &lt;- data.table(&quot;Component&quot; = c(&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;B&quot;,&quot;C&quot;,&quot;C&quot;),
                         &quot;Year&quot; = c(2020, 2021, 2020, 2021,
                                    2020, 2021),
                         &quot;Threshold_lower&quot; = c(0,0,18000,18000,40000,50000),
                         &quot;Threshold_upper&quot; = c(18000,18000,40000,50000,Inf,Inf),
                         &quot;Rate&quot; = c(0,0,0.2,0.2,0.4,0.45),
                         &quot;Tax paid (up to MTR)&quot; = c(0,0,0,0,4400,6400)) 
taxation_data &lt;- data.table(&quot;Year&quot; = c(2020,2020,2021,2021),
                            &quot;Income&quot; = c(20000, 15000,80000,45000))
  
# Based on the parameters, determine which &quot;component&quot; (threshold) applies to each
# individual in the taxation_data
lapply(unique(parameters$Year), function(x) {
  # Tax rates apply up to the upper part of the threshold &quot;Threshold_upper&quot;
  thresholds &lt;- parameters[Year == x, .(Component, Threshold_upper)] 
  thresholds &lt;- setNames(c(thresholds$Threshold_upper), c(as.character(thresholds$Component)))
  taxation_data[Year == x, Component := cut(Income, breaks = thresholds, 
                                            labels = names(thresholds)[2:length(thresholds)], 
                                            include.lowest = TRUE)]
}) %&gt;% 
  invisible()
# Merge in the other variables from parameters
taxation_data &lt;- merge(taxation_data, 
                       parameters[, .(Component, Year, Threshold_lower, Rate, `Tax paid (up to MTR)`)],
                       by.x = c(&quot;Year&quot;, &quot;Component&quot;), 
                       by.y=c(&quot;Year&quot;, &quot;Component&quot;), 
                       all.x=TRUE)
# Calculate `gross tax`
setnafill(taxation_data, fill = 0, cols = c(&quot;Rate&quot;, &quot;Tax paid (up to MTR)&quot;, &quot;Threshold_lower&quot;))
taxation_data[, `Gross tax` := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`]

答案1

得分: 2

不确定是否漏掉了什么，这只是一个简单的非等值合并，不需要特殊处理。

# 因为在合并过程中会丢失名称/值
parameters[, thlow := Threshold_lower]
parameters[taxation_data, on = .(Year, thlow <= Income, Threshold_upper >= Income)
  ][, c("Income", "thlow", "Threshold_upper") := .(thlow, NULL, NULL)
  ][, tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
  ][]
#    Component  Year Threshold_lower  Rate Tax paid (up to MTR) Income   tax
#       <char> <num>           <num> <num>                <num>  <num> <num>
# 1:         B  2020           18000  0.20                    0  20000   400
# 2:         A  2020               0  0.00                    0  15000     0
# 3:         C  2021           50000  0.45                 6400  80000 19900
# 4:         B  2021           18000  0.20                    0  45000  5400

英文:

Not sure if I'm missing something, isn't this just a simple non-equi merge with no special handling required?

# because names/values are lost in the merge
parameters[, thlow := Threshold_lower]
parameters[taxation_data, on = .(Year, thlow &lt;= Income, Threshold_upper &gt;= Income)
  ][, c(&quot;Income&quot;, &quot;thlow&quot;, &quot;Threshold_upper&quot;) := .(thlow, NULL, NULL)
  ][, tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
  ][]
#    Component  Year Threshold_lower  Rate Tax paid (up to MTR) Income   tax
#       &lt;char&gt; &lt;num&gt;           &lt;num&gt; &lt;num&gt;                &lt;num&gt;  &lt;num&gt; &lt;num&gt;
# 1:         B  2020           18000  0.20                    0  20000   400
# 2:         A  2020               0  0.00                    0  15000     0
# 3:         C  2021           50000  0.45                 6400  80000 19900
# 4:         B  2021           18000  0.20                    0  45000  5400

答案2

得分: 1

通过每年向“Income”添加一个固定金额，我们可以使用单个“findInterval”调用手动执行连接。作为一个函数：

library(data.table)
tax_join2 <- function(parameters, taxation_data) {
  # add an amount every year after the first so there is no overlap in
  # components between years
  interval <- max(parameters$Threshold_lower, taxation_data$Income) + 1
  min_year <- min(parameters$Year)
  parameters2 <- setorder(copy(parameters), Year, Threshold_lower)[
    ,Threshold_upper := Threshold_lower + interval*(Year - min_year)
  ]
  setcolorder(
    taxation_data[
      ,c(
        "Component",
        "Threshold_lower",
        "Rate",
        "Tax paid (up to MTR)"
      ) := parameters2[
        findInterval(
          Income + interval*(taxation_data$Year - min_year),
          parameters2$Threshold_upper
        ),
        c(1, 3, 5, 6)
      ]
    ][, tax := (Income - Threshold_lower)*Rate + `Tax paid (up to MTR)`],
    c(
      "Component",
      "Year",
      "Threshold_lower",
      "Rate",
      "Tax paid (up to MTR)",
      "Income",
      "tax"
    )
  )
}
Test on the example data:
```R
parameters <- data.table("Component" = c("A","A","B","B","C","C"),
                         "Year" = c(2020, 2021, 2020, 2021,
                                    2020, 2021),
                         "Threshold_lower" = c(0,0,18000,18000,40000,50000),
                         "Threshold_upper" = c(18000,18000,40000,50000,Inf,Inf),
                         "Rate" = c(0,0,0.2,0.2,0.4,0.45),
                         "Tax paid (up to MTR)" = c(0,0,0,0,4400,6400)) 
taxation_data <- data.table("Year" = c(2020,2020,2021,2021),
                            "Income" = c(20000, 15000,80000,45000))
tax_join2(parameters, taxation_data)[]

Compare timings against a simple non-equi join as proposed by @r2evans (as a function).

tax_join1 <- function(parameters, taxation_data) {
  parameters <- copy(parameters)[, thlow := Threshold_lower]
  parameters[
    taxation_data, on = .(Year, thlow <= Income, Threshold_upper >= Income)
  ][
    , c("Income", "thlow", "Threshold_upper") := .(thlow, NULL, NULL)
  ][
    , tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
  ]
}
Larger example data set, with 100M rows:
```R
set.seed(1673481669)
parameters <- data.table("Component" = rep(LETTERS[1:3], each = 13),
                         "Year" = rep(2010:2022, 3),
                         "Threshold_lower" = rep(c(0,18000,40000), each = 13),
                         "Threshold_upper" = rep(c(18000,40000,Inf), each = 13),
                         "Rate" = rep(c(0,0.2,0.4), each = 13),
                         "Tax paid (up to MTR)" = rep(c(0,0,4400), each = 13))
taxation_data <- data.table(Year = sample(2010:2022, 1e8, 1),
                            Income = runif(1e5, 0, max(parameters$Threshold_lower)*1.3))
Timing:
```R
system.time(dt1 <- tax_join1(parameters, taxation_data))
system.time(dt2 <- tax_join2(parameters, taxation_data))
identical(dt1, dt2)

希望这些信息对你有所帮助。

英文:

By adding a fixed amount to Income for every year, we can perform the join manually with a single findInterval call. As a function:

library(data.table)
tax_join2 &lt;- function(parameters, taxation_data) {
# add an amount every year after the first so there is no overlap in
# components between years
interval &lt;- max(parameters$Threshold_lower, taxation_data$Income) + 1
min_year &lt;- min(parameters$Year)
parameters2 &lt;- setorder(copy(parameters), Year, Threshold_lower)[
,Threshold_upper := Threshold_lower + interval*(Year - min_year)
]
setcolorder(
taxation_data[
,c(
&quot;Component&quot;,
&quot;Threshold_lower&quot;,
&quot;Rate&quot;,
&quot;Tax paid (up to MTR)&quot;
) := parameters2[
findInterval(
Income + interval*(taxation_data$Year - min_year),
parameters2$Threshold_upper
),
c(1, 3, 5, 6)
]
][, tax := (Income - Threshold_lower)*Rate + `Tax paid (up to MTR)`],
c(
&quot;Component&quot;,
&quot;Year&quot;,
&quot;Threshold_lower&quot;,
&quot;Rate&quot;,
&quot;Tax paid (up to MTR)&quot;,
&quot;Income&quot;,
&quot;tax&quot;
)
)
}

Test on the example data:

parameters &lt;- data.table(&quot;Component&quot; = c(&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;B&quot;,&quot;C&quot;,&quot;C&quot;),
&quot;Year&quot; = c(2020, 2021, 2020, 2021,
2020, 2021),
&quot;Threshold_lower&quot; = c(0,0,18000,18000,40000,50000),
&quot;Threshold_upper&quot; = c(18000,18000,40000,50000,Inf,Inf),
&quot;Rate&quot; = c(0,0,0.2,0.2,0.4,0.45),
&quot;Tax paid (up to MTR)&quot; = c(0,0,0,0,4400,6400)) 
taxation_data &lt;- data.table(&quot;Year&quot; = c(2020,2020,2021,2021),
&quot;Income&quot; = c(20000, 15000,80000,45000))
tax_join2(parameters, taxation_data)[]
#&gt;    Component Year Threshold_lower Rate Tax paid (up to MTR) Income   tax
#&gt; 1:         B 2020           18000 0.20                    0  20000   400
#&gt; 2:         A 2020               0 0.00                    0  15000     0
#&gt; 3:         C 2021           50000 0.45                 6400  80000 19900
#&gt; 4:         B 2021           18000 0.20                    0  45000  5400

Compare timings against a simple non-equi join as proposed by @r2evans (as a function).

tax_join1 &lt;- function(parameters, taxation_data) {
parameters &lt;- copy(parameters)[, thlow := Threshold_lower]
parameters[
taxation_data, on = .(Year, thlow &lt;= Income, Threshold_upper &gt;= Income)
][
, c(&quot;Income&quot;, &quot;thlow&quot;, &quot;Threshold_upper&quot;) := .(thlow, NULL, NULL)
][
, tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
]
}

Larger example data set, with 100M rows:

set.seed(1673481669)
parameters &lt;- data.table(&quot;Component&quot; = rep(LETTERS[1:3], each = 13),
&quot;Year&quot; = rep(2010:2022, 3),
&quot;Threshold_lower&quot; = rep(c(0,18000,40000), each = 13),
&quot;Threshold_upper&quot; = rep(c(18000,40000,Inf), each = 13),
&quot;Rate&quot; = rep(c(0,0.2,0.4), each = 13),
&quot;Tax paid (up to MTR)&quot; = rep(c(0,0,4400), each = 13))
taxation_data &lt;- data.table(Year = sample(2010:2022, 1e8, 1),
Income = runif(1e5, 0, max(parameters$Threshold_lower)*1.3))

Timing:

system.time(dt1 &lt;- tax_join1(parameters, taxation_data))
#&gt;    user  system elapsed 
#&gt;   41.21    3.86   42.06
system.time(dt2 &lt;- tax_join2(parameters, taxation_data))
#&gt;    user  system elapsed 
#&gt;    9.06    2.17   12.41
identical(dt1, dt2)
#&gt; [1] TRUE

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

有没有一种更快的方法来执行非等值连接并在R中找到连接值的最大值？

问题

答案1

答案2

Quanteda和stringr在R中：(正确) 正则表达式无法解析

如何根据两列而不是单列来将数据透视成宽格式。

使用terra包裁剪栅格时，不选择所有单元格。

R编程 – 数据清洗 – 日期时间

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。