有没有一种更快的方法来执行非等值连接并在R中找到连接值的最大值?

huangapple go评论81阅读模式
英文:

Is there a faster way to perform a non-equi join and find the max of the joined values in R?

问题

我试图加速一些R代码。由于数据量很大(数千万行),处理需要一些时间。基本上,我有一个名为parameters的小数据表,其中包含税率和阈值,以及一个名为taxation_data的大数据表,其中包含有关收入的个人级别数据。我想计算每个人的“总税收”,这需要从parameters表中查找相关的税率和阈值。

我的第一次尝试(未显示)是执行一个非等连接,并过滤连接值的最大值。那非常慢,我找到了一种使用cut函数来提高速度的方法(请参见下面的示例)。尽管如此,我仍然认为应该有更快的方法来做到这一点。特别是,我发现cut步骤非常快,但合并步骤很慢。有什么建议吗?

这是我能想出的最好方法:

  1. library(tidyverse)
  2. library(data.table)
  3. parameters <- data.table("Component" = c("A","A","B","B","C","C"),
  4. "Year" = c(2020, 2021, 2020, 2021,
  5. 2020, 2021),
  6. "Threshold_lower" = c(0,0,18000,18000,40000,50000),
  7. "Threshold_upper" = c(18000,18000,40000,50000,Inf,Inf),
  8. "Rate" = c(0,0,0.2,0.2,0.4,0.45),
  9. "Tax paid (up to MTR)" = c(0,0,0,0,4400,6400))
  10. taxation_data <- data.table("Year" = c(2020,2020,2021,2021),
  11. "Income" = c(20000, 15000,80000,45000))
  12. # 根据参数确定每个个人在taxation_data中应用哪个“组件”(阈值)
  13. lapply(unique(parameters$Year), function(x) {
  14. # 税率适用于阈值的上部分“Threshold_upper”
  15. thresholds <- parameters[Year == x, .(Component, Threshold_upper)]
  16. thresholds <- setNames(c(thresholds$Threshold_upper), c(as.character(thresholds$Component)))
  17. taxation_data[Year == x, Component := cut(Income, breaks = thresholds,
  18. labels = names(thresholds)[2:length(thresholds)],
  19. include.lowest = TRUE)]
  20. }) %>%
  21. invisible()
  22. # 合并来自parameters的其他变量
  23. taxation_data <- merge(taxation_data,
  24. parameters[, .(Component, Year, Threshold_lower, Rate, `Tax paid (up to MTR)`)],
  25. by.x = c("Year", "Component"),
  26. by.y=c("Year", "Component"),
  27. all.x=TRUE)
  28. # 计算“总税收”
  29. setnafill(taxation_data, fill = 0, cols = c("Rate", "Tax paid (up to MTR)", "Threshold_lower"))
  30. taxation_data[, `Gross tax` := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`]
英文:

I'm trying to speed up some R code. Due to the large volume of data (tens of millions of rows), it takes some time to process. Essentially, I have a small data.table called parameters with tax rates and thresholds and a large data.table called taxation_data with individual level data on incomes. I want to calculate each person's gross tax, which requires looking up the relevant tax rates and thresholds from the parameters table.

My first attempt (not shown) was to perform a non-equi join and to filter on the max of the joined values. That was very slow and I found a way to improve the speed using the cut function (see example below). I still think there must be a faster way to do this though. In particular, I find it interesting that the cut step is very fast, but the merge step is slow. Any ideas?

This is the best I have been able to come up with:

  1. library(tidyverse)
  2. library(data.table)
  3. parameters &lt;- data.table(&quot;Component&quot; = c(&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;B&quot;,&quot;C&quot;,&quot;C&quot;),
  4. &quot;Year&quot; = c(2020, 2021, 2020, 2021,
  5. 2020, 2021),
  6. &quot;Threshold_lower&quot; = c(0,0,18000,18000,40000,50000),
  7. &quot;Threshold_upper&quot; = c(18000,18000,40000,50000,Inf,Inf),
  8. &quot;Rate&quot; = c(0,0,0.2,0.2,0.4,0.45),
  9. &quot;Tax paid (up to MTR)&quot; = c(0,0,0,0,4400,6400))
  10. taxation_data &lt;- data.table(&quot;Year&quot; = c(2020,2020,2021,2021),
  11. &quot;Income&quot; = c(20000, 15000,80000,45000))
  12. # Based on the parameters, determine which &quot;component&quot; (threshold) applies to each
  13. # individual in the taxation_data
  14. lapply(unique(parameters$Year), function(x) {
  15. # Tax rates apply up to the upper part of the threshold &quot;Threshold_upper&quot;
  16. thresholds &lt;- parameters[Year == x, .(Component, Threshold_upper)]
  17. thresholds &lt;- setNames(c(thresholds$Threshold_upper), c(as.character(thresholds$Component)))
  18. taxation_data[Year == x, Component := cut(Income, breaks = thresholds,
  19. labels = names(thresholds)[2:length(thresholds)],
  20. include.lowest = TRUE)]
  21. }) %&gt;%
  22. invisible()
  23. # Merge in the other variables from parameters
  24. taxation_data &lt;- merge(taxation_data,
  25. parameters[, .(Component, Year, Threshold_lower, Rate, `Tax paid (up to MTR)`)],
  26. by.x = c(&quot;Year&quot;, &quot;Component&quot;),
  27. by.y=c(&quot;Year&quot;, &quot;Component&quot;),
  28. all.x=TRUE)
  29. # Calculate `gross tax`
  30. setnafill(taxation_data, fill = 0, cols = c(&quot;Rate&quot;, &quot;Tax paid (up to MTR)&quot;, &quot;Threshold_lower&quot;))
  31. taxation_data[, `Gross tax` := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`]

答案1

得分: 2

不确定是否漏掉了什么,这只是一个简单的非等值合并,不需要特殊处理。

  1. # 因为在合并过程中会丢失名称/值
  2. parameters[, thlow := Threshold_lower]
  3. parameters[taxation_data, on = .(Year, thlow <= Income, Threshold_upper >= Income)
  4. ][, c("Income", "thlow", "Threshold_upper") := .(thlow, NULL, NULL)
  5. ][, tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
  6. ][]
  7. # Component Year Threshold_lower Rate Tax paid (up to MTR) Income tax
  8. # <char> <num> <num> <num> <num> <num> <num>
  9. # 1: B 2020 18000 0.20 0 20000 400
  10. # 2: A 2020 0 0.00 0 15000 0
  11. # 3: C 2021 50000 0.45 6400 80000 19900
  12. # 4: B 2021 18000 0.20 0 45000 5400
英文:

Not sure if I'm missing something, isn't this just a simple non-equi merge with no special handling required?

  1. # because names/values are lost in the merge
  2. parameters[, thlow := Threshold_lower]
  3. parameters[taxation_data, on = .(Year, thlow &lt;= Income, Threshold_upper &gt;= Income)
  4. ][, c(&quot;Income&quot;, &quot;thlow&quot;, &quot;Threshold_upper&quot;) := .(thlow, NULL, NULL)
  5. ][, tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
  6. ][]
  7. # Component Year Threshold_lower Rate Tax paid (up to MTR) Income tax
  8. # &lt;char&gt; &lt;num&gt; &lt;num&gt; &lt;num&gt; &lt;num&gt; &lt;num&gt; &lt;num&gt;
  9. # 1: B 2020 18000 0.20 0 20000 400
  10. # 2: A 2020 0 0.00 0 15000 0
  11. # 3: C 2021 50000 0.45 6400 80000 19900
  12. # 4: B 2021 18000 0.20 0 45000 5400

答案2

得分: 1

通过每年向“Income”添加一个固定金额,我们可以使用单个“findInterval”调用手动执行连接。作为一个函数:

  1. library(data.table)
  2. tax_join2 <- function(parameters, taxation_data) {
  3. # add an amount every year after the first so there is no overlap in
  4. # components between years
  5. interval <- max(parameters$Threshold_lower, taxation_data$Income) + 1
  6. min_year <- min(parameters$Year)
  7. parameters2 <- setorder(copy(parameters), Year, Threshold_lower)[
  8. ,Threshold_upper := Threshold_lower + interval*(Year - min_year)
  9. ]
  10. setcolorder(
  11. taxation_data[
  12. ,c(
  13. "Component",
  14. "Threshold_lower",
  15. "Rate",
  16. "Tax paid (up to MTR)"
  17. ) := parameters2[
  18. findInterval(
  19. Income + interval*(taxation_data$Year - min_year),
  20. parameters2$Threshold_upper
  21. ),
  22. c(1, 3, 5, 6)
  23. ]
  24. ][, tax := (Income - Threshold_lower)*Rate + `Tax paid (up to MTR)`],
  25. c(
  26. "Component",
  27. "Year",
  28. "Threshold_lower",
  29. "Rate",
  30. "Tax paid (up to MTR)",
  31. "Income",
  32. "tax"
  33. )
  34. )
  35. }
  36. Test on the example data:
  37. ```R
  38. parameters <- data.table("Component" = c("A","A","B","B","C","C"),
  39. "Year" = c(2020, 2021, 2020, 2021,
  40. 2020, 2021),
  41. "Threshold_lower" = c(0,0,18000,18000,40000,50000),
  42. "Threshold_upper" = c(18000,18000,40000,50000,Inf,Inf),
  43. "Rate" = c(0,0,0.2,0.2,0.4,0.45),
  44. "Tax paid (up to MTR)" = c(0,0,0,0,4400,6400))
  45. taxation_data <- data.table("Year" = c(2020,2020,2021,2021),
  46. "Income" = c(20000, 15000,80000,45000))
  47. tax_join2(parameters, taxation_data)[]

Compare timings against a simple non-equi join as proposed by @r2evans (as a function).

  1. tax_join1 <- function(parameters, taxation_data) {
  2. parameters <- copy(parameters)[, thlow := Threshold_lower]
  3. parameters[
  4. taxation_data, on = .(Year, thlow <= Income, Threshold_upper >= Income)
  5. ][
  6. , c("Income", "thlow", "Threshold_upper") := .(thlow, NULL, NULL)
  7. ][
  8. , tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
  9. ]
  10. }
  11. Larger example data set, with 100M rows:
  12. ```R
  13. set.seed(1673481669)
  14. parameters <- data.table("Component" = rep(LETTERS[1:3], each = 13),
  15. "Year" = rep(2010:2022, 3),
  16. "Threshold_lower" = rep(c(0,18000,40000), each = 13),
  17. "Threshold_upper" = rep(c(18000,40000,Inf), each = 13),
  18. "Rate" = rep(c(0,0.2,0.4), each = 13),
  19. "Tax paid (up to MTR)" = rep(c(0,0,4400), each = 13))
  20. taxation_data <- data.table(Year = sample(2010:2022, 1e8, 1),
  21. Income = runif(1e5, 0, max(parameters$Threshold_lower)*1.3))
  22. Timing:
  23. ```R
  24. system.time(dt1 <- tax_join1(parameters, taxation_data))
  25. system.time(dt2 <- tax_join2(parameters, taxation_data))
  26. identical(dt1, dt2)

希望这些信息对你有所帮助。

英文:

By adding a fixed amount to Income for every year, we can perform the join manually with a single findInterval call. As a function:

  1. library(data.table)
  2. tax_join2 &lt;- function(parameters, taxation_data) {
  3. # add an amount every year after the first so there is no overlap in
  4. # components between years
  5. interval &lt;- max(parameters$Threshold_lower, taxation_data$Income) + 1
  6. min_year &lt;- min(parameters$Year)
  7. parameters2 &lt;- setorder(copy(parameters), Year, Threshold_lower)[
  8. ,Threshold_upper := Threshold_lower + interval*(Year - min_year)
  9. ]
  10. setcolorder(
  11. taxation_data[
  12. ,c(
  13. &quot;Component&quot;,
  14. &quot;Threshold_lower&quot;,
  15. &quot;Rate&quot;,
  16. &quot;Tax paid (up to MTR)&quot;
  17. ) := parameters2[
  18. findInterval(
  19. Income + interval*(taxation_data$Year - min_year),
  20. parameters2$Threshold_upper
  21. ),
  22. c(1, 3, 5, 6)
  23. ]
  24. ][, tax := (Income - Threshold_lower)*Rate + `Tax paid (up to MTR)`],
  25. c(
  26. &quot;Component&quot;,
  27. &quot;Year&quot;,
  28. &quot;Threshold_lower&quot;,
  29. &quot;Rate&quot;,
  30. &quot;Tax paid (up to MTR)&quot;,
  31. &quot;Income&quot;,
  32. &quot;tax&quot;
  33. )
  34. )
  35. }

Test on the example data:

  1. parameters &lt;- data.table(&quot;Component&quot; = c(&quot;A&quot;,&quot;A&quot;,&quot;B&quot;,&quot;B&quot;,&quot;C&quot;,&quot;C&quot;),
  2. &quot;Year&quot; = c(2020, 2021, 2020, 2021,
  3. 2020, 2021),
  4. &quot;Threshold_lower&quot; = c(0,0,18000,18000,40000,50000),
  5. &quot;Threshold_upper&quot; = c(18000,18000,40000,50000,Inf,Inf),
  6. &quot;Rate&quot; = c(0,0,0.2,0.2,0.4,0.45),
  7. &quot;Tax paid (up to MTR)&quot; = c(0,0,0,0,4400,6400))
  8. taxation_data &lt;- data.table(&quot;Year&quot; = c(2020,2020,2021,2021),
  9. &quot;Income&quot; = c(20000, 15000,80000,45000))
  10. tax_join2(parameters, taxation_data)[]
  11. #&gt; Component Year Threshold_lower Rate Tax paid (up to MTR) Income tax
  12. #&gt; 1: B 2020 18000 0.20 0 20000 400
  13. #&gt; 2: A 2020 0 0.00 0 15000 0
  14. #&gt; 3: C 2021 50000 0.45 6400 80000 19900
  15. #&gt; 4: B 2021 18000 0.20 0 45000 5400

Compare timings against a simple non-equi join as proposed by @r2evans (as a function).

  1. tax_join1 &lt;- function(parameters, taxation_data) {
  2. parameters &lt;- copy(parameters)[, thlow := Threshold_lower]
  3. parameters[
  4. taxation_data, on = .(Year, thlow &lt;= Income, Threshold_upper &gt;= Income)
  5. ][
  6. , c(&quot;Income&quot;, &quot;thlow&quot;, &quot;Threshold_upper&quot;) := .(thlow, NULL, NULL)
  7. ][
  8. , tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
  9. ]
  10. }

Larger example data set, with 100M rows:

  1. set.seed(1673481669)
  2. parameters &lt;- data.table(&quot;Component&quot; = rep(LETTERS[1:3], each = 13),
  3. &quot;Year&quot; = rep(2010:2022, 3),
  4. &quot;Threshold_lower&quot; = rep(c(0,18000,40000), each = 13),
  5. &quot;Threshold_upper&quot; = rep(c(18000,40000,Inf), each = 13),
  6. &quot;Rate&quot; = rep(c(0,0.2,0.4), each = 13),
  7. &quot;Tax paid (up to MTR)&quot; = rep(c(0,0,4400), each = 13))
  8. taxation_data &lt;- data.table(Year = sample(2010:2022, 1e8, 1),
  9. Income = runif(1e5, 0, max(parameters$Threshold_lower)*1.3))

Timing:

  1. system.time(dt1 &lt;- tax_join1(parameters, taxation_data))
  2. #&gt; user system elapsed
  3. #&gt; 41.21 3.86 42.06
  4. system.time(dt2 &lt;- tax_join2(parameters, taxation_data))
  5. #&gt; user system elapsed
  6. #&gt; 9.06 2.17 12.41
  7. identical(dt1, dt2)
  8. #&gt; [1] TRUE

huangapple
  • 本文由 发表于 2023年6月26日 11:19:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/76553335.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定