英文:
Is there a faster way to perform a non-equi join and find the max of the joined values in R?
问题
我试图加速一些R代码。由于数据量很大(数千万行),处理需要一些时间。基本上,我有一个名为parameters
的小数据表,其中包含税率和阈值,以及一个名为taxation_data
的大数据表,其中包含有关收入的个人级别数据。我想计算每个人的“总税收”,这需要从parameters
表中查找相关的税率和阈值。
我的第一次尝试(未显示)是执行一个非等连接,并过滤连接值的最大值。那非常慢,我找到了一种使用cut
函数来提高速度的方法(请参见下面的示例)。尽管如此,我仍然认为应该有更快的方法来做到这一点。特别是,我发现cut
步骤非常快,但合并步骤很慢。有什么建议吗?
这是我能想出的最好方法:
library(tidyverse)
library(data.table)
parameters <- data.table("Component" = c("A","A","B","B","C","C"),
"Year" = c(2020, 2021, 2020, 2021,
2020, 2021),
"Threshold_lower" = c(0,0,18000,18000,40000,50000),
"Threshold_upper" = c(18000,18000,40000,50000,Inf,Inf),
"Rate" = c(0,0,0.2,0.2,0.4,0.45),
"Tax paid (up to MTR)" = c(0,0,0,0,4400,6400))
taxation_data <- data.table("Year" = c(2020,2020,2021,2021),
"Income" = c(20000, 15000,80000,45000))
# 根据参数确定每个个人在taxation_data中应用哪个“组件”(阈值)
lapply(unique(parameters$Year), function(x) {
# 税率适用于阈值的上部分“Threshold_upper”
thresholds <- parameters[Year == x, .(Component, Threshold_upper)]
thresholds <- setNames(c(thresholds$Threshold_upper), c(as.character(thresholds$Component)))
taxation_data[Year == x, Component := cut(Income, breaks = thresholds,
labels = names(thresholds)[2:length(thresholds)],
include.lowest = TRUE)]
}) %>%
invisible()
# 合并来自parameters的其他变量
taxation_data <- merge(taxation_data,
parameters[, .(Component, Year, Threshold_lower, Rate, `Tax paid (up to MTR)`)],
by.x = c("Year", "Component"),
by.y=c("Year", "Component"),
all.x=TRUE)
# 计算“总税收”
setnafill(taxation_data, fill = 0, cols = c("Rate", "Tax paid (up to MTR)", "Threshold_lower"))
taxation_data[, `Gross tax` := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`]
英文:
I'm trying to speed up some R code. Due to the large volume of data (tens of millions of rows), it takes some time to process. Essentially, I have a small data.table called parameters
with tax rates and thresholds and a large data.table called taxation_data
with individual level data on incomes. I want to calculate each person's gross tax
, which requires looking up the relevant tax rates and thresholds from the parameters
table.
My first attempt (not shown) was to perform a non-equi join and to filter on the max of the joined values. That was very slow and I found a way to improve the speed using the cut
function (see example below). I still think there must be a faster way to do this though. In particular, I find it interesting that the cut
step is very fast, but the merge step is slow. Any ideas?
This is the best I have been able to come up with:
library(tidyverse)
library(data.table)
parameters <- data.table("Component" = c("A","A","B","B","C","C"),
"Year" = c(2020, 2021, 2020, 2021,
2020, 2021),
"Threshold_lower" = c(0,0,18000,18000,40000,50000),
"Threshold_upper" = c(18000,18000,40000,50000,Inf,Inf),
"Rate" = c(0,0,0.2,0.2,0.4,0.45),
"Tax paid (up to MTR)" = c(0,0,0,0,4400,6400))
taxation_data <- data.table("Year" = c(2020,2020,2021,2021),
"Income" = c(20000, 15000,80000,45000))
# Based on the parameters, determine which "component" (threshold) applies to each
# individual in the taxation_data
lapply(unique(parameters$Year), function(x) {
# Tax rates apply up to the upper part of the threshold "Threshold_upper"
thresholds <- parameters[Year == x, .(Component, Threshold_upper)]
thresholds <- setNames(c(thresholds$Threshold_upper), c(as.character(thresholds$Component)))
taxation_data[Year == x, Component := cut(Income, breaks = thresholds,
labels = names(thresholds)[2:length(thresholds)],
include.lowest = TRUE)]
}) %>%
invisible()
# Merge in the other variables from parameters
taxation_data <- merge(taxation_data,
parameters[, .(Component, Year, Threshold_lower, Rate, `Tax paid (up to MTR)`)],
by.x = c("Year", "Component"),
by.y=c("Year", "Component"),
all.x=TRUE)
# Calculate `gross tax`
setnafill(taxation_data, fill = 0, cols = c("Rate", "Tax paid (up to MTR)", "Threshold_lower"))
taxation_data[, `Gross tax` := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`]
答案1
得分: 2
不确定是否漏掉了什么,这只是一个简单的非等值合并,不需要特殊处理。
# 因为在合并过程中会丢失名称/值
parameters[, thlow := Threshold_lower]
parameters[taxation_data, on = .(Year, thlow <= Income, Threshold_upper >= Income)
][, c("Income", "thlow", "Threshold_upper") := .(thlow, NULL, NULL)
][, tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
][]
# Component Year Threshold_lower Rate Tax paid (up to MTR) Income tax
# <char> <num> <num> <num> <num> <num> <num>
# 1: B 2020 18000 0.20 0 20000 400
# 2: A 2020 0 0.00 0 15000 0
# 3: C 2021 50000 0.45 6400 80000 19900
# 4: B 2021 18000 0.20 0 45000 5400
英文:
Not sure if I'm missing something, isn't this just a simple non-equi merge with no special handling required?
# because names/values are lost in the merge
parameters[, thlow := Threshold_lower]
parameters[taxation_data, on = .(Year, thlow <= Income, Threshold_upper >= Income)
][, c("Income", "thlow", "Threshold_upper") := .(thlow, NULL, NULL)
][, tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
][]
# Component Year Threshold_lower Rate Tax paid (up to MTR) Income tax
# <char> <num> <num> <num> <num> <num> <num>
# 1: B 2020 18000 0.20 0 20000 400
# 2: A 2020 0 0.00 0 15000 0
# 3: C 2021 50000 0.45 6400 80000 19900
# 4: B 2021 18000 0.20 0 45000 5400
答案2
得分: 1
通过每年向“Income”添加一个固定金额,我们可以使用单个“findInterval”调用手动执行连接。作为一个函数:
library(data.table)
tax_join2 <- function(parameters, taxation_data) {
# add an amount every year after the first so there is no overlap in
# components between years
interval <- max(parameters$Threshold_lower, taxation_data$Income) + 1
min_year <- min(parameters$Year)
parameters2 <- setorder(copy(parameters), Year, Threshold_lower)[
,Threshold_upper := Threshold_lower + interval*(Year - min_year)
]
setcolorder(
taxation_data[
,c(
"Component",
"Threshold_lower",
"Rate",
"Tax paid (up to MTR)"
) := parameters2[
findInterval(
Income + interval*(taxation_data$Year - min_year),
parameters2$Threshold_upper
),
c(1, 3, 5, 6)
]
][, tax := (Income - Threshold_lower)*Rate + `Tax paid (up to MTR)`],
c(
"Component",
"Year",
"Threshold_lower",
"Rate",
"Tax paid (up to MTR)",
"Income",
"tax"
)
)
}
Test on the example data:
```R
parameters <- data.table("Component" = c("A","A","B","B","C","C"),
"Year" = c(2020, 2021, 2020, 2021,
2020, 2021),
"Threshold_lower" = c(0,0,18000,18000,40000,50000),
"Threshold_upper" = c(18000,18000,40000,50000,Inf,Inf),
"Rate" = c(0,0,0.2,0.2,0.4,0.45),
"Tax paid (up to MTR)" = c(0,0,0,0,4400,6400))
taxation_data <- data.table("Year" = c(2020,2020,2021,2021),
"Income" = c(20000, 15000,80000,45000))
tax_join2(parameters, taxation_data)[]
Compare timings against a simple non-equi join as proposed by @r2evans (as a function).
tax_join1 <- function(parameters, taxation_data) {
parameters <- copy(parameters)[, thlow := Threshold_lower]
parameters[
taxation_data, on = .(Year, thlow <= Income, Threshold_upper >= Income)
][
, c("Income", "thlow", "Threshold_upper") := .(thlow, NULL, NULL)
][
, tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
]
}
Larger example data set, with 100M rows:
```R
set.seed(1673481669)
parameters <- data.table("Component" = rep(LETTERS[1:3], each = 13),
"Year" = rep(2010:2022, 3),
"Threshold_lower" = rep(c(0,18000,40000), each = 13),
"Threshold_upper" = rep(c(18000,40000,Inf), each = 13),
"Rate" = rep(c(0,0.2,0.4), each = 13),
"Tax paid (up to MTR)" = rep(c(0,0,4400), each = 13))
taxation_data <- data.table(Year = sample(2010:2022, 1e8, 1),
Income = runif(1e5, 0, max(parameters$Threshold_lower)*1.3))
Timing:
```R
system.time(dt1 <- tax_join1(parameters, taxation_data))
system.time(dt2 <- tax_join2(parameters, taxation_data))
identical(dt1, dt2)
希望这些信息对你有所帮助。
英文:
By adding a fixed amount to Income
for every year, we can perform the join manually with a single findInterval
call. As a function:
library(data.table)
tax_join2 <- function(parameters, taxation_data) {
# add an amount every year after the first so there is no overlap in
# components between years
interval <- max(parameters$Threshold_lower, taxation_data$Income) + 1
min_year <- min(parameters$Year)
parameters2 <- setorder(copy(parameters), Year, Threshold_lower)[
,Threshold_upper := Threshold_lower + interval*(Year - min_year)
]
setcolorder(
taxation_data[
,c(
"Component",
"Threshold_lower",
"Rate",
"Tax paid (up to MTR)"
) := parameters2[
findInterval(
Income + interval*(taxation_data$Year - min_year),
parameters2$Threshold_upper
),
c(1, 3, 5, 6)
]
][, tax := (Income - Threshold_lower)*Rate + `Tax paid (up to MTR)`],
c(
"Component",
"Year",
"Threshold_lower",
"Rate",
"Tax paid (up to MTR)",
"Income",
"tax"
)
)
}
Test on the example data:
parameters <- data.table("Component" = c("A","A","B","B","C","C"),
"Year" = c(2020, 2021, 2020, 2021,
2020, 2021),
"Threshold_lower" = c(0,0,18000,18000,40000,50000),
"Threshold_upper" = c(18000,18000,40000,50000,Inf,Inf),
"Rate" = c(0,0,0.2,0.2,0.4,0.45),
"Tax paid (up to MTR)" = c(0,0,0,0,4400,6400))
taxation_data <- data.table("Year" = c(2020,2020,2021,2021),
"Income" = c(20000, 15000,80000,45000))
tax_join2(parameters, taxation_data)[]
#> Component Year Threshold_lower Rate Tax paid (up to MTR) Income tax
#> 1: B 2020 18000 0.20 0 20000 400
#> 2: A 2020 0 0.00 0 15000 0
#> 3: C 2021 50000 0.45 6400 80000 19900
#> 4: B 2021 18000 0.20 0 45000 5400
Compare timings against a simple non-equi join as proposed by @r2evans (as a function).
tax_join1 <- function(parameters, taxation_data) {
parameters <- copy(parameters)[, thlow := Threshold_lower]
parameters[
taxation_data, on = .(Year, thlow <= Income, Threshold_upper >= Income)
][
, c("Income", "thlow", "Threshold_upper") := .(thlow, NULL, NULL)
][
, tax := (Income - Threshold_lower) * Rate + `Tax paid (up to MTR)`
]
}
Larger example data set, with 100M rows:
set.seed(1673481669)
parameters <- data.table("Component" = rep(LETTERS[1:3], each = 13),
"Year" = rep(2010:2022, 3),
"Threshold_lower" = rep(c(0,18000,40000), each = 13),
"Threshold_upper" = rep(c(18000,40000,Inf), each = 13),
"Rate" = rep(c(0,0.2,0.4), each = 13),
"Tax paid (up to MTR)" = rep(c(0,0,4400), each = 13))
taxation_data <- data.table(Year = sample(2010:2022, 1e8, 1),
Income = runif(1e5, 0, max(parameters$Threshold_lower)*1.3))
Timing:
system.time(dt1 <- tax_join1(parameters, taxation_data))
#> user system elapsed
#> 41.21 3.86 42.06
system.time(dt2 <- tax_join2(parameters, taxation_data))
#> user system elapsed
#> 9.06 2.17 12.41
identical(dt1, dt2)
#> [1] TRUE
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论