英文:
Leave out function parts refering to non-existing columns
问题
我有一个大型的数据框,约有50万行和几千列。此外,我得到了一个包含大约7千行代码的庞大函数,用于处理这些数据以使其更易于使用(设置日期、定义因子级别等)。让我们称之为 huge_fun
。它看起来像这样:
huge_fun <- function(data){
data$a1 <- factor(data$a1, levels = letters)
# ....
data$j1 <- as.Date(data$j1, origin = "1899-12-30")
# ....
data$z1 <- factor(data$z1, levels = LETTERS)
# ....
data$z174920 <- as.Date(data$z174920, "%m/%d/%y")
# ....
return(data)
}
我不需要所有的列进行分析,为了节省内存,我只加载所需的列。问题是,将 huge_fun
应用于我的数据会返回错误,因为在函数内部选择了我没有加载的列。它看起来像这样:
df <- setNames(data.frame(matrix(rep(NA, 10*10), ncol = 10)), paste0(letters[1:10], "1"))
huge_fun(df)
# Error in `$<-.data.frame`(`*tmp*`, "z1", value = integer(0)) :
# replacement has 0 rows, data has 10
我尝试过使用 data.table
而不是 data.frame
对象,确实不再出现错误:
dt <- setNames(data.table::data.table(matrix(rep(NA, 10*10), ncol = 10)), paste0(letters[1:10], "1"))
huge_fun(dt)
这个方法有效,但问题在于 huge_fun
函数实际上会添加我之前没有加载的列!因此,我的数据变得非常庞大,即使经过几个小时函数仍在运行。如何可以跳过处理我没有加载的列的函数部分?或者参考上述两种方法:如何避免 data.frame
方法中的错误,或者如何避免 data.table
方法中函数添加新列?再次强调,函数有大约7千行的代码,我需要的列根据分析而异,因此无法手动编辑函数。
英文:
I have a big dataframe of about 500k rows and a few thousand columns. Further, I was given a huge function with about 7 thousand lines code that processes this data in order to make it more handy (setting dates, defining factor levels, and so on). Let's call it huge_fun
. It looks like this:
huge_fun <- function(data){
data$a1 <- factor(data$a1, levels= letters)
# .....
data$j1 <- as.Date(data$j1, origin = "1899-12-30")
# ....
data$z1 <- factor(data$z1, levels= LETTERS)
# .....
data$z174920 <- as.Date(data$z174920, "%m/%d/%y")
# .....
return(data)
}
I do not need all the columns for my analysis and in order to save ram I load only the needed columns. The problem is that applying the huge_fun
to my data returns an error because inside the function columns are selected that I did not load. It looks like this:
df <- setNames(data.frame(matrix(rep(NA, 10*10), ncol= 10)), paste0(letters[1:10], "1"))
huge_fun(df)
# Error in `$<-.data.frame`(`*tmp*`, "z1", value = integer(0)) :
# replacement has 0 rows, data has 10
What I've tried is to use data.table
rather than an data.frame
object and indeed the error does not occur anymore:
dt <- setNames(data.table::data.table(matrix(rep(NA, 10*10), ncol= 10)), paste0(letters[1:10], "1"))
huge_fun(dt)
This works but the problem here is that the huge_fun
function actually adds those columns I did not load before! Hence, my data becomes massively big and even after hours the function is still running. How can I leave out the function parts that process columns that I did not load? Or refering to both methods above: How can we avoid the error in the data.frame
method or how can we avoid the function adding new columns in the data.table
method? Again, the function has about 7k lines of code and the columns I need differ depending on the analysis, hence, I can't edit the function by hand.
答案1
得分: 2
As a quick fix, you could create an S3 class and define a method for $<-
:
DF <- iris
foo <- function(DF) {
DF$b <- DF$a * 2
DF
}
foo(DF)
#Error in `$<-.data.frame`(`*tmp*`, b, value = numeric(0)) :
# replacement has 0 rows, data has 150
class(DF) <- c("special_DF", class(DF))
`$<-.special_DF` <- function (x, name, value) {
if (is.null(value) || length(value) == 0L) {
warning("skipping zero-length assignment")
return(x)
}
`$<-.data.frame`(x, name, value)
}
foo(DF)
#works
You might also need to create methods for `[<-` and `[[<-`.
However, ultimately, that horrible function needs to be rewritten.
<details>
<summary>英文:</summary>
As a quick fix, you could create an S3 class and define a method for `$<-`:
DF <- iris
foo <- function(DF) {
DF$b <- DF$a * 2
DF
}
foo(DF)
#Error in `$<-.data.frame`(`*tmp*`, b, value = numeric(0)) :
# replacement has 0 rows, data has 150
class(DF) <- c("special_DF", class(DF))
`$<-.special_DF` <- function (x, name, value) {
if (is.null(value) || length(value) == 0L) {
warning("skipping zero-length assignment")
return(x)
}
`$<-.data.frame`(x, name, value)
}
foo(DF)
#works
You might also need to create methods for `[<-` and `[[<-`.
However, ultimately, that horrible function needs to be rewritten.
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论