Leave out function parts refering to non-existing columns

huangapple go评论73阅读模式
英文:

Leave out function parts refering to non-existing columns

问题

我有一个大型的数据框,约有50万行和几千列。此外,我得到了一个包含大约7千行代码的庞大函数,用于处理这些数据以使其更易于使用(设置日期、定义因子级别等)。让我们称之为 huge_fun。它看起来像这样:

huge_fun <- function(data){
   data$a1 <- factor(data$a1, levels = letters)
   # ....
   data$j1 <- as.Date(data$j1, origin = "1899-12-30")
   # ....
   data$z1 <- factor(data$z1, levels = LETTERS)
   # ....
   data$z174920 <- as.Date(data$z174920, "%m/%d/%y")
   # ....
  
   return(data)
}

我不需要所有的列进行分析,为了节省内存,我只加载所需的列。问题是,将 huge_fun 应用于我的数据会返回错误,因为在函数内部选择了我没有加载的列。它看起来像这样:

df <- setNames(data.frame(matrix(rep(NA, 10*10), ncol = 10)), paste0(letters[1:10], "1"))
huge_fun(df)
# Error in `$<-.data.frame`(`*tmp*`, "z1", value = integer(0)) : 
# replacement has 0 rows, data has 10

我尝试过使用 data.table 而不是 data.frame 对象,确实不再出现错误:

dt <- setNames(data.table::data.table(matrix(rep(NA, 10*10), ncol = 10)), paste0(letters[1:10], "1"))
huge_fun(dt)

这个方法有效,但问题在于 huge_fun 函数实际上会添加我之前没有加载的列!因此,我的数据变得非常庞大,即使经过几个小时函数仍在运行。如何可以跳过处理我没有加载的列的函数部分?或者参考上述两种方法:如何避免 data.frame 方法中的错误,或者如何避免 data.table 方法中函数添加新列?再次强调,函数有大约7千行的代码,我需要的列根据分析而异,因此无法手动编辑函数。

英文:

I have a big dataframe of about 500k rows and a few thousand columns. Further, I was given a huge function with about 7 thousand lines code that processes this data in order to make it more handy (setting dates, defining factor levels, and so on). Let's call it huge_fun. It looks like this:

 huge_fun &lt;- function(data){
   data$a1 &lt;- factor(data$a1, levels= letters)
   # .....
   data$j1 &lt;- as.Date(data$j1, origin = &quot;1899-12-30&quot;)
   # ....
   data$z1 &lt;- factor(data$z1, levels= LETTERS)
   # .....
   data$z174920 &lt;- as.Date(data$z174920, &quot;%m/%d/%y&quot;)
   # .....

   return(data)
 }

I do not need all the columns for my analysis and in order to save ram I load only the needed columns. The problem is that applying the huge_fun to my data returns an error because inside the function columns are selected that I did not load. It looks like this:

 df &lt;- setNames(data.frame(matrix(rep(NA, 10*10), ncol= 10)), paste0(letters[1:10], &quot;1&quot;))
 huge_fun(df)
 # Error in `$&lt;-.data.frame`(`*tmp*`, &quot;z1&quot;, value = integer(0)) : 
 # replacement has 0 rows, data has 10

What I've tried is to use data.table rather than an data.frame object and indeed the error does not occur anymore:

 dt &lt;- setNames(data.table::data.table(matrix(rep(NA, 10*10), ncol= 10)), paste0(letters[1:10], &quot;1&quot;))
 huge_fun(dt)

This works but the problem here is that the huge_fun function actually adds those columns I did not load before! Hence, my data becomes massively big and even after hours the function is still running. How can I leave out the function parts that process columns that I did not load? Or refering to both methods above: How can we avoid the error in the data.frame method or how can we avoid the function adding new columns in the data.table method? Again, the function has about 7k lines of code and the columns I need differ depending on the analysis, hence, I can't edit the function by hand.

答案1

得分: 2

As a quick fix, you could create an S3 class and define a method for $<-:

DF <- iris
foo <- function(DF) {
  DF$b <- DF$a * 2
  DF
}

foo(DF)
#Error in `$<-.data.frame`(`*tmp*`, b, value = numeric(0)) :
#  replacement has 0 rows, data has 150

class(DF) <- c("special_DF", class(DF))

`$<-.special_DF` <- function (x, name, value) {
  if (is.null(value) || length(value) == 0L) {
    warning("skipping zero-length assignment")
    return(x)
    }
  `$<-.data.frame`(x, name, value)
}
foo(DF)
#works

You might also need to create methods for `[<-` and `[[<-`.

However, ultimately, that horrible function needs to be rewritten.


<details>
<summary>英文:</summary>

As a quick fix, you could create an S3 class and define a method for `$&lt;-`:

    DF &lt;- iris
    foo &lt;- function(DF) {
      DF$b &lt;- DF$a * 2
      DF
    }
    
    foo(DF)
    #Error in `$&lt;-.data.frame`(`*tmp*`, b, value = numeric(0)) :
    #  replacement has 0 rows, data has 150
    
    class(DF) &lt;- c(&quot;special_DF&quot;, class(DF))
    
    `$&lt;-.special_DF` &lt;- function (x, name, value) {
      if (is.null(value) || length(value) == 0L) {
        warning(&quot;skipping zero-length assignment&quot;)
        return(x)
        }
      `$&lt;-.data.frame`(x, name, value)
    }
    foo(DF)
    #works

You might also need to create methods for `[&lt;-` and `[[&lt;-`.

However, ultimately, that horrible function needs to be rewritten.

</details>



huangapple
  • 本文由 发表于 2023年4月4日 15:23:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75926560.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定