对’apply’的使用产生了误解

huangapple go评论64阅读模式
英文:

Misunderstanding the use of 'apply'

问题

我有一个函数:

myFun <- function(x, y)
{
}

它的目的是处理数据框的一列:

myFun(dataFrame$Column, anotherPrameterValue)

dataFrame$Column 是一个具有4个水平的因子。函数很好地识别它并正常工作。我附上了在函数内部的调试器中的环境数据的图像(在第一行设置断点时)。

对’apply’的使用产生了误解

如果通过索引传递它,它也有效:

myFun(dataFrame[1], anotherPrameterValue)

对’apply’的使用产生了误解

但是,如果我这样编码:

apply(dataFrame, 2, myFun, y = anotherParameterValue)

传递给函数的 x 数据非常不同:

对’apply’的使用产生了误解

我想这可能与我对 apply 的理解有关...

如果你需要我的函数内部的代码,请告诉我,但我认为这可能不是必要的,因为问题似乎出现在参数传递的数据中。

英文:

I have a function:

myFun &lt;- function (x, y)
{
}

It's intended to process a column of a dataframe

myFun(dataFrame$Column, anotherPrameterValue)

dataFrame$Column is a Factor with 4 levels. It's well recognized by the function and works great. I attach image of environment data from debugger (breakpoint inside the function, the first line)

对’apply’的使用产生了误解

It also works if passed by index:

myFun(dataFrame[1], anotherPrameterValue)

对’apply’的使用产生了误解

But, if I code:

apply(dataFrame, 2, myFun, y = anotherParameterValue)

The data passed to the function in 'x' is very different:

对’apply’的使用产生了误解

I suppose it must be something I'm not understanding in 'apply'...

If you need the code inside my function, tell me, but I think it's not neccesary, as the problem shows in the data received through parameters.

答案1

得分: 1

如评论中所解释的,apply 适用于 matrix 类型的对象。在这个过程中,R 将尝试将您的数据框输入静默转换为矩阵。

一个工作示例:

set.seed(42)
quux &lt;- data.frame(int1=sample(1000,3), int2=sample(1000,3), num3=runif(3), num4=runif(3)) |&gt;
  transform(fctr5 = factor(int1), chr6=as.character(int2))
quux
#   int1 int2      num3      num4 fctr5 chr6
# 1  561  153 0.7365883 0.7050648   561  153
# 2  997   74 0.1346666 0.4577418   997   74
# 3  321  228 0.6569923 0.7191123   321  228
myfun &lt;- function(z, y = 0) y + mean(z)
myfun(quux$int2, 1000)
# [1] 1151.667
apply(quux, 2, myfun, y = 1000)
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
#  int1  int2  num3  num4 fctr5  chr6 
#    NA    NA    NA    NA    NA    NA 

如果我们调试 myfun 并查看正在发生的情况,我们立刻会看到一个问题:

debug(myfun)
apply(quux, 2, myfun, y = 1000)
# debugging in: FUN(newX[, i], ...)
# debug at #1: y + mean(z)
y
# [1] 1000
z
# [1] &quot;561&quot; &quot;997&quot; &quot;321&quot;

您可以通过每次调用 myfun 时继续进行调试,每次调用一次,每次操作一列。您会发现它们都是 character 类型。

似乎“显而易见”不能在字符串上执行计算,有时某些数学运算可以在 factor 上运行(不适用于 mean),但它们不应该(因为根据函数的不同,它可能在因子的整数编码上运行或在水平的字符串表示上运行,这是非常不同的事情)。

我们该如何修复这个问题?将数据框子集,以便仅操作类似数字的列。

isnum &lt;- sapply(quux, is.numeric)
isnum
#  int1  int2  num3  num4 fctr5  chr6 
#  TRUE  TRUE  TRUE  TRUE FALSE FALSE 
apply(quux[,isnum], 2, myfun, y = 1000)
#     int1     int2     num3     num4 
# 1626.333 1151.667 1000.509 1000.627 

值得注意的是,apply 本身并不是必要的,我们也可以在这里使用 lapplysapply,这取决于您打算如何处理返回值。例如,如果您只需要上述的平均值,可以使用:

sapply(quux[,isnum], myfun, y = 1000)
#     int1     int2     num3     num4 
# 1626.333 1151.667 1000.509 1000.627 

但如果您想要替换数据框的值(出于某种原因...与我合作),可以这样做:

quux[isnum] &lt;- lapply(quux[isnum], myfun, y = 1000)
quux
#       int1     int2     num3     num4 fctr5 chr6
# 1 1626.333 1151.667 1000.509 1000.627   561  153
# 2 1626.333 1151.667 1000.509 1000.627   997   74
# 3 1626.333 1151.667 1000.509 1000.627   321  228

或者如果您想要将列附加到 quux,然后:

#(从原始 quux 开始)
isnum_ch &lt;- names(isnum)[isnum]
isnum_ch &lt;- paste0(isnum_ch, &quot;_new&quot;)
isnum_ch
# [1] &quot;int1_new&quot; &quot;int2_new&quot; &quot;num3_new&quot; &quot;num4_new&quot;
cbind(quux, setNames(lapply(quux[isnum], myfun, y = 500), isnum_ch))
#   int1 int2      num3      num4 fctr5 chr6 int1_new int2_new num3_new num4_new
# 1  561  153 0.7365883 0.7050648   561  153 1126.333 651.6667 500.5094 500.6273
# 2  997   74 0.1346666 0.4577418   997   74 1126.333 651.6667 500.5094 500.6273
# 3  321  228 0.6569923 0.7191123   321  228 1126.333 651.6667 500.5094 500.6273
英文:

As explained in the comments, apply is for objects of class matrix. R will happily/silently try to convert your frame input to a matrix while doing so.

A working example:

set.seed(42)
quux &lt;- data.frame(int1=sample(1000,3), int2=sample(1000,3), num3=runif(3), num4=runif(3)) |&gt;
  transform(fctr5 = factor(int1), chr6=as.character(int2))
quux
#   int1 int2      num3      num4 fctr5 chr6
# 1  561  153 0.7365883 0.7050648   561  153
# 2  997   74 0.1346666 0.4577418   997   74
# 3  321  228 0.6569923 0.7191123   321  228
myfun &lt;- function(z, y = 0) y + mean(z)
myfun(quux$int2, 1000)
# [1] 1151.667
apply(quux, 2, myfun, y = 1000)
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
# Warning in mean.default(z) :
#   argument is not numeric or logical: returning NA
#  int1  int2  num3  num4 fctr5  chr6 
#    NA    NA    NA    NA    NA    NA 

If we debug myfun and step into what's going on, we'll immediately see a problem:

debug(myfun)
apply(quux, 2, myfun, y = 1000)
# debugging in: FUN(newX[, i], ...)
# debug at #1: y + mean(z)
y
# [1] 1000
z
# [1] &quot;561&quot; &quot;997&quot; &quot;321&quot;

You can continue through each call to myfun, once per column. You'll find that they are all class character.

It seems "obvious" that one cannot calculate something on the strings, and sometimes some math-operations can work on factors (not with mean) but they shouldn't (because depending on the function, it might work on the integer-encoding of the factor or the string-representations of the levels, very different things).

How do we fix this? Subset the frame so that you're only operating on the number-like columns.

isnum &lt;- sapply(quux, is.numeric)
isnum
#  int1  int2  num3  num4 fctr5  chr6 
#  TRUE  TRUE  TRUE  TRUE FALSE FALSE 
apply(quux[,isnum], 2, myfun, y = 1000)
#     int1     int2     num3     num4 
# 1626.333 1151.667 1000.509 1000.627 

FYI, apply itself is not necessary, we can also use lapply or sapply here, depending on what you're planning on doing with the return value. For example, if you just need the averages as above, use

sapply(quux[,isnum], myfun, y = 1000)
#     int1     int2     num3     num4 
# 1626.333 1151.667 1000.509 1000.627 

But if you want to replace the frames values (for some reason ... work with me), one might do:

quux[isnum] &lt;- lapply(quux[isnum], myfun, y = 1000)
quux
#       int1     int2     num3     num4 fctr5 chr6
# 1 1626.333 1151.667 1000.509 1000.627   561  153
# 2 1626.333 1151.667 1000.509 1000.627   997   74
# 3 1626.333 1151.667 1000.509 1000.627   321  228

Or if you wanted to append the columns to quux, then

# (starting with the original quux)
isnum_ch &lt;- names(isnum)[isnum]
isnum_ch &lt;- paste0(isnum_ch, &quot;_new&quot;)
isnum_ch
# [1] &quot;int1_new&quot; &quot;int2_new&quot; &quot;num3_new&quot; &quot;num4_new&quot;
cbind(quux, setNames(lapply(quux[isnum], myfun, y = 500), isnum_ch))
#   int1 int2      num3      num4 fctr5 chr6 int1_new int2_new num3_new num4_new
# 1  561  153 0.7365883 0.7050648   561  153 1126.333 651.6667 500.5094 500.6273
# 2  997   74 0.1346666 0.4577418   997   74 1126.333 651.6667 500.5094 500.6273
# 3  321  228 0.6569923 0.7191123   321  228 1126.333 651.6667 500.5094 500.6273

huangapple
  • 本文由 发表于 2023年2月7日 01:36:46
  • 转载请务必保留本文链接:https://go.coder-hub.com/75364714.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定