英文:
How is R able to sum an integer sequence so fast?
问题
创建一个大的连续整数序列:
x <- 1:1e20
R如何能够如此快速地计算sum
?
sum(x)
它不需要循环遍历1e20个向量元素并求和吗?
英文:
Create a large contiguous sequence of integers:
x <- 1:1e20
How is R able to compute the sum
so fast?
sum(x)
Doesn't it have to loop over 1e20 elements in the vector and sum each element?
答案1
得分: 6
总结评论:
R引入了一种称为ALTREP(ALternate REPresentation)的东西,用于处理R对象。其目的是更高效地执行一些操作。从https://www.r-project.org/dsc/2017/slides/dsc2017.pdf,一些示例包括:
- 允许向量数据存储在内存映射文件或分布式系统中
- 允许紧凑的表示算术序列
- 允许向对象添加元数据
- 允许延迟进行计算/分配
- 支持环境的替代表示
第二和第四点似乎在这里适用。
我们可以通过查看我推测是R sum
原始函数的核心部分,位于https://github.com/wch/r-source/blob/7c0449d81c853f781fb13e9c7118065aedaf2f7f/src/main/altclasses.c#L262中,来看到这一点:
static SEXP compact_intseq_Sum(SEXP x, Rboolean narm)
{
#ifdef COMPACT_INTSEQ_MUTABLE
/* If the vector has been expanded it may have been modified. */
if (COMPACT_SEQ_EXPANDED(x) != R_NilValue)
return NULL;
#endif
double tmp;
SEXP info = COMPACT_SEQ_INFO(x);
R_xlen_t size = COMPACT_INTSEQ_INFO_LENGTH(info);
R_xlen_t n1 = COMPACT_INTSEQ_INFO_FIRST(info);
int inc = COMPACT_INTSEQ_INFO_INCR(info);
tmp = (size / 2.0) * (n1 + n1 + inc * (size - 1));
if(tmp > INT_MAX || tmp < R_INT_MIN)
/**** check for overflow of exact integer range? */
return ScalarReal(tmp);
else
return ScalarInteger((int) tmp);
}
换句话说,对于没有间隔的整数序列的减少是微不足道的。当存在间隔或NA
时,事情变得更加复杂。
在实际操作中:
vec <- 1:1e10
sum(vec)
# [1] 5e+19
sum(vec[-10])
# Error: cannot allocate vector of size 37.3 Gb
### win11, R-4.2.2
理想情况下,我们会看到sum(vec) == (sum(vec[-10]) + 10)
,但由于我们不能使用序列求和的优化,所以不能这样做。
英文:
Summing up the comments:
R introduced something called ALTREP, or ALternate REPresentation for R objects. Its intent is to do some things more efficiently. From https://www.r-project.org/dsc/2017/slides/dsc2017.pdf, some examples include:
- allow vector data to be in a memory-mapped file or distributed
- allow compact representation of arithmetic sequences;
- allow adding meta-data to objects;
- allow computations/allocations to be deferred;
- support alternative representations of environments.
The second and fourth bullets seem appropriate here.
We can see a hint of this in action by looking at what I'm inferring is at the core of the R sum
primitive for altreps, at https://github.com/wch/r-source/blob/7c0449d81c853f781fb13e9c7118065aedaf2f7f/src/main/altclasses.c#L262:
static SEXP compact_intseq_Sum(SEXP x, Rboolean narm)
{
#ifdef COMPACT_INTSEQ_MUTABLE
/* If the vector has been expanded it may have been modified. */
if (COMPACT_SEQ_EXPANDED(x) != R_NilValue)
return NULL;
#endif
double tmp;
SEXP info = COMPACT_SEQ_INFO(x);
R_xlen_t size = COMPACT_INTSEQ_INFO_LENGTH(info);
R_xlen_t n1 = COMPACT_INTSEQ_INFO_FIRST(info);
int inc = COMPACT_INTSEQ_INFO_INCR(info);
tmp = (size / 2.0) * (n1 + n1 + inc * (size - 1));
if(tmp > INT_MAX || tmp < R_INT_MIN)
/**** check for overflow of exact integer range? */
return ScalarReal(tmp);
else
return ScalarInteger((int) tmp);
}
Namely, the reduction of an integer sequence without gaps is trivial. It's when there are gaps or NA
s that things become a bit more complicated.
In action:
vec <- 1:1e10
sum(vec)
# [1] 5e+19
sum(vec[-10])
# Error: cannot allocate vector of size 37.3 Gb
### win11, R-4.2.2
Where ideally we would see that sum(vec) == (sum(vec[-10]) + 10)
, but we cannot since we can't use the optimization of sequence-summing.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论