2023年4月10日 20:28:24go评论92阅读模式

英文:

Efficient way to take a product of vectors, and then apply an operator to the output matrix and then sum all of the matrices

问题

I have a dataset, X, with n rows and d columns.
I want to take the dot product with each row to the transpose of itself. In R code this would be x %*% t(x), and this gives a d by d matrix.

I then truncate the values in the matrix using a function:
Where tau is a d by d matrix.

trunc_operator = function(x, tau){
  x = ifelse(abs(x) &gt; tau, tau*sign(x), x)
  return(x)
}

I do this for each row, and then sequentially add them together.

acf_mat = matrix(0, ncol(data), ncol(data))
for(i in 1:nrow(data)){
  acf_mat = acf_mat + mat_prod_pairwise_trunc(data[i,], data[i,], tau = tau)
}

The reason I do this sequentially is so that I do not have to store many d by d matrices, but just one matrix at a time, and therefore less memory is used.

The mat_prod_pairwise_trunc() function is doing the dot product and then applying the trunc_operator function.

mat_prod_pairwise_trunc = function(x,y, tau){
  trunc_operator((x %*% t(y))/2,tau)
}

However the for loop is slow, and I want to improve the speed.

Ideas so far

writing the whole for loop in C++ using Rcpp
Just writing the trunc_operator() function in C++
Use parallelisation, by maybe splitting the data into batches?

What is an efficient way of carrying this out?

英文:

I then truncate the values in the matrix using a function:
Where tau is a d by d matrix.

trunc_operator = function(x, tau){
  x = ifelse(abs(x) &gt; tau, tau*sign(x), x)
  return(x)
}

I do this for each row, and then sequentially add them together.

acf_mat = matrix(0, ncol(data), ncol(data))
for(i in 1:nrow(data)){
  acf_mat = acf_mat + mat_prod_pairwise_trunc(data[i,], data[i,], tau = tau)
}

The reason I do this sequentially is so that I do not have to store many d by d matrices, but just one matrix at a time, and therefore less memory is used.

The mat_prod_pairwise_trunc() function is doing the dot product and then applying the trunc_operator function.

mat_prod_pairwise_trunc = function(x,y, tau){
  trunc_operator((x %*% t(y))/2,tau)
}

However the for loop is slow, and I want to improve the speed.

Ideas so far

writing the whole for loop in C++ using Rcpp
Just writing the trunc_operator() function in C++
Use parallelisation, by maybe splitting the data into batches?

What is an efficient way of carrying this out?

Reproducible Example
Below is the code of the full example:

# creating dataset
n = 200
d = 200
data = rt(n = n*d, df = 4.1)
data = matrix(data, nrow = n, ncol = d)
data_pairwise_fun = function(data){
  n = nrow(data)
  data_pairwise = matrix(nrow = (n*(n-1))/2, ncol = ncol(data))
  for(i in 1:(n-1)){
    for(j in 1:(n-i)){
      data_pairwise[((i-1) * n) - (i*(i-1)/2) + j, ] = data[i,] - data[i + j,]
    }
  }
  return(data_pairwise)
}
data = data_pairwise_fun(data)
# tau
t_tau_mat = function(df, delta, d, n){
  off = (df/(df-2))^2
  on = ((1/2)*((3*df^2)/(df - 2)*(df - 4))) + (3/2)*off
  V_mat = diag(on - off, nrow = d, ncol = d)
  V_mat = V_mat + off
  V_mat = sqrt(V_mat)
  V_mat * (sqrt(floor((n/2))/(2*log(d) + log(1/delta))))
}
tau = t_tau_mat(4.1,1/n,d,n)
# functions:
trunc_operator = function(x, tau){
  x = ifelse(abs(x) &gt; tau, tau*sign(x), x)
  return(x)
}
mat_prod_pairwise_trunc = function(x,y, tau){
  trunc_operator((x %*% t(y))/2,tau)
}
# Main job:
acf_mat = matrix(0, ncol(data), ncol(data))
for(i in 1:nrow(data)){
  acf_mat = acf_mat + mat_prod_pairwise_trunc(data[i,], data[i,], tau = tau)
}

答案1

得分: 8

以下是您要翻译的代码部分：

Here are some tricks for ~5 times improvement.
***trunc*** **operator**
The slow part here is the `ifelse()` statement which operates on each element separately. We can gain speed by vectorizing it and making it operate on all the passed elements at once:
trunc = function(x, tau){
  inds &lt;- abs(x) &gt; tau
  x[inds] &lt;- tau[inds] * sign(x[inds])
  x
}
**Outer products**
The slow part here is performing operations row by row. Your case has more rows than columns. The trick - achieve the same result by operating on columns.
res  = matrix(0, ncol(data), ncol(data))
tau2 = tau * 2
for(i in 1:ncol(data)) {
  inds = i:ncol(data)
  res[i,inds] = colSums(trunc(data[,inds,drop=FALSE] * data[,i], tau2[rep(i,nrow(data)),inds]))
}
res[lower.tri(res)] = t(res)[lower.tri(res)]
res = res / 2
**Short explanation**
Think about how the result is achieved. The final matrix has as many rows and columns as there are columns in the data. For each row you do an outer product and then sum those products.
Now, think about all the operations that went into producing the first element of the result. It takes the first element of the first column, squares it, then the second element of the first column, squares it, etc., applies the truncating operation on all squares and sums them up.
Hence, we can achieve the same result by doing:
res &lt;- sum(trunc_operator(data[,1] * data[,1] / 2, tau))
Next, think about how the second element is computed in your matrix. It takes the product of first element in the first row and the second element if the first row, adds the first element of the second row multiplied by the second element of the second row, etc., and again applies the truncating operator and sums the result. We can get that by:
res &lt;- sum(trunc_operator(data[,1] * data[,2] / 2, tau))
But there is one more trick. We can get the first elements for every column at once by multiplying with the whole matrix:
res &lt;- colSums(trunc_operator(data[,1] * data / 2, tau))
What we see is that we can construct the matrix by iterating over columns rather than over rows. Since you have fewer columns than rows this should be faster.
But there is one more trick. We note that the resulting matrix is always symmetrical. Hence after each iteration we can discard the result for the already computed columns and only store the results for the upper diagonal. Then restore the full matrix at the end.
Benchmark with all 3 proposed solutions:
microbenchmark::microbenchmark(
  original = {
    res = matrix(0, ncol(data), ncol(data))
    for(i in 1:nrow(data)) {
      res = res + mat_prod_pairwise_trunc(data[i,], data[i,], tau = tau)
    }
    res
  },
  newProd = {
    tau2 = tau * 2
    res = matrix(0, d, d)
    for (i in seq_len(nrow(dat))
      res = res + newProd(data[i, ], tau = tau2)
    res = res / 2
    res
  },
  byCol = {
    res  = matrix(0, ncol(data), ncol(data))
    tau2 = tau * 2
    for(i in 1:ncol(data)) {
      inds = i:ncol(data)
      res[i,inds] = colSums(trunc(data[,inds,drop=FALSE] * data[,i], tau2[rep(i,nrow(data)),inds]))
    }
    res[lower.tri(res)] = t(res)[lower.tri(res)]
    res = res / 2
    res
  },
  times = 2,
  unit = &quot;s&quot;,
  check = &quot;equal&quot;)

希望这能帮助您理解这些代码部分。

英文:

Here are some tricks for ~5 times improvement.

trunc operator

The slow part here is the ifelse() statement which operates on each element separately. We can gain speed by vectorizing it and making it operate on all the passed elements at once:

trunc = function(x, tau){
inds &lt;- abs(x) &gt; tau
x[inds] &lt;- tau[inds] * sign(x[inds])
x
}

Outer products

The slow part here is performing operations row by row. Your case has more rows than columns. The trick - achieve the same result by operating on columns.

res  = matrix(0, ncol(data), ncol(data))
tau2 = tau * 2
for(i in 1:ncol(data)) {
inds = i:ncol(data)
res[i,inds] = colSums(trunc(data[,inds,drop=FALSE] * data[,i], tau2[rep(i,nrow(data)),inds]))
}
res[lower.tri(res)] = t(res)[lower.tri(res)]
res = res / 2

Short explanation

Think about how the result is achieved. The final matrix has as many rows and columns as there are columns in the data. For each row you do an outer product and then sum those products.

Now, think about all the operations that went into producing the first element of the result. It takes the first element of the first column, squares it, then the second element of the first column, squares it, etc., applies the truncating operation on all squares and sums them up.

Hence, we can achieve the same result by doing:

res &lt;- sum(trunc_operator(data[,1] * data[,1] / 2, tau))

Next, think about how the second element is computed in your matrix. It takes the product of first element in the first row and the second element if the first row, adds the first element of the second row multiplied by the second element of the second row, etc., and again applies the truncating operator and sums the result. We can get that by:

res &lt;- sum(trunc_operator(data[,1] * data[,2] / 2, tau))

But there is one more trick. We can get the first elements for every column at once by multiplying with the whole matrix:

res &lt;- colSums(trunc_operator(data[,1] * data / 2, tau))

What we see is that we can construct the matrix by iterating over columns rather than over rows. Since you have fewer columns than rows this should be faster.

But there is one more trick. We note that the resulting matrix is always symmetrical. Hence after each iteration we can discard the result for the already computed columns and only store the results for the upper diagonal. Then restore the full matrix at the end.

Benchmark with all 3 proposed solutions:

microbenchmark::microbenchmark(
original = {
res = matrix(0, ncol(data), ncol(data))
for(i in 1:nrow(data)) {
res = res + mat_prod_pairwise_trunc(data[i,], data[i,], tau = tau)
}
res
},
newProd = {
tau2 = tau * 2
res = matrix(0, d, d)
for (i in seq_len(nrow(dat)))
res = res + newProd(data[i, ], tau = tau2)
res = res / 2
res
},
byCol = {
res  = matrix(0, ncol(data), ncol(data))
tau2 = tau * 2
for(i in 1:ncol(data)) {
inds = i:ncol(data)
res[i,inds] = colSums(trunc(data[,inds,drop=FALSE] * data[,i], tau2[rep(i,nrow(data)),inds]))
}
res[lower.tri(res)] = t(res)[lower.tri(res)]
res = res / 2
res
},
times = 2,
unit = &quot;s&quot;,
check = &quot;equal&quot;)
Unit: seconds
expr       min        lq     mean   median       uq      max neval cld
original 47.822215 47.822215 49.94679 49.94679 52.07136 52.07136     2  a
newProd 12.371246 12.371246 12.89065 12.89065 13.41006 13.41006     2   b
byCol  9.778319  9.778319 10.11568 10.11568 10.45303 10.45303     2   b

答案2

得分: 8

这是一个Rcpp方法。在提供的数据集上，它只需要不到一秒：

Rcpp::cppFunction(
  "std::vector<std::vector<double>> funCpp(std::vector<std::vector<double>> &data,
     std::vector<std::vector<double>> &tau){
   int p = data.size(), n = data[0].size();
   std::vector<std::vector<double>> results(p, std::vector<double>(p));
   for(int i = 0; i < p; i++){
    for(int j = i; j < p; j++){
      double res = 0;
      double _tau = tau[i][j];
      for(int k = 0; k < n; k++){
        double ans = data[i][k] * data[j][k]/2;
        double sgn = (0 < ans) - (ans < 0);
        res += ans * sgn > _tau ? _tau * sgn : ans;
      }
      results[i][j] = res;
      results[j][i] = res;
    }
   }
  return results;
}")
  
funR <- function(data, tau){
  out <- funCpp(data.frame(data), data.frame(tau))
  simplify2array(out)
}
  
s <- funR(data, tau)
all.equal(s, original)
[1] TRUE

最终结果s不到一秒就能计算出来。它比原始函数快约25倍。由于第一个答案与原问题的结果不同，所以我无法使用microbenchmark。

英文:

Here is an Rcpp approach. It takes less than a second on the provided dataset:

Rcpp::cppFunction(
&quot;std::vector&lt;std::vector&lt;double&gt;&gt;
funCpp(std::vector&lt;std::vector&lt;double&gt;&gt; &amp;data,
std::vector&lt;std::vector&lt;double&gt;&gt; &amp;tau){
int p = data.size(), n = data[0].size();
std::vector&lt;std::vector&lt;double&gt;&gt; results(p, std::vector&lt;double&gt;(p));
for(int i = 0; i &lt; p; i++){
for(int j = i; j &lt; p; j++){
double res = 0;
double _tau = tau[i][j];
for(int k = 0; k &lt; n; k++){
double ans = data[i][k] * data[j][k]/2;
double sgn = (0 &lt; ans) - (ans &lt; 0);
res += ans * sgn &gt; _tau ? _tau * sgn : ans;
}
results[i][j] = res;
results[j][i] = res;
}
}
return results;
}&quot;)
funR &lt;- function(data, tau){
out &lt;- funCpp(data.frame(data), data.frame(tau))
simplify2array(out)
}
s &lt;- funR(data, tau)
all.equal(s, original)
[1] TRUE

The final result, s, takes less than a second to compute. It is ~25 times faster than the original function. I was unable to use microbenchmark with the provided answers above as the first answer does not produce the same results as the original question.

答案3

得分: 7

You are evaluating x %*% t(x) instead of tcrossprod(x, NULL). These call BLAS routines DGEMM and DSYRK, respectively, where only DSYRK takes advantage of symmetry.

So your first step should be to call tcrossprod(x, NULL) where appropriate and not x %*% t(x) or even tcrossprod(x, x).

If the resulting code is still too slow for your purposes, then experiment with different BLAS implementations, depending on your hardware. But note that external BLAS implementations will not be used if the vector x contains special values, namely NaN, NA_real_, Inf, or -Inf.

I've retyped your example below, optimizing some of the set-up code and taking some liberties with style and formatting (partly for syntax highlighting in Emacs). There is a benchmark against your code at the end.

英文:

You are evaluating x %*% t(x) instead of tcrossprod(x, NULL). These call BLAS routines DGEMM and DSYRK, respectively, where only DSYRK takes advantage of symmetry.

So your first step should be to call tcrossprod(x, NULL) where appropriate and not x %*% t(x) or even tcrossprod(x, x).

set.seed(0)
n &lt;- 200L
d &lt;- 200L
df &lt;- 4.1
mkDat &lt;- function(x) {
n &lt;- nrow(x)
if (n &lt; 2L)
return(matrix(0, 0L, ncol(x)))
i1 &lt;- rep.int(1L:(n - 1L), (n - 1L):1L)
i2 &lt;- sequence.default((n - 1L):1L, 2L:n, 1L)
x[i1, ] - x[i2, ]
}
dat &lt;- mkDat(matrix(rt(n * d, df), n, d))
mkTau &lt;- function(df, delta, n, d) {
off &lt;- (df/(df - 2))^2
on &lt;- 0.5 * (3 * df^2)/(df - 2) * (df - 4) + 1.5 * off
sqrt(diag(on - off, d, d) + off) *
sqrt(floor(n / 2)/(2 * log(d) - log(delta)))
}
tau &lt;- mkTau(df, 1/n, n, d)
oldProd &lt;- function(x, y, tau) {
r &lt;- (x %*% t(y)) / 2
ifelse(abs(r) &gt; tau, tau * sign(r), r)
}
newProd &lt;- function(x, y = NULL, tau) {
r &lt;- tcrossprod(x, y)
if (length(i &lt;- which(abs(r) &gt; tau)))
r[i] &lt;- tau[i] * sign(r[i])
r
}
dat.invsqrt2 &lt;- dat / sqrt(2)
microbenchmark::microbenchmark(
oldLoop = {
res &lt;- matrix(0, d, d)
for (i in seq_len(nrow(dat))) 
res &lt;- res + oldProd(dat[i, ], dat[i, ], tau = tau)
res
},
newLoop = { 
res &lt;- matrix(0, d, d)
for (i in seq_len(nrow(dat)))
res &lt;- res + newProd(dat.invsqrt2[i, ], tau = tau)
res 
},
times = 10L,
unit = &quot;s&quot;,
check = &quot;equal&quot;)

Unit: seconds
expr       min        lq      mean    median        uq       max neval
oldLoop 15.422133 15.509987 15.626912 15.564334 15.681810 16.100875    10
newLoop  3.928236  4.021573  4.040398  4.053979  4.074112  4.087461    10

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Efficient way to take a product of vectors, and then apply an operator to the output matrix and then sum all of the matrices

问题

答案1

答案2

答案3

简单特性 – 创建具有特定属性的新点

在组内进行抽样，匹配子组大小。

无法将命令的输出分配给 $GITHUB_OUTPUT 在 GitHub Actions 中

最新/截至2023年的世界地图

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。