Efficient way to take a product of vectors, and then apply an operator to the output matrix and then sum all of the matrices

huangapple go评论67阅读模式
英文:

Efficient way to take a product of vectors, and then apply an operator to the output matrix and then sum all of the matrices

问题

I have a dataset, X, with n rows and d columns.
I want to take the dot product with each row to the transpose of itself. In R code this would be x %*% t(x), and this gives a d by d matrix.

I then truncate the values in the matrix using a function:
Where tau is a d by d matrix.

trunc_operator = function(x, tau){

  x = ifelse(abs(x) > tau, tau*sign(x), x)
  return(x)

}

I do this for each row, and then sequentially add them together.

acf_mat = matrix(0, ncol(data), ncol(data))
for(i in 1:nrow(data)){
  acf_mat = acf_mat + mat_prod_pairwise_trunc(data[i,], data[i,], tau = tau)
}

The reason I do this sequentially is so that I do not have to store many d by d matrices, but just one matrix at a time, and therefore less memory is used.

The mat_prod_pairwise_trunc() function is doing the dot product and then applying the trunc_operator function.

mat_prod_pairwise_trunc = function(x,y, tau){
  trunc_operator((x %*% t(y))/2,tau)
}

However the for loop is slow, and I want to improve the speed.

Ideas so far

  1. writing the whole for loop in C++ using Rcpp
  2. Just writing the trunc_operator() function in C++
  3. Use parallelisation, by maybe splitting the data into batches?

What is an efficient way of carrying this out?


英文:

I have a dataset, X, with n rows and d columns.
I want to take the dot product with each row to the transpose of itself. In R code this would be x %*% t(x), and this gives a d by d matrix.

I then truncate the values in the matrix using a function:
Where tau is a d by d matrix.

trunc_operator = function(x, tau){

  x = ifelse(abs(x) > tau, tau*sign(x), x)
  return(x)

}

I do this for each row, and then sequentially add them together.

acf_mat = matrix(0, ncol(data), ncol(data))
for(i in 1:nrow(data)){
  acf_mat = acf_mat + mat_prod_pairwise_trunc(data[i,], data[i,], tau = tau)
}

The reason I do this sequentially is so that I do not have to store many d by d matrices, but just one matrix at a time, and therefore less memory is used.

The mat_prod_pairwise_trunc() function is doing the dot product and then applying the trunc_operator function.

mat_prod_pairwise_trunc = function(x,y, tau){
  trunc_operator((x %*% t(y))/2,tau)
}

However the for loop is slow, and I want to improve the speed.

Ideas so far

  1. writing the whole for loop in C++ using Rcpp
  2. Just writing the trunc_operator() function in C++
  3. Use parallelisation, by maybe splitting the data into batches?

What is an efficient way of carrying this out?


Reproducible Example
Below is the code of the full example:

# creating dataset
n = 200
d = 200

data = rt(n = n*d, df = 4.1)
data = matrix(data, nrow = n, ncol = d)

data_pairwise_fun = function(data){
  n = nrow(data)
  data_pairwise = matrix(nrow = (n*(n-1))/2, ncol = ncol(data))
  for(i in 1:(n-1)){
    for(j in 1:(n-i)){
      data_pairwise[((i-1) * n) - (i*(i-1)/2) + j, ] = data[i,] - data[i + j,]
    }
  }
  return(data_pairwise)
}

data = data_pairwise_fun(data)


# tau
t_tau_mat = function(df, delta, d, n){
  off = (df/(df-2))^2
  on = ((1/2)*((3*df^2)/(df - 2)*(df - 4))) + (3/2)*off

  V_mat = diag(on - off, nrow = d, ncol = d)
  V_mat = V_mat + off
  V_mat = sqrt(V_mat)
  V_mat * (sqrt(floor((n/2))/(2*log(d) + log(1/delta))))
}

tau = t_tau_mat(4.1,1/n,d,n)

# functions:
trunc_operator = function(x, tau){

  x = ifelse(abs(x) > tau, tau*sign(x), x)
  return(x)

}

mat_prod_pairwise_trunc = function(x,y, tau){
  trunc_operator((x %*% t(y))/2,tau)
}


# Main job:
acf_mat = matrix(0, ncol(data), ncol(data))
for(i in 1:nrow(data)){
  acf_mat = acf_mat + mat_prod_pairwise_trunc(data[i,], data[i,], tau = tau)
}

答案1

得分: 8

以下是您要翻译的代码部分:

Here are some tricks for ~5 times improvement.

***trunc*** **operator**

The slow part here is the `ifelse()` statement which operates on each element separately. We can gain speed by vectorizing it and making it operate on all the passed elements at once:

trunc = function(x, tau){
  inds <- abs(x) > tau
  x[inds] <- tau[inds] * sign(x[inds])
  x
}

**Outer products**

The slow part here is performing operations row by row. Your case has more rows than columns. The trick - achieve the same result by operating on columns.

res  = matrix(0, ncol(data), ncol(data))
tau2 = tau * 2
for(i in 1:ncol(data)) {
  inds = i:ncol(data)
  res[i,inds] = colSums(trunc(data[,inds,drop=FALSE] * data[,i], tau2[rep(i,nrow(data)),inds]))
}
res[lower.tri(res)] = t(res)[lower.tri(res)]
res = res / 2

**Short explanation**

Think about how the result is achieved. The final matrix has as many rows and columns as there are columns in the data. For each row you do an outer product and then sum those products.

Now, think about all the operations that went into producing the first element of the result. It takes the first element of the first column, squares it, then the second element of the first column, squares it, etc., applies the truncating operation on all squares and sums them up.

Hence, we can achieve the same result by doing:

res <- sum(trunc_operator(data[,1] * data[,1] / 2, tau))

Next, think about how the second element is computed in your matrix. It takes the product of first element in the first row and the second element if the first row, adds the first element of the second row multiplied by the second element of the second row, etc., and again applies the truncating operator and sums the result. We can get that by:

res <- sum(trunc_operator(data[,1] * data[,2] / 2, tau))

But there is one more trick. We can get the first elements for every column at once by multiplying with the whole matrix:

res <- colSums(trunc_operator(data[,1] * data / 2, tau))

What we see is that we can construct the matrix by iterating over columns rather than over rows. Since you have fewer columns than rows this should be faster.

But there is one more trick. We note that the resulting matrix is always symmetrical. Hence after each iteration we can discard the result for the already computed columns and only store the results for the upper diagonal. Then restore the full matrix at the end.

Benchmark with all 3 proposed solutions:

microbenchmark::microbenchmark(
  original = {
    res = matrix(0, ncol(data), ncol(data))
    for(i in 1:nrow(data)) {
      res = res + mat_prod_pairwise_trunc(data[i,], data[i,], tau = tau)
    }
    res
  },
  newProd = {
    tau2 = tau * 2
    res = matrix(0, d, d)
    for (i in seq_len(nrow(dat))
      res = res + newProd(data[i, ], tau = tau2)
    res = res / 2
    res
  },
  byCol = {
    res  = matrix(0, ncol(data), ncol(data))
    tau2 = tau * 2
    for(i in 1:ncol(data)) {
      inds = i:ncol(data)
      res[i,inds] = colSums(trunc(data[,inds,drop=FALSE] * data[,i], tau2[rep(i,nrow(data)),inds]))
    }
    res[lower.tri(res)] = t(res)[lower.tri(res)]
    res = res / 2
    res
  },
  times = 2,
  unit = "s",
  check = "equal")

希望这能帮助您理解这些代码部分。

英文:

Here are some tricks for ~5 times improvement.

trunc operator

The slow part here is the ifelse() statement which operates on each element separately. We can gain speed by vectorizing it and making it operate on all the passed elements at once:

trunc = function(x, tau){
inds <- abs(x) > tau
x[inds] <- tau[inds] * sign(x[inds])
x
}

Outer products

The slow part here is performing operations row by row. Your case has more rows than columns. The trick - achieve the same result by operating on columns.

res  = matrix(0, ncol(data), ncol(data))
tau2 = tau * 2
for(i in 1:ncol(data)) {
inds = i:ncol(data)
res[i,inds] = colSums(trunc(data[,inds,drop=FALSE] * data[,i], tau2[rep(i,nrow(data)),inds]))
}
res[lower.tri(res)] = t(res)[lower.tri(res)]
res = res / 2

Short explanation

Think about how the result is achieved. The final matrix has as many rows and columns as there are columns in the data. For each row you do an outer product and then sum those products.

Now, think about all the operations that went into producing the first element of the result. It takes the first element of the first column, squares it, then the second element of the first column, squares it, etc., applies the truncating operation on all squares and sums them up.

Hence, we can achieve the same result by doing:

res <- sum(trunc_operator(data[,1] * data[,1] / 2, tau))

Next, think about how the second element is computed in your matrix. It takes the product of first element in the first row and the second element if the first row, adds the first element of the second row multiplied by the second element of the second row, etc., and again applies the truncating operator and sums the result. We can get that by:

res <- sum(trunc_operator(data[,1] * data[,2] / 2, tau))

But there is one more trick. We can get the first elements for every column at once by multiplying with the whole matrix:

res <- colSums(trunc_operator(data[,1] * data / 2, tau))

What we see is that we can construct the matrix by iterating over columns rather than over rows. Since you have fewer columns than rows this should be faster.

But there is one more trick. We note that the resulting matrix is always symmetrical. Hence after each iteration we can discard the result for the already computed columns and only store the results for the upper diagonal. Then restore the full matrix at the end.

Benchmark with all 3 proposed solutions:

microbenchmark::microbenchmark(
original = {
res = matrix(0, ncol(data), ncol(data))
for(i in 1:nrow(data)) {
res = res + mat_prod_pairwise_trunc(data[i,], data[i,], tau = tau)
}
res
},
newProd = {
tau2 = tau * 2
res = matrix(0, d, d)
for (i in seq_len(nrow(dat)))
res = res + newProd(data[i, ], tau = tau2)
res = res / 2
res
},
byCol = {
res  = matrix(0, ncol(data), ncol(data))
tau2 = tau * 2
for(i in 1:ncol(data)) {
inds = i:ncol(data)
res[i,inds] = colSums(trunc(data[,inds,drop=FALSE] * data[,i], tau2[rep(i,nrow(data)),inds]))
}
res[lower.tri(res)] = t(res)[lower.tri(res)]
res = res / 2
res
},
times = 2,
unit = "s",
check = "equal")
Unit: seconds
expr       min        lq     mean   median       uq      max neval cld
original 47.822215 47.822215 49.94679 49.94679 52.07136 52.07136     2  a
newProd 12.371246 12.371246 12.89065 12.89065 13.41006 13.41006     2   b
byCol  9.778319  9.778319 10.11568 10.11568 10.45303 10.45303     2   b

答案2

得分: 8

这是一个Rcpp方法。在提供的数据集上,它只需要不到一秒:

Rcpp::cppFunction(
  "std::vector<std::vector<double>> funCpp(std::vector<std::vector<double>> &data,
     std::vector<std::vector<double>> &tau){
   int p = data.size(), n = data[0].size();
   std::vector<std::vector<double>> results(p, std::vector<double>(p));
   for(int i = 0; i < p; i++){
    for(int j = i; j < p; j++){
      double res = 0;
      double _tau = tau[i][j];
      for(int k = 0; k < n; k++){
        double ans = data[i][k] * data[j][k]/2;
        double sgn = (0 < ans) - (ans < 0);
        res += ans * sgn > _tau ? _tau * sgn : ans;
      }
      results[i][j] = res;
      results[j][i] = res;
    }
   }
  return results;
}")
  
funR <- function(data, tau){
  out <- funCpp(data.frame(data), data.frame(tau))
  simplify2array(out)
}
  
s <- funR(data, tau)
all.equal(s, original)
[1] TRUE

最终结果s不到一秒就能计算出来。它比原始函数快约25倍。由于第一个答案与原问题的结果不同,所以我无法使用microbenchmark

英文:

Here is an Rcpp approach. It takes less than a second on the provided dataset:

Rcpp::cppFunction(
&quot;std::vector&lt;std::vector&lt;double&gt;&gt;
funCpp(std::vector&lt;std::vector&lt;double&gt;&gt; &amp;data,
std::vector&lt;std::vector&lt;double&gt;&gt; &amp;tau){
int p = data.size(), n = data[0].size();
std::vector&lt;std::vector&lt;double&gt;&gt; results(p, std::vector&lt;double&gt;(p));
for(int i = 0; i &lt; p; i++){
for(int j = i; j &lt; p; j++){
double res = 0;
double _tau = tau[i][j];
for(int k = 0; k &lt; n; k++){
double ans = data[i][k] * data[j][k]/2;
double sgn = (0 &lt; ans) - (ans &lt; 0);
res += ans * sgn &gt; _tau ? _tau * sgn : ans;
}
results[i][j] = res;
results[j][i] = res;
}
}
return results;
}&quot;)
funR &lt;- function(data, tau){
out &lt;- funCpp(data.frame(data), data.frame(tau))
simplify2array(out)
}
s &lt;- funR(data, tau)
all.equal(s, original)
[1] TRUE

The final result, s, takes less than a second to compute. It is ~25 times faster than the original function. I was unable to use microbenchmark with the provided answers above as the first answer does not produce the same results as the original question.

答案3

得分: 7

You are evaluating x %*% t(x) instead of tcrossprod(x, NULL). These call BLAS routines DGEMM and DSYRK, respectively, where only DSYRK takes advantage of symmetry.

So your first step should be to call tcrossprod(x, NULL) where appropriate and not x %*% t(x) or even tcrossprod(x, x).

If the resulting code is still too slow for your purposes, then experiment with different BLAS implementations, depending on your hardware. But note that external BLAS implementations will not be used if the vector x contains special values, namely NaN, NA_real_, Inf, or -Inf.

I've retyped your example below, optimizing some of the set-up code and taking some liberties with style and formatting (partly for syntax highlighting in Emacs). There is a benchmark against your code at the end.

英文:

You are evaluating x %*% t(x) instead of tcrossprod(x, NULL). These call BLAS routines DGEMM and DSYRK, respectively, where only DSYRK takes advantage of symmetry.

So your first step should be to call tcrossprod(x, NULL) where appropriate and not x %*% t(x) or even tcrossprod(x, x).

If the resulting code is still too slow for your purposes, then experiment with different BLAS implementations, depending on your hardware. But note that external BLAS implementations will not be used if the vector x contains special values, namely NaN, NA_real_, Inf, or -Inf.

I've retyped your example below, optimizing some of the set-up code and taking some liberties with style and formatting (partly for syntax highlighting in Emacs). There is a benchmark against your code at the end.

set.seed(0)
n &lt;- 200L
d &lt;- 200L
df &lt;- 4.1
mkDat &lt;- function(x) {
n &lt;- nrow(x)
if (n &lt; 2L)
return(matrix(0, 0L, ncol(x)))
i1 &lt;- rep.int(1L:(n - 1L), (n - 1L):1L)
i2 &lt;- sequence.default((n - 1L):1L, 2L:n, 1L)
x[i1, ] - x[i2, ]
}
dat &lt;- mkDat(matrix(rt(n * d, df), n, d))
mkTau &lt;- function(df, delta, n, d) {
off &lt;- (df/(df - 2))^2
on &lt;- 0.5 * (3 * df^2)/(df - 2) * (df - 4) + 1.5 * off
sqrt(diag(on - off, d, d) + off) *
sqrt(floor(n / 2)/(2 * log(d) - log(delta)))
}
tau &lt;- mkTau(df, 1/n, n, d)
oldProd &lt;- function(x, y, tau) {
r &lt;- (x %*% t(y)) / 2
ifelse(abs(r) &gt; tau, tau * sign(r), r)
}
newProd &lt;- function(x, y = NULL, tau) {
r &lt;- tcrossprod(x, y)
if (length(i &lt;- which(abs(r) &gt; tau)))
r[i] &lt;- tau[i] * sign(r[i])
r
}
dat.invsqrt2 &lt;- dat / sqrt(2)
microbenchmark::microbenchmark(
oldLoop = {
res &lt;- matrix(0, d, d)
for (i in seq_len(nrow(dat))) 
res &lt;- res + oldProd(dat[i, ], dat[i, ], tau = tau)
res
},
newLoop = { 
res &lt;- matrix(0, d, d)
for (i in seq_len(nrow(dat)))
res &lt;- res + newProd(dat.invsqrt2[i, ], tau = tau)
res 
},
times = 10L,
unit = &quot;s&quot;,
check = &quot;equal&quot;)
Unit: seconds
expr       min        lq      mean    median        uq       max neval
oldLoop 15.422133 15.509987 15.626912 15.564334 15.681810 16.100875    10
newLoop  3.928236  4.021573  4.040398  4.053979  4.074112  4.087461    10

huangapple
  • 本文由 发表于 2023年4月10日 20:28:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/75977129.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定