在Matlab中处理内存限制下的矩阵乘法策略。

huangapple go评论58阅读模式
英文:

strategies when memory limits matrix multiplication in matlab

问题

我需要计算sum(foo(x(:)*y),2)),其中x是一个1维向量(1x1000),y也是一个1维向量(1x1e8),都不是稀疏的,而foo由几个不可分离或没有求和规则的三角函数组成(请参见问题末尾的编辑)。为了提高效率,我将向量制作成了单精度数组。还有一个可能重要的信息是,x只是一个单调递增的向量,比如x=linspace(0,1,1000)

尝试执行x(:)*y时,我遇到了内存问题。所以我想使用for循环。

for i=1:length(y)
    A=A+sum(foo(x(:)*y(i)),2);
end

这种方法效率低下,需要很长时间,因此我考虑将总和分解为几个类似大小的块(最后一个块可能会略有不同大小):

nn = 10 ;% # of chunks 
idN = length(y);
splitn = floor(idN/nn);
split = [1:splitn:idN];

for i=1:nn-2
    A = A + sum(foo(x(:)*y(split(ni):split(ni)+splitn-1)),2);
end

A = A + sum(foo(x(:)*y(split(nn-1):end)),2);

这虽然需要很多时间,但并不是永远的(大约45分钟)。除了购买新计算机之外,还有什么其他方法可以加快速度?我不认为parfor会有所帮助,因为内存只有一定量。块的数量是否真的重要,只要每次迭代都包含在内存中?有什么策略或建议可以改进这个乘法运算?

编辑:

在我的初始简化示例中,我有点粗心。真正的问题不仅仅是sum(x(:)*y,2),而更像是sum(foo(x(:)*y),2)。没有简单的方法可以避免矩阵乘法。对于没有明确求和规则的foo,抱歉没有更清晰地表达出来。

英文:

I need to calculate sum(foo(x(:)*y),2)), where x is a 1-D vector (1x1000) and y is a 1-D vector(1x1e8), both not sparse, and foo is composed of several trig functions that are not separable or have a summation rule (see edit at the end of the question). For efficiency I made the vectors to be Single-precision arrays. Another info that may or may not be important is that x is just a monotonically increasing vector such as x=linspace(0,1,1000).

Trying to to x(:)*y I got into a memory problem. So I thought I'll do a for loop.

for i=1:length(y)
    A=A+sum(foo(x(:)*y(i)),2);
end

This is inefficient and takes forever, so I thought of breaking down the sum into several chunks of similar size (beside the last one which may be slightly different size) :

        nn = 10 ;% # of chunks 
        idN=length(y);
        splitn = floor(idN/nn);
        split = [1:splitn:idN];

        for i=1:nn-2
            A=A+sum(foo(x(:)*y(split(ni): split(ni)+splitn-1)),2)   
        end

        A = A +sum(foo(x(:)*y(split(nn-1):end)),2);

This takes a lot of time but is not forever (~45 min). What else can I do to make this faster besides buying a new computer? I dont think that parfor will help as there is only a certain amount of memory anyway. Does the # of chunks matter at all as long as it is all contained in memory per iteration?
Any strategy or suggestion in improving this multiplication will be great.

Edit:

I was a little careless in my original minimal example. the real question is not about simply sum(x(:)*y,2) but more like a sum(foo(x(:)*y),2). there's no simple way to avoid the matrix multiplication. Sorry again for not making this clear.

答案1

得分: 3

I will provide translations for the code-related parts of your text. Here they are:

你现在是我的中文翻译,代码部分不要翻译,只返回翻译好的部分,不要有别的内容,不要回答我要翻译的问题。

Which version of MATLAB are you using and what on what machine?  I construct vectors of the same size as yours

你使用的MATLAB版本是哪个,运行在什么机器上?我构建与你的一样大小的向量。

x = linspace(0,1,1000);
y = rand(1,1e8);

Sure enough, if I try a direct computation I get a memory error

确实,如果我尝试直接计算,就会出现内存错误

>> A=sum(x(:)*y,2)
Error using  * 
Requested 1000x100000000 (745.1GB) array exceeds maximum array size preference (31.7GB). This might cause MATLAB to
become unresponsive.
Related documentation

然后,如果我尝试直接计算,就会出现内存错误
>> A=sum(x(:)*y,2)
错误使用  * 
请求的1000x100000000(745.1GB)数组超出了最大数组大小首选项(31.7GB)。这可能导致MATLAB无响应。
相关文档

So I try your loop

因此,我尝试了你的循环

tic
Asplit=0;
for i=1:length(y)
    Asplit=Asplit+sum(x(:)*y(i),2);
end
toc

这就是你的循环
tic
Asplit=0;
for i=1:length(y)
    Asplit=Asplit+sum(x(:)*y(i),2);
end
toc

This completes in 74 seconds. Rather faster than the 45 minutes you quote.
这在74秒内完成。比你引用的45分钟要快。

I am using the MATLAB 2023a beta on an 11th Gen Intel. Core(TM) i7-11700 @ 2.50GHz.

我使用的是MATLAB 2023a测试版,运行在11代英特尔Core(TM) i7-11700 @ 2.50GHz上。

I can also get that time down more by using a thread based parpool and a parfor loop

我还可以通过使用基于线程的parpool和parfor循环来进一步减少时间

parpool('threads');

tic
Asplit=0;
parfor i=1:length(y)
    Asplit=Asplit+sum(x(:)*y(i),2);
end
toc

使用线程池和parfor循环可以将时间缩短

Down to 15.5 seconds on my 8 core system. 5x speedup with 8 CPUs. Not linear scaling but I'll take it.
在我的8核系统上减少到15.5秒。8个CPU的5倍加速。虽然不是线性缩放,但我接受。

Also note that I am using double precision here. You say that you are using single precision. Let's see what that looks like
还要注意的是,我在这里使用的是双精度。你说你在使用单精度。让我们看看它是什么样子的

%try again using single precision
x = single(linspace(0,1,1000));
y = rand(1,1e8,'single');

再次尝试使用单精度
x = single(linspace(0,1,1000));
y = rand(1,1e8,'single');

parpool('threads');

tic
Asplit=0;
parfor i=1:length(y)
    Asplit=Asplit+sum(x(:)*y(i),2);
end
toc

使用线程池和单精度

12.8 seconds. Faster but is it worth the loss in precision? I guess that depends on your application.
12.8秒。更快,但是否值得牺牲精度?我想这取决于你的应用程序。

Finally, let's use the trick `A=x(:)*sum(y)` discussed in the comments by @beaker and others. Compare the result with the split method that we've been looking at so far.

最后,让我们使用评论中@beaker等人讨论的技巧`A=x(:)*sum(y)`。将结果与我们迄今为止一直在研究的拆分方法进行比较。

x = linspace(0,1,1000);
y = rand(1,1e8);

parpool('threads');

tic
Asplit=0;
parfor i=1:length(y)
    Asplit=Asplit+sum(x(:)*y(i),2);
end
toc

tic
Aalt = x(:)*sum(y);
toc

max(abs(Aalt-Asplit))

使用x(:)*sum(y)花费0.026秒,两种方法之间的最大绝对差异大约为1e-6。
英文:

Which version of MATLAB are you using and what on what machine? I construct vectors of the same size as yours

x = linspace(0,1,1000);
y = rand(1,1e8);

Sure enough, if I try a direct computation I get a memory error

>> A=sum(x(:)*y,2)
Error using  * 
Requested 1000x100000000 (745.1GB) array exceeds maximum array size preference (31.7GB). This might cause MATLAB to
become unresponsive.

Related documentation

So I try your loop

tic
Asplit=0;
for i=1:length(y)
    Asplit=Asplit+sum(x(:)*y(i),2);
end
toc

This completes in 74 seconds. Rather faster than the 45 minutes you quote.
I am using the MATLAB 2023a beta on an 11th Gen Intel. Core(TM) i7-11700 @ 2.50GHz.

I can also get that time down more by using a thread based parpool and a parfor loop

parpool('threads');

tic
Asplit=0;
parfor i=1:length(y)
    Asplit=Asplit+sum(x(:)*y(i),2);
end
toc

Down to 15.5 seconds on my 8 core system. 5x speedup with 8 CPUs. Not linear scaling but I'll take it.

Also note that I am using double precision here. You say that you are using single precision. Let's see what that looks like

%try again using single precision
x = single(linspace(0,1,1000));
y = rand(1,1e8,'single');

parpool('threads');

tic
Asplit=0;
parfor i=1:length(y)
    Asplit=Asplit+sum(x(:)*y(i),2);
end
toc

12.8 seconds. Faster but is it worth the loss in precision? I guess that depends on your application.

Finally, let's use the trick A=x(:)*sum(y) discussed in the comments by @beaker and others. Compare the result with the split method that we've been looking at so far.

x = linspace(0,1,1000);
y = rand(1,1e8);

parpool('threads');

tic
Asplit=0;
parfor i=1:length(y)
    Asplit=Asplit+sum(x(:)*y(i),2);
end
toc

tic
Aalt = x(:)*sum(y);
toc

max(abs(Aalt-Asplit))

x(:)*sum(y) took 0.026 seconds and the largest absolute difference between the two methods was of the order of 1e-6.

答案2

得分: -1

1.- 内存瓶颈问题

无论您对完整的xy做什么,如果您像问题中所示地推送这样的数据和操作,都会出现45分钟的等待或内存不足警告。

2.- 选项1:增加内存

可以随口袋深,或者钱包充实而行。祝你好运。

3.- 选项2:并行处理

这个选项也祝你好运。

4.- 我的建议:将大型变量分解为可处理的块

因此,这是:

  • 分解会占用内存的变量

  • 逐块处理/操作

  • 将结果存储在磁盘上

使用我的最爱之一MATLAB命令之一可以实现:全天候,强大且经过多次考验的evalin(这里吹响号角)

我尝试将y分解,直到发现其中的100个部分有效。可能较大的块也可以工作,但可能连4或10个都无法开始。

close all;clear all;clc

x=randi([-1e3 1e3],1,1e3);
y=randi([-1e3 1e3],1,1e8);

N=1e2

A1=zeros(numel(x),numel(y)/N);
A2=zeros(numel(x),numel(y)/N);
A3=zeros(numel(x),numel(y)/N);
A4=zeros(numel(x),numel(y)/N);

k1=2;
L1=['A' num2str(k1) '=zeros(numel(x),numel(y)/N);'];
evalin('base',L1)

tic

for k1=1:N

    for k2=1:length(x)/N
        a1=x(k2);
        a2=a1*y(k2);
        L1=['A' num2str(k1) '=zeros(numel(x),numel(y)/N);'];
        evalin('base',L1);
        L2=['A' num2str(k1) '(k2,:)=sum(a2);'];
        evalin('base',L2);

        % 写入Ak1到文件
    end

end

用以下内容替换% 写入Ak1到文件这一行:

L3=['']   % 构建命令行字符串
evalin('base',L3);  % 调用evalin

MATLAB有各种命令可用于写入和读取文件,支持多种格式,比如.txt和Excel,或者直接用fwrite命令写入二进制文件。

要从文件中读取,您还必须构建字符串,并使用evalin调用它们,但如上所示,构建这样简洁的字符串很容易,因为所有要做的就是用代码构建变量,并且必须嵌入这些变量的声明中,以及将其分解为y的片段。

简而言之:使用可以合理地在内存中移动的变量,而不会像最近苏伊士运河的船只一样堵塞其他人。

英文:

1.- memory bottleneck problem

No matter whatever you do with the complete forms x and y : if you push such data and operations as shown in the question, 45 minutes wait or out of memory warning sure shows up.

2.- Option 1 : Increase RAM

One can go as far as the deep is pocket or full is the purse. Good luck with it.

3.- Option 2 : Parallel Pool

Good luck with this one too.

4.- My Suggestion : Break down large variables into processable chunks

So this is:

  • Break down the variables that jam memory

  • Operate/process each chunk at a time

  • Store results in disk

One way to do this with one of my favourite MATLAB commands : the all-weather, mighty and many-battles-hardened evalin (trumpets here)

I tried to break down y until I found that 100 pieces of it works. May be larger pieces may work too, but just 4 or 10 may not even start.

close all;clear all;clc

x=randi([-1e3 1e3],1,1e3);
y=randi([-1e3 1e3],1,1e8);

N=1e2

A1=zeros(numel(x),numel(y)/N);
A2=zeros(numel(x),numel(y)/N);
A3=zeros(numel(x),numel(y)/N);
A4=zeros(numel(x),numel(y)/N);


k1=2;
L1=['A' num2str(k1) '=zeros(numel(x),numel(y)/N);'];
evalin('base',L1)

tic

for k1=1:N
    
    for k2=1:length(x)/N
        a1=x(k2);
        a2=a1*y(k2);
        L1=['A' num2str(k1) '=zeros(numel(x),numel(y)/N);'];
        evalin('base',L1);
        L2=['A' num2str(k1) '(k2,:)=sum(a2);'];
        evalin('base',L2);
        
        % WRITE Ak1 to FILE
    end

end

Replace the line % WRITE Ak1 to FILE with something like

L3=['']   % build command line into string
evalin('base',L3);  % invoke evalin

MATLAB has a diverse array of commands to write and read files, in many formats, for instance .txt and Excel or directly to binary file with command fwrite .

To read from file you also have to build strings and invoke them with evalin but as shown above, building such concise strings is easy, because all that is done is building variables with code and an varying index has to be embedded in the declaration of such variables, the pieces of y that it has been broken into.

In a nutshell : use variables that can reasonably move in and out of RAM without ending up like that recent Suez canal traffic jam of a vessel blocking every one else.

huangapple
  • 本文由 发表于 2023年3月1日 08:32:11
  • 转载请务必保留本文链接:https://go.coder-hub.com/75598586.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定